Skip to content
This repository has been archived by the owner on Jan 12, 2022. It is now read-only.

ci instability: status command only waits for 1 minute to resolve instance URLs #981

Closed
jgehrcke opened this issue Jul 2, 2021 · 9 comments
Assignees
Labels
type: ci-instability may or may not be a user-facing problem, too

Comments

@jgehrcke
Copy link
Contributor

jgehrcke commented Jul 2, 2021

[2021-07-02T11:36:30Z] 2021-07-02T11:36:30.332Z info: logging to file: opstrace_cli_status_20210702-113630Z.log
...

[2021-07-02T11:36:31Z] 2021-07-02T11:36:31.262Z info: waiting for expected HTTP responses at these URLs: {
[2021-07-02T11:36:31Z]   "https://dd.default.prs-bk-4923-b5e-g.opstrace.io/api/v1/series": "default"
[2021-07-02T11:36:31Z] }
[2021-07-02T11:36:51Z] 2021-07-02T11:36:51.513Z info: https://dd.default.prs-bk-4923-b5e-g.opstrace.io/api/v1/series?api_key=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjNjODVhN2ViMzY4YTE3OWI0NGNmYThjOTk2OGFiMTdhNmFiNzliYzIifQ.eyJpYXQiOjE2MjUyMjQ5NzksImV4cCI6MTk0MDgwMDk3OSwiYXVkIjoib3BzdHJhY2UtY2x1c3Rlci1wcnMtYmstNDkyMy1iNWUtZyIsImlzcyI6Im9wc3RyYWNlLWNsaSIsInN1YiI6InRlbmFudC1kZWZhdWx0In0.Qd-gFVYb59n9Di7vz1NaQvSGduiNzzGgUP0VnBe1xEZe1Hiv4lnn4quGmdOF0NOprUb5erD7bLV7bgHhNljgYZaWNK-oKsiCwa1Uq_9LUA3fJCtVMU9mOpZIru0QsmwN-H-wQr61U3tXIrcx_MH6Elk0AAHxCTR7eZt3zpACZw0Y5rgw4hAFftU8PwbCCdr_Weqr_mnRjcyaJ2y69uO_fHuvA9GaqUhL9elcNImxdVlgTkU2PrGxR4xw6CsGEguWlFdAmWMOTW7ow7I072Bzi6tYGhRTUJTJdwXK9LI0fG65nvxmWmRejRpRjjVcSlnc0p1wsfZOjLZq4gy2sl6QGw: still waiting for expected signal. Last error: getaddrinfo ENOTFOUND dd.default.prs-bk-4923-b5e-g.opstrace.io
[2021-07-02T11:37:16Z] 2021-07-02T11:37:16.531Z info: https://dd.default.prs-bk-4923-b5e-g.opstrace.io/api/v1/series?api_key=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjNjODVhN2ViMzY4YTE3OWI0NGNmYThjOTk2OGFiMTdhNmFiNzliYzIifQ.eyJpYXQiOjE2MjUyMjQ5NzksImV4cCI6MTk0MDgwMDk3OSwiYXVkIjoib3BzdHJhY2UtY2x1c3Rlci1wcnMtYmstNDkyMy1iNWUtZyIsImlzcyI6Im9wc3RyYWNlLWNsaSIsInN1YiI6InRlbmFudC1kZWZhdWx0In0.Qd-gFVYb59n9Di7vz1NaQvSGduiNzzGgUP0VnBe1xEZe1Hiv4lnn4quGmdOF0NOprUb5erD7bLV7bgHhNljgYZaWNK-oKsiCwa1Uq_9LUA3fJCtVMU9mOpZIru0QsmwN-H-wQr61U3tXIrcx_MH6Elk0AAHxCTR7eZt3zpACZw0Y5rgw4hAFftU8PwbCCdr_Weqr_mnRjcyaJ2y69uO_fHuvA9GaqUhL9elcNImxdVlgTkU2PrGxR4xw6CsGEguWlFdAmWMOTW7ow7I072Bzi6tYGhRTUJTJdwXK9LI0fG65nvxmWmRejRpRjjVcSlnc0p1wsfZOjLZq4gy2sl6QGw: still waiting for expected signal. Last error: getaddrinfo ENOTFOUND dd.default.prs-bk-4923-b5e-g.opstrace.io
[2021-07-02T11:37:30Z] 2021-07-02T11:37:30.351Z error: error seen by saga middleware:
[2021-07-02T11:37:30Z] Error: timeout checking the status of the cluster
[2021-07-02T11:37:30Z]     at rootTaskStatusCore (/snapshot/build/packages/cli/src/status.ts:112:11)
[2021-07-02T11:37:30Z]     at rootTaskStatusCore.next (<anonymous>)
[2021-07-02T11:37:30Z]     at next (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:1161:27)
[2021-07-02T11:37:30Z]     at currCb (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:1255:7)
[2021-07-02T11:37:30Z]     at chCbAtKey (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:702:9)
[2021-07-02T11:37:30Z]     at currCb (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:1255:7)
[2021-07-02T11:37:30Z] 2021-07-02T11:37:30.352Z error: saga stack: The above error occurred in task rootTaskStatusCore
[2021-07-02T11:37:30Z]     created by rootTaskStatus
[2021-07-02T11:37:30Z]
[2021-07-02T11:37:30Z] Error: exit.

https://buildkite.com/opstrace/prs/builds/4923#58765aca-2f2f-4bd9-b19e-3469e67742c8/2087-2584

The Error: timeout checking the status of the cluster came after 1 minute. Is that timeout constant a little small?

I think that makes sense if status is supposed to check the current state of the cluster w.r.t. some externally visible properties.

Now, the idea of "current" of course has impact on the timeout constants being used.

The status command is also a little underspecified, so this issue is not a big deal, and potentially not an issue at all.

This here is actually also a manifestation of #834 because getaddrinfo ENOTFOUND was issued for a DNS name that was resolvable shortly before that.

@jgehrcke jgehrcke changed the title ci instability: status command only waits for 1 minutes to resolve instance URLs ci instability: status command only waits for 1 minute to resolve instance URLs Jul 2, 2021
@sreis
Copy link
Contributor

sreis commented Jul 6, 2021

Another one this time it failed to connect to loki in the system tenant:

[2021-07-06T15:48:37Z] 2021-07-06T15:48:37.275Z info: waiting for expected HTTP responses at these URLs: {
[2021-07-06T15:48:37Z]   "https://cortex.default.prs-bk-4966-aa4-a.opstrace.io/api/v1/labels": "default",
[2021-07-06T15:48:37Z]   "https://loki.default.prs-bk-4966-aa4-a.opstrace.io/loki/api/v1/labels": "default",
[2021-07-06T15:48:37Z]   "https://cortex.system.prs-bk-4966-aa4-a.opstrace.io/api/v1/labels": "system",
[2021-07-06T15:48:37Z]   "https://loki.system.prs-bk-4966-aa4-a.opstrace.io/loki/api/v1/labels": "system"
[2021-07-06T15:48:37Z] }
[2021-07-06T15:48:37Z] 2021-07-06T15:48:37.396Z info: https://cortex.default.prs-bk-4966-aa4-a.opstrace.io/api/v1/labels: got expected HTTP response
[2021-07-06T15:48:42Z] 2021-07-06T15:48:42.388Z info: https://loki.system.prs-bk-4966-aa4-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
[2021-07-06T15:48:52Z] 2021-07-06T15:48:52.316Z info: https://cortex.system.prs-bk-4966-aa4-a.opstrace.io/api/v1/labels: got expected HTTP response
[2021-07-06T15:49:37Z] 2021-07-06T15:49:37.289Z error: error seen by saga middleware:
[2021-07-06T15:49:37Z] Error: timeout checking the status of the cluster

But it was able to connect a few minutes earlier just after the cluster was created.

[2021-07-06T15:43:09Z] 2021-07-06T15:43:09.012Z info: https://loki.system.prs-bk-4966-aa4-a.opstrace.io/loki/api/v1/labels: got expected HTTP response

@sreis
Copy link
Contributor

sreis commented Jul 9, 2021

[2021-07-09T08:42:18Z] 2021-07-09T08:42:18.704Z info: https://loki.default.schedul-bk-2633-29b-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
[2021-07-09T08:43:18Z] 2021-07-09T08:43:18.541Z error: error seen by saga middleware:
[2021-07-09T08:43:18Z] Error: timeout checking the status of the cluster

https://buildkite.com/opstrace/scheduled-main-builds/builds/2633#69bd2556-ec0a-4648-ade0-a18c70a9f3c2/2782-3264

@sreis sreis self-assigned this Jul 9, 2021
@sreis sreis added the type: ci-instability may or may not be a user-facing problem, too label Jul 9, 2021
@sreis
Copy link
Contributor

sreis commented Jul 9, 2021

Status command debug logs:

2021-07-09T08:42:18.704Z info: https://loki.default.schedul-bk-2633-29b-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
2021-07-09T08:42:25.142Z debug: HTTP response details:
status: 502
body[:500]: <html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>

2021-07-09T08:42:40.153Z debug: https://loki.system.schedul-bk-2633-29b-a.opstrace.io/loki/api/v1/labels: HTTP request failed with: Timeout awaiting 'request' for 10000ms
2021-07-09T08:42:55.161Z debug: https://loki.system.schedul-bk-2633-29b-a.opstrace.io/loki/api/v1/labels: HTTP request failed with: Timeout awaiting 'request' for 10000ms
2021-07-09T08:43:10.175Z debug: https://loki.system.schedul-bk-2633-29b-a.opstrace.io/loki/api/v1/labels: HTTP request failed with: Timeout awaiting 'request' for 10000ms
2021-07-09T08:43:18.542Z error: error seen by saga middleware:

Loki querier logs show the msg="error querying storage" err="context canceled" error:

2021-07-09 08:42:18	
level=info ts=2021-07-09T08:42:18.720615675Z caller=table_manager.go:182 msg="downloading all files for table schedul-bk-2633-29b-a_loki_index_18817"
2021-07-09 08:42:20	
level=info ts=2021-07-09T08:42:20.187205986Z caller=table_manager.go:208 msg="syncing tables"
2021-07-09 08:42:21	
level=error ts=2021-07-09T08:42:21.615384793Z caller=chunk_store.go:526 org_id=system traceID=7adb44d90c3f5183 msg="error querying storage" err="context canceled"
2021-07-09 08:42:21	
level=info ts=2021-07-09T08:42:21.64337351Z caller=table_manager.go:182 msg="downloading all files for table schedul-bk-2633-29b-a_loki_index_18817"
2021-07-09 08:42:25	
level=error ts=2021-07-09T08:42:25.140153298Z caller=chunk_store.go:526 org_id=system traceID=132da10484c2a849 msg="error querying storage" err="context canceled"
2021-07-09 08:42:40	
level=error ts=2021-07-09T08:42:40.153371878Z caller=chunk_store.go:526 org_id=system traceID=1ecc626bbc5173d0 msg="error querying storage" err="context canceled"
2021-07-09 08:42:55	
level=error ts=2021-07-09T08:42:55.161634121Z caller=chunk_store.go:526 org_id=system traceID=0ce94bf04060b47b msg="error querying storage" err="context canceled"
2021-07-09 08:43:10	
level=error ts=2021-07-09T08:43:10.175864974Z caller=chunk_store.go:526 org_id=system traceID=314b39fe0497f4bf msg="error querying storage" err="context canceled"
2021-07-09 08:43:18	
level=error ts=2021-07-09T08:43:18.557428762Z caller=chunk_store.go:526 org_id=system traceID=584e1d9cb27f764b msg="error querying storage" err="context canceled"
2021-07-09 08:43:24	
level=info ts=2021-07-09T08:43:24.52080582Z caller=table_manager.go:208 msg="syncing tables"

A couple of minutes later it was still trying to fetch data:

2021-07-09 08:44:20	
level=error ts=2021-07-09T08:44:20.132185035Z caller=table.go:184 msg="failed to initialize table schedul-bk-2633-29b-a_loki_index_18817, cleaning it up" err="RequestCanceled: request context canceled\ncaused by: context canceled"
2021-07-09 08:44:20	
level=error ts=2021-07-09T08:44:20.132288446Z caller=table.go:94 msg="failed to download table" name=schedul-bk-2633-29b-a_loki_index_18817

This could be a network hiccup preventing the queriers from fetching data from ingesters and/or S3.

@jgehrcke
Copy link
Contributor Author

jgehrcke commented Jul 9, 2021

Should we bump the status command timeout to e.g. 5 minutes because this might have good impact on CI and not so important negative impact on anything else? :)

@jgehrcke
Copy link
Contributor Author

jgehrcke commented Jul 9, 2021

[2021-07-08T20:47:13Z] 2021-07-08T20:47:13.368Z info: https://loki.default.schedul-bk-2630-95c-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
[2021-07-08T20:48:13Z] 2021-07-08T20:48:13.309Z error: error seen by saga middleware:
[2021-07-08T20:48:13Z] Error: timeout checking the status of the cluster
[2021-07-08T20:48:13Z]     at rootTaskStatusCore (/snapshot/build/packages/cli/src/status.ts:110:11)
[2021-07-08T20:48:13Z]     at rootTaskStatusCore.next (<anonymous>)

https://buildkite.com/opstrace/scheduled-main-builds/builds/2630#34f62782-bcfe-419c-afa3-459b2834743c/3295-3791

@sreis
Copy link
Contributor

sreis commented Jul 9, 2021

Should we bump the status command timeout to e.g. 5 minutes because this might have good impact on CI and not so important negative impact on anything else? :)

I was testing locally, increasing the request timeout from 10s to 60s but it didn't address the issue.

Weirdly, it works just after the create command finishes, and then this happens when you run the status command a few minutes later.

@sreis
Copy link
Contributor

sreis commented Jul 9, 2021

I was able to repro this issue a few times locally by deleting the loki namespace and waiting for the controller to recreate things before running the status command.

@sreis
Copy link
Contributor

sreis commented Jul 27, 2021

Closing since we haven't seen this since we bumped the timeout. We can reopen if we see it again.

@sreis sreis closed this as completed Jul 27, 2021
@jgehrcke
Copy link
Contributor Author

[2021-10-22T12:51:40Z] 2021-10-22T12:51:40.847Z �[32minfo�[39m: waiting for expected HTTP responses at these URLs: {
[2021-10-22T12:51:40Z]   "https://cortex.default.schedul-bk-3415-52c-a.opstrace.io/api/v1/labels": "default",
[2021-10-22T12:51:40Z]   "https://loki.default.schedul-bk-3415-52c-a.opstrace.io/loki/api/v1/labels": "default",
[2021-10-22T12:51:40Z]   "https://cortex.system.schedul-bk-3415-52c-a.opstrace.io/api/v1/labels": "system",
[2021-10-22T12:51:40Z]   "https://loki.system.schedul-bk-3415-52c-a.opstrace.io/loki/api/v1/labels": "system"
[2021-10-22T12:51:40Z] }
[2021-10-22T12:51:41Z] 2021-10-22T12:51:41.055Z �[32minfo�[39m: https://loki.default.schedul-bk-3415-52c-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
[2021-10-22T12:51:48Z] 2021-10-22T12:51:48.924Z �[32minfo�[39m: https://cortex.system.schedul-bk-3415-52c-a.opstrace.io/api/v1/labels: got expected HTTP response
[2021-10-22T12:52:05Z] 2021-10-22T12:52:05.003Z �[32minfo�[39m: https://loki.system.schedul-bk-3415-52c-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
[2021-10-22T12:52:22Z] 2021-10-22T12:52:22.897Z �[32minfo�[39m: https://cortex.default.schedul-bk-3415-52c-a.opstrace.io/api/v1/labels: still waiting for expected signal. Last error: Timeout awaiting 'request' for 10000ms
[2021-10-22T12:53:10Z] 2021-10-22T12:53:10.009Z �[32minfo�[39m: https://cortex.default.schedul-bk-3415-52c-a.opstrace.io/api/v1/labels: still waiting for expected signal. Last error: Timeout awaiting 'connect' for 3000ms
[2021-10-22T12:53:15Z] 2021-10-22T12:53:15.055Z �[32minfo�[39m: https://cortex.default.schedul-bk-3415-52c-a.opstrace.io/api/v1/labels: got expected HTTP response
[2021-10-22T12:53:15Z] 2021-10-22T12:53:15.056Z �[32minfo�[39m: wait for data API endpoints: all probe URLs returned expected HTTP responses, continue
[2021-10-22T12:53:15Z] 2021-10-22T12:53:15.056Z �[32minfo�[39m: waiting for expected HTTP responses at these URLs: {
[2021-10-22T12:53:15Z]   "https://dd.default.schedul-bk-3415-52c-a.opstrace.io/api/v1/series": "default"
[2021-10-22T12:53:15Z] }
[2021-10-22T12:53:23Z] 2021-10-22T12:53:23.156Z �[32minfo�[39m: wait for DD API endpoints: all probe URLs returned expected HTTP responses, continue
[2021-10-22T12:53:23Z] 2021-10-22T12:53:23.156Z �[32minfo�[39m: waiting for expected HTTP responses at these URLs: {
[2021-10-22T12:53:23Z]   "https://default.schedul-bk-3415-52c-a.opstrace.io/": "default",
[2021-10-22T12:53:23Z]   "https://system.schedul-bk-3415-52c-a.opstrace.io/": "system"
[2021-10-22T12:53:23Z] }
[2021-10-22T12:54:12Z] 2021-10-22T12:54:12.285Z �[32minfo�[39m: https://default.schedul-bk-3415-52c-a.opstrace.io/: still waiting for expected signal. Last error: Timeout awaiting 'request' for 10000ms
[2021-10-22T12:54:19Z] 2021-10-22T12:54:19.294Z �[32minfo�[39m: https://system.schedul-bk-3415-52c-a.opstrace.io/: still waiting for expected signal. Last error: Timeout awaiting 'connect' for 3000ms
[2021-10-22T12:55:13Z] 2021-10-22T12:55:13.415Z �[32minfo�[39m: https://system.schedul-bk-3415-52c-a.opstrace.io/: still waiting for expected signal. Last error: Timeout awaiting 'connect' for 3000ms
[2021-10-22T12:55:13Z] 2021-10-22T12:55:13.415Z �[32minfo�[39m: https://default.schedul-bk-3415-52c-a.opstrace.io/: still waiting for expected signal. Last error: Timeout awaiting 'request' for 10000ms
[2021-10-22T12:56:00Z] 2021-10-22T12:56:00.588Z �[32minfo�[39m: https://default.schedul-bk-3415-52c-a.opstrace.io/: still waiting for expected signal. Last error: Timeout awaiting 'connect' for 3000ms
[2021-10-22T12:56:40Z] 2021-10-22T12:56:40.851Z �[31merror�[39m: error seen by saga middleware:
[2021-10-22T12:56:40Z] Error: timeout checking the status of the cluster
[2021-10-22T12:56:40Z]     at rootTaskStatusCore (/snapshot/build/packages/cli/src/status.ts:110:11)
[2021-10-22T12:56:40Z]     at rootTaskStatusCore.next (<anonymous>)
[2021-10-22T12:56:40Z]     at next (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:1161:27)
[2021-10-22T12:56:40Z]     at currCb (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:1255:7)
[2021-10-22T12:56:40Z]     at chCbAtKey (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:702:9)
[2021-10-22T12:56:40Z]     at currCb (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:1255:7)
[2021-10-22T12:56:40Z] 2021-10-22T12:56:40.851Z �[31merror�[39m: saga stack: The above error occurred in task rootTaskStatusCore
[2021-10-22T12:56:40Z]     created by rootTaskStatus
[2021-10-22T12:56:40Z] 
[2021-10-22T12:56:40Z] Error: exit.

https://buildkite.com/opstrace/scheduled-main-builds/builds/3415

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: ci-instability may or may not be a user-facing problem, too
Development

No branches or pull requests

2 participants