ci instability: `status` command only waits for 1 minute to resolve instance URLs #981

jgehrcke · 2021-07-02T12:14:14Z

[2021-07-02T11:36:30Z] 2021-07-02T11:36:30.332Z info: logging to file: opstrace_cli_status_20210702-113630Z.log
...

[2021-07-02T11:36:31Z] 2021-07-02T11:36:31.262Z info: waiting for expected HTTP responses at these URLs: {
[2021-07-02T11:36:31Z]   "https://dd.default.prs-bk-4923-b5e-g.opstrace.io/api/v1/series": "default"
[2021-07-02T11:36:31Z] }
[2021-07-02T11:36:51Z] 2021-07-02T11:36:51.513Z info: https://dd.default.prs-bk-4923-b5e-g.opstrace.io/api/v1/series?api_key=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjNjODVhN2ViMzY4YTE3OWI0NGNmYThjOTk2OGFiMTdhNmFiNzliYzIifQ.eyJpYXQiOjE2MjUyMjQ5NzksImV4cCI6MTk0MDgwMDk3OSwiYXVkIjoib3BzdHJhY2UtY2x1c3Rlci1wcnMtYmstNDkyMy1iNWUtZyIsImlzcyI6Im9wc3RyYWNlLWNsaSIsInN1YiI6InRlbmFudC1kZWZhdWx0In0.Qd-gFVYb59n9Di7vz1NaQvSGduiNzzGgUP0VnBe1xEZe1Hiv4lnn4quGmdOF0NOprUb5erD7bLV7bgHhNljgYZaWNK-oKsiCwa1Uq_9LUA3fJCtVMU9mOpZIru0QsmwN-H-wQr61U3tXIrcx_MH6Elk0AAHxCTR7eZt3zpACZw0Y5rgw4hAFftU8PwbCCdr_Weqr_mnRjcyaJ2y69uO_fHuvA9GaqUhL9elcNImxdVlgTkU2PrGxR4xw6CsGEguWlFdAmWMOTW7ow7I072Bzi6tYGhRTUJTJdwXK9LI0fG65nvxmWmRejRpRjjVcSlnc0p1wsfZOjLZq4gy2sl6QGw: still waiting for expected signal. Last error: getaddrinfo ENOTFOUND dd.default.prs-bk-4923-b5e-g.opstrace.io
[2021-07-02T11:37:16Z] 2021-07-02T11:37:16.531Z info: https://dd.default.prs-bk-4923-b5e-g.opstrace.io/api/v1/series?api_key=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IjNjODVhN2ViMzY4YTE3OWI0NGNmYThjOTk2OGFiMTdhNmFiNzliYzIifQ.eyJpYXQiOjE2MjUyMjQ5NzksImV4cCI6MTk0MDgwMDk3OSwiYXVkIjoib3BzdHJhY2UtY2x1c3Rlci1wcnMtYmstNDkyMy1iNWUtZyIsImlzcyI6Im9wc3RyYWNlLWNsaSIsInN1YiI6InRlbmFudC1kZWZhdWx0In0.Qd-gFVYb59n9Di7vz1NaQvSGduiNzzGgUP0VnBe1xEZe1Hiv4lnn4quGmdOF0NOprUb5erD7bLV7bgHhNljgYZaWNK-oKsiCwa1Uq_9LUA3fJCtVMU9mOpZIru0QsmwN-H-wQr61U3tXIrcx_MH6Elk0AAHxCTR7eZt3zpACZw0Y5rgw4hAFftU8PwbCCdr_Weqr_mnRjcyaJ2y69uO_fHuvA9GaqUhL9elcNImxdVlgTkU2PrGxR4xw6CsGEguWlFdAmWMOTW7ow7I072Bzi6tYGhRTUJTJdwXK9LI0fG65nvxmWmRejRpRjjVcSlnc0p1wsfZOjLZq4gy2sl6QGw: still waiting for expected signal. Last error: getaddrinfo ENOTFOUND dd.default.prs-bk-4923-b5e-g.opstrace.io
[2021-07-02T11:37:30Z] 2021-07-02T11:37:30.351Z error: error seen by saga middleware:
[2021-07-02T11:37:30Z] Error: timeout checking the status of the cluster
[2021-07-02T11:37:30Z]     at rootTaskStatusCore (/snapshot/build/packages/cli/src/status.ts:112:11)
[2021-07-02T11:37:30Z]     at rootTaskStatusCore.next (<anonymous>)
[2021-07-02T11:37:30Z]     at next (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:1161:27)
[2021-07-02T11:37:30Z]     at currCb (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:1255:7)
[2021-07-02T11:37:30Z]     at chCbAtKey (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:702:9)
[2021-07-02T11:37:30Z]     at currCb (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:1255:7)
[2021-07-02T11:37:30Z] 2021-07-02T11:37:30.352Z error: saga stack: The above error occurred in task rootTaskStatusCore
[2021-07-02T11:37:30Z]     created by rootTaskStatus
[2021-07-02T11:37:30Z]
[2021-07-02T11:37:30Z] Error: exit.

https://buildkite.com/opstrace/prs/builds/4923#58765aca-2f2f-4bd9-b19e-3469e67742c8/2087-2584

The Error: timeout checking the status of the cluster came after 1 minute. Is that timeout constant a little small?

I think that makes sense if status is supposed to check the current state of the cluster w.r.t. some externally visible properties.

Now, the idea of "current" of course has impact on the timeout constants being used.

The status command is also a little underspecified, so this issue is not a big deal, and potentially not an issue at all.

This here is actually also a manifestation of #834 because getaddrinfo ENOTFOUND was issued for a DNS name that was resolvable shortly before that.

The text was updated successfully, but these errors were encountered:

sreis · 2021-07-06T16:48:30Z

Another one this time it failed to connect to loki in the system tenant:

[2021-07-06T15:48:37Z] 2021-07-06T15:48:37.275Z info: waiting for expected HTTP responses at these URLs: {
[2021-07-06T15:48:37Z]   "https://cortex.default.prs-bk-4966-aa4-a.opstrace.io/api/v1/labels": "default",
[2021-07-06T15:48:37Z]   "https://loki.default.prs-bk-4966-aa4-a.opstrace.io/loki/api/v1/labels": "default",
[2021-07-06T15:48:37Z]   "https://cortex.system.prs-bk-4966-aa4-a.opstrace.io/api/v1/labels": "system",
[2021-07-06T15:48:37Z]   "https://loki.system.prs-bk-4966-aa4-a.opstrace.io/loki/api/v1/labels": "system"
[2021-07-06T15:48:37Z] }
[2021-07-06T15:48:37Z] 2021-07-06T15:48:37.396Z info: https://cortex.default.prs-bk-4966-aa4-a.opstrace.io/api/v1/labels: got expected HTTP response
[2021-07-06T15:48:42Z] 2021-07-06T15:48:42.388Z info: https://loki.system.prs-bk-4966-aa4-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
[2021-07-06T15:48:52Z] 2021-07-06T15:48:52.316Z info: https://cortex.system.prs-bk-4966-aa4-a.opstrace.io/api/v1/labels: got expected HTTP response
[2021-07-06T15:49:37Z] 2021-07-06T15:49:37.289Z error: error seen by saga middleware:
[2021-07-06T15:49:37Z] Error: timeout checking the status of the cluster

But it was able to connect a few minutes earlier just after the cluster was created.

[2021-07-06T15:43:09Z] 2021-07-06T15:43:09.012Z info: https://loki.system.prs-bk-4966-aa4-a.opstrace.io/loki/api/v1/labels: got expected HTTP response

sreis · 2021-07-09T09:24:07Z

[2021-07-09T08:42:18Z] 2021-07-09T08:42:18.704Z info: https://loki.default.schedul-bk-2633-29b-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
[2021-07-09T08:43:18Z] 2021-07-09T08:43:18.541Z error: error seen by saga middleware:
[2021-07-09T08:43:18Z] Error: timeout checking the status of the cluster

https://buildkite.com/opstrace/scheduled-main-builds/builds/2633#69bd2556-ec0a-4648-ade0-a18c70a9f3c2/2782-3264

sreis · 2021-07-09T10:13:55Z

Status command debug logs:

2021-07-09T08:42:18.704Z info: https://loki.default.schedul-bk-2633-29b-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
2021-07-09T08:42:25.142Z debug: HTTP response details:
status: 502
body[:500]: <html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>

2021-07-09T08:42:40.153Z debug: https://loki.system.schedul-bk-2633-29b-a.opstrace.io/loki/api/v1/labels: HTTP request failed with: Timeout awaiting 'request' for 10000ms
2021-07-09T08:42:55.161Z debug: https://loki.system.schedul-bk-2633-29b-a.opstrace.io/loki/api/v1/labels: HTTP request failed with: Timeout awaiting 'request' for 10000ms
2021-07-09T08:43:10.175Z debug: https://loki.system.schedul-bk-2633-29b-a.opstrace.io/loki/api/v1/labels: HTTP request failed with: Timeout awaiting 'request' for 10000ms
2021-07-09T08:43:18.542Z error: error seen by saga middleware:

Loki querier logs show the msg="error querying storage" err="context canceled" error:

2021-07-09 08:42:18	
level=info ts=2021-07-09T08:42:18.720615675Z caller=table_manager.go:182 msg="downloading all files for table schedul-bk-2633-29b-a_loki_index_18817"
2021-07-09 08:42:20	
level=info ts=2021-07-09T08:42:20.187205986Z caller=table_manager.go:208 msg="syncing tables"
2021-07-09 08:42:21	
level=error ts=2021-07-09T08:42:21.615384793Z caller=chunk_store.go:526 org_id=system traceID=7adb44d90c3f5183 msg="error querying storage" err="context canceled"
2021-07-09 08:42:21	
level=info ts=2021-07-09T08:42:21.64337351Z caller=table_manager.go:182 msg="downloading all files for table schedul-bk-2633-29b-a_loki_index_18817"
2021-07-09 08:42:25	
level=error ts=2021-07-09T08:42:25.140153298Z caller=chunk_store.go:526 org_id=system traceID=132da10484c2a849 msg="error querying storage" err="context canceled"
2021-07-09 08:42:40	
level=error ts=2021-07-09T08:42:40.153371878Z caller=chunk_store.go:526 org_id=system traceID=1ecc626bbc5173d0 msg="error querying storage" err="context canceled"
2021-07-09 08:42:55	
level=error ts=2021-07-09T08:42:55.161634121Z caller=chunk_store.go:526 org_id=system traceID=0ce94bf04060b47b msg="error querying storage" err="context canceled"
2021-07-09 08:43:10	
level=error ts=2021-07-09T08:43:10.175864974Z caller=chunk_store.go:526 org_id=system traceID=314b39fe0497f4bf msg="error querying storage" err="context canceled"
2021-07-09 08:43:18	
level=error ts=2021-07-09T08:43:18.557428762Z caller=chunk_store.go:526 org_id=system traceID=584e1d9cb27f764b msg="error querying storage" err="context canceled"
2021-07-09 08:43:24	
level=info ts=2021-07-09T08:43:24.52080582Z caller=table_manager.go:208 msg="syncing tables"

A couple of minutes later it was still trying to fetch data:

2021-07-09 08:44:20	
level=error ts=2021-07-09T08:44:20.132185035Z caller=table.go:184 msg="failed to initialize table schedul-bk-2633-29b-a_loki_index_18817, cleaning it up" err="RequestCanceled: request context canceled\ncaused by: context canceled"
2021-07-09 08:44:20	
level=error ts=2021-07-09T08:44:20.132288446Z caller=table.go:94 msg="failed to download table" name=schedul-bk-2633-29b-a_loki_index_18817

This could be a network hiccup preventing the queriers from fetching data from ingesters and/or S3.

jgehrcke · 2021-07-09T11:21:44Z

Should we bump the status command timeout to e.g. 5 minutes because this might have good impact on CI and not so important negative impact on anything else? :)

jgehrcke · 2021-07-09T11:24:02Z

[2021-07-08T20:47:13Z] 2021-07-08T20:47:13.368Z info: https://loki.default.schedul-bk-2630-95c-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
[2021-07-08T20:48:13Z] 2021-07-08T20:48:13.309Z error: error seen by saga middleware:
[2021-07-08T20:48:13Z] Error: timeout checking the status of the cluster
[2021-07-08T20:48:13Z]     at rootTaskStatusCore (/snapshot/build/packages/cli/src/status.ts:110:11)
[2021-07-08T20:48:13Z]     at rootTaskStatusCore.next (<anonymous>)

https://buildkite.com/opstrace/scheduled-main-builds/builds/2630#34f62782-bcfe-419c-afa3-459b2834743c/3295-3791

sreis · 2021-07-09T11:42:16Z

Should we bump the status command timeout to e.g. 5 minutes because this might have good impact on CI and not so important negative impact on anything else? :)

I was testing locally, increasing the request timeout from 10s to 60s but it didn't address the issue.

Weirdly, it works just after the create command finishes, and then this happens when you run the status command a few minutes later.

sreis · 2021-07-09T14:29:40Z

I was able to repro this issue a few times locally by deleting the loki namespace and waiting for the controller to recreate things before running the status command.

sreis · 2021-07-27T11:12:41Z

Closing since we haven't seen this since we bumped the timeout. We can reopen if we see it again.

jgehrcke · 2021-10-22T14:08:07Z

[2021-10-22T12:51:40Z] 2021-10-22T12:51:40.847Z �[32minfo�[39m: waiting for expected HTTP responses at these URLs: {
[2021-10-22T12:51:40Z]   "https://cortex.default.schedul-bk-3415-52c-a.opstrace.io/api/v1/labels": "default",
[2021-10-22T12:51:40Z]   "https://loki.default.schedul-bk-3415-52c-a.opstrace.io/loki/api/v1/labels": "default",
[2021-10-22T12:51:40Z]   "https://cortex.system.schedul-bk-3415-52c-a.opstrace.io/api/v1/labels": "system",
[2021-10-22T12:51:40Z]   "https://loki.system.schedul-bk-3415-52c-a.opstrace.io/loki/api/v1/labels": "system"
[2021-10-22T12:51:40Z] }
[2021-10-22T12:51:41Z] 2021-10-22T12:51:41.055Z �[32minfo�[39m: https://loki.default.schedul-bk-3415-52c-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
[2021-10-22T12:51:48Z] 2021-10-22T12:51:48.924Z �[32minfo�[39m: https://cortex.system.schedul-bk-3415-52c-a.opstrace.io/api/v1/labels: got expected HTTP response
[2021-10-22T12:52:05Z] 2021-10-22T12:52:05.003Z �[32minfo�[39m: https://loki.system.schedul-bk-3415-52c-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
[2021-10-22T12:52:22Z] 2021-10-22T12:52:22.897Z �[32minfo�[39m: https://cortex.default.schedul-bk-3415-52c-a.opstrace.io/api/v1/labels: still waiting for expected signal. Last error: Timeout awaiting 'request' for 10000ms
[2021-10-22T12:53:10Z] 2021-10-22T12:53:10.009Z �[32minfo�[39m: https://cortex.default.schedul-bk-3415-52c-a.opstrace.io/api/v1/labels: still waiting for expected signal. Last error: Timeout awaiting 'connect' for 3000ms
[2021-10-22T12:53:15Z] 2021-10-22T12:53:15.055Z �[32minfo�[39m: https://cortex.default.schedul-bk-3415-52c-a.opstrace.io/api/v1/labels: got expected HTTP response
[2021-10-22T12:53:15Z] 2021-10-22T12:53:15.056Z �[32minfo�[39m: wait for data API endpoints: all probe URLs returned expected HTTP responses, continue
[2021-10-22T12:53:15Z] 2021-10-22T12:53:15.056Z �[32minfo�[39m: waiting for expected HTTP responses at these URLs: {
[2021-10-22T12:53:15Z]   "https://dd.default.schedul-bk-3415-52c-a.opstrace.io/api/v1/series": "default"
[2021-10-22T12:53:15Z] }
[2021-10-22T12:53:23Z] 2021-10-22T12:53:23.156Z �[32minfo�[39m: wait for DD API endpoints: all probe URLs returned expected HTTP responses, continue
[2021-10-22T12:53:23Z] 2021-10-22T12:53:23.156Z �[32minfo�[39m: waiting for expected HTTP responses at these URLs: {
[2021-10-22T12:53:23Z]   "https://default.schedul-bk-3415-52c-a.opstrace.io/": "default",
[2021-10-22T12:53:23Z]   "https://system.schedul-bk-3415-52c-a.opstrace.io/": "system"
[2021-10-22T12:53:23Z] }
[2021-10-22T12:54:12Z] 2021-10-22T12:54:12.285Z �[32minfo�[39m: https://default.schedul-bk-3415-52c-a.opstrace.io/: still waiting for expected signal. Last error: Timeout awaiting 'request' for 10000ms
[2021-10-22T12:54:19Z] 2021-10-22T12:54:19.294Z �[32minfo�[39m: https://system.schedul-bk-3415-52c-a.opstrace.io/: still waiting for expected signal. Last error: Timeout awaiting 'connect' for 3000ms
[2021-10-22T12:55:13Z] 2021-10-22T12:55:13.415Z �[32minfo�[39m: https://system.schedul-bk-3415-52c-a.opstrace.io/: still waiting for expected signal. Last error: Timeout awaiting 'connect' for 3000ms
[2021-10-22T12:55:13Z] 2021-10-22T12:55:13.415Z �[32minfo�[39m: https://default.schedul-bk-3415-52c-a.opstrace.io/: still waiting for expected signal. Last error: Timeout awaiting 'request' for 10000ms
[2021-10-22T12:56:00Z] 2021-10-22T12:56:00.588Z �[32minfo�[39m: https://default.schedul-bk-3415-52c-a.opstrace.io/: still waiting for expected signal. Last error: Timeout awaiting 'connect' for 3000ms
[2021-10-22T12:56:40Z] 2021-10-22T12:56:40.851Z �[31merror�[39m: error seen by saga middleware:
[2021-10-22T12:56:40Z] Error: timeout checking the status of the cluster
[2021-10-22T12:56:40Z]     at rootTaskStatusCore (/snapshot/build/packages/cli/src/status.ts:110:11)
[2021-10-22T12:56:40Z]     at rootTaskStatusCore.next (<anonymous>)
[2021-10-22T12:56:40Z]     at next (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:1161:27)
[2021-10-22T12:56:40Z]     at currCb (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:1255:7)
[2021-10-22T12:56:40Z]     at chCbAtKey (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:702:9)
[2021-10-22T12:56:40Z]     at currCb (/snapshot/build/node_modules/@redux-saga/core/dist/redux-saga-core.dev.cjs.js:1255:7)
[2021-10-22T12:56:40Z] 2021-10-22T12:56:40.851Z �[31merror�[39m: saga stack: The above error occurred in task rootTaskStatusCore
[2021-10-22T12:56:40Z]     created by rootTaskStatus
[2021-10-22T12:56:40Z] 
[2021-10-22T12:56:40Z] Error: exit.

https://buildkite.com/opstrace/scheduled-main-builds/builds/3415

jgehrcke changed the title ~~ci instability: status command only waits for 1 minutes to resolve instance URLs~~ ci instability: status command only waits for 1 minute to resolve instance URLs Jul 2, 2021

This was referenced Jul 2, 2021

ci instability: DNS name resolution errors for *.*.*.opstrace.io after create succeeded (after names resolved once) #834

Closed

looker: consolidate metric/log primitives in code #974

Merged

sreis mentioned this issue Jul 6, 2021

feat: integrate cortex-operator #869

Merged

sreis self-assigned this Jul 9, 2021

sreis added the type: ci-instability may or may not be a user-facing problem, too label Jul 9, 2021

sreis mentioned this issue Jul 9, 2021

cli(status): bump check cluster status wait timeout #1025

Merged

sreis closed this as completed Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci instability: `status` command only waits for 1 minute to resolve instance URLs #981

ci instability: `status` command only waits for 1 minute to resolve instance URLs #981

jgehrcke commented Jul 2, 2021 •

edited

sreis commented Jul 6, 2021

sreis commented Jul 9, 2021

sreis commented Jul 9, 2021

jgehrcke commented Jul 9, 2021

jgehrcke commented Jul 9, 2021

sreis commented Jul 9, 2021 •

edited

sreis commented Jul 9, 2021

sreis commented Jul 27, 2021

jgehrcke commented Oct 22, 2021

ci instability: status command only waits for 1 minute to resolve instance URLs #981

ci instability: status command only waits for 1 minute to resolve instance URLs #981

Comments

jgehrcke commented Jul 2, 2021 • edited

sreis commented Jul 6, 2021

sreis commented Jul 9, 2021

sreis commented Jul 9, 2021

jgehrcke commented Jul 9, 2021

jgehrcke commented Jul 9, 2021

sreis commented Jul 9, 2021 • edited

sreis commented Jul 9, 2021

sreis commented Jul 27, 2021

jgehrcke commented Oct 22, 2021

ci instability: `status` command only waits for 1 minute to resolve instance URLs #981

ci instability: `status` command only waits for 1 minute to resolve instance URLs #981

jgehrcke commented Jul 2, 2021 •

edited

sreis commented Jul 9, 2021 •

edited