Reporter subsystem failure restarts the process #618

fsaintjacques · 2023-10-19T15:15:32Z

If the reporter sub-system fails to report a Run multiple times (after backoff), it shutdowns the process. We're hitting this scenario because we have multiple dynamic workspaces hitting the same github repository master's commit. But github has a limit of 1000 statuses per commit & context (workspace):

ERROR 2023-10-19T14:46:01.489316015Z {"backoff":1.0954821796E10, "error":"POST https://api.github.com/repos/[redacted]/statuses/7eb12bd04be1ce5a20fd1e536525cdb61c5c6871: 422 Validation Failed [{Resource:Status Field: Code:custom Message:This SHA and context has reached the maximum number of statuses.}]", "level":"ERROR", "msg":"restarting subsystem", "name":"reporter"}

The expected behavior is that failing to notify github shouldn't take down the process (which in our case corrupts some runs that we need to manually cancel).

The text was updated successfully, but these errors were encountered:

leg100 · 2023-10-19T19:38:24Z

Hello Francois. I'll need more info before embarking on a solution. To replicate the error, I had the reporter simply return an error:

2023/10/19 19:22:50 ERROR restarting subsystem error="fickle error" name=reporter backoff=576.536841ms
2023/10/19 19:22:52 ERROR restarting subsystem error="fickle error" name=reporter backoff=554.712678ms
2023/10/19 19:22:53 ERROR restarting subsystem error="fickle error" name=reporter backoff=1.282855436s
2023/10/19 19:22:55 ERROR restarting subsystem error="fickle error" name=reporter backoff=1.80456602s
2023/10/19 19:22:58 ERROR restarting subsystem error="fickle error" name=reporter backoff=2.396128739s
2023/10/19 19:23:02 ERROR restarting subsystem error="fickle error" name=reporter backoff=2.005529795s
2023/10/19 19:23:05 ERROR restarting subsystem error="fickle error" name=reporter backoff=5.490968261s
2023/10/19 19:23:11 ERROR restarting subsystem error="fickle error" name=reporter backoff=6.89707495s
2023/10/19 19:23:19 ERROR restarting subsystem error="fickle error" name=reporter backoff=9.403647243s
2023/10/19 19:23:29 ERROR restarting subsystem error="fickle error" name=reporter backoff=22.978989478s
2023/10/19 19:23:53 ERROR restarting subsystem error="fickle error" name=reporter backoff=32.547234336s
2023/10/19 19:24:27 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m4.247044651s
2023/10/19 19:25:32 ERROR restarting subsystem error="fickle error" name=reporter backoff=56.962790539s
2023/10/19 19:26:30 ERROR restarting subsystem error="fickle error" name=reporter backoff=33.770130682s
2023/10/19 19:27:05 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m13.992406476s
2023/10/19 19:28:20 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m1.96730443s
2023/10/19 19:29:23 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m20.532220974s
2023/10/19 19:30:44 ERROR restarting subsystem error="fickle error" name=reporter backoff=39.353501338s
2023/10/19 19:31:25 ERROR restarting subsystem error="fickle error" name=reporter backoff=33.639028942s
2023/10/19 19:31:59 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m8.981556322s
2023/10/19 19:33:09 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m18.121362314s
2023/10/19 19:34:29 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m29.903974466s
2023/10/19 19:35:59 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m27.678914538s
2023/10/19 19:37:28 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m15.10068044s
2023/10/19 19:38:44 INFO gracefully shutting down server...
2023/10/19 19:38:44 DEBUG gracefully shutdown subsystem name=notifier
2023/10/19 19:38:44 DEBUG gracefully shutdown subsystem name=broker
2023/10/19 19:38:44 DEBUG gracefully shutdown subsystem name=proxy
2023/10/19 19:38:44 DEBUG gracefully shutdown subsystem name=scheduler
Error: fickle error
exited: exit status 1

And it does look like it gives up backing off after 15 minutes, taking the process with it. Is that what you are seeing?

If so, presumably for those 15 minutes it is trying and failing to send run status updates for a single commit. But I don't see how you are managing to hit the 1000 statuses per combination of commit and workspace; there should only be one run for each workspace and each run should trigger no more than half a dozen status updates?

fsaintjacques · 2023-10-19T19:43:15Z

The runs are triggered not by commit, but automatically when the developer's local environment restarts (due to various factors). Thus if you assume a couple restart by day, you can easily bust the 1000 limit on one commit & context in a reasonable amount of time (couple of months). Note that module has a really low volume of commit, my temporary fix was to fake a new commit; to reset the limit.

fsaintjacques · 2023-10-19T19:52:42Z

I forgot to answer your question, yes this is what we're seeing. Github will 422 forever, the backoff will hit 15 minutes, and the process shutdowns (but gets restarted by gke).

leg100 · 2023-10-19T20:07:05Z

How are you creating the run? Via the API? If so, perhaps the solution is to restrict the reporter to reporting on runs that are triggered by a commit. After all, that is the purpose of the reporter.

fsaintjacques · 2023-10-19T20:12:32Z

Yes, the runs are triggered via the API (or via the UI), that would be an acceptable solution for me. Still the issue of taking down the process because github fails is a problem.

leg100 · 2023-10-19T20:24:45Z

Still the issue of taking down the process because github fails is a problem.

I agree. I didn't know the default exponential backoff algorithm exits after 15 minutes, so I've learned something new. And there are multiple "subsystems" that share this behaviour, not only the reporter. None of them should probably be terminating the process. I'll tackle that separately.

Fixes #618

🤖 I have created a release *beep* *boop* --- ## [0.1.14](v0.1.13...v0.1.14) (2023-10-19) ### Features * github app: [#617](#617) * always use latest terraform version ([#616](#616)) ([83469ca](83469ca)), closes [#608](#608) ### Bug Fixes * error 'schema: converter not found for integration.manifest' ([e53ebf2](e53ebf2)) * fixed bug where proxy was ignored ([#609](#609)) ([c1ee8d8](c1ee8d8)) * prevent modules with no published versions from crashing otf ([#611](#611)) ([84aa299](84aa299)) * skip reporting runs created via API ([#622](#622)) ([5d4527b](5d4527b)), closes [#618](#618) ### Miscellaneous * add note re cloud block to allow CLI apply ([4f03544](4f03544)) * remove unused exchange code response ([4a966cd](4a966cd)) * upgrade vulnerable markdown go mod ([781e0f6](781e0f6)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Louis Garman <75728+leg100@users.noreply.github.com>

leg100 mentioned this issue Oct 19, 2023

fix: skip reporting runs created via API #622

Merged

leg100 closed this as completed in #622 Oct 19, 2023

leg100 added a commit that referenced this issue Oct 19, 2023

fix: skip reporting runs created via API (#622)

5d4527b

Fixes #618

github-actions bot mentioned this issue Oct 19, 2023

chore(master): release 0.1.14 #606

Merged

leg100 mentioned this issue Oct 19, 2023

Subsystem failure should not terminate process #623

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reporter subsystem failure restarts the process #618

Reporter subsystem failure restarts the process #618

fsaintjacques commented Oct 19, 2023

leg100 commented Oct 19, 2023

fsaintjacques commented Oct 19, 2023

fsaintjacques commented Oct 19, 2023

leg100 commented Oct 19, 2023

fsaintjacques commented Oct 19, 2023

leg100 commented Oct 19, 2023

Reporter subsystem failure restarts the process #618

Reporter subsystem failure restarts the process #618

Comments

fsaintjacques commented Oct 19, 2023

leg100 commented Oct 19, 2023

fsaintjacques commented Oct 19, 2023

fsaintjacques commented Oct 19, 2023

leg100 commented Oct 19, 2023

fsaintjacques commented Oct 19, 2023

leg100 commented Oct 19, 2023