-
Notifications
You must be signed in to change notification settings - Fork 43
Reporter subsystem failure restarts the process #618
Comments
Hello Francois. I'll need more info before embarking on a solution. To replicate the error, I had the reporter simply return an error:
And it does look like it gives up backing off after 15 minutes, taking the process with it. Is that what you are seeing? If so, presumably for those 15 minutes it is trying and failing to send run status updates for a single commit. But I don't see how you are managing to hit the 1000 statuses per combination of commit and workspace; there should only be one run for each workspace and each run should trigger no more than half a dozen status updates? |
The runs are triggered not by commit, but automatically when the developer's local environment restarts (due to various factors). Thus if you assume a couple restart by day, you can easily bust the 1000 limit on one commit & context in a reasonable amount of time (couple of months). Note that module has a really low volume of commit, my temporary fix was to fake a new commit; to reset the limit. |
I forgot to answer your question, yes this is what we're seeing. Github will 422 forever, the backoff will hit 15 minutes, and the process shutdowns (but gets restarted by gke). |
How are you creating the run? Via the API? If so, perhaps the solution is to restrict the reporter to reporting on runs that are triggered by a commit. After all, that is the purpose of the reporter. |
Yes, the runs are triggered via the API (or via the UI), that would be an acceptable solution for me. Still the issue of taking down the process because github fails is a problem. |
I agree. I didn't know the default exponential backoff algorithm exits after 15 minutes, so I've learned something new. And there are multiple "subsystems" that share this behaviour, not only the reporter. None of them should probably be terminating the process. I'll tackle that separately. |
🤖 I have created a release *beep* *boop* --- ## [0.1.14](v0.1.13...v0.1.14) (2023-10-19) ### Features * github app: [#617](#617) * always use latest terraform version ([#616](#616)) ([83469ca](83469ca)), closes [#608](#608) ### Bug Fixes * error 'schema: converter not found for integration.manifest' ([e53ebf2](e53ebf2)) * fixed bug where proxy was ignored ([#609](#609)) ([c1ee8d8](c1ee8d8)) * prevent modules with no published versions from crashing otf ([#611](#611)) ([84aa299](84aa299)) * skip reporting runs created via API ([#622](#622)) ([5d4527b](5d4527b)), closes [#618](#618) ### Miscellaneous * add note re cloud block to allow CLI apply ([4f03544](4f03544)) * remove unused exchange code response ([4a966cd](4a966cd)) * upgrade vulnerable markdown go mod ([781e0f6](781e0f6)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Louis Garman <75728+leg100@users.noreply.github.com>
If the reporter sub-system fails to report a Run multiple times (after backoff), it shutdowns the process. We're hitting this scenario because we have multiple dynamic workspaces hitting the same github repository master's commit. But github has a limit of 1000 statuses per commit & context (workspace):
The expected behavior is that failing to notify github shouldn't take down the process (which in our case corrupts some runs that we need to manually cancel).
The text was updated successfully, but these errors were encountered: