Skip to content
This repository has been archived by the owner on Dec 26, 2023. It is now read-only.

Reporter subsystem failure restarts the process #618

Closed
fsaintjacques opened this issue Oct 19, 2023 · 6 comments · Fixed by #622 or #606
Closed

Reporter subsystem failure restarts the process #618

fsaintjacques opened this issue Oct 19, 2023 · 6 comments · Fixed by #622 or #606

Comments

@fsaintjacques
Copy link
Contributor

If the reporter sub-system fails to report a Run multiple times (after backoff), it shutdowns the process. We're hitting this scenario because we have multiple dynamic workspaces hitting the same github repository master's commit. But github has a limit of 1000 statuses per commit & context (workspace):

ERROR 2023-10-19T14:46:01.489316015Z {"backoff":1.0954821796E10, "error":"POST https://api.github.com/repos/[redacted]/statuses/7eb12bd04be1ce5a20fd1e536525cdb61c5c6871: 422 Validation Failed [{Resource:Status Field: Code:custom Message:This SHA and context has reached the maximum number of statuses.}]", "level":"ERROR", "msg":"restarting subsystem", "name":"reporter"}

The expected behavior is that failing to notify github shouldn't take down the process (which in our case corrupts some runs that we need to manually cancel).

@leg100
Copy link
Owner

leg100 commented Oct 19, 2023

Hello Francois. I'll need more info before embarking on a solution. To replicate the error, I had the reporter simply return an error:

2023/10/19 19:22:50 ERROR restarting subsystem error="fickle error" name=reporter backoff=576.536841ms
2023/10/19 19:22:52 ERROR restarting subsystem error="fickle error" name=reporter backoff=554.712678ms
2023/10/19 19:22:53 ERROR restarting subsystem error="fickle error" name=reporter backoff=1.282855436s
2023/10/19 19:22:55 ERROR restarting subsystem error="fickle error" name=reporter backoff=1.80456602s
2023/10/19 19:22:58 ERROR restarting subsystem error="fickle error" name=reporter backoff=2.396128739s
2023/10/19 19:23:02 ERROR restarting subsystem error="fickle error" name=reporter backoff=2.005529795s
2023/10/19 19:23:05 ERROR restarting subsystem error="fickle error" name=reporter backoff=5.490968261s
2023/10/19 19:23:11 ERROR restarting subsystem error="fickle error" name=reporter backoff=6.89707495s
2023/10/19 19:23:19 ERROR restarting subsystem error="fickle error" name=reporter backoff=9.403647243s
2023/10/19 19:23:29 ERROR restarting subsystem error="fickle error" name=reporter backoff=22.978989478s
2023/10/19 19:23:53 ERROR restarting subsystem error="fickle error" name=reporter backoff=32.547234336s
2023/10/19 19:24:27 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m4.247044651s
2023/10/19 19:25:32 ERROR restarting subsystem error="fickle error" name=reporter backoff=56.962790539s
2023/10/19 19:26:30 ERROR restarting subsystem error="fickle error" name=reporter backoff=33.770130682s
2023/10/19 19:27:05 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m13.992406476s
2023/10/19 19:28:20 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m1.96730443s
2023/10/19 19:29:23 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m20.532220974s
2023/10/19 19:30:44 ERROR restarting subsystem error="fickle error" name=reporter backoff=39.353501338s
2023/10/19 19:31:25 ERROR restarting subsystem error="fickle error" name=reporter backoff=33.639028942s
2023/10/19 19:31:59 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m8.981556322s
2023/10/19 19:33:09 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m18.121362314s
2023/10/19 19:34:29 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m29.903974466s
2023/10/19 19:35:59 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m27.678914538s
2023/10/19 19:37:28 ERROR restarting subsystem error="fickle error" name=reporter backoff=1m15.10068044s
2023/10/19 19:38:44 INFO gracefully shutting down server...
2023/10/19 19:38:44 DEBUG gracefully shutdown subsystem name=notifier
2023/10/19 19:38:44 DEBUG gracefully shutdown subsystem name=broker
2023/10/19 19:38:44 DEBUG gracefully shutdown subsystem name=proxy
2023/10/19 19:38:44 DEBUG gracefully shutdown subsystem name=scheduler
Error: fickle error
exited: exit status 1

And it does look like it gives up backing off after 15 minutes, taking the process with it. Is that what you are seeing?

If so, presumably for those 15 minutes it is trying and failing to send run status updates for a single commit. But I don't see how you are managing to hit the 1000 statuses per combination of commit and workspace; there should only be one run for each workspace and each run should trigger no more than half a dozen status updates?

@fsaintjacques
Copy link
Contributor Author

The runs are triggered not by commit, but automatically when the developer's local environment restarts (due to various factors). Thus if you assume a couple restart by day, you can easily bust the 1000 limit on one commit & context in a reasonable amount of time (couple of months). Note that module has a really low volume of commit, my temporary fix was to fake a new commit; to reset the limit.

@fsaintjacques
Copy link
Contributor Author

I forgot to answer your question, yes this is what we're seeing. Github will 422 forever, the backoff will hit 15 minutes, and the process shutdowns (but gets restarted by gke).

@leg100
Copy link
Owner

leg100 commented Oct 19, 2023

How are you creating the run? Via the API? If so, perhaps the solution is to restrict the reporter to reporting on runs that are triggered by a commit. After all, that is the purpose of the reporter.

@fsaintjacques
Copy link
Contributor Author

Yes, the runs are triggered via the API (or via the UI), that would be an acceptable solution for me. Still the issue of taking down the process because github fails is a problem.

@leg100
Copy link
Owner

leg100 commented Oct 19, 2023

Still the issue of taking down the process because github fails is a problem.

I agree. I didn't know the default exponential backoff algorithm exits after 15 minutes, so I've learned something new. And there are multiple "subsystems" that share this behaviour, not only the reporter. None of them should probably be terminating the process. I'll tackle that separately.

leg100 added a commit that referenced this issue Oct 19, 2023
leg100 added a commit that referenced this issue Oct 20, 2023
🤖 I have created a release *beep* *boop*
---


## [0.1.14](v0.1.13...v0.1.14)
(2023-10-19)


### Features

* github app: [#617](#617)
* always use latest terraform version
([#616](#616))
([83469ca](83469ca)),
closes [#608](#608)


### Bug Fixes

* error 'schema: converter not found for integration.manifest'
([e53ebf2](e53ebf2))
* fixed bug where proxy was ignored
([#609](#609))
([c1ee8d8](c1ee8d8))
* prevent modules with no published versions from crashing otf
([#611](#611))
([84aa299](84aa299))
* skip reporting runs created via API
([#622](#622))
([5d4527b](5d4527b)),
closes [#618](#618)


### Miscellaneous

* add note re cloud block to allow CLI apply
([4f03544](4f03544))
* remove unused exchange code response
([4a966cd](4a966cd))
* upgrade vulnerable markdown go mod
([781e0f6](781e0f6))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Louis Garman <75728+leg100@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants