Remove regex from 404 duplicate deletion check #210
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
Hatchet creates apps when you need them (up to a limit) and then when that limit is hit, it will attempt to delete old apps.
One of the bottlenecks that Hatchet faces is API limits. Hatchet relies on rate throttling https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/ to slow down and spread out API requests, but there's a point where too many requests and processes sleeping for too long can result in failures.
Hatchet is also distributed as multiple test runs on multiple machines. So we have to assume that if our process is trying to delete apps, that other processes might as well and that creates race conditions. If two (or more) servers try to delete the same resource then one will get a 404 response.
The reaper is responsible for keeping a list of apps
hatchet/lib/hatchet/reaper.rb
Line 176 in 8c80522
hatchet/lib/hatchet/reaper.rb
Line 133 in 8c80522
The default strategy is
:refresh_api_and_continue
. It will sleep for a period of time which allows any other machines that are also cleaning up to continue deleting apps. When it wakes up it queries the API to see how many apps are left and either stops (if enough have already been deleted) or continues trying to remove apps.Effectively it's trying to limit API requests, both for duplicate delete requests and for listing apps requests.
404
Previously I was trying to disambiguate between "404 the url has a typo in it" (or some other issue like DNS problem with routing), versus "404, you clearly got to a valid API endpoint but the resource couldn't be found". To guard against that I added in a regex against the message in the body.
Problem
API is in the process of updating error messages so this will no longer work.
Internal link https://salesforce-internal.slack.com/archives/C1RS6AUDR/p1718894438197959.
Change
Remove the regex. Instead, when we receive a 404 response consider that the app is deleted. This will trigger the "sleep and refresh" logic.
Consequences: If the service is unreachable for some reason (such as a DNS issue or if the URL changes we will also get a 404 response and will keep trying to update apps in a loop. With our request throttling each attempt would progressively take longer and longer and eventually the tests would timeout and fail.
We could check the "id" field of the json body which should be
"not_found"
in this case, but it's also the same value for a bad URLSo that mechanism would help with DNS issues, but not with other 404 problems. It also introduces the possibility of json parsing failure and could also break in the future if this ID is revised or changed.
I think retrying is safe (enough) here.