Remove regex from 404 duplicate deletion check #210

schneems · 2024-06-20T21:57:42Z

Context

Hatchet creates apps when you need them (up to a limit) and then when that limit is hit, it will attempt to delete old apps.

One of the bottlenecks that Hatchet faces is API limits. Hatchet relies on rate throttling https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/ to slow down and spread out API requests, but there's a point where too many requests and processes sleeping for too long can result in failures.

Hatchet is also distributed as multiple test runs on multiple machines. So we have to assume that if our process is trying to delete apps, that other processes might as well and that creates race conditions. If two (or more) servers try to delete the same resource then one will get a 404 response.

The reaper is responsible for keeping a list of apps

hatchet/lib/hatchet/reaper.rb

Line 176 in 8c80522

private def refresh_app_list

and also handling what to do when a "conflict" is found.

hatchet/lib/hatchet/reaper.rb

Line 133 in 8c80522

private def handle_conflict(conflict_message:, strategy:)

The default strategy is :refresh_api_and_continue. It will sleep for a period of time which allows any other machines that are also cleaning up to continue deleting apps. When it wakes up it queries the API to see how many apps are left and either stops (if enough have already been deleted) or continues trying to remove apps.

Effectively it's trying to limit API requests, both for duplicate delete requests and for listing apps requests.

404

Previously I was trying to disambiguate between "404 the url has a typo in it" (or some other issue like DNS problem with routing), versus "404, you clearly got to a valid API endpoint but the resource couldn't be found". To guard against that I added in a regex against the message in the body.

Problem

API is in the process of updating error messages so this will no longer work.

Internal link https://salesforce-internal.slack.com/archives/C1RS6AUDR/p1718894438197959.

Change

Remove the regex. Instead, when we receive a 404 response consider that the app is deleted. This will trigger the "sleep and refresh" logic.

Consequences: If the service is unreachable for some reason (such as a DNS issue or if the URL changes we will also get a 404 response and will keep trying to update apps in a loop. With our request throttling each attempt would progressively take longer and longer and eventually the tests would timeout and fail.

We could check the "id" field of the json body which should be "not_found" in this case, but it's also the same value for a bad URL

$ curl https://api.heroku.com/schemaz -H "Accept: application/vnd.heroku+json; version=3"
{
  "id": "not_found",
  "message": "The requested API endpoint was not found. Are you using the right HTTP verb (i.e. `GET` vs. `POST`), and did you specify your intended version with the `Accept` header?"
}

So that mechanism would help with DNS issues, but not with other 404 problems. It also introduces the possibility of json parsing failure and could also break in the future if this ID is revised or changed.

I think retrying is safe (enough) here.

## Context Hatchet creates apps when you need them (up to a limit) and then when that limit is hit, it will attempt to delete old apps. One of the bottlenecks that Hatchet faces is API limits. Hatchet relies on rate throttling https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/ to slow down and spread out API requests, but there's a point where too many requests and processes sleeping for too long can result in failures. Hatchet is also distributed as multiple test runs on multiple machines. So we have to assume that if our process is trying to delete apps, that other processes might as well and that creates race conditions. If two (or more) servers try to delete the same resource then one will get a 404 response. The reaper is responsible for keeping a list of apps https://github.com/heroku/hatchet/blob/8c80522eddcefaee79396353673a81a018b66af4/lib/hatchet/reaper.rb#L176 and also handling what to do when a "conflict" is found. https://github.com/heroku/hatchet/blob/8c80522eddcefaee79396353673a81a018b66af4/lib/hatchet/reaper.rb#L133 The default strategy is `:refresh_api_and_continue`. It will sleep for a period of time which allows any other machines that are also cleaning up to continue deleting apps. When it wakes up it queries the API to see how many apps are left and either stops (if enough have already been deleted) or continues trying to remove apps. Effectively it's trying to limit API requests, both for duplicate delete requests and for listing apps requests. ## 404 Previously I was trying to disambiguate between "404 the url has a typo in it" (or some other issue like DNS problem with routing), versus "404, you clearly got to a valid API endpoint but the resource couldn't be found". To guard against that I added in a regex against the message in the body. ## Problem API is in the process of updating error messages so this will no longer work. Internal link https://salesforce-internal.slack.com/archives/C1RS6AUDR/p1718894438197959. ## Change Remove the regex. Instead, when we receive a 404 response consider that the app is deleted. This will trigger the "sleep and refresh" logic. Consequences: If the service is unreachable for some reason (such as a DNS issue or if the URL changes we will also get a 404 response and will keep trying to update apps in a loop. With our request throttling each attempt would progressively take longer and longer and eventually the tests would timeout and fail. We could check the "id" field of the json body which should be `"not_found"` in this case, but it's also the same value for a bad URL ``` $ curl https://api.heroku.com/schemaz -H "Accept: application/vnd.heroku+json; version=3" { "id": "not_found", "message": "The requested API endpoint was not found. Are you using the right HTTP verb (i.e. `GET` vs. `POST`), and did you specify your intended version with the `Accept` header?" } ``` So that mechanism would help with DNS issues, but not with other 404 problems. It also introduces the possibility of json parsing failure and could also break in the future if this ID is revised or changed. I think retrying is safe (enough) here.

edmorley

Agree removing this check makes sense and reduces potential future churn from further API changes :-)

schneems requested a review from a team as a code owner June 20, 2024 21:57

schneems marked this pull request as draft June 20, 2024 21:57

schneems force-pushed the schneems/404-means-not-found branch from 6e39548 to 5b3b1fa Compare June 20, 2024 22:04

schneems marked this pull request as ready for review June 20, 2024 22:07

schneems force-pushed the schneems/404-means-not-found branch from 5b3b1fa to 430b888 Compare June 20, 2024 22:08

schneems changed the title ~~Use json ID equality instead of regex~~ Remove regex from 404 duplicate deletion check Jun 20, 2024

edmorley approved these changes Jun 21, 2024

View reviewed changes

schneems merged commit 688d43e into main Jun 21, 2024
7 checks passed

schneems deleted the schneems/404-means-not-found branch June 21, 2024 14:26

schneems mentioned this pull request Jun 21, 2024

Find a way to get back on main for heroku_hatchet heroku/heroku-buildpack-php#734

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove regex from 404 duplicate deletion check #210

Remove regex from 404 duplicate deletion check #210

schneems commented Jun 20, 2024

edmorley left a comment

Remove regex from 404 duplicate deletion check #210

Remove regex from 404 duplicate deletion check #210

Conversation

schneems commented Jun 20, 2024

Context

404

Problem

Change

edmorley left a comment

Choose a reason for hiding this comment