Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove regex from 404 duplicate deletion check #210

Merged
merged 1 commit into from
Jun 21, 2024

Conversation

schneems
Copy link
Contributor

Context

Hatchet creates apps when you need them (up to a limit) and then when that limit is hit, it will attempt to delete old apps.

One of the bottlenecks that Hatchet faces is API limits. Hatchet relies on rate throttling https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/ to slow down and spread out API requests, but there's a point where too many requests and processes sleeping for too long can result in failures.

Hatchet is also distributed as multiple test runs on multiple machines. So we have to assume that if our process is trying to delete apps, that other processes might as well and that creates race conditions. If two (or more) servers try to delete the same resource then one will get a 404 response.

The reaper is responsible for keeping a list of apps

private def refresh_app_list
and also handling what to do when a "conflict" is found.
private def handle_conflict(conflict_message:, strategy:)

The default strategy is :refresh_api_and_continue. It will sleep for a period of time which allows any other machines that are also cleaning up to continue deleting apps. When it wakes up it queries the API to see how many apps are left and either stops (if enough have already been deleted) or continues trying to remove apps.

Effectively it's trying to limit API requests, both for duplicate delete requests and for listing apps requests.

404

Previously I was trying to disambiguate between "404 the url has a typo in it" (or some other issue like DNS problem with routing), versus "404, you clearly got to a valid API endpoint but the resource couldn't be found". To guard against that I added in a regex against the message in the body.

Problem

API is in the process of updating error messages so this will no longer work.

Internal link https://salesforce-internal.slack.com/archives/C1RS6AUDR/p1718894438197959.

Change

Remove the regex. Instead, when we receive a 404 response consider that the app is deleted. This will trigger the "sleep and refresh" logic.

Consequences: If the service is unreachable for some reason (such as a DNS issue or if the URL changes we will also get a 404 response and will keep trying to update apps in a loop. With our request throttling each attempt would progressively take longer and longer and eventually the tests would timeout and fail.

We could check the "id" field of the json body which should be "not_found" in this case, but it's also the same value for a bad URL

$ curl https://api.heroku.com/schemaz -H "Accept: application/vnd.heroku+json; version=3"
{
  "id": "not_found",
  "message": "The requested API endpoint was not found. Are you using the right HTTP verb (i.e. `GET` vs. `POST`), and did you specify your intended version with the `Accept` header?"
}

So that mechanism would help with DNS issues, but not with other 404 problems. It also introduces the possibility of json parsing failure and could also break in the future if this ID is revised or changed.

I think retrying is safe (enough) here.

@schneems schneems requested a review from a team as a code owner June 20, 2024 21:57
@schneems schneems marked this pull request as draft June 20, 2024 21:57
@schneems schneems force-pushed the schneems/404-means-not-found branch from 6e39548 to 5b3b1fa Compare June 20, 2024 22:04
@schneems schneems marked this pull request as ready for review June 20, 2024 22:07
## Context

Hatchet creates apps when you need them (up to a limit) and then when that limit is hit, it will attempt to delete old apps.

One of the bottlenecks that Hatchet faces is API limits. Hatchet relies on rate throttling https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/ to slow down and spread out API requests, but there's a point where too many requests and processes sleeping for too long can result in failures.

Hatchet is also distributed as multiple test runs on multiple machines. So we have to assume that if our process is trying to delete apps, that other processes might as well and that creates race conditions. If two (or more) servers try to delete the same resource then one will get a 404 response.

The reaper is responsible for keeping a list of apps https://github.com/heroku/hatchet/blob/8c80522eddcefaee79396353673a81a018b66af4/lib/hatchet/reaper.rb#L176 and also handling what to do when a "conflict" is found. https://github.com/heroku/hatchet/blob/8c80522eddcefaee79396353673a81a018b66af4/lib/hatchet/reaper.rb#L133

The default strategy is `:refresh_api_and_continue`. It will sleep for a period of time which allows any other machines that are also cleaning up to continue deleting apps. When it wakes up it queries the API to see how many apps are left and either stops (if enough have already been deleted) or continues trying to remove apps.

Effectively it's trying to limit API requests, both for duplicate delete requests and for listing apps requests.

## 404

Previously I was trying to disambiguate between "404 the url has a typo in it" (or some other issue like DNS problem with routing), versus "404, you clearly got to a valid API endpoint but the resource couldn't be found". To guard against that I added in a regex against the message in the body.

## Problem

API is in the process of updating error messages so this will no longer work.

Internal link https://salesforce-internal.slack.com/archives/C1RS6AUDR/p1718894438197959.

## Change

Remove the regex. Instead, when we receive a 404 response consider that the app is deleted. This will trigger the "sleep and refresh" logic.

Consequences: If the service is unreachable for some reason (such as a DNS issue or if the URL changes we will also get a 404 response and will keep trying to update apps in a loop. With our request throttling each attempt would progressively take longer and longer and eventually the tests would timeout and fail.

We could check the "id" field of the json body which should be `"not_found"` in this case, but it's also the same value for a bad URL 

```
$ curl https://api.heroku.com/schemaz -H "Accept: application/vnd.heroku+json; version=3"
{
  "id": "not_found",
  "message": "The requested API endpoint was not found. Are you using the right HTTP verb (i.e. `GET` vs. `POST`), and did you specify your intended version with the `Accept` header?"
}
```

So that mechanism would help with DNS issues, but not with other 404 problems. It also introduces the possibility of json parsing failure and could also break in the future if this ID is revised or changed.

I think retrying is safe (enough) here.
@schneems schneems force-pushed the schneems/404-means-not-found branch from 5b3b1fa to 430b888 Compare June 20, 2024 22:08
@schneems schneems changed the title Use json ID equality instead of regex Remove regex from 404 duplicate deletion check Jun 20, 2024
Copy link
Member

@edmorley edmorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree removing this check makes sense and reduces potential future churn from further API changes :-)

@schneems schneems merged commit 688d43e into main Jun 21, 2024
7 checks passed
@schneems schneems deleted the schneems/404-means-not-found branch June 21, 2024 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants