Rethink ECR clean-up #97

emilymdubois · 2017-03-28T19:21:20Z

Refs #68

This PR initializes an alternative way to cleanup ECR images to keep the registry below 1,000 images on a continual basis.

I've added steps to scripts/send-job.sh that:

Fetch images for the given repository, and sorts by image creation time.
Determine which of the oldest images need to be deleted to bring the registry size down to 899 images. I configured it this way since it precedes the SNS publish step, which would upon success bring the registry size up to 900, or whatever maximum we choose.
Delete those images from the registry.

There are a few scripts/cleanup.js features that I haven't implemented in this PR:

Image tags that do not resemble GitShas
Blacklist of images that should not be deleted
Image tags that cannot be retrieved from GitHub

@rclark, would love to chat about the original intent of each of these features so I can evaluate if they're worth persisting.

rclark · 2017-03-28T20:56:57Z

Image tags that do not resemble GitShas

This would prevent know images from being deleted, e.g. if you've pushed a Git tag like v1.2.3 you will land an image with that tag in the repository. These images are less likely to represent "cruft" from ongoing dev commits.

Image tags that cannot be retrieved from GitHub

Same as above -- this could represent a named image that you manually placed into the ECR repository. Again, less likely to represent "cruft" and more likely to be something you don't want to lose.

Blacklist of images that should not be deleted

Just a way to be explicit about images that you want to keep. I don't see any clear way to accommodate this in auto-cleanup mode, but I think it is a fair tradeoff, esp. if the previous two features can be maintained.

emilymdubois · 2017-03-29T19:22:49Z

@rclark, I've added the following validations:

ImageTag resembles a GitSha (40-character string, comprised of letters and numbers).
ImageTag exists as a GitSha on GitHub. This involves hitting the API endpoint for the owner, repository, and commit ID, and confirming that a 200 HTTP code is returned.

A few thoughts:

The GitHub validation is designed so that no excessive requests are made to the GitHub API. Once enough images are identified for deletion, no more requests to the GitHub API are made. There is a risk of being rate limited if (1) many jobs are sent at once on a single access token, or (2) there is a systemic reason why non-200 status codes are returned for every request (e.g., GitHub is down, access token doesn't have appropriate permissions). However, I'm not sure if there is a way to validate imageTags as GitShas with fewer requests than as architected.
I'm not sure if we need more error handling around GitHub API requests. I set this up so that any imageTags with non-200 HTTP codes are skipped over. If all the imageTags have been tested against the GitHub API and there aren't enough images that qualify for deletion, a message prints with information about the GitShas and their status codes. I feel like this is a very unlikely scenario, but flagging nevertheless in case you have ideas for making this more robust. Maybe the job shouldn't be sent if the registry is at capacity, and there aren't enough viable images for deletion?

And a few questions:

Do you see any risks with having this run every time an image is created and uploaded to its registry?
Anything you see that's a ⚠️ before I descend into a dark, twisted testing hole?

rclark · 2017-03-29T19:40:52Z

🤔 The Github API requests would be a pretty significant possible point-of-failure.

hitting the API endpoint for the owner, repository, and commit ID, and confirming that a 200 HTTP code is returned.

Is that what the manual-mode script did? Could we instead use git to avoid removing an image with a tag that is not a SHA or tag in the repo's git history?

emilymdubois · 2017-03-31T17:34:08Z

@rclark the current status of this PR is:

ecr_cleanup runs the cleanup.js script using the region and repository as arguments
ecr_cleanup is called from docker_push, where the cleanup.js script has access to a single region

With regard to tests:

ecr_cleanup is tested here
ecr_cleanup is mocked in the docker_push test here
scripts/cleanup.js has its own tests here

Mind reviewing when you have a moment? Thanks!

rclark · 2017-04-04T15:56:20Z

utils.sh

+  local region=$1
+  local repo=$2
+  cleanup_filepath=./scripts/cleanup.js
+  cleanup_filepath ${region} ${repo}


Looks like a missing $ in front of cleanup_filepath here.

rclark · 2017-04-04T16:01:14Z

test/utils.test.sh

+test_region="us-east-1"
+test_repo="test-repo"
+
+function cleanup_filepath() {


Yeah since you're actually calling an external file, you can't just mask the bash function in order to mock this call. This test is working now because of the other missing $ I noted, which would actually break at runtime.

Instead you need to do something like:

make cleanup.js live somewhere in $PATH when conex is running, perhaps symlinked as cleanup_ecr

make the call to this script cleanup_ecr instead of ./path/to/cleanup.js

in your test, you can mask that cleanup_ecr command. I'm not 100% sure if a function will work, or if you'll need to use a alias, or put an actual dummy file in $PATH before the real cleanup_ecr

arunasank · 2017-07-04T15:22:09Z

👋 @rclark @emilymdubois I spent some time working on this today. Regarding the puzzle of validating if a sha/tag is valid on github, could one feasible solution be the addition of an extra tag to each image called tag and commit for images from a tagged version, and a git commit, respectively? This would avoid any calls to the Github API

To start off, we would also have to manually add these additional tags to existing images as a one time thing, but this fixes the long term process to clean up images, without depending on the github API.

AFAICT the github API rate limits by a github user per https://developer.github.com/v3/#rate-limiting, and we get 60 requests per hour per each mapbox employee (since we all have access to all repositories, however retrieving different people's access tokens will be hard), which would be 😕 considering the number of repositories and images we have.

rclark · 2017-07-05T19:25:33Z

the addition of an extra tag to each image called tag and commit for images from a tagged version, and a git commit, respectively? This would avoid any calls to the Github API

If I am following, you could accomplish this by simply prepending "commit" images with commit-${gitsha}. Then the auto cleanup script would never delete an image that has a tag that does not start with commit-. Is that basically what you had in mind?

arunasank · 2017-07-06T06:18:11Z

If I am following, you could accomplish this by simply prepending "commit" images with commit-${gitsha}. Then the auto cleanup script would never delete an image that has a tag that does not start with commit-. Is that basically what you had in mind?

Yep, exactly - looking for a means to avoid hitting the github API while deleting images. I suggested adding a new tag:

To circumvent edits to code that pulls images from ECR
It seems like retagging an image involves addition of new tags, and not updating the existing tags and I assumed that this would make it harder to clean up existing images, since there would be a set of images with just a $gitsha tag (old) and a set of images with commit-${gitsha} or tag-${gitsha} tag (new).

arunasank · 2017-08-09T05:35:45Z

An additional constraint that came up when I spoke to @emilymcafee last week is the constraint of being cautious when we delete images that correspond to merge commits - so we would leave something like 50 merge commits around in each repository. The question was really about how to identify these merge commits.

I spent some time looking at Github Payloads for various kinds of merge commits, since github doesn't have a clear identifier for these commits. What I saw:

Regular merge commit

Payload has a list of commits, all of which are not distinct except for the actual merge commit, which is distinct. The head_commit also points to the merge commit and is distinct.
Couldn't think of any false positives here.

Squashed merge commit

Payload has one commit, which is distinct. Head commit matches payload commits.
False-positives: Regular single commits to master (Happens for documentation edits etc)

Rebased merge commit

Payload has one commit, which is not distinct.
False-positives: Regular rebases to master (Almost never happens)

I did write some code for parsing the above, which works on all my test fixtures, but it seems like a nicer way to think of this, is to save the 50 latest images from the default github branch, as opposed to saving merge commit images, since these are both analogous - since most images on a default branch are built is via a merge commit, barring other noisy commits like edits to documentation etc. The only problems I can think of with this approach is when people actively develop on the default github branch.

cc/ @rclark @emilymdubois @emilymcafee

rclark · 2017-09-28T19:45:45Z

@arunasank let's hold on new commits here for a minute -- we need to be able to clearly describe the behavior we're going for. So far I'm still pretty confused, and given that the tests aren't passing, I'm not convinced that this is even doing what it intends to do.

rclark · 2017-10-03T17:52:16Z

After discussion with @arunasank this morning, I'm running with this description. Please let me know if this is clear.

/**
 * "Generic" images are those that were built by conex for individual git
 * commits. "Priority" images are those built by conex for merge commits or
 * tags. We want there to never be more than 900 images in the repository. If
 * the number of images in the repository is greater than 900, we begin by
 * deleting the oldest "generic" images. If we still need to delete more images
 * to get to 900, we begin deleting old "priority" images, but always leave at
 * least 50 of them in the repository.
 */

* cleanup refactor, tests pass * remove extraneous node.js packages

rclark · 2017-10-06T17:30:09Z

Next step is more exhaustive testing on a staging system. Here are @arunasank's suggestions:

For example, I think making sure that all of these cases work, would be really useful.

MAX_IMAGES = 10, MAX_PRIORITY_IMAGES = 5, say.

Generic	Priority	Sum	Example
< Max Images	< Priority Max	< Max Images	GENERIC = 5, PRIORITY = 3 => GENERIC = 5, PRIORITY = 3
< Max Images	< Priority Max	= Max Images	GENERIC = 6, PRIORITY = 4 => GENERIC - 6, PRIORITY = 4
< Max Images	< Priority Max	> Max Images	GENERIC = 7, PRIORITY = 4 => GENERIC - 6, PRIORITY = 4
< Max Images	> Priority Max	< Max Images	GENERIC = 3, PRIORITY = 6 => GENERIC - 3, PRIORITY = 6
< Max Images	> Priority Max	= Max Images	GENERIC = 3, PRIORITY = 7 => GENERIC - 3, PRIORITY = 7
< Max Images	> Priority Max	> Max Images	GENERIC = 3, PRIORITY = 8 => GENERIC - 2, PRIORITY = 8

And cases from all of the 27 possibilities. At the very least, I think we should at least test the cases, where the total number of images is less than the max, = max, > max(which you already tested.

rclark · 2017-10-09T18:51:46Z

This last commit sets up a testing framework where you provide the initial number of generic, priority, and custom images, and also the expected final numbers. The test setup code

creates a git repo in your local tmpdir()
makes the appropriate commits to it
mocks ECR to represent these images
runs cleanup.js, then compares what's left in the mock ECR to what you said is expected.

This test code is long and convoluted, but it seems like the right way to create and persist the test rubric that @arunasank suggested we try out in staging.

Some of the tests pass, and others fail. What this means is that I still cannot easily answer the question "given ECR initial state X, what will be the cleanup outcome Y". I'm not comfortable running this out until its easy to correctly answer that question.

rclark · 2017-10-10T01:32:06Z

Worked through my problems with the complex test suite and uncovered a couple of issues in our logic:

we weren't adequately managing images in ECR that have multiple tags. It appears that the outcome from a tagged-commit build through ecs-conex results in one image, with one digest, and with 2 tags. I adjusted the logic to look at all image tags to determine whether the image is generic, priority, or custom.
our detection of "merge commits" only works if the merge is performed with the --no-ff (no fast-forward) flag on the git merge. According to Github documentation:
- --no-ff is used when you use the "Create a merge commit" big green button on a PR
- --no-ff is not used when you squash and merge or rebase and merge.
What this means is that our attempt to detect merge-commits will miss if a PR is squash or rebase merged. This means the commits will not be identified as priority images, and are therefore susceptible to being removed prematurely. I'll need to do some more digging to figure out if there's any other way to differentiate a squashed/rebased merge from a "generic" image.
behavior in repositories with large numbers of custom images is not exactly what you might expect. As an example:
- start with 20 generic images, 70 tagged, 850 custom
- finish with 0 generic, 50 tagged, 850 custom

Generally speaking, I think we should consider abandoning this automated approach. We can ask support for limit increases on repositories that are filing up. We can also build better tooling for manual-mode selection of images to remove from a repository.

Automation feels complex and error prone, and the risk of removing an image that is in use is pretty serious.

emilymdubois · 2018-03-06T21:52:11Z

@rclark @arunasank are we good to close this pull request in favor of rate limit increases and tools for manually deleting images?

rclark · 2018-03-07T00:02:18Z

I think I am 👍 on closing here, please close if you agree @arunasank.

emilymdubois mentioned this pull request Mar 29, 2017

Alarming on ECR registry size #68

Open

rclark reviewed Apr 4, 2017

View reviewed changes

arunasank force-pushed the rethink-ecr-prune branch from 67c518e to 1475884 Compare July 5, 2017 05:21

arunasank deployed to staging August 4, 2017 12:18 Active

arunasank force-pushed the rethink-ecr-prune branch from 1475884 to 353b03b Compare August 4, 2017 12:20

arunasank had a problem deploying to staging August 4, 2017 13:09 Failure

arunasank deployed to staging August 4, 2017 13:12 Active

arunasank force-pushed the rethink-ecr-prune branch 6 times, most recently from f212d40 to 23c699b Compare August 8, 2017 11:06

arunasank force-pushed the rethink-ecr-prune branch 5 times, most recently from 11742b0 to 0d66127 Compare August 10, 2017 11:31

arunasank deployed to staging August 10, 2017 12:11 Active

arunasank deployed to staging August 10, 2017 12:35 Active

arunasank deployed to staging August 10, 2017 12:47 Active

arunasank deployed to staging August 10, 2017 13:17 Active

arunasank added 13 commits September 28, 2017 13:11

update tests

f413a9d

package-lock.json update

f038948

reduce max images to test

20b9625

another condition

2f082da

[skip ci] log

13ae7c9

remove tagging

c85e96b

logs

26c7a2c

[skip ci] oh

0ab0293

bash nodejs

23bb158

more logs

64f7e03

replace spawnSync with spawn + inherit

a3588bc

npm install

b8923b1

nicer spawn

1b8c9b5

arunasank force-pushed the rethink-ecr-prune branch from 6eb4d8a to 1b8c9b5 Compare September 28, 2017 07:42

arunasank added 2 commits September 28, 2017 14:34

logs

9df3c56

logs

7422f72

rclark mentioned this pull request Oct 3, 2017

ECR cleanup code review & overhaul #115

Merged

ECR cleanup code review & overhaul (#115)

4517465

* cleanup refactor, tests pass * remove extraneous node.js packages

rclark added 2 commits October 9, 2017 08:51

cleanup explicitly, not as docker_push() side-effect

9347a49

test framework for image cleanup

3685a1f

fixup functional tests, handle double-tagged images

94f52fd

arunasank closed this Mar 7, 2018

arunasank deleted the rethink-ecr-prune branch March 7, 2018 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink ECR clean-up #97

Rethink ECR clean-up #97

emilymdubois commented Mar 28, 2017 •

edited

Loading

rclark commented Mar 28, 2017

emilymdubois commented Mar 29, 2017

rclark commented Mar 29, 2017

emilymdubois commented Mar 31, 2017

rclark Apr 4, 2017

rclark Apr 4, 2017

arunasank commented Jul 4, 2017 •

edited

Loading

rclark commented Jul 5, 2017

arunasank commented Jul 6, 2017 •

edited by rclark

Loading

arunasank commented Aug 9, 2017 •

edited

Loading

rclark commented Sep 28, 2017

rclark commented Oct 3, 2017

rclark commented Oct 6, 2017

rclark commented Oct 9, 2017

rclark commented Oct 10, 2017

emilymdubois commented Mar 6, 2018

rclark commented Mar 7, 2018

Rethink ECR clean-up #97

Rethink ECR clean-up #97

Conversation

emilymdubois commented Mar 28, 2017 • edited Loading

rclark commented Mar 28, 2017

emilymdubois commented Mar 29, 2017

rclark commented Mar 29, 2017

emilymdubois commented Mar 31, 2017

rclark Apr 4, 2017

Choose a reason for hiding this comment

rclark Apr 4, 2017

Choose a reason for hiding this comment

arunasank commented Jul 4, 2017 • edited Loading

rclark commented Jul 5, 2017

arunasank commented Jul 6, 2017 • edited by rclark Loading

arunasank commented Aug 9, 2017 • edited Loading

rclark commented Sep 28, 2017

rclark commented Oct 3, 2017

rclark commented Oct 6, 2017

rclark commented Oct 9, 2017

rclark commented Oct 10, 2017

emilymdubois commented Mar 6, 2018

rclark commented Mar 7, 2018

emilymdubois commented Mar 28, 2017 •

edited

Loading

arunasank commented Jul 4, 2017 •

edited

Loading

arunasank commented Jul 6, 2017 •

edited by rclark

Loading

arunasank commented Aug 9, 2017 •

edited

Loading