Skip to content
This repository has been archived by the owner on Dec 13, 2018. It is now read-only.

Rethink ECR clean-up #97

Closed
wants to merge 88 commits into from
Closed

Rethink ECR clean-up #97

wants to merge 88 commits into from

Conversation

emilymdubois
Copy link
Contributor

@emilymdubois emilymdubois commented Mar 28, 2017

Refs #68

This PR initializes an alternative way to cleanup ECR images to keep the registry below 1,000 images on a continual basis.

I've added steps to scripts/send-job.sh that:

  • Fetch images for the given repository, and sorts by image creation time.
  • Determine which of the oldest images need to be deleted to bring the registry size down to 899 images. I configured it this way since it precedes the SNS publish step, which would upon success bring the registry size up to 900, or whatever maximum we choose.
  • Delete those images from the registry.

There are a few scripts/cleanup.js features that I haven't implemented in this PR:

  1. Image tags that do not resemble GitShas
  2. Blacklist of images that should not be deleted
  3. Image tags that cannot be retrieved from GitHub

@rclark, would love to chat about the original intent of each of these features so I can evaluate if they're worth persisting.

@rclark
Copy link
Contributor

rclark commented Mar 28, 2017

Image tags that do not resemble GitShas

This would prevent know images from being deleted, e.g. if you've pushed a Git tag like v1.2.3 you will land an image with that tag in the repository. These images are less likely to represent "cruft" from ongoing dev commits.

Image tags that cannot be retrieved from GitHub

Same as above -- this could represent a named image that you manually placed into the ECR repository. Again, less likely to represent "cruft" and more likely to be something you don't want to lose.

Blacklist of images that should not be deleted

Just a way to be explicit about images that you want to keep. I don't see any clear way to accommodate this in auto-cleanup mode, but I think it is a fair tradeoff, esp. if the previous two features can be maintained.

@emilymdubois
Copy link
Contributor Author

@rclark, I've added the following validations:

  • ImageTag resembles a GitSha (40-character string, comprised of letters and numbers).
  • ImageTag exists as a GitSha on GitHub. This involves hitting the API endpoint for the owner, repository, and commit ID, and confirming that a 200 HTTP code is returned.

A few thoughts:

  • The GitHub validation is designed so that no excessive requests are made to the GitHub API. Once enough images are identified for deletion, no more requests to the GitHub API are made. There is a risk of being rate limited if (1) many jobs are sent at once on a single access token, or (2) there is a systemic reason why non-200 status codes are returned for every request (e.g., GitHub is down, access token doesn't have appropriate permissions). However, I'm not sure if there is a way to validate imageTags as GitShas with fewer requests than as architected.
  • I'm not sure if we need more error handling around GitHub API requests. I set this up so that any imageTags with non-200 HTTP codes are skipped over. If all the imageTags have been tested against the GitHub API and there aren't enough images that qualify for deletion, a message prints with information about the GitShas and their status codes. I feel like this is a very unlikely scenario, but flagging nevertheless in case you have ideas for making this more robust. Maybe the job shouldn't be sent if the registry is at capacity, and there aren't enough viable images for deletion?

And a few questions:

  • Do you see any risks with having this run every time an image is created and uploaded to its registry?
  • Anything you see that's a ⚠️ before I descend into a dark, twisted testing hole?

@rclark
Copy link
Contributor

rclark commented Mar 29, 2017

🤔 The Github API requests would be a pretty significant possible point-of-failure.

hitting the API endpoint for the owner, repository, and commit ID, and confirming that a 200 HTTP code is returned.

Is that what the manual-mode script did? Could we instead use git to avoid removing an image with a tag that is not a SHA or tag in the repo's git history?

@emilymdubois
Copy link
Contributor Author

@rclark the current status of this PR is:

  • ecr_cleanup runs the cleanup.js script using the region and repository as arguments
  • ecr_cleanup is called from docker_push, where the cleanup.js script has access to a single region

With regard to tests:

  • ecr_cleanup is tested here
  • ecr_cleanup is mocked in the docker_push test here
  • scripts/cleanup.js has its own tests here

Mind reviewing when you have a moment? Thanks!

utils.sh Outdated
local region=$1
local repo=$2
cleanup_filepath=./scripts/cleanup.js
cleanup_filepath ${region} ${repo}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a missing $ in front of cleanup_filepath here.

test_region="us-east-1"
test_repo="test-repo"

function cleanup_filepath() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah since you're actually calling an external file, you can't just mask the bash function in order to mock this call. This test is working now because of the other missing $ I noted, which would actually break at runtime.

Instead you need to do something like:

  • make cleanup.js live somewhere in $PATH when conex is running, perhaps symlinked as cleanup_ecr
  • make the call to this script cleanup_ecr instead of ./path/to/cleanup.js
  • in your test, you can mask that cleanup_ecr command. I'm not 100% sure if a function will work, or if you'll need to use a alias, or put an actual dummy file in $PATH before the real cleanup_ecr

@arunasank
Copy link
Contributor

arunasank commented Jul 4, 2017

👋 @rclark @emilymdubois I spent some time working on this today. Regarding the puzzle of validating if a sha/tag is valid on github, could one feasible solution be the addition of an extra tag to each image called tag and commit for images from a tagged version, and a git commit, respectively? This would avoid any calls to the Github API

To start off, we would also have to manually add these additional tags to existing images as a one time thing, but this fixes the long term process to clean up images, without depending on the github API.

AFAICT the github API rate limits by a github user per https://developer.github.com/v3/#rate-limiting, and we get 60 requests per hour per each mapbox employee (since we all have access to all repositories, however retrieving different people's access tokens will be hard), which would be 😕 considering the number of repositories and images we have.

@rclark
Copy link
Contributor

rclark commented Jul 5, 2017

the addition of an extra tag to each image called tag and commit for images from a tagged version, and a git commit, respectively? This would avoid any calls to the Github API

If I am following, you could accomplish this by simply prepending "commit" images with commit-${gitsha}. Then the auto cleanup script would never delete an image that has a tag that does not start with commit-. Is that basically what you had in mind?

@arunasank
Copy link
Contributor

arunasank commented Jul 6, 2017

If I am following, you could accomplish this by simply prepending "commit" images with commit-${gitsha}. Then the auto cleanup script would never delete an image that has a tag that does not start with commit-. Is that basically what you had in mind?

Yep, exactly - looking for a means to avoid hitting the github API while deleting images. I suggested adding a new tag:

@arunasank arunasank deployed to staging August 4, 2017 12:18 Active
@arunasank arunasank had a problem deploying to staging August 4, 2017 13:09 Failure
@arunasank arunasank deployed to staging August 4, 2017 13:12 Active
@arunasank arunasank force-pushed the rethink-ecr-prune branch 6 times, most recently from f212d40 to 23c699b Compare August 8, 2017 11:06
@arunasank
Copy link
Contributor

arunasank commented Aug 9, 2017

An additional constraint that came up when I spoke to @emilymcafee last week is the constraint of being cautious when we delete images that correspond to merge commits - so we would leave something like 50 merge commits around in each repository. The question was really about how to identify these merge commits.

I spent some time looking at Github Payloads for various kinds of merge commits, since github doesn't have a clear identifier for these commits. What I saw:

Regular merge commit

  • Payload has a list of commits, all of which are not distinct except for the actual merge commit, which is distinct. The head_commit also points to the merge commit and is distinct.
  • Couldn't think of any false positives here.

Squashed merge commit

  • Payload has one commit, which is distinct. Head commit matches payload commits.
  • False-positives: Regular single commits to master (Happens for documentation edits etc)

Rebased merge commit

  • Payload has one commit, which is not distinct.
  • False-positives: Regular rebases to master (Almost never happens)

I did write some code for parsing the above, which works on all my test fixtures, but it seems like a nicer way to think of this, is to save the 50 latest images from the default github branch, as opposed to saving merge commit images, since these are both analogous - since most images on a default branch are built is via a merge commit, barring other noisy commits like edits to documentation etc. The only problems I can think of with this approach is when people actively develop on the default github branch.

cc/ @rclark @emilymdubois @emilymcafee

@arunasank arunasank force-pushed the rethink-ecr-prune branch 5 times, most recently from 11742b0 to 0d66127 Compare August 10, 2017 11:31
@arunasank arunasank deployed to staging August 10, 2017 12:11 Active
@arunasank arunasank deployed to staging August 10, 2017 12:35 Active
@arunasank arunasank deployed to staging August 10, 2017 12:47 Active
@arunasank arunasank deployed to staging August 10, 2017 13:17 Active
@rclark
Copy link
Contributor

rclark commented Sep 28, 2017

@arunasank let's hold on new commits here for a minute -- we need to be able to clearly describe the behavior we're going for. So far I'm still pretty confused, and given that the tests aren't passing, I'm not convinced that this is even doing what it intends to do.

@rclark
Copy link
Contributor

rclark commented Oct 3, 2017

After discussion with @arunasank this morning, I'm running with this description. Please let me know if this is clear.

/**
 * "Generic" images are those that were built by conex for individual git
 * commits. "Priority" images are those built by conex for merge commits or
 * tags. We want there to never be more than 900 images in the repository. If
 * the number of images in the repository is greater than 900, we begin by
 * deleting the oldest "generic" images. If we still need to delete more images
 * to get to 900, we begin deleting old "priority" images, but always leave at
 * least 50 of them in the repository.
 */

* cleanup refactor, tests pass

* remove extraneous node.js packages
@rclark
Copy link
Contributor

rclark commented Oct 6, 2017

Next step is more exhaustive testing on a staging system. Here are @arunasank's suggestions:

For example, I think making sure that all of these cases work, would be really useful.

  • MAX_IMAGES = 10, MAX_PRIORITY_IMAGES = 5, say.
Generic Priority Sum Example
< Max Images < Priority Max < Max Images GENERIC = 5, PRIORITY = 3 => GENERIC = 5, PRIORITY = 3
< Max Images < Priority Max = Max Images GENERIC = 6, PRIORITY = 4 => GENERIC - 6, PRIORITY = 4
< Max Images < Priority Max > Max Images GENERIC = 7, PRIORITY = 4 => GENERIC - 6, PRIORITY = 4
< Max Images > Priority Max < Max Images GENERIC = 3, PRIORITY = 6 => GENERIC - 3, PRIORITY = 6
< Max Images > Priority Max = Max Images GENERIC = 3, PRIORITY = 7 => GENERIC - 3, PRIORITY = 7
< Max Images > Priority Max > Max Images GENERIC = 3, PRIORITY = 8 => GENERIC - 2, PRIORITY = 8

And cases from all of the 27 possibilities. At the very least, I think we should at least test the cases, where the total number of images is less than the max, = max, > max(which you already tested.

@rclark
Copy link
Contributor

rclark commented Oct 9, 2017

This last commit sets up a testing framework where you provide the initial number of generic, priority, and custom images, and also the expected final numbers. The test setup code

  • creates a git repo in your local tmpdir()
  • makes the appropriate commits to it
  • mocks ECR to represent these images
  • runs cleanup.js, then compares what's left in the mock ECR to what you said is expected.

This test code is long and convoluted, but it seems like the right way to create and persist the test rubric that @arunasank suggested we try out in staging.

Some of the tests pass, and others fail. What this means is that I still cannot easily answer the question "given ECR initial state X, what will be the cleanup outcome Y". I'm not comfortable running this out until its easy to correctly answer that question.

@rclark
Copy link
Contributor

rclark commented Oct 10, 2017

Worked through my problems with the complex test suite and uncovered a couple of issues in our logic:

  • we weren't adequately managing images in ECR that have multiple tags. It appears that the outcome from a tagged-commit build through ecs-conex results in one image, with one digest, and with 2 tags. I adjusted the logic to look at all image tags to determine whether the image is generic, priority, or custom.

  • our detection of "merge commits" only works if the merge is performed with the --no-ff (no fast-forward) flag on the git merge. According to Github documentation:

    • --no-ff is used when you use the "Create a merge commit" big green button on a PR
    • --no-ff is not used when you squash and merge or rebase and merge.

    What this means is that our attempt to detect merge-commits will miss if a PR is squash or rebase merged. This means the commits will not be identified as priority images, and are therefore susceptible to being removed prematurely. I'll need to do some more digging to figure out if there's any other way to differentiate a squashed/rebased merge from a "generic" image.

  • behavior in repositories with large numbers of custom images is not exactly what you might expect. As an example:

    • start with 20 generic images, 70 tagged, 850 custom
    • finish with 0 generic, 50 tagged, 850 custom

Generally speaking, I think we should consider abandoning this automated approach. We can ask support for limit increases on repositories that are filing up. We can also build better tooling for manual-mode selection of images to remove from a repository.

Automation feels complex and error prone, and the risk of removing an image that is in use is pretty serious.

@emilymdubois
Copy link
Contributor Author

@rclark @arunasank are we good to close this pull request in favor of rate limit increases and tools for manually deleting images?

@rclark
Copy link
Contributor

rclark commented Mar 7, 2018

I think I am 👍 on closing here, please close if you agree @arunasank.

@arunasank arunasank closed this Mar 7, 2018
@arunasank arunasank deleted the rethink-ecr-prune branch March 7, 2018 03:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants