Clean out old data/runs? #750

iandees · 2019-10-09T02:26:11Z

In the interest of saving money and not having an infinitely growing S3 bucket, I'd like to discuss the idea of deleting runs that are old and no longer used.

Here's the rules I'm thinking:

We should always keep the latest successful run for any particular source.
Otherwise, delete data older than 18 months.

Another configuration I could see would be a "backoff" where we keep X number of frequent runs, Y number of monthly runs, and Z number of yearly runs.

I'm curious what others think. Are there very compelling usecases for extremely old data that I'm not thinking of? Would this be really hard to implement given our current data archive model?

nvkelso · 2019-10-09T02:35:26Z

caveat that I'm a packrat, but I like this option as it allows stats about how the project has grown:

Another configuration I could see would be a "backoff" where we keep X number of frequent runs, Y number of monthly runs, and Z number of yearly runs.

Daily for a week
Weekly for a month
Monthly for a year
Yearly for each year

iandees · 2019-10-09T02:42:25Z

Another option is that we could de-dupe data. This would be more technically difficult to transition to, but we could save a lot of data in S3 by hashing the output ZIPs and adjusting the database to point to the first copy of the data that is exactly the same, deleting all duplicates.

For example, the most recent 5 runs of us/au/countrywide all have the same S3 etag, meaning the contents hash to the same value according to S3. Deduping those rows would save 1.7GB.

missinglink · 2019-10-09T14:24:50Z

S3 has several Storage Classes, we have our buckets set up to move older data to progressively cheaper storage as it ages and then it gets deleted, here's the terraform config for achieving that:

resource "aws_s3_bucket" "unique-name" {
  bucket = "bucket-name"
  acl    = "private"

  versioning {
    enabled = false
  }

  lifecycle_rule {
    enabled = true

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 60
      storage_class = "GLACIER"
    }

    expiration {
      days = 240
    }
  }
}

missinglink · 2019-10-09T14:26:18Z

I'd also be 👍 for moving the data to cold storage rather than deleting forever.

iandees · 2019-10-09T14:28:09Z

Yep, we already have 50+ TB in Standard IA as part of a transition at 30 days. Part of the reason I want to delete old data is that people/spiders (ignoring our robots.txt) go back and download that really old data, which adds extra to our bandwidth bill.

migurski · 2019-10-09T14:38:31Z

Can a spider access data in glacier, or does it need to be defrosted by its owner in order to be available? I am also a packrat and I’d hate to lose information, so moving older files to glacier where they might be publicly inaccessible does seem like a good option.

migurski · 2019-10-09T14:39:34Z

Re: de-duping the data, we could switch to a content-addressable URL scheme that makes this happen automatically moving forward. I believe we already store the hash of the zip contents, so there’s no need to recalculate this.

missinglink · 2019-10-09T14:44:10Z

I think you'll need a new private bucket, once the files are moved there then no-one is getting at them except who you allow :)

missinglink · 2019-10-09T14:46:13Z

Actually, we (geocode.earth) would be happy to hold the historic data in a bucket we control (and pay for), we could then allow whoever to take copies (so long as they do it from the same aws datacentre).

This would actually suit us too because we'd like a copy of everything anyway.

missinglink · 2019-10-09T14:46:58Z

How much data are we talking about exactly? (all of it)

iandees · 2019-10-09T15:08:55Z

How much data are we talking about exactly?

Huh, I was thinking about bandwidth when I said 50TB before. We've got 6.2TB in Standard-IA, 183GB in ReducedRedundancy (from back when it was cheaper than Standard), and 186GB in Standard.

Now that we no longer have any un-authenticated requests going to the bucket we can probably turn on requester pays and let anyone grab whatever they want from it. I'll file a separate ticket for that.

iandees · 2019-10-09T15:21:56Z

It sounds like there isn't an appetite for deleting old data. That's ok – storage itself isn't really all that expensive.

Implementing a content-addressable system would be great and would help reduce waste. I have an S3 inventory running right now that will tell us just how much data is duplicated. If deduping would save a huge amount of space I'll probably try to implement that sooner than later.

Can a spider access data in glacier, or does it need to be defrosted by its owner in order to be available?

No, I don't think stuff in glacier is accessible with a standard GetObject from S3 (which is what our CDN is doing). It has to be pulled out of the archive and then accessed. I'm not inclined to move our data to Glacier since it's so expensive to retrieve. Why keep the data around if we're not planning on using it? :-)

Another option is to remove links to runs/sets older than 30 days. Maybe you have to login with GitHub to see the links?

andrewharvey · 2019-12-26T11:10:12Z

Implementing a content-addressable system would be great and would help reduce waste. I have an S3 inventory running right now that will tell us just how much data is duplicated. If deduping would save a huge amount of space I'll probably try to implement that sooner than later.

I would guess there would be a ton of duplicated data. It would be nice to keep the historical data, but I'm all for deleting duplicate data.

Further wastage would be reduced by making http calls with If-Modified-Since to avoid downloading and re-running the data if it hasn't changed since the last run.

iandees · 2019-12-26T15:30:08Z

I just ran a quick dedupe test on one of the recent S3 inventories and found that there are 3396619012547 bytes (3.4TB) in unique files and 3587042594660 bytes (3.5TB) in files that duplicate those unique files.

So building a simple file dedupe file system would cut our storage bill in half (from ~$2.73/day to ~$1.37/day). That pretty good, but not as good as I thought it would be.

iandees added the question label Oct 9, 2019

iandees mentioned this issue Oct 9, 2019

Allow requester pays access to the data.openaddresses.io bucket #752

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean out old data/runs? #750

Clean out old data/runs? #750

iandees commented Oct 9, 2019

nvkelso commented Oct 9, 2019

iandees commented Oct 9, 2019

missinglink commented Oct 9, 2019

missinglink commented Oct 9, 2019

iandees commented Oct 9, 2019 •

edited

Loading

migurski commented Oct 9, 2019

migurski commented Oct 9, 2019

missinglink commented Oct 9, 2019

missinglink commented Oct 9, 2019

missinglink commented Oct 9, 2019 •

edited

Loading

iandees commented Oct 9, 2019

iandees commented Oct 9, 2019

andrewharvey commented Dec 26, 2019

iandees commented Dec 26, 2019

Clean out old data/runs? #750

Clean out old data/runs? #750

Comments

iandees commented Oct 9, 2019

nvkelso commented Oct 9, 2019

iandees commented Oct 9, 2019

missinglink commented Oct 9, 2019

missinglink commented Oct 9, 2019

iandees commented Oct 9, 2019 • edited Loading

migurski commented Oct 9, 2019

migurski commented Oct 9, 2019

missinglink commented Oct 9, 2019

missinglink commented Oct 9, 2019

missinglink commented Oct 9, 2019 • edited Loading

iandees commented Oct 9, 2019

iandees commented Oct 9, 2019

andrewharvey commented Dec 26, 2019

iandees commented Dec 26, 2019

iandees commented Oct 9, 2019 •

edited

Loading

missinglink commented Oct 9, 2019 •

edited

Loading