New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Planet file retention policy #180

Open
zerebubuth opened this Issue Nov 20, 2017 · 24 comments

Comments

Projects
None yet
10 participants
@zerebubuth
Collaborator

zerebubuth commented Nov 20, 2017

We have a recurring problem with the disk on ironbelly filling up:

image

This is largely down to the storage of planet files (16TB total) and logs (7.7TB). The vast majority of both logs and planet files are old (2% of space used is for the last 3 planets, 76% is for planets from before 2017).

Old planet files can be useful for historical spelunking, but very few people actually do so (94% of requests in the last 10 weeks were for planets generated within the same period). A close approximation to an old planet can be obtained using osmium-tool's time-filter command, which can recover a point-in-time planet from the full history. Recovering several planet files in this way would likely be significantly faster than downloading the older planet files, on most internet connections. A possible downside is that there isn't enough information in the history file to recover the transactional state of the database, so timestamps alone might produce a planet which doesn't have referential integrity. If one is spelunking in older planets, this is something that one would have to be robust against anyway.

Another issue with old planet files is that they preserve data which has since been redacted, which really shouldn't be available anyway.

I think we should have a policy for the number of old planets we keep around - or perhaps generate them on-demand using osmium-tool from the most recent history file. Here's a starting place:

  1. Keep all the "before current era" planet files, i.e: all the CC-BY-SA ones.
  2. Keep the last 3 good planet files.
  3. Regenerate a planet file for the 1st of each month using osmium-tool (possibly on-demand?)

How does that sound?

@Zverik

This comment has been minimized.

Show comment
Hide comment
@Zverik

Zverik Nov 20, 2017

I suggest keeping one planet file for each month before 2012-09-12 and one planet file for each year after that, plus three latest planet files as you suggest. No to regenerating planet files, since that can be done by a user.

Do we have a CC-BY-SA full history extract?

Zverik commented Nov 20, 2017

I suggest keeping one planet file for each month before 2012-09-12 and one planet file for each year after that, plus three latest planet files as you suggest. No to regenerating planet files, since that can be done by a user.

Do we have a CC-BY-SA full history extract?

@imagico

This comment has been minimized.

Show comment
Hide comment
@imagico

imagico Nov 20, 2017

Reducing the number of historic planet files readily available for download seems fine to me but i wanted to make sure that removing them would not mean historical data contained in there is lost forever. I think the importance of the OpenStreetMap project warrants that the history of its data including what has been redacted is preserved for future generations. If there is already a separate archive storage or if the data including what has been redacted is already stored otherwise in more compact form that point would be moot of course.

imagico commented Nov 20, 2017

Reducing the number of historic planet files readily available for download seems fine to me but i wanted to make sure that removing them would not mean historical data contained in there is lost forever. I think the importance of the OpenStreetMap project warrants that the history of its data including what has been redacted is preserved for future generations. If there is already a separate archive storage or if the data including what has been redacted is already stored otherwise in more compact form that point would be moot of course.

@iandees

This comment has been minimized.

Show comment
Hide comment
@iandees

iandees Nov 20, 2017

It makes sense to thin out the historical/old planet files (but not completely delete them). Can I suggest that we upload the older planet files to archive.org before we delete them? I'm sure the AWS Public Datasets team would be happy to host the historical planet files, too.

iandees commented Nov 20, 2017

It makes sense to thin out the historical/old planet files (but not completely delete them). Can I suggest that we upload the older planet files to archive.org before we delete them? I'm sure the AWS Public Datasets team would be happy to host the historical planet files, too.

@zerebubuth

This comment has been minimized.

Show comment
Hide comment
@zerebubuth

zerebubuth Nov 20, 2017

Collaborator

Do we have a CC-BY-SA full history extract?

Yes. There are two on the planet server, one of which is named "final", and the other is two months later (?!). I'm not totally sure which would have been the final one before the license change, but the one dated in June is probably it.

I think the importance of the OpenStreetMap project warrants that the history of its data including what has been redacted is preserved for future generations.

Yes, the redacted data is preserved in the database and database backups. Apologies - I should have been clearer in saying that I think continuing to distribute (via planet files) the redacted data is the problem. Continuing to keep it privately and internally until the copyright expires doesn't seem like a problem.

Can I suggest that we upload the older planet files to archive.org before we delete them?

I think that any planet file containing redacted data is a potential legal liability. I wouldn't want to foist that onto archive.org or anyone else.

Collaborator

zerebubuth commented Nov 20, 2017

Do we have a CC-BY-SA full history extract?

Yes. There are two on the planet server, one of which is named "final", and the other is two months later (?!). I'm not totally sure which would have been the final one before the license change, but the one dated in June is probably it.

I think the importance of the OpenStreetMap project warrants that the history of its data including what has been redacted is preserved for future generations.

Yes, the redacted data is preserved in the database and database backups. Apologies - I should have been clearer in saying that I think continuing to distribute (via planet files) the redacted data is the problem. Continuing to keep it privately and internally until the copyright expires doesn't seem like a problem.

Can I suggest that we upload the older planet files to archive.org before we delete them?

I think that any planet file containing redacted data is a potential legal liability. I wouldn't want to foist that onto archive.org or anyone else.

@imagico

This comment has been minimized.

Show comment
Hide comment
@imagico

imagico Nov 20, 2017

Yes, the redacted data is preserved in the database and database backups.

Then i have no objection. The most likely future use case for such data would be research - for such it would of course be convenient if the data was publicly available but the legal issues and ultimately the interests of those whose rights might be violated by this data stand above this aspect of convenience.

imagico commented Nov 20, 2017

Yes, the redacted data is preserved in the database and database backups.

Then i have no objection. The most likely future use case for such data would be research - for such it would of course be convenient if the data was publicly available but the legal issues and ultimately the interests of those whose rights might be violated by this data stand above this aspect of convenience.

@Zverik

This comment has been minimized.

Show comment
Hide comment
@Zverik

Zverik Nov 20, 2017

Maybe just keep the first planet file for each month, for a start? This does not solve the issue long-term, but we will have a couple more years to think of a solution.

Zverik commented Nov 20, 2017

Maybe just keep the first planet file for each month, for a start? This does not solve the issue long-term, but we will have a couple more years to think of a solution.

@Firefishy Firefishy added the planet label Nov 20, 2017

@HolgerJeromin

This comment has been minimized.

Show comment
Hide comment
@HolgerJeromin

HolgerJeromin commented Nov 20, 2017

@Zverik

This comment has been minimized.

Show comment
Hide comment
@Zverik

Zverik Nov 21, 2017

Yes, minute diffs are important and irreplaceable. And they don't take much space.

Zverik commented Nov 21, 2017

Yes, minute diffs are important and irreplaceable. And they don't take much space.

@HolgerJeromin

This comment has been minimized.

Show comment
Hide comment
@HolgerJeromin

HolgerJeromin Nov 21, 2017

Ok, i thought a BY-SA consumer will use the last (full-history?) planet for all he wants.

HolgerJeromin commented Nov 21, 2017

Ok, i thought a BY-SA consumer will use the last (full-history?) planet for all he wants.

@pnorman

This comment has been minimized.

Show comment
Hide comment
@pnorman

pnorman Nov 21, 2017

Collaborator

Yes. There are two on the planet server, one of which is named "final", and the other is two months later (?!). I'm not totally sure which would have been the final one before the license change, but the one dated in June is probably it.

Final is from before redactions started. The other is post-redaction, but pre-license change.

I'd recommend making the last 4 weeks available, as well as every month (4 weeks) for the last year, then every 6 months.
I'm not sure what to do with history files, as I don't use them. I imagine there is very little use for old history files.

Files we remove should be archived to offline storage.

Regarding redacted data, we should not be considering it when planning this. If it weren't technically difficult, it would be better to remove any redacted data from all planet files, since we can't distribute it.

Collaborator

pnorman commented Nov 21, 2017

Yes. There are two on the planet server, one of which is named "final", and the other is two months later (?!). I'm not totally sure which would have been the final one before the license change, but the one dated in June is probably it.

Final is from before redactions started. The other is post-redaction, but pre-license change.

I'd recommend making the last 4 weeks available, as well as every month (4 weeks) for the last year, then every 6 months.
I'm not sure what to do with history files, as I don't use them. I imagine there is very little use for old history files.

Files we remove should be archived to offline storage.

Regarding redacted data, we should not be considering it when planning this. If it weren't technically difficult, it would be better to remove any redacted data from all planet files, since we can't distribute it.

@mmd-osm

This comment has been minimized.

Show comment
Hide comment
@mmd-osm

mmd-osm Nov 22, 2017

A close approximation to an old planet can be obtained using osmium-tool's time-filter command, which can recover a point-in-time planet from the full history

Regarding redacted data, we should not be considering it when planning this. If it weren't technically difficult, it would be better to remove any redacted data from all planet files, since we can't distribute it.

Regenerating previous planet files based on current full history files along with redaction has the downside that older versions of an object (which haven't been redacted) still "shine through", and will be used by osmiums time filter. This can cause quite a bit of havoc with object geometries.

While this is a bit off topic for this issue, creating full history files should - rather than just skipping redacted versions - at least contain a dummy marker including object id, version, an exact timestamp of the original object, and an information that this version has been redacted.

Also, I'd like to vote to keep the very first ODbL compliant planet file, along with all subsequent minutely diffs (you can probably guess why).

mmd-osm commented Nov 22, 2017

A close approximation to an old planet can be obtained using osmium-tool's time-filter command, which can recover a point-in-time planet from the full history

Regarding redacted data, we should not be considering it when planning this. If it weren't technically difficult, it would be better to remove any redacted data from all planet files, since we can't distribute it.

Regenerating previous planet files based on current full history files along with redaction has the downside that older versions of an object (which haven't been redacted) still "shine through", and will be used by osmiums time filter. This can cause quite a bit of havoc with object geometries.

While this is a bit off topic for this issue, creating full history files should - rather than just skipping redacted versions - at least contain a dummy marker including object id, version, an exact timestamp of the original object, and an information that this version has been redacted.

Also, I'd like to vote to keep the very first ODbL compliant planet file, along with all subsequent minutely diffs (you can probably guess why).

@hbogner

This comment has been minimized.

Show comment
Hide comment
@hbogner

hbogner Jan 3, 2018

I'm one of those people who uses(or at least used) older planet files for history processing.
There are still people who don't know how to process full history planet file and would like to access OSM history.
Is there really no slot for additional disk in the server?

hbogner commented Jan 3, 2018

I'm one of those people who uses(or at least used) older planet files for history processing.
There are still people who don't know how to process full history planet file and would like to access OSM history.
Is there really no slot for additional disk in the server?

@tomhughes

This comment has been minimized.

Show comment
Hide comment
@tomhughes

tomhughes Jan 3, 2018

Member

It already has 12 x 4Tb disks - how many more should we be prepared to add for your convenience?

Note that it's not really one server as the goal is to actually have this in at least two and probably three places for redundancy. Right now our second site has a machine with 12 x 2Tb disks though the two aren't quite equivalent.

Member

tomhughes commented Jan 3, 2018

It already has 12 x 4Tb disks - how many more should we be prepared to add for your convenience?

Note that it's not really one server as the goal is to actually have this in at least two and probably three places for redundancy. Right now our second site has a machine with 12 x 2Tb disks though the two aren't quite equivalent.

@hbogner

This comment has been minimized.

Show comment
Hide comment
@hbogner

hbogner Jan 3, 2018

I was just asking, not judging or ordering, just curiosity.

Also few years ago I asked does OSM need any more servers in Croatia because we could host them here(one or maybe two locations), but there was no need then.

hbogner commented Jan 3, 2018

I was just asking, not judging or ordering, just curiosity.

Also few years ago I asked does OSM need any more servers in Croatia because we could host them here(one or maybe two locations), but there was no need then.

@zerebubuth

This comment has been minimized.

Show comment
Hide comment
@zerebubuth

zerebubuth Jan 3, 2018

Collaborator

There are still people who don't know how to process full history planet file and would like to access OSM history.

What could be done to make it easier? It seems like it would be possible to write a server to do a point-in-time reconstruction of a planet file, streamed from osmium-tool time-filter, since it works in a single pass. Would that remove the need to store the individual old planet files?

Is there really no slot for additional disk in the server?

The space on disk was the proximate reason for adding this issue, since running out of space on disk causes all kinds of bad things to happen. For the reasons @tomhughes already said, maintaining a really high capacity, resilient storage is a pain. We're working on an approach #169 which should make that easier, and any help with that would be appreciated.

I think there's a bigger issue; old planet files can contain data that we no longer want to redistribute, such as data copied from other maps that has since been redacted. I can't think of a feasible way to edit those redactions out of old planet files, and I think a better way would be to rely on the up-to-date "history" planets for historical information.

Collaborator

zerebubuth commented Jan 3, 2018

There are still people who don't know how to process full history planet file and would like to access OSM history.

What could be done to make it easier? It seems like it would be possible to write a server to do a point-in-time reconstruction of a planet file, streamed from osmium-tool time-filter, since it works in a single pass. Would that remove the need to store the individual old planet files?

Is there really no slot for additional disk in the server?

The space on disk was the proximate reason for adding this issue, since running out of space on disk causes all kinds of bad things to happen. For the reasons @tomhughes already said, maintaining a really high capacity, resilient storage is a pain. We're working on an approach #169 which should make that easier, and any help with that would be appreciated.

I think there's a bigger issue; old planet files can contain data that we no longer want to redistribute, such as data copied from other maps that has since been redacted. I can't think of a feasible way to edit those redactions out of old planet files, and I think a better way would be to rely on the up-to-date "history" planets for historical information.

@Zverik

This comment has been minimized.

Show comment
Hide comment
@Zverik

Zverik Jan 3, 2018

I assume we're talking about keeping less of older planet files, not removing them altogether.

Zverik commented Jan 3, 2018

I assume we're talking about keeping less of older planet files, not removing them altogether.

@mmd-osm

This comment has been minimized.

Show comment
Hide comment
@mmd-osm

mmd-osm Jan 3, 2018

It seems like it would be possible to write a server to do a point-in-time reconstruction of a planet file

In theory yes. However there are potential issues with redactions in that file. As one example, versions before a redaction "shine through". This could be seen as a feature in some situations or causing issues with geometries in others. One could argue that a redacted version should be equivalent to a deleted version here. Please see my comment above #180 (comment)

mmd-osm commented Jan 3, 2018

It seems like it would be possible to write a server to do a point-in-time reconstruction of a planet file

In theory yes. However there are potential issues with redactions in that file. As one example, versions before a redaction "shine through". This could be seen as a feature in some situations or causing issues with geometries in others. One could argue that a redacted version should be equivalent to a deleted version here. Please see my comment above #180 (comment)

@zerebubuth

This comment has been minimized.

Show comment
Hide comment
@zerebubuth

zerebubuth Jan 4, 2018

Collaborator

I assume we're talking about keeping less of older planet files, not removing them altogether.

That's the easiest thing to do, short term, and it seems that there's reasonably broad support for keeping one per month. It would still be relatively easy (although harder than just downloading the file) to recover other weeks by applying diffs to the monthly files.

Long term (and I think I've gone a little off-topic, sorry), I'd prefer not publicly distributing old files apart from a small number of snapshots before and after the license change, and older planets of historical interest.

I think this issue is clear - we can trim down to one a month. I should file a follow-up for the remaining issues blocking a more thorough and legally defensible approach, as @mmd-osm pointed out in #180 (comment).

Collaborator

zerebubuth commented Jan 4, 2018

I assume we're talking about keeping less of older planet files, not removing them altogether.

That's the easiest thing to do, short term, and it seems that there's reasonably broad support for keeping one per month. It would still be relatively easy (although harder than just downloading the file) to recover other weeks by applying diffs to the monthly files.

Long term (and I think I've gone a little off-topic, sorry), I'd prefer not publicly distributing old files apart from a small number of snapshots before and after the license change, and older planets of historical interest.

I think this issue is clear - we can trim down to one a month. I should file a follow-up for the remaining issues blocking a more thorough and legally defensible approach, as @mmd-osm pointed out in #180 (comment).

@Firefishy

This comment has been minimized.

Show comment
Hide comment
@Firefishy

Firefishy Sep 19, 2018

Collaborator

I assume this can be closed now?

Collaborator

Firefishy commented Sep 19, 2018

I assume this can be closed now?

@tomhughes

This comment has been minimized.

Show comment
Hide comment
@tomhughes

tomhughes Sep 19, 2018

Member

Did it actually get switched to live mode?

Member

tomhughes commented Sep 19, 2018

Did it actually get switched to live mode?

@tomhughes

This comment has been minimized.

Show comment
Hide comment
@tomhughes

tomhughes Sep 19, 2018

Member

No, it's still in debug mode.

Member

tomhughes commented Sep 19, 2018

No, it's still in debug mode.

@zerebubuth

This comment has been minimized.

Show comment
Hide comment
@zerebubuth

zerebubuth Sep 19, 2018

Collaborator

Good point. Seems to have been working well for the past few months. We can turn debug off - or even just close the issue. I don't mind getting the monthly reminder of what got deleted.

Collaborator

zerebubuth commented Sep 19, 2018

Good point. Seems to have been working well for the past few months. We can turn debug off - or even just close the issue. I don't mind getting the monthly reminder of what got deleted.

@tomhughes

This comment has been minimized.

Show comment
Hide comment
@tomhughes

tomhughes Sep 19, 2018

Member

So it is actually deleting now then?

Member

tomhughes commented Sep 19, 2018

So it is actually deleting now then?

@zerebubuth

This comment has been minimized.

Show comment
Hide comment
@zerebubuth

zerebubuth Sep 19, 2018

Collaborator

Yup, it got enabled back in April: openstreetmap/chef#157

image

It's hard to see whether the rate of growth has really slowed, though. Perhaps we've simply found space for other stuff to accumulate...

Collaborator

zerebubuth commented Sep 19, 2018

Yup, it got enabled back in April: openstreetmap/chef#157

image

It's hard to see whether the rate of growth has really slowed, though. Perhaps we've simply found space for other stuff to accumulate...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment