New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce maximum changeset size to 10k changes #1259

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
@woodpeck
Contributor

woodpeck commented Jul 31, 2016

10k changes ought to be enough for any normal mapping activity. Automatic edits and imports are controlled by scripts anyway so it doesn't make a difference to them (they just have to adapt the limit). Reason for my suggestion is that large changesets are increasingly difficult to handle (frequent timeouts when trying to load and process). The changeset size limit is returned by the API in the "capabilities" request (http://api.openstreetmap.org/api/capabilities) i.e. client software that honours that information will automatically pick up the new limit.

Reduce maximum changeset size to 10k changes
10k changes ought to be enough for any normal mapping activity. Automatic edits and imports are controlled by scripts anyway so it doesn't make a difference to them (they just have to adapt the limit). Reason for my suggestion is that large changesets are increasingly difficult to handle (frequent timeouts when trying to load and process). The changeset size limit is returned by the API in the "capabilities" request (http://api.openstreetmap.org/api/capabilities) i.e. client software that honours that information will automatically pick up the new limit.
@pnorman

This comment has been minimized.

Show comment
Hide comment
@pnorman

pnorman Jul 31, 2016

Contributor

👍

Changesets over 10k cause a problem for uploads, downloads, analysis, and figuring out what was done in them. The API has gotten better at handling large changesets but changesets over a certain size will always be a problem.

Third-party analysis tools also have problems with changesets this large, tending to time out even if they don't need to download the changeset.

Uploads over 10k are prone to timeouts, so anyone creating them will tend to be uploading by chunks, which is not ideal because interrupted connections require manual cleanup of the result.

Best practices have long been to use changesets under 10k, and software like the redaction bot and the TIGER name completion code have picked sizes in the 500-1k range for the maximum changeset size.

A statistical analysis of changeset sizes I did a couple of years ago supports this too. The 99th percentile changeset size is 1.7k and the 99.5th percentile is 3.2k. These values have been decreasing with this year being 1.4k and 2.4k. The proposed limit would only affect 0.09% of changesets this year.

Contributor

pnorman commented Jul 31, 2016

👍

Changesets over 10k cause a problem for uploads, downloads, analysis, and figuring out what was done in them. The API has gotten better at handling large changesets but changesets over a certain size will always be a problem.

Third-party analysis tools also have problems with changesets this large, tending to time out even if they don't need to download the changeset.

Uploads over 10k are prone to timeouts, so anyone creating them will tend to be uploading by chunks, which is not ideal because interrupted connections require manual cleanup of the result.

Best practices have long been to use changesets under 10k, and software like the redaction bot and the TIGER name completion code have picked sizes in the 500-1k range for the maximum changeset size.

A statistical analysis of changeset sizes I did a couple of years ago supports this too. The 99th percentile changeset size is 1.7k and the 99.5th percentile is 3.2k. These values have been decreasing with this year being 1.4k and 2.4k. The proposed limit would only affect 0.09% of changesets this year.

@planemad

This comment has been minimized.

Show comment
Hide comment
@planemad

planemad Aug 3, 2016

This would impact ongoing imports using JOSM. Ticketed to auto split large changesets : https://josm.openstreetmap.de/ticket/13265#ticket

planemad commented Aug 3, 2016

This would impact ongoing imports using JOSM. Ticketed to auto split large changesets : https://josm.openstreetmap.de/ticket/13265#ticket

@pnorman

This comment has been minimized.

Show comment
Hide comment
@pnorman

pnorman Aug 3, 2016

Contributor

JOSM already auto-splits large changesets. Hopefully with this change we can get people out of the habit of using the horrible chunk upload.

Contributor

pnorman commented Aug 3, 2016

JOSM already auto-splits large changesets. Hopefully with this change we can get people out of the habit of using the horrible chunk upload.

@Zverik

This comment has been minimized.

Show comment
Hide comment
@Zverik

Zverik Aug 3, 2016

Contributor

Let's do this. I suggest posting a warning to talk@, dev@ and imports@ mailing lists, and in two weeks merging the pull request. We need a confirmation from OWG that they are okay with this.

Contributor

Zverik commented Aug 3, 2016

Let's do this. I suggest posting a warning to talk@, dev@ and imports@ mailing lists, and in two weeks merging the pull request. We need a confirmation from OWG that they are okay with this.

@mmd-osm

This comment has been minimized.

Show comment
Hide comment
@mmd-osm

mmd-osm Aug 3, 2016

Contributor

Are various revert scripts also happy to process 50k changesets after the change, including comments like "Reverting changesets foo, part x/y" or the like?

Contributor

mmd-osm commented Aug 3, 2016

Are various revert scripts also happy to process 50k changesets after the change, including comments like "Reverting changesets foo, part x/y" or the like?

@pnorman

This comment has been minimized.

Show comment
Hide comment
@pnorman

pnorman Aug 3, 2016

Contributor

The revert scripts have to be able to handle changesets > current max size already as there are some 50k+1 changesets, and I believe they do this okay.

This will make it easier to revert large changesets since we'll be able to reliably download them from the API.

Contributor

pnorman commented Aug 3, 2016

The revert scripts have to be able to handle changesets > current max size already as there are some 50k+1 changesets, and I believe they do this okay.

This will make it easier to revert large changesets since we'll be able to reliably download them from the API.

@pnorman pnorman referenced this pull request Aug 21, 2016

Merged

Add support for changesets #111

@SomeoneElseOSM

This comment has been minimized.

Show comment
Hide comment
@SomeoneElseOSM

SomeoneElseOSM Aug 22, 2016

Just a comment to echo what @pnorman said above, and to say that we've just had another example where someone created "unrevertable (by JOSM) changesets" https://help.openstreetmap.org/questions/51646/how-to-revert-3-changesets-containing-136000-untagged-nodes?page=1&focusedAnswerId=51648#51648 .

SomeoneElseOSM commented Aug 22, 2016

Just a comment to echo what @pnorman said above, and to say that we've just had another example where someone created "unrevertable (by JOSM) changesets" https://help.openstreetmap.org/questions/51646/how-to-revert-3-changesets-containing-136000-untagged-nodes?page=1&focusedAnswerId=51648#51648 .

@gravitystorm

This comment has been minimized.

Show comment
Hide comment
@gravitystorm

gravitystorm Aug 25, 2016

Collaborator

Personally I've no major objections to lowering the limit. It's an arbitrary number after all, so a different arbitrary number is fine.

One minor point is that this feels like papering over the cracks. If the diff uploads are taking too long and timing out, we should find out the cause - either the code is inefficient (likely more complex for deletes than additions), or the databases are too busy. It could be better to solve those problems instead.

Collaborator

gravitystorm commented Aug 25, 2016

Personally I've no major objections to lowering the limit. It's an arbitrary number after all, so a different arbitrary number is fine.

One minor point is that this feels like papering over the cracks. If the diff uploads are taking too long and timing out, we should find out the cause - either the code is inefficient (likely more complex for deletes than additions), or the databases are too busy. It could be better to solve those problems instead.

@zerebubuth

This comment has been minimized.

Show comment
Hide comment
@zerebubuth

zerebubuth Aug 25, 2016

Contributor

If the diff uploads are taking too long and timing out, we should find out the cause - either the code is inefficient (likely more complex for deletes than additions), or the databases are too busy. It could be better to solve those problems instead.

One part of the problem is that the protocol has a design flaw; if the connection to the API server is broken, then it's very hard to tell what the state of the upload is, which can be frustrating.

For API 0.7 (or perhaps also back-ported to 0.6.x), my favourite of the many good solutions is to replace the single POST upload with two requests. The first request would be a POST to .../upload/new, returning 202 with a Location: header to a unique upload UUID, and the second PUTting the osmChange data there. GETting that location would return the status of the upload. This means that an interrupted POST would leave an empy, but harmless, upload location and an interrupted PUT could be checked to see if the request was successful, and whether it has been processed or not.

Apologies - that was pretty well off-topic for this issue. I agree with @gravitystorm and have no objection to lowering the limit, especially if we're pretty sure it'll affect <0.1% of uploads.

Contributor

zerebubuth commented Aug 25, 2016

If the diff uploads are taking too long and timing out, we should find out the cause - either the code is inefficient (likely more complex for deletes than additions), or the databases are too busy. It could be better to solve those problems instead.

One part of the problem is that the protocol has a design flaw; if the connection to the API server is broken, then it's very hard to tell what the state of the upload is, which can be frustrating.

For API 0.7 (or perhaps also back-ported to 0.6.x), my favourite of the many good solutions is to replace the single POST upload with two requests. The first request would be a POST to .../upload/new, returning 202 with a Location: header to a unique upload UUID, and the second PUTting the osmChange data there. GETting that location would return the status of the upload. This means that an interrupted POST would leave an empy, but harmless, upload location and an interrupted PUT could be checked to see if the request was successful, and whether it has been processed or not.

Apologies - that was pretty well off-topic for this issue. I agree with @gravitystorm and have no objection to lowering the limit, especially if we're pretty sure it'll affect <0.1% of uploads.

@pnorman

This comment has been minimized.

Show comment
Hide comment
@pnorman

pnorman Aug 25, 2016

Contributor

One minor point is that this feels like papering over the cracks. If the diff uploads are taking too long and timing out, we should find out the cause - either the code is inefficient (likely more complex for deletes than additions), or the databases are too busy. It could be better to solve those problems instead.

The changeset limit is more about downloads failing and keeping changesets to an understandable size. 10k changes in a single upload is still likely to fail. When I was doing coastline replacements 2-5k was about the maximum upload chunk size.

I'd actually prefer a maximum changeset size that could reliably be uploaded in one chunk and represents the high end of what is a reasonable amount of work for a person to do in a reasonable amount of time, but that would be controversial.

zerebubuth/openstreetmap-cgimap#111 should help the downloads somewhat, but all the other reasons for decreasing the size remain.

Contributor

pnorman commented Aug 25, 2016

One minor point is that this feels like papering over the cracks. If the diff uploads are taking too long and timing out, we should find out the cause - either the code is inefficient (likely more complex for deletes than additions), or the databases are too busy. It could be better to solve those problems instead.

The changeset limit is more about downloads failing and keeping changesets to an understandable size. 10k changes in a single upload is still likely to fail. When I was doing coastline replacements 2-5k was about the maximum upload chunk size.

I'd actually prefer a maximum changeset size that could reliably be uploaded in one chunk and represents the high end of what is a reasonable amount of work for a person to do in a reasonable amount of time, but that would be controversial.

zerebubuth/openstreetmap-cgimap#111 should help the downloads somewhat, but all the other reasons for decreasing the size remain.

@zerebubuth

This comment has been minimized.

Show comment
Hide comment
@zerebubuth

zerebubuth Aug 26, 2016

Contributor

I'm afraid to say that zerebubuth/openstreetmap-cgimap#111 is just the beginning and doesn't yet handle changeset downloads (osmChange), only changeset metadata and discussions. I hope to be able to add osmChange downloads soon. If anyone wants to help then it would be appreciated; code, tests, docs, ad-hoc comparisons with the rails API - anything would be helpful.

Contributor

zerebubuth commented Aug 26, 2016

I'm afraid to say that zerebubuth/openstreetmap-cgimap#111 is just the beginning and doesn't yet handle changeset downloads (osmChange), only changeset metadata and discussions. I hope to be able to add osmChange downloads soon. If anyone wants to help then it would be appreciated; code, tests, docs, ad-hoc comparisons with the rails API - anything would be helpful.

@woodpeck

This comment has been minimized.

Show comment
Hide comment
@woodpeck

woodpeck Oct 18, 2016

Contributor

Can we keep the various good ideas for the future in the mind while at the same time papering over the cracks by applying this?

Contributor

woodpeck commented Oct 18, 2016

Can we keep the various good ideas for the future in the mind while at the same time papering over the cracks by applying this?

@Zverik

This comment has been minimized.

Show comment
Hide comment
@Zverik

Zverik Dec 7, 2016

Contributor

So... Are we going to merge this?

Contributor

Zverik commented Dec 7, 2016

So... Are we going to merge this?

@grischard

This comment has been minimized.

Show comment
Hide comment
@grischard

grischard Jan 31, 2017

Contributor

@tomhughes reducing the maximum changeset size to 10k changes would greatly help the work of the DWG. Would it please be possible to merge it?

Contributor

grischard commented Jan 31, 2017

@tomhughes reducing the maximum changeset size to 10k changes would greatly help the work of the DWG. Would it please be possible to merge it?

@tomhughes

This comment has been minimized.

Show comment
Hide comment
@tomhughes

tomhughes Jan 31, 2017

Member

Of course it's possible. The question is what is the correct way to decide on a change as significant as this...

Member

tomhughes commented Jan 31, 2017

Of course it's possible. The question is what is the correct way to decide on a change as significant as this...

@tomhughes

This comment has been minimized.

Show comment
Hide comment
@tomhughes

tomhughes Jan 31, 2017

Member

So to summarise discussion so far it seems we have decided JOSM will adapt automatically.

Presumably iD and Potlatch are extremely unlikely to generate such large changesets.

What about Merkaartor? Various command like tools?

Other than that is was suggested this should be announced somewhere before implementing it which has not been done as far as I know?

Member

tomhughes commented Jan 31, 2017

So to summarise discussion so far it seems we have decided JOSM will adapt automatically.

Presumably iD and Potlatch are extremely unlikely to generate such large changesets.

What about Merkaartor? Various command like tools?

Other than that is was suggested this should be announced somewhere before implementing it which has not been done as far as I know?

@Komzpa

This comment has been minimized.

Show comment
Hide comment
@Komzpa

Komzpa Jan 31, 2017

Is there any kind of histogram on the object count in a changeset? So that we see that there is major drop actually at 10000 objects, and then a sudden rise on 50000 objects.

If this exists only to fight with API slowness, then I have to raise the concern about inefficiencies in current API implementation, converting each object into 2x0.4ms sleeps (2 x 0.4ms x 50000 = 40s for any kind of API request), for details/PoC see https://github.com/Komzpa/fastmap

Komzpa commented Jan 31, 2017

Is there any kind of histogram on the object count in a changeset? So that we see that there is major drop actually at 10000 objects, and then a sudden rise on 50000 objects.

If this exists only to fight with API slowness, then I have to raise the concern about inefficiencies in current API implementation, converting each object into 2x0.4ms sleeps (2 x 0.4ms x 50000 = 40s for any kind of API request), for details/PoC see https://github.com/Komzpa/fastmap

@mmd-osm

This comment has been minimized.

Show comment
Hide comment
@mmd-osm

mmd-osm Jan 31, 2017

Contributor

re Merkaartor: sources don't seem to include the terms "capabilities" or "maximum_elements". So it's either some hardcoded value or not being checked at all. (@Krakonos)

Question would be if this change could be activated on the dev instance for testing maybe?

Contributor

mmd-osm commented Jan 31, 2017

re Merkaartor: sources don't seem to include the terms "capabilities" or "maximum_elements". So it's either some hardcoded value or not being checked at all. (@Krakonos)

Question would be if this change could be activated on the dev instance for testing maybe?

@tomhughes tomhughes closed this in ad2b4fe Jan 31, 2017

@tomhughes

This comment has been minimized.

Show comment
Hide comment
@tomhughes

tomhughes Jan 31, 2017

Member

Apparently I'm not going to get any peace until I merge this so consider it done.

Member

tomhughes commented Jan 31, 2017

Apparently I'm not going to get any peace until I merge this so consider it done.

@DaveF63

This comment has been minimized.

Show comment
Hide comment
@DaveF63

DaveF63 Jan 31, 2017

This proposal won't affect me but...

Are these 0.09% of edits bad data? If not, this appears to be cutting OSMs nose to spite its face. As @gravitystorm points out, improving the code, if possible, would be a step forward rather than a backward step of disallowing the addition of data.

Are these 10k+ uploads from specific contributors or sources?

DaveF63 commented Jan 31, 2017

This proposal won't affect me but...

Are these 0.09% of edits bad data? If not, this appears to be cutting OSMs nose to spite its face. As @gravitystorm points out, improving the code, if possible, would be a step forward rather than a backward step of disallowing the addition of data.

Are these 10k+ uploads from specific contributors or sources?

@Krakonos

This comment has been minimized.

Show comment
Hide comment
@Krakonos

Krakonos Feb 1, 2017

Member

I just checked and it seems neither hardcoded limit or capabilities are implemented. I do have rewriting this code on my shortlist, but even my short list seems pretty long right now. I'll at least let users know this change happened.

Member

Krakonos commented Feb 1, 2017

I just checked and it seems neither hardcoded limit or capabilities are implemented. I do have rewriting this code on my shortlist, but even my short list seems pretty long right now. I'll at least let users know this change happened.

@Krakonos Krakonos referenced this pull request Feb 1, 2017

Open

Implement full OSM API #127

@Zverik

This comment has been minimized.

Show comment
Hide comment
@Zverik

Zverik Feb 1, 2017

Contributor

Just a bit of trivia: of 8.5 mln changesets made in 2016, 6796 (0.08%) contained more than 10k objects, and 6191 of these were uploaded with JOSM.

@DaveF63, please read the comments. This limit won't affect mappers, since JOSM (and soon, other editors) can split changes in multiple changesets.

Contributor

Zverik commented Feb 1, 2017

Just a bit of trivia: of 8.5 mln changesets made in 2016, 6796 (0.08%) contained more than 10k objects, and 6191 of these were uploaded with JOSM.

@DaveF63, please read the comments. This limit won't affect mappers, since JOSM (and soon, other editors) can split changes in multiple changesets.

@DaveF63

This comment has been minimized.

Show comment
Hide comment
@DaveF63

DaveF63 Feb 1, 2017

@Zverik I did. Blocking 10k+ changesets is still on table. ID has implemented it.
If JOSM, as claimed, already splits them, & other editors won't be effected, what is being used to upload? To find a solution you first have to understand the problem.

DaveF63 commented Feb 1, 2017

@Zverik I did. Blocking 10k+ changesets is still on table. ID has implemented it.
If JOSM, as claimed, already splits them, & other editors won't be effected, what is being used to upload? To find a solution you first have to understand the problem.

@openstreetmap openstreetmap locked and limited conversation to collaborators Feb 1, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.