Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use libosmium for geometry building #684

Merged
merged 30 commits into from Mar 3, 2017

Conversation

lonvia
Copy link
Collaborator

@lonvia lonvia commented Feb 4, 2017

This PR replaces the libgeos-based geometry builder with the libosmium area assembly. As libosmium approaches multipolygon assembly in a slightly different way, there are some visible changes to the polygon handling, as well as some secondary changes.

Changes in Polygon Handling

  • Geometry type is determined by tagging only. Previously osm2pgsql would create polygons only when a polygon was requested by the tagtransform and the geometry could actually assembled into rings. Otherwise a linestring was added to the *_ways table. This could lead to the odd situation where a partially broken multipolygon had some of its rings in the polygon table and the unclosed ways in the ways table. New behaviour is that only the result of tagtransform (or for output-multi: the requested format for the table) is taken into account. In some corner cases, this can lead to a regression with the C-transform when keys may describe a polygon or a way.
  • Broken geometries are not recovered. The libgeos geometry builder was very forgiving with broken polygons and would create a partial solution. The libosmium assembler uses an all or nothing approach. If the geometry is bad, the entire object is discarded. There is also no more buffering of geometries to recover them. On the plus side, libosmium can handle geometries where rings touch in more than one point, which is permissible in OSM but not for OGR polygons.
  • --exclude-invalid-polygon removed. libosmium never creates invalid geometries.
  • Line merging has different results. For route relations a custom line merging algorithm has been added which seems to be much more aggressive than the one employed by libgeos.

Other Changes

  • Reprojection of geometries now moved into geometry assembler because libosmium can build geometries only in WSG84.
  • Expiry works now on binary wkb only. This actually results in much cleaner code.
  • Some streamlining of middle interface (mostly working more with the native libosmium types).
  • now unused custom nodelist_t and multinodelist_t removed.

Todo

Relations and nodes can be const, as their buffer is read only.
Ways need to be writable, so we can add node locations
directly in the object. Hand them around as pointers to
make it clear they are mutable.
Don't go through the conversation into a geos geometry
but use our new wkb reader instead.
Role out our own wkb writer class instead of using libosmiums
internal implementation details. Allows a bit of stream-lining
and removes some of the hacks added to the libosmium copy.
Go back to printf conversion because std::to_string() does not
have sufficient precision.
No longer needed. Using libosmium types now.
Get roles and ids directly from the relation member list,
avoiding intermediate id lists and additional loops through
the members.
Postgres returns wkbs in hexform whereas the parser expects a
pure binary form. So convert before usage.
@imagico
Copy link

imagico commented Feb 6, 2017

Regarding broken geometries:

In the current state of libosmium w.r.t. assembling multipolygons, i.e. without osmcode/libosmium#152 being addressed, this will have a huge impact on users in map rendering. I have used libosmium quite extensively for data preparation for landcover rendering and there are areas where 20-50 percent of the big landcover polygons are broken in ways currently not recoverable by libosmium, mostly self intersections. The visual impact on maps will be boldly visible.

That would be great for the standard style because it would bring down the number of broken geometries in the database quite quickly, faster than any QA tool can do, but it would certainly also be problematic for many osm2pgsql users.

Ideally libosmium should be able to handle these errors and you should be able to choose within osm2pgsql how strict you want to be about them.

Additional note: If you assemble polygons in geographic coordinates and reproject them into other coordinate systems you can get invalid geometries. This is rare practically with mercator but if you want to make sure you only have valid geometries you need a validity check afterwards.

@lonvia
Copy link
Collaborator Author

lonvia commented Feb 6, 2017

Additional note: If you assemble polygons in geographic coordinates and reproject them into other coordinate systems you can get invalid geometries. This is rare practically with mercator but if you want to make sure you only have valid geometries you need a validity check afterwards.

Reintroducing the validity check is something I'd like to avoid. It's expensive and we'd essentially have to reimplement whatever libgeos is doing. So the question is, does this have any practical implications for normal osm2pgsql use cases? If it only happens when using some obscure projections, then post-filtering is probably the better option. If there are practical cases for mercator projection we should check if we can catch these particular ones without going through the full-blown validation process.

@lonvia
Copy link
Collaborator Author

lonvia commented Feb 6, 2017

In the current state of libosmium w.r.t. assembling multipolygons, i.e. without osmcode/libosmium#152 being addressed, this will have a huge impact on users in map rendering. I have used libosmium quite extensively for data preparation for landcover rendering and there are areas where 20-50 percent of the big landcover polygons are broken in ways currently not recoverable by libosmium, mostly self intersections. The visual impact on maps will be boldly visible.

joto's area statistics report that approximately 50k out of 242.000k have intersections. @joto tells me that this figure actually includes polygons that are reported multiple times and the real figure is more like 20k self-intersecting polygons of which 6k are relations. That's 0.001% of polygons. It's probably true that some areas where the polygons are concentrated but with this small number, they cannot be large and deserve being fixed. I also think that it will make life easier for mappers and data consumers alike if osm2pgsql takes a more consistent approach to creating polygons.

That would be great for the standard style because it would bring down the number of broken geometries in the database quite quickly, faster than any QA tool can do, but it would certainly also be problematic for many osm2pgsql users.

Not necessarily. The average data consumer expects to get the data they see on osm.org when they process the planet. They don't need what they are not aware is not even there. And fixing broken data always comes with the danger of fixing it the wrong way.

@imagico
Copy link

imagico commented Feb 6, 2017

Regarding reprojection and validity - It is probably not a big deal in most applications - normal rendering for example does not really care if the geometry is valid or not but with

  • --exclude-invalid-polygon removed. libosmium never creates invalid geometries.

i wanted to point out that this would not imply osm2pgsql never creates invalid geometries with coordinate systems other than geographic. The most important thing is to make this clear in the documentation.

Ultimately this is a more generic problem anyway, i.e. the need for a reprojection library that is capable of validity preserving reprojection of linestring/polygon geometries.

joto's area statistics report that approximately 50k out of 242.000k have intersections. @joto tells me that this figure actually includes polygons that are reported multiple times and the real figure is more like 20k self-intersecting polygons of which 6k are relations.

I was talking about visual impact here - large polygons have a much higher likeliness of being broken than small ones - a bit like ceramic components in engineering 😄. At the same time large polygons have a high visual impact in map rendering. The vast majority of polygons in the OSM database are buildings which are obviously rarely broken.

Note i am not against being stricter with polygons in osm2pgsql, on the contrary. But you need to expect complaints from users because of this. To get an idea of the visual impact - if you take the difference in Jochen's comparison map with the removal of old style MPs: http://area.jochentopf.com/map/index.html - the impact of removing all invalid polygons osmium currently does not handle is larger than that at z6-8 (that is for water and wood/forest) in many areas, in particular northern Europe.

@woodpeck
Copy link
Contributor

woodpeck commented Feb 6, 2017

One of the things I do at $DAYJOB is sell shape files generated from OSM data, where I typically discard invalid polygons. I'd like to support imagico's assessment - nobody cares if a house goes missing, but everyone notices if a 50km segment of the Rhone is lacking its riverbank area. So yes, it might be "only" a few thousand relations but there will be many large ones and there will be complaints. On the other hand people who post-process OSM data like me will finally be free from user complaints like "but this forest is visible on the OSM web site, why is it not in your shapefile?", so +1 for dropping invalid stuff. Incidentally I made an attempt to do that 5 years ago in 8be0cfc, maybe this time it gets through ;)

@pnorman
Copy link
Collaborator

pnorman commented Feb 7, 2017

Appveyor is failing with

C:\osm2pgsql\contrib\libosmium\osmium/tags/filter.hpp(41): fatal error C1083: Cannot open include file: 'boost/iterator/filter_iterator.hpp': No such file or directory

@alex85k, could this be something that's missing in the binary blobs used to supply the osm2pgsql dependencies?

@pnorman
Copy link
Collaborator

pnorman commented Feb 7, 2017

Geometry type is determined by tagging only. Previously osm2pgsql would create polygons only when a polygon was requested by the tagtransform and the geometry could actually assembled into rings. Otherwise a linestring was added to the *_ways table. This could lead to the odd situation where a partially broken multipolygon had some of its rings in the polygon table and the unclosed ways in the ways table. New behaviour is that only the result of tagtransform (or for output-multi: the requested format for the table) is taken into account. In some corner cases, this can lead to a regression with the C-transform when keys may describe a polygon or a way.

How will this work with man_made=pier?

@pnorman
Copy link
Collaborator

pnorman commented Feb 7, 2017

I've set up https://osmium.osm2pgsql.paulnorman.ca. The data on the right is from 170125 and imported with ./osm2pgsql -d osm2pgsql_test -S ../default.style --slim --drop --flat-nodes nodes.bin --cache 34000 ~/planet-170125.osm.pbf

@lonvia
Copy link
Collaborator Author

lonvia commented Feb 7, 2017

The missing dependency for Appveyor is something we probably can (and would want to) get rid of in the code. So it's okay to leave it for now.

@lonvia
Copy link
Collaborator Author

lonvia commented Feb 7, 2017

How will this work with man_made=pier?

That's the regression. It will be fine for the lua transform, as polygon can be set according to value not only key but it won't work for the C transform anymore. If that is not acceptable than we need to reintroduce the old behaviour for ways.

@joto
Copy link
Collaborator

joto commented Feb 7, 2017

Regarding the fixing of broken multipolygons: The Mapbox data team is actively fixing broken multipolygons that have a large visible impact on the map. They have some kind of prioritized list. I am also working on Maproulette challenges to get the community involved in fixing.

@pnorman
Copy link
Collaborator

pnorman commented Feb 14, 2017

Numbers from a test with master (b9d7fcc) and this PR (0333e11)

pr

Node stats: total(3706912087), max(4630874436) in 548s
Way stats: total(389670834), max(468563887) in 2308s
Relation stats: total(4734167), max(6908001) in 3582s
Finished processing 267057024 ways in 4428 s
All indexes on planet_osm_roads created in 903s
All indexes on planet_osm_point created in 2347s
All indexes on planet_osm_line created in 5859s
All indexes on planet_osm_polygon created in 9432s
Osm2pgsql took 20310s overall
User time (seconds): 11826.25
Maximum resident set size (kbytes): 40146944
Involuntary context switches: 69054

master

Node stats: total(3706912087), max(4630874436) in 639s
Way stats: total(389670834), max(468563887) in 2864s
Relation stats: total(4734167), max(6908001) in 4775s
Finished processing 267057051 ways in 6517 s
All indexes on planet_osm_roads created in 916s
All indexes on planet_osm_point created in 2468s
All indexes on planet_osm_line created in 5876s
All indexes on planet_osm_polygon created in 9605s
Osm2pgsql took 24412s overall
User time (seconds): 29804.55
Maximum resident set size (kbytes): 39976848
Involuntary context switches: 2654796

This is saving over an hour on a 6 hour import. Interestingly, master is more parallel, probably because it has more work to do.

@jburgess777
Copy link
Contributor

That is an impressive reduction of CPU from 30k to 12k seconds. Can that be explained by the simpler geometry processing?
Given that the overall time is 20k seconds, the bottleneck must be somewhere other than osm2pgsql.

@lonvia
Copy link
Collaborator Author

lonvia commented Feb 15, 2017

That is an impressive reduction of CPU from 30k to 12k seconds. Can that be explained by the simpler geometry processing?

Yes. Using geos functions to build OSM geometries was not a straightforward process while libosmiums algorithm is tailored for the peculiarities of OSM data. There is also a lot less memory allocation going on when preparing the data as more data is processed directly in place in the libosmium buffers.

Given that the overall time is 20k seconds, the bottleneck must be somewhere other than osm2pgsql.

Our biggest bottleneck is still postgres. Given that all queries (including inserts) are done synchronous, osm2pgsql probably spends a lot of time waiting for results.

Looking at the numbers it might be worth testing if reducing the number of parallel threads won't actually be faster.

@lonvia
Copy link
Collaborator Author

lonvia commented Feb 15, 2017

Performance numbers for a planet import on a 32GB RAM machine with SSDs (-C 24000 --drop -s):

PR

Processing: Node(3467864k 5376.5k/s) Way(361033k 31.52k/s) Relation(4392840 337.86/s)  parse time: 25102s
Node stats: total(3467864331), max(4318974389) in 645s
Way stats: total(361033871), max(432971616) in 11455s
Relation stats: total(4393429), max(6435395) in 13002s
Finished processing 244505022 ways in 7390 s

master

Processing: Node(3467864k 4051.2k/s) Way(361033k 25.98k/s) Relation(4393000 254.80/s)  parse time: 31992s
Node stats: total(3467864331), max(4318974389) in 856s
Way stats: total(361033871), max(432971616) in 13894s
Relation stats: total(4393429), max(6435395) in 17242s
Finished processing 244505051 ways in 10367 s

The PR database has about 66k polygons less coming from about 20k objects. Summing up the areas of all polygons, there is about 1% less. So as suspected, there are in particular large areas missing.

@pnorman
Copy link
Collaborator

pnorman commented Feb 17, 2017

On the planet the PR is

  • 0.028% (71k/255M) fewer rows in planet_osm_polygon;
  • 0.25% (6.94e+12/2.81749e+15) less ST_Area(way) in planet_osm_polygon;
  • 0.28% (8.02e+12/2.83212457458888e+15) less way_area in planet_osm_polygon;
  • 0.8% (1M/131M) fewer rows in planet_osm_line;
  • 0.016% (2k/11M) more rows in planet_osm_roads; and
  • unchanged for planet_osm_point

@lonvia
Copy link
Collaborator Author

lonvia commented Feb 27, 2017

I have finished all testing. Code-wise this is good to go.

@pnorman pnorman added this to the 0.94.0 milestone Feb 28, 2017
@pnorman
Copy link
Collaborator

pnorman commented Feb 28, 2017

I'm also okay with it, it's now just a question of release management.

@lonvia
Copy link
Collaborator Author

lonvia commented Feb 28, 2017

@joto informed me this morning that I might have broken the expiry code. It assumes that the WKBs that are returned from the database have native endianess. That is not necessarily the case. It is trivial to fix but needs to be done before the PR can be merged.

@pnorman
Copy link
Collaborator

pnorman commented Feb 28, 2017

It assumes that the WKBs that are returned from the database have native endianess

Would this also be an issue in putting geoms into the DB?

@lonvia
Copy link
Collaborator Author

lonvia commented Feb 28, 2017

Would this also be an issue in putting geoms into the DB?

No. Each WKB geometry contains a flag that indicates the endiness used, so the DB must make sure to interpret the string correctly.

@lonvia
Copy link
Collaborator Author

lonvia commented Feb 28, 2017

It wasn't trivial after all, so I've only added a check that the endianess is as expected the native endianess of the machine. We can add an endian flip if somebody should ever hit the exception.

wkb.hpp Outdated
if (out[0] != Endian)
throw std::runtime_error(
"Endian setting of database WKB not supported.");

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say what the endianness of osm2pgsql is in the error message?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better yet, something like "Different endian settings of database and osm2pgsql are not supported. Database is big-endian, osm2pgsql is little-endian."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@pnorman pnorman merged commit c49d1b9 into osm2pgsql-dev:master Mar 3, 2017
@ilovezfs
Copy link

ilovezfs commented Mar 3, 2017

@pnorman is there a new release in the offing then?

@pnorman
Copy link
Collaborator

pnorman commented Mar 3, 2017

There aren't plans for an immediate release. First we want to let any bugs after a big change like this have a a chance to shake out, but we'd like to get any other changes that involve schema changes into the same release as #668

@lonvia lonvia deleted the projection-in-output branch March 4, 2017 16:10
@ilovezfs
Copy link

@pnorman thanks. Any suggestion on how we should handle the Homebrew formula at this point? See https://github.com/Homebrew/homebrew-core/blob/master/Formula/osm2pgsql.rb#L21-L26

It's using

  patch do
    url "https://github.com/openstreetmap/osm2pgsql/pull/636.patch"
    sha256 "54aa12fe5a3ebbc9ecc02b7e5771939b99f6437f5a55b67d8835df6d8d58619a"
  end

but that seems somewhat inappropriate at this point since #636 will never be merged.

@pnorman
Copy link
Collaborator

pnorman commented Apr 28, 2017

osm2pgsql has no support for geos 3.6 in the 0.92.x branch and is not likely to get it with the need to maintain 3.5 compatibility and the need to verify a patch, particularly around memory leaks. Because we've moved away from geos, it's not something many people are likely to work on.

@ilovezfs
Copy link

@pnorman Understood. A new release would be much appreciated so we can drop the patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants