Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion to C++ #187

Closed
wants to merge 234 commits into from
Closed

Conversation

pnorman
Copy link
Collaborator

@pnorman pnorman commented Sep 29, 2014

C++ osm2pgsql rewrite

This PR is substantial, as this diff is thousands of lines of additions and removals.

Summary

  • Rewrite in C++98
  • Uses boost
  • Adds unit tests
  • Performance increases
  • In-memory pending tracker
  • New multi backend, which allows separation of middle tables and rendering tables, and rendering tables to be updated on multiple databases at once, with different styles or lua transforms

C++98 conversion

osm2pgsql has been rewritten in object-oriented C++98 with use of boost libraries. Any server that someone is contemplating running as rendering server should support this, as other components require support of this. In particular, any setup that can compile Mapnik should compile this without issues.

Despite the substantial rewrite, parts of osm2pgsql are still using C idioms such as void * pointers.

C++ brings substantial advantages for writing and not having to manually manage memory and prevent reading out of bounds.

Note: Do a make distclean when switching to/from this branch

Fixes #156

Boost

The version requirement of Boost has been kept to one that Ubuntu 12.04 supports out of the box, and any machine able to run Mapnik 2 should have a recent enough version of Boost.

Unit test

Unit tests have been added for the portions rewritten in the conversion. This unit test framework can easily be extended.

Fixes #186

Validation

As our goal was not to change the osm2pgsql output, a validation of the output was conducted with a planet from 130904. The rendering tables were checked for number of rows, area/length, and checks were done with selected tags. Way/relations were checked for number of rows, and total number of array members.

The only variation seen was in middle tables, which was expected. To facilitate addition of the multi-backend it became necessary to move discarding delete tags from the middle to backend. This means that the middle tables have a complete representation of OSM data, regardless of style files.

In-memory pending tracker

Previously osm2pgsql stored a pending flag on the middle ways and relations tables. This was undesirable for a few reasons

  • The data didn't need to be persisted between runs
  • After any run, pending should have been false for all rows
  • MVCC wasn't necessary, so we weren't getting any advantages from PostgreSQL
  • Loading into the tables with pending=true then updating pending=false on a later stage lead to significant table and index bloat, including a partial index which was 3.6GB after import instead of empty
  • Having pending in-memory made parts of threading easier

On a full planet import, the pending information for all of the ways is approximately 30MB, so memory usage is not an issue.

The move from in-db to in-memory for the middle was a gain of 7% on single-threaded middle ways, on top of earlier gains in other stages, and overall gains from having smaller less bloated tables.

Fixes #105

Fixes #111

Multi-backend

The new multi-backend has been developed for situations involving multiple database servers. It allows the middle slim tables to be in a different database than the rendering tables. This allows the rendering tables to be replicated with PostgreSQL replication without also replicating the slim tables, the rendering database servers to be provisioned with less database disk space, and multiple styles to be updated from one osm2pgsql process.

Because of the increased complexity of configuration when multiple servers are involved, the multi-backend takes a JSON file as a configuration, rather than increasing the already excessive number of osm2pgsql command-line options.

This feature is still in development, and will be of interest to only those running multiple rendering databases.

Performance

Although the conversion to C++ was not done for performance reasons, in practice the use of standard library data structures allows for easier performance gains. Future optimizations should be significantly easier. Overall, there is a performance gain, probably due to in-memory pending.

Performance tests were carried out on the machine used for previous testing (Hetzner EX40-SSD, i7-4770, 32GB RAM, 2x240GB SSD in RAID0) as well as extracts on a lower-end machine (i5-3570 @ 3.4 GHz, 8GB RAM, 500GB 3.5" 7200RPM HDD). These represent one of the fastest possible machines for osm2pgsql imports, and a typical developer desktop with a single mechanical drive.

Unlike some other results, performance numbers do include index creation, as that time could be different thanks to in-memory pending changes.

Versions benchmarked were 8fc8964 (current master) and 4b6489a. Pending relations are not present in a standard import, so do not impact the benchmarks. Note: Index numbers are still generating, but should not significantly depend on the changes, and will certainly be done before discussion is complete.

Command lines used were

  • osm2pgsql/osm2pgsql --create --style osm2pgsql/default.style --flat-nodes flat.nodes --cache 20000 --slim --number-processes 8 planet-130904.osm.pbf for full planet tests.
  • osm2pgsql/osm2pgsql --create --style ~/osm2pgsql/default.style --cache 6000 --slim --number-processes 4 germany-140601.osm.pbf
Stage 8fc8964 4b6489a Change
i7 SSD i5 HDD i7 SSD i5 HDD i7 SSD i5 HDD
ProcessingNodes 759 689 1018 699 +34% +1.5%
Ways 4315 587 4566 580 +5.8% -1.2%
Relations 9863 12210 9187 11636 -6.9% -4.7%
Total 14937 13486 14771 12915 -1.1 -4.2%
Pending Ways 4989 2001 3810 1287 -24% -36%
Ordering and indexes 9500 27976 ### ### +###% +###%
Grand Total 29467 43519 ### ### +###% +###%

Negative percentages are faster

Note: Germany only needed 2577MB of cache, but I didn't know this number until after I ran the import, but this doesn't effect the conclusions.

Known issues

The console output is different, and no better than before.

zerebubuth and others added 30 commits May 8, 2014 17:26
Added a very basic XML parsing unit test, which is run in the
standard autoconf manner. Also ported the existing regression
tests to be run from the autoconf test harness as well.

Part of this meant pulling most of the code into a temporary
library so that it can be re-used by all the binaries, including
the unit tests. Hopefully this won't cause any major issues - it's
all libtool-based, so should be portable, and is statically linked
to the resulting binary so needs no installation.
No actual functionality changes should have been made. However, it has been a long time since C++ was really forward-compatible from C, so there was a bunch of syntax and type stuff that needed changing.
…re, spatial_ref_sys) in the event that creating these extension fails. previously this was hardcoded for postgis bits to debian like system. for hstore it used to just fail outright
…ns instead of passing osmdata_t object. clean up main by breaking out input function pointer into seprate function call. break out output_t creation into seprate function call. moved cleanup of pointers that osmdata_t holds into its destructor
zerebubuth and others added 3 commits October 3, 2014 11:45
remove unneded module, fixes errors on default Python installation
replace variable length arrays with std::vector
@pnorman
Copy link
Collaborator Author

pnorman commented Oct 3, 2014

@zerebubuth, let's stick this on the EWG agenda for the 6th. (17:30 UTC, #osm-ewg)

rachekalmir and others added 2 commits October 5, 2014 14:19
Update parse-pbf.c replacing "protobuf_c_default_allocator" with "NULL" so that it compiles with protobuf 1.0.1.
Use "Update parse-pbf.c for protobuf 1.0.X" commit from master branch
@@ -121,6 +121,14 @@ fi
dnl Check for pthread library
AX_PTHREAD(,[AC_MSG_ERROR([no])])

dnl Check for Boost libraries
AX_BOOST_BASE([1.49], , [AC_MSG_ERROR([cannot find Boost libraries, which are are required for building avecado. Please install libboost-dev.])])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be osm2pgsql, not avecado

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got me! Copy paste fail!
Am 07.10.2014 18:02 schrieb "Paul Norman" notifications@github.com:

In configure.ac:

@@ -121,6 +121,14 @@ fi
dnl Check for pthread library
AX_PTHREAD(,[AC_MSG_ERROR([no])])

+dnl Check for Boost libraries
+AX_BOOST_BASE([1.49], , [AC_MSG_ERROR([cannot find Boost libraries, which are are required for building avecado. Please install libboost-dev.])])

should be osm2pgsql, not avecado


Reply to this email directly or view it on GitHub
https://github.com/openstreetmap/osm2pgsql/pull/187/files#r18552486.

@pnorman
Copy link
Collaborator Author

pnorman commented Oct 15, 2014

Looking at the Coverity scan, most of the issues raised seem to be in largely untouched C-code and have to do with memory. Nothing blocking, though we should review it later

kevinkreiser and others added 3 commits October 15, 2014 14:59
…1.48

The PPA only goes back to 1.54, so we use the standard packages for 1.48 and don't check between that and 1.54
@pnorman
Copy link
Collaborator Author

pnorman commented Oct 20, 2014

@apmon, any updates on the review? fyi, matt, kevin and myself are in the denver office this week

Check multiple boost versions with travis, drop boost requirement to 1.48
@pnorman
Copy link
Collaborator Author

pnorman commented Oct 24, 2014

@lonvia any thoughts from a gazetteer point of view?

int CopyActive;
unsigned int BufferLen;

char Buffer[BUFFER_SIZE];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should be converted to a std::string just like in table.cpp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. i flirted with changing more of this many times but thought it was too risky since i wasnt much of an expert with nominatim. if we could it would be really nice to use table_t directly in here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mechanics behind filling the table are more or less directly copied from output_pgsql, so shouldn't be really a problem to use table_t. But it's nothing urgent. I'd be happy to look into it, once this is merged.

@lonvia
Copy link
Collaborator

lonvia commented Oct 24, 2014

I've mainly looked at output-gazetter now and skimmed the other changes shortly. I've put some comments in the code. Some other random musings:

  • Please check for and remove trailing white space.
  • buffer.hpp/.cpp does not seem to be used anywhere. There are includes in output-psql and output-multi but I cannot find any reference to the buffer class.
  • In output-gazetter: any particular reason why split_tags() and escape_array_record() are global methods instead of class methods?
  • Could naming conventions for variables, classes, members etc. be made consistent? (I'm aware that the C code wasn't any better but no reason to keep bad habits up.)

@pnorman
Copy link
Collaborator Author

pnorman commented Oct 25, 2014

  • Please check for and remove trailing white space

Taking this on.

pnorman added a commit to pnorman/osm2pgsql that referenced this pull request Oct 25, 2014
Also increments the version to 0.87.0

Closes osm2pgsql-dev#187

Fixes osm2pgsql-dev#156
Fixes osm2pgsql-dev#186
Fixes osm2pgsql-dev#105
Fixes osm2pgsql-dev#111
@pnorman pnorman closed this Oct 25, 2014
@pnorman pnorman deleted the cpp_conversion branch October 25, 2014 08:56
@pnorman pnorman mentioned this pull request Oct 25, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants