openstreemap pipeline improvements #27

missinglink · 2014-12-05T21:39:45Z

PENDING REVIEW

First of all; apologies for the very large PR. This work is the result of research and development of the best way to build the OSM import pipeline over the last 3-4 months.

The result is a far more refined, stable and tested import pipeline which we can continue to build off going-forward.

The current master branch (the target of this PR) has 4 major pain points:

slow imports - it can take up to 20 days to import the full planet file, it's not multi-core.
failure to exit(0) - sometimes the imports fail to exit cleanly, making ops scripting difficult.
Recursive process.nextTick detected - streams-related errors.
code clarity/testing - the code is only partially tested and in some places, it's messy.

In order to address [1] the following work was done, which resulted in drastic speed improvements:

In order to fix [2], I simply:

fix stats.js module via unit tests, remove complexity

Issue [3] was resolved by fixing the above issues and:

upgrade all dependencies.
fix nodejs version inconsistencies to work with both 0.10 and 0.12.

Issue [4] was tackled by:

implement pelias-model
add doc.setAddress() and doc.getAddress() functions to pelias/model
general clean up
improve global module.exports, add tests.
configure code linting and precommit hook, lint everything.
make osm data mappers clearer and easier to write/modify
add a unit test for every module
add detailed comments at the top of complex streams to explain their purpose
add end-to-end system test

Some extra benefits we got along the way:

extract address data and store in ES
remove dependency on having quattroshapes indexed before running
resolve inconsistent ways count from subsequent import runs

sevko · 2015-02-09T14:51:32Z

Before this gets merged, I'm curious how you identified the OSM parsing as our bottleneck.

missinglink · 2015-02-09T16:05:37Z

In the case of OSM data its pretty obviously a bottleneck as we parse ~2.5B nodes and only use ~20m of them, so the faster it runs the better off we will be.

Having said that I agree the other I/O operations still need better performance testing.

sevko · 2015-02-09T16:08:19Z

Right, but I can parse an 800mb PBF in 3 minutes with straight up osm-pbf-parser. If we assumed that the rest of our pipeline ran instantaneously, wouldn't the entire import take that long? Since it doesn't, it seems like the delays are elsewhere. Am I missing something?

missinglink · 2015-02-09T16:14:45Z

That is a very good question.
On 9 Feb 2015 16:08, "Severyn Kozak" notifications@github.com wrote:

Right, but I can parse an 800mb PBF in 3 minutes with straight up
osm-pbf-parser. If we assumed that the rest of our pipeline ran
instantaneously, wouldn't the entire import take that long? Since it
doesn't, it seems like the delays are elsewhere. Am I missing something?

—
Reply to this email directly or view it on GitHub
#27 (comment).

sevko · 2015-02-10T16:38:01Z

Let's figure that out and discuss how we should benchmark the importer before this gets merged.

cc @dianashk , @hkrishna

sevko · 2015-02-21T01:54:44Z

Well, it turns out that I can parse and filter the planet PBF with pure JavaScript in ~17 hours. Still don't understand why the Go parser is making a significant difference, or why our imports were taking 3 weeks.

heffergm · 2015-02-21T02:02:26Z

My 2p, solely based on my observations running both, are:

with old setup, no activity on multiple cores (i.e. single CPU @100%) regardless of thread count, peak of 4k requests/s to ES
with experimental setup, activity on all cores, peak of ~30k res/s to ES

If I had to hazard a guess, there's a limitation in the old implementation that's blocking on request/response to ES, not necessarily parsing the pbf, and the fact that it's operating only on a single core effectively limits the throughput you're going to get.

sevko · 2015-03-09T15:34:03Z

index.js

+    suggester = require('pelias-suggester-pipeline'),
+    dbmapper = require('./stream/dbmapper');
+
+var osm = { pbf: {}, doc: {}, address: {}, tag: {} };


I think it'd be cleaner if you replaced the osm object with simple variable declarations, ie:

var docConstructor = require('./stream/document_constructor'); var docDenormalizer = require('./stream/denormalizer');

It doesn't really seem to serve any purpose.

osm is exported. the object hierarchy is intended to give readability to the exported API

in my opinion this would work better as a class, especially since it's being exported.

why? could you make a case please?

When you export a class you can be a lot more explicit about the interface. Do things like define accessor methods if needed. Allow multiple instances, if that's a desired behavior, and explicitly indicate it isn't by making it a singleton. Maybe it's my OOP background, but exporting an anonymous object as the face of your package feels unfinished. With that said, I'm not extremely passionate about this and will not be super upset if you keep it as is.

A 'class' in JS represents an instantiable construct for which each instance may hold state for it's individual properties while sharing methods from a prototype.

This functionality is not required in this case and so we would end up exporting a singleton with no individual state or prototypal methods.

The current implementation is equivalent and forwards compatible if (hypothetically) we decided to export a singleton in the future.

The system is designed in a way that allows individual components to be easily added/removed from the import pipeline by simply excluding/including additional streams to the pipeline.

A good analogy is this:

#! /bin/bash $> osm_parser | admin_lookup | elasticsearch;

if another developer would like to decompose that pipeline and fork an existing stream or add a new one then they can simply re-compose the pipeline as such:

#! /bin/bash $> osm_parser | my_fork_of_admin_lookup | something_else | elasticsearch;

I would prefer if this repository is completely functional and stateless.

I think this specific discussion would be best continued outside of this PR.

missinglink · 2015-03-09T15:34:52Z

thanks for the review @sevko

I was hoping we could remove that functionality completely, are you working on something like that?
I don't see the value of merging these, there is one schema for names and one for addresses, I assume you mean merging the addresses one? these are already merged for performance reasons here https://github.com/pelias/openstreetmap/blob/experimental/stream/tag_mapper.js#L11 but left seperate for readability.
It does not extract admin values, it simply copies them from the poi doc to the address doc so we don't have to look them up twice.

sevko · 2015-03-09T15:39:11Z

Nope. We still need to compute way center-points until we decide to import non-point geometries (lines, polygons) into elasticsearch, so it seems like those should stick around.
They hinder navigability more than improve readability, in my opinion. Two of those files have only one key-value pair, and a lot of boilerplate (including a module.exports, which a .json file wouldn't need, and the same header comment duplicated in four places). The three address_ files can be simplified to:

{
    "karlsruhe": {
        "addr:housename": "name",
        "addr:housenumber": "number",
        "addr:street": "street",
        "addr:state": "state",
        "addr:postcode": "zip",
        "addr:city": "city",
        "addr:country": "country"
    },
    "naptan": {
        "naptan:Street": "street"
    }
    "osm": {
        "postal_code": "zip"
    }
}

My mistake.

missinglink · 2015-03-09T15:43:54Z

I think we should aim for a separate module as this functionality may be useful to other modules in the future, the code itself is covered by a stream test but the libs not unit tested. I personally feel it's out of scope of this PR and would add more complexity to an already large PR which can be merged independently of that issue.
What does hinder navigability mean? If you are saying that code with less characters is better then I strongly disagree. A json file cannot contain comments and would be far more confusing to other developers trying to collaborate.

sevko · 2015-03-09T15:46:51Z

I think that if we want to make any more changes, now's as good a time as ever.

In that case, at least compress them into one module. One function per file seems extreme.
I mean navigating three files is more difficult than navigating one, especially when they contain hardly any content. Feel free to keep them as a .js file (indeed, losing the comment would be bad), but I don't see why each deserves its own module.

Also, seems like the config/ folder contains only a single file, which is somewhat similar to the files inside schema/. Maybe merge them into one directory?

sevko · 2015-03-09T16:25:04Z

The easier a source tree is to navigate and read, the more likely we are to get external contributors. This seems like it'll confuse people (why so many files/directories for something so simple?). It did me, at least.. @dianashk , what's your take?

dianashk · 2015-03-09T20:44:38Z

util/geoJsonCenter.js

@@ -16,4 +20,4 @@ module.exports = function( geometry ){
  }

  return;


can we return false or null here?

return; is equivalent to return undefined;, why are return false; or return null; better?

returning undefined means the function is of type void which it isn't. if there is ever an expectation of the function returning something, it shouldn't return void at any point. null would be an appropriate substitution for an object when one cannot be created.

Agreed. It's a semantic difference. null implies a result could not be computed, while undefined implies a variable wasn't initialized.

Are we talking about javascript here or another language, how is this function type 'void'?

If we are talking types in js, then exporting something which is an object seems a bit odd to me.
re: typeof null == "object".

Since you feel strongly about this, I am happy to change to accommodate.

dianashk · 2015-03-09T22:04:49Z

To address the earlier discussion:

With regard to combining/moving the geoJson*.js util contents, I find the current split clean and modular. I can see the reasoning for splitting little utilities like that up into separate files. Grouping them under the util directory hides the perceived complexity of too many files, while having them in separate files means you don't have to scroll through a wall of code to find a small function not at all related to the rest of the file contents. It's not unlike node_modules/ hiding all the various dependencies.
For the same reason as (1), I don't mind the schema breakup. There is a logical split between those schema items, and having them split up in separate files highlights that to the reader. I'm not offended by this. What could be done to make it clear that the various address parts should be merged and to hide that complexity, is to add an index.js to that directory. The require the schema directory instead of individual files.

schema/index.js

module.exports.NAME_SCHEMA = require('../schema/name_osm');
module.exports.ADDRESS_SCHEMA = merge( true, false,
  require('./address_osm'),
  require('./address_naptan'),
  require('./address_karlsruhe')
);

stream\tag_mapper.js

var schema = require('../schema');

sevko · 2015-03-09T23:59:34Z

One function per file still seems foreign to me, since modules are supposed to be logical groupings of like functions. More importantly, with regards to the schema files, duplicating the same header comment in three places and having a file that looks like:

/**
  Attempt to map OSM address tags to the Pelias address schema.
  On the left is the OSM tag name, on the right is corresponding
  doc.address key for which it should be mapped to.
  eg. tags['naptan:Street'] -> doc.address['street']
  @ref: http://wiki.openstreetmap.org/wiki/NaPTAN
**/

var NAPTAN_SCHEMA = {
  'naptan:Street': 'street'
};

module.exports = NAPTAN_SCHEMA;

just seems smelly. At this point, though, two people disagree and this is relatively unimportant so I don't have much of an argument. Before I +1, have we run a staging (ie full planet) import with the latest commit(s) in this branch? I don't remember whether things like the admin-lookup integration were included in our last run.

dianashk · 2015-03-10T02:54:00Z

index.js

+var elasticsearch = require('pelias-dbclient'),
+    adminLookup = require('pelias-admin-lookup'),
+    suggester = require('pelias-suggester-pipeline'),
+    dbmapper = require('./stream/dbmapper');


why are all the stream object defined in this project required and saved as part of the osm object except dbmapper? would we expect the client of this module to want access to all the other objects but not dbmapper, or adminLookup/suggester for that matter? just curious what the distinction is here.

There are pros/cons to this approach, the dbmapper is a pretty generic stream which is used to map pelias.model.Document objects to the syntax that is accepted by pelias/dbclient.

Exporting it makes it easier for 3rd party consumers to do that simple mapping but it strictly targets the pelias index, meaning consumers cannot build an index of another name.

I would love to see this functionality abstracted out to either pelias/dbclient, pelias/model or another module which is responsible for this domain and re-usable by other imports, for that reason I chose not to export the stream in order to avoid deprecating that public API in the future.

In regards to the adminLookup/suggester modules, they are already packaged and distributed by npm, why would we export them?

see: 9d4edc3

Thanks for the explanation.

I think the database mapper stream should be moved into pelias-dbclient, since we're currently repeating that same bit of code in all of our importers. If we're still keen on keeping that package generalized for anyone's use, we should then re-publish it as an elasticsearch-bulk-indexer and rewrite pelias-dbclient as a lightweight wrapper for it.

dianashk · 2015-03-10T03:06:04Z

Overall, what a huge difference! Definitely a lot more streamlined and polished. Tests are a great addition. 👍

dianashk · 2015-03-10T13:35:43Z

index.js

-    suggester.streams.suggester( suggester.generators )
-  ]);
+// run import if executed directly; but not if imported via require()
+if( require.main === module ){


I missed it yesterday, but I think it's important. index.js should be used strictly for exporting your package and shouldn't be directly executed. There should be an additional app.js / server.js file that requires the index.js and executes. That entry point then also serves as an example of usage for clients. I think this is the common expectation when looking at a node project and is based on the following. index.js is the default file used when requireing a directory, hence index.js makes sense as the main export point for a module. npm start will default to node server.js if a server.js file is found in the package root. So that means a separate entry point for execution is expected.)

This is a fair comment but unfortunately it will need to stay this way until we can update our dependants in pelias/vagrant and our build scripts as they currently rely on executing the import via node index.js.

I'd be happy to dev an issue to resolve this after we merge.

dianashk · 2015-03-10T15:16:24Z

+1

sevko · 2015-03-10T15:17:19Z

@missinglink ,

Before I +1, have we run a staging (ie full planet) import with the latest commit(s) in this branch? I don't remember whether things like the admin-lookup integration were included in our last run.

sevko · 2015-03-10T15:45:52Z

Well, code reviews are meant to cover changes that would result in objectively cleaner/better code in addition to purely functional improvements. We all clearly have strong opinions about the former, and (dis)agree about a number of them. I think it's good that we got a chance to discuss them.

openstreemap pipeline improvements

missinglink · 2015-03-10T15:47:49Z

@heffergm can you please test this in our dev environment. experimental branch is now merged to master.

$ npm publish
npm http PUT https://registry.npmjs.org/pelias-openstreetmap
npm http 201 https://registry.npmjs.org/pelias-openstreetmap
+ pelias-openstreetmap@1.0.0

sevko · 2015-03-10T15:50:08Z

In the future, should we test branches in imports before merging them to master? I did that with both geonames and quattroshapes. Seems like that's the best way to guarantee that only stable code hits master.

there be dragons here

565e03c

missinglink added the in progress label Dec 5, 2014

missinglink added 16 commits December 7, 2014 22:27

experimental

b1bc57a

moved go code to it's own repo

6891fed

convert i/o stream to use parallel processing

da66473

refactor

3cc9366

move test file

0aa2b31

fix name mapper

b202df2

fix name cleaning

ed0e419

fix test

c9b3252

remove unused file

6475f9e

remove unused file

1d1b845

revert debugging

684e42a

remove comments

ef930b2

cleanup/refactor

0f66c60

send address records to osmaddress type

3c302c9

fix type

9c08882

fix address type, update tests

34ae7eb

missinglink added 2 commits February 17, 2015 15:26

remove unused deps

b9509a4

bump dep

91af8f2

missinglink changed the title ~~there be dragons here~~ here be dragons Feb 23, 2015

missinglink mentioned this pull request Feb 23, 2015

Planet import fails with OOM error #31

Closed

hkrishna assigned missinglink Feb 25, 2015

sevko reviewed Mar 9, 2015
View reviewed changes

clearer module exports syntax

810f235

dianashk reviewed Mar 9, 2015
View reviewed changes

dianashk reviewed Mar 10, 2015
View reviewed changes

geoJsonCenter: return null;

dfa072f

missinglink added a commit that referenced this pull request Mar 10, 2015

Merge pull request #27 from pelias/experimental

0a6f0c5

openstreemap pipeline improvements

missinglink merged commit 0a6f0c5 into master Mar 10, 2015

missinglink removed the in review label Mar 10, 2015

This was referenced Mar 10, 2015

integrate the new pelias-admin-lookup #35

Closed

refactor osm_types forker #34

Closed

migrate to pelias-model #32

Closed

Process does not always exit on completion of data ingestion #24

Closed

sevko deleted the experimental branch March 12, 2015 15:23

		@@ -16,4 +20,4 @@ module.exports = function( geometry ){
		}

		return;

openstreemap pipeline improvements #27

openstreemap pipeline improvements #27

Conversation

missinglink commented Dec 5, 2014

sevko commented Feb 9, 2015

missinglink commented Feb 9, 2015

sevko commented Feb 9, 2015

missinglink commented Feb 9, 2015

sevko commented Feb 10, 2015

sevko commented Feb 21, 2015

heffergm commented Feb 21, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

missinglink commented Mar 9, 2015

sevko commented Mar 9, 2015

missinglink commented Mar 9, 2015

sevko commented Mar 9, 2015

sevko commented Mar 9, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dianashk commented Mar 9, 2015

sevko commented Mar 9, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dianashk commented Mar 10, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dianashk commented Mar 10, 2015

sevko commented Mar 10, 2015

sevko commented Mar 10, 2015

missinglink commented Mar 10, 2015

sevko commented Mar 10, 2015