Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openstreemap pipeline improvements #27

Merged
merged 56 commits into from
Mar 10, 2015
Merged

openstreemap pipeline improvements #27

merged 56 commits into from
Mar 10, 2015

Conversation

missinglink
Copy link
Member

PENDING REVIEW

First of all; apologies for the very large PR. This work is the result of research and development of the best way to build the OSM import pipeline over the last 3-4 months.

The result is a far more refined, stable and tested import pipeline which we can continue to build off going-forward.

The current master branch (the target of this PR) has 4 major pain points:

  1. slow imports - it can take up to 20 days to import the full planet file, it's not multi-core.
  2. failure to exit(0) - sometimes the imports fail to exit cleanly, making ops scripting difficult.
  3. Recursive process.nextTick detected - streams-related errors.
  4. code clarity/testing - the code is only partially tested and in some places, it's messy.

In order to address [1] the following work was done, which resulted in drastic speed improvements:

  • implement golang pbf parser
  • implement admin-lookup
  • upgrade to the latest suggester-pipeline module
  • simplify features.js
  • remove osm_types.js

In order to fix [2], I simply:

  • fix stats.js module via unit tests, remove complexity

Issue [3] was resolved by fixing the above issues and:

  • upgrade all dependencies.
  • fix nodejs version inconsistencies to work with both 0.10 and 0.12.

Issue [4] was tackled by:

  • implement pelias-model
  • add doc.setAddress() and doc.getAddress() functions to pelias/model
  • general clean up
  • improve global module.exports, add tests.
  • configure code linting and precommit hook, lint everything.
  • make osm data mappers clearer and easier to write/modify
  • add a unit test for every module
  • add detailed comments at the top of complex streams to explain their purpose
  • add end-to-end system test

Some extra benefits we got along the way:

  • extract address data and store in ES
  • remove dependency on having quattroshapes indexed before running
  • resolve inconsistent ways count from subsequent import runs

@sevko
Copy link
Contributor

sevko commented Feb 9, 2015

Before this gets merged, I'm curious how you identified the OSM parsing as our bottleneck.

@missinglink
Copy link
Member Author

In the case of OSM data its pretty obviously a bottleneck as we parse ~2.5B nodes and only use ~20m of them, so the faster it runs the better off we will be.

Having said that I agree the other I/O operations still need better performance testing.

@sevko
Copy link
Contributor

sevko commented Feb 9, 2015

Right, but I can parse an 800mb PBF in 3 minutes with straight up osm-pbf-parser. If we assumed that the rest of our pipeline ran instantaneously, wouldn't the entire import take that long? Since it doesn't, it seems like the delays are elsewhere. Am I missing something?

@missinglink
Copy link
Member Author

That is a very good question.
On 9 Feb 2015 16:08, "Severyn Kozak" notifications@github.com wrote:

Right, but I can parse an 800mb PBF in 3 minutes with straight up
osm-pbf-parser. If we assumed that the rest of our pipeline ran
instantaneously, wouldn't the entire import take that long? Since it
doesn't, it seems like the delays are elsewhere. Am I missing something?


Reply to this email directly or view it on GitHub
#27 (comment).

@sevko
Copy link
Contributor

sevko commented Feb 10, 2015

Let's figure that out and discuss how we should benchmark the importer before this gets merged.

cc @dianashk , @hkrishna

@sevko
Copy link
Contributor

sevko commented Feb 21, 2015

Well, it turns out that I can parse and filter the planet PBF with pure JavaScript in ~17 hours. Still don't understand why the Go parser is making a significant difference, or why our imports were taking 3 weeks.

@heffergm
Copy link

My 2p, solely based on my observations running both, are:

  • with old setup, no activity on multiple cores (i.e. single CPU @100%) regardless of thread count, peak of 4k requests/s to ES
  • with experimental setup, activity on all cores, peak of ~30k res/s to ES

If I had to hazard a guess, there's a limitation in the old implementation that's blocking on request/response to ES, not necessarily parsing the pbf, and the fact that it's operating only on a single core effectively limits the throughput you're going to get.

@missinglink missinglink changed the title there be dragons here here be dragons Feb 23, 2015
suggester = require('pelias-suggester-pipeline'),
dbmapper = require('./stream/dbmapper');

var osm = { pbf: {}, doc: {}, address: {}, tag: {} };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be cleaner if you replaced the osm object with simple variable declarations, ie:

var docConstructor = require('./stream/document_constructor');
var docDenormalizer = require('./stream/denormalizer');

It doesn't really seem to serve any purpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

osm is exported. the object hierarchy is intended to give readability to the exported API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my opinion this would work better as a class, especially since it's being exported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? could you make a case please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you export a class you can be a lot more explicit about the interface. Do things like define accessor methods if needed. Allow multiple instances, if that's a desired behavior, and explicitly indicate it isn't by making it a singleton. Maybe it's my OOP background, but exporting an anonymous object as the face of your package feels unfinished. With that said, I'm not extremely passionate about this and will not be super upset if you keep it as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 'class' in JS represents an instantiable construct for which each instance may hold state for it's individual properties while sharing methods from a prototype.

This functionality is not required in this case and so we would end up exporting a singleton with no individual state or prototypal methods.

The current implementation is equivalent and forwards compatible if (hypothetically) we decided to export a singleton in the future.

The system is designed in a way that allows individual components to be easily added/removed from the import pipeline by simply excluding/including additional streams to the pipeline.

A good analogy is this:

#! /bin/bash
$> osm_parser | admin_lookup | elasticsearch;

if another developer would like to decompose that pipeline and fork an existing stream or add a new one then they can simply re-compose the pipeline as such:

#! /bin/bash
$> osm_parser | my_fork_of_admin_lookup | something_else | elasticsearch;

I would prefer if this repository is completely functional and stateless.

I think this specific discussion would be best continued outside of this PR.

@missinglink
Copy link
Member Author

thanks for the review @sevko

  1. I was hoping we could remove that functionality completely, are you working on something like that?
  2. I don't see the value of merging these, there is one schema for names and one for addresses, I assume you mean merging the addresses one? these are already merged for performance reasons here https://github.com/pelias/openstreetmap/blob/experimental/stream/tag_mapper.js#L11 but left seperate for readability.
  3. It does not extract admin values, it simply copies them from the poi doc to the address doc so we don't have to look them up twice.

@sevko
Copy link
Contributor

sevko commented Mar 9, 2015

  1. Nope. We still need to compute way center-points until we decide to import non-point geometries (lines, polygons) into elasticsearch, so it seems like those should stick around.
  2. They hinder navigability more than improve readability, in my opinion. Two of those files have only one key-value pair, and a lot of boilerplate (including a module.exports, which a .json file wouldn't need, and the same header comment duplicated in four places). The three address_ files can be simplified to:
{
    "karlsruhe": {
        "addr:housename": "name",
        "addr:housenumber": "number",
        "addr:street": "street",
        "addr:state": "state",
        "addr:postcode": "zip",
        "addr:city": "city",
        "addr:country": "country"
    },
    "naptan": {
        "naptan:Street": "street"
    }
    "osm": {
        "postal_code": "zip"
    }
}
  1. My mistake.

@missinglink
Copy link
Member Author

  1. I think we should aim for a separate module as this functionality may be useful to other modules in the future, the code itself is covered by a stream test but the libs not unit tested. I personally feel it's out of scope of this PR and would add more complexity to an already large PR which can be merged independently of that issue.
  2. What does hinder navigability mean? If you are saying that code with less characters is better then I strongly disagree. A json file cannot contain comments and would be far more confusing to other developers trying to collaborate.

@sevko
Copy link
Contributor

sevko commented Mar 9, 2015

I think that if we want to make any more changes, now's as good a time as ever.

  1. In that case, at least compress them into one module. One function per file seems extreme.
  2. I mean navigating three files is more difficult than navigating one, especially when they contain hardly any content. Feel free to keep them as a .js file (indeed, losing the comment would be bad), but I don't see why each deserves its own module.

Also, seems like the config/ folder contains only a single file, which is somewhat similar to the files inside schema/. Maybe merge them into one directory?

@sevko
Copy link
Contributor

sevko commented Mar 9, 2015

The easier a source tree is to navigate and read, the more likely we are to get external contributors. This seems like it'll confuse people (why so many files/directories for something so simple?). It did me, at least.. @dianashk , what's your take?

@@ -16,4 +20,4 @@ module.exports = function( geometry ){
}

return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we return false or null here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return; is equivalent to return undefined;, why are return false; or return null; better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning undefined means the function is of type void which it isn't. if there is ever an expectation of the function returning something, it shouldn't return void at any point. null would be an appropriate substitution for an object when one cannot be created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It's a semantic difference. null implies a result could not be computed, while undefined implies a variable wasn't initialized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we talking about javascript here or another language, how is this function type 'void'?

If we are talking types in js, then exporting something which is an object seems a bit odd to me.
re: typeof null == "object".

Since you feel strongly about this, I am happy to change to accommodate.

@dianashk
Copy link
Contributor

dianashk commented Mar 9, 2015

To address the earlier discussion:

  1. With regard to combining/moving the geoJson*.js util contents, I find the current split clean and modular. I can see the reasoning for splitting little utilities like that up into separate files. Grouping them under the util directory hides the perceived complexity of too many files, while having them in separate files means you don't have to scroll through a wall of code to find a small function not at all related to the rest of the file contents. It's not unlike node_modules/ hiding all the various dependencies.
  2. For the same reason as (1), I don't mind the schema breakup. There is a logical split between those schema items, and having them split up in separate files highlights that to the reader. I'm not offended by this. What could be done to make it clear that the various address parts should be merged and to hide that complexity, is to add an index.js to that directory. The require the schema directory instead of individual files.

schema/index.js

module.exports.NAME_SCHEMA = require('../schema/name_osm');
module.exports.ADDRESS_SCHEMA = merge( true, false,
  require('./address_osm'),
  require('./address_naptan'),
  require('./address_karlsruhe')
);

stream\tag_mapper.js

var schema = require('../schema');

@sevko
Copy link
Contributor

sevko commented Mar 9, 2015

One function per file still seems foreign to me, since modules are supposed to be logical groupings of like functions. More importantly, with regards to the schema files, duplicating the same header comment in three places and having a file that looks like:

/**
  Attempt to map OSM address tags to the Pelias address schema.
  On the left is the OSM tag name, on the right is corresponding
  doc.address key for which it should be mapped to.
  eg. tags['naptan:Street'] -> doc.address['street']
  @ref: http://wiki.openstreetmap.org/wiki/NaPTAN
**/

var NAPTAN_SCHEMA = {
  'naptan:Street': 'street'
};

module.exports = NAPTAN_SCHEMA;

just seems smelly. At this point, though, two people disagree and this is relatively unimportant so I don't have much of an argument. Before I +1, have we run a staging (ie full planet) import with the latest commit(s) in this branch? I don't remember whether things like the admin-lookup integration were included in our last run.

var elasticsearch = require('pelias-dbclient'),
adminLookup = require('pelias-admin-lookup'),
suggester = require('pelias-suggester-pipeline'),
dbmapper = require('./stream/dbmapper');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are all the stream object defined in this project required and saved as part of the osm object except dbmapper? would we expect the client of this module to want access to all the other objects but not dbmapper, or adminLookup/suggester for that matter? just curious what the distinction is here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are pros/cons to this approach, the dbmapper is a pretty generic stream which is used to map pelias.model.Document objects to the syntax that is accepted by pelias/dbclient.

Exporting it makes it easier for 3rd party consumers to do that simple mapping but it strictly targets the pelias index, meaning consumers cannot build an index of another name.

I would love to see this functionality abstracted out to either pelias/dbclient, pelias/model or another module which is responsible for this domain and re-usable by other imports, for that reason I chose not to export the stream in order to avoid deprecating that public API in the future.

In regards to the adminLookup/suggester modules, they are already packaged and distributed by npm, why would we export them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see: 9d4edc3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the database mapper stream should be moved into pelias-dbclient, since we're currently repeating that same bit of code in all of our importers. If we're still keen on keeping that package generalized for anyone's use, we should then re-publish it as an elasticsearch-bulk-indexer and rewrite pelias-dbclient as a lightweight wrapper for it.

@dianashk
Copy link
Contributor

Overall, what a huge difference! Definitely a lot more streamlined and polished. Tests are a great addition. 👍

suggester.streams.suggester( suggester.generators )
]);
// run import if executed directly; but not if imported via require()
if( require.main === module ){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed it yesterday, but I think it's important. index.js should be used strictly for exporting your package and shouldn't be directly executed. There should be an additional app.js / server.js file that requires the index.js and executes. That entry point then also serves as an example of usage for clients. I think this is the common expectation when looking at a node project and is based on the following. index.js is the default file used when requireing a directory, hence index.js makes sense as the main export point for a module. npm start will default to node server.js if a server.js file is found in the package root. So that means a separate entry point for execution is expected.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair comment but unfortunately it will need to stay this way until we can update our dependants in pelias/vagrant and our build scripts as they currently rely on executing the import via node index.js.

I'd be happy to dev an issue to resolve this after we merge.

@dianashk
Copy link
Contributor

+1

@sevko
Copy link
Contributor

sevko commented Mar 10, 2015

@missinglink ,

Before I +1, have we run a staging (ie full planet) import with the latest commit(s) in this branch? I don't remember whether things like the admin-lookup integration were included in our last run.

@sevko
Copy link
Contributor

sevko commented Mar 10, 2015

Well, code reviews are meant to cover changes that would result in objectively cleaner/better code in addition to purely functional improvements. We all clearly have strong opinions about the former, and (dis)agree about a number of them. I think it's good that we got a chance to discuss them.

missinglink added a commit that referenced this pull request Mar 10, 2015
openstreemap pipeline improvements
@missinglink missinglink merged commit 0a6f0c5 into master Mar 10, 2015
@missinglink
Copy link
Member Author

@heffergm can you please test this in our dev environment. experimental branch is now merged to master.

$ npm publish
npm http PUT https://registry.npmjs.org/pelias-openstreetmap
npm http 201 https://registry.npmjs.org/pelias-openstreetmap
+ pelias-openstreetmap@1.0.0

@sevko
Copy link
Contributor

sevko commented Mar 10, 2015

In the future, should we test branches in imports before merging them to master? I did that with both geonames and quattroshapes. Seems like that's the best way to guarantee that only stable code hits master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants