-
-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openstreemap pipeline improvements #27
Conversation
Before this gets merged, I'm curious how you identified the OSM parsing as our bottleneck. |
In the case of OSM data its pretty obviously a bottleneck as we parse ~2.5B nodes and only use ~20m of them, so the faster it runs the better off we will be. Having said that I agree the other I/O operations still need better performance testing. |
Right, but I can parse an 800mb PBF in 3 minutes with straight up |
That is a very good question.
|
Well, it turns out that I can parse and filter the planet PBF with pure JavaScript in ~17 hours. Still don't understand why the Go parser is making a significant difference, or why our imports were taking 3 weeks. |
My 2p, solely based on my observations running both, are:
If I had to hazard a guess, there's a limitation in the old implementation that's blocking on request/response to ES, not necessarily parsing the pbf, and the fact that it's operating only on a single core effectively limits the throughput you're going to get. |
suggester = require('pelias-suggester-pipeline'), | ||
dbmapper = require('./stream/dbmapper'); | ||
|
||
var osm = { pbf: {}, doc: {}, address: {}, tag: {} }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be cleaner if you replaced the osm
object with simple variable declarations, ie:
var docConstructor = require('./stream/document_constructor');
var docDenormalizer = require('./stream/denormalizer');
It doesn't really seem to serve any purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
osm is exported. the object hierarchy is intended to give readability to the exported API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my opinion this would work better as a class, especially since it's being exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? could you make a case please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you export a class you can be a lot more explicit about the interface. Do things like define accessor methods if needed. Allow multiple instances, if that's a desired behavior, and explicitly indicate it isn't by making it a singleton. Maybe it's my OOP background, but exporting an anonymous object as the face of your package feels unfinished. With that said, I'm not extremely passionate about this and will not be super upset if you keep it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 'class' in JS represents an instantiable construct for which each instance may hold state for it's individual properties while sharing methods from a prototype.
This functionality is not required in this case and so we would end up exporting a singleton with no individual state or prototypal methods.
The current implementation is equivalent and forwards compatible if (hypothetically) we decided to export a singleton in the future.
The system is designed in a way that allows individual components to be easily added/removed from the import pipeline by simply excluding/including additional streams to the pipeline.
A good analogy is this:
#! /bin/bash
$> osm_parser | admin_lookup | elasticsearch;
if another developer would like to decompose that pipeline and fork an existing stream or add a new one then they can simply re-compose the pipeline as such:
#! /bin/bash
$> osm_parser | my_fork_of_admin_lookup | something_else | elasticsearch;
I would prefer if this repository is completely functional and stateless.
I think this specific discussion would be best continued outside of this PR.
thanks for the review @sevko
|
{
"karlsruhe": {
"addr:housename": "name",
"addr:housenumber": "number",
"addr:street": "street",
"addr:state": "state",
"addr:postcode": "zip",
"addr:city": "city",
"addr:country": "country"
},
"naptan": {
"naptan:Street": "street"
}
"osm": {
"postal_code": "zip"
}
}
|
|
I think that if we want to make any more changes, now's as good a time as ever.
Also, seems like the |
The easier a source tree is to navigate and read, the more likely we are to get external contributors. This seems like it'll confuse people (why so many files/directories for something so simple?). It did me, at least.. @dianashk , what's your take? |
@@ -16,4 +20,4 @@ module.exports = function( geometry ){ | |||
} | |||
|
|||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we return false or null here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return;
is equivalent to return undefined;
, why are return false;
or return null;
better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning undefined
means the function is of type void
which it isn't. if there is ever an expectation of the function returning something, it shouldn't return void at any point. null
would be an appropriate substitution for an object when one cannot be created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It's a semantic difference. null
implies a result could not be computed, while undefined
implies a variable wasn't initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we talking about javascript here or another language, how is this function type 'void'?
If we are talking types
in js, then exporting something which is an object seems a bit odd to me.
re: typeof null == "object"
.
Since you feel strongly about this, I am happy to change to accommodate.
To address the earlier discussion:
module.exports.NAME_SCHEMA = require('../schema/name_osm');
module.exports.ADDRESS_SCHEMA = merge( true, false,
require('./address_osm'),
require('./address_naptan'),
require('./address_karlsruhe')
);
var schema = require('../schema'); |
One function per file still seems foreign to me, since modules are supposed to be logical groupings of like functions. More importantly, with regards to the schema files, duplicating the same header comment in three places and having a file that looks like: /**
Attempt to map OSM address tags to the Pelias address schema.
On the left is the OSM tag name, on the right is corresponding
doc.address key for which it should be mapped to.
eg. tags['naptan:Street'] -> doc.address['street']
@ref: http://wiki.openstreetmap.org/wiki/NaPTAN
**/
var NAPTAN_SCHEMA = {
'naptan:Street': 'street'
};
module.exports = NAPTAN_SCHEMA; just seems smelly. At this point, though, two people disagree and this is relatively unimportant so I don't have much of an argument. Before I +1, have we run a staging (ie full planet) import with the latest commit(s) in this branch? I don't remember whether things like the |
var elasticsearch = require('pelias-dbclient'), | ||
adminLookup = require('pelias-admin-lookup'), | ||
suggester = require('pelias-suggester-pipeline'), | ||
dbmapper = require('./stream/dbmapper'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are all the stream object defined in this project required and saved as part of the osm object except dbmapper? would we expect the client of this module to want access to all the other objects but not dbmapper, or adminLookup/suggester for that matter? just curious what the distinction is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are pros/cons to this approach, the dbmapper
is a pretty generic stream which is used to map pelias.model.Document
objects to the syntax that is accepted by pelias/dbclient
.
Exporting it makes it easier for 3rd party consumers to do that simple mapping but it strictly targets the pelias
index, meaning consumers cannot build an index of another name.
I would love to see this functionality abstracted out to either pelias/dbclient
, pelias/model
or another module which is responsible for this domain and re-usable by other imports, for that reason I chose not to export the stream in order to avoid deprecating that public API in the future.
In regards to the adminLookup/suggester
modules, they are already packaged and distributed by npm
, why would we export them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see: 9d4edc3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the database mapper stream should be moved into pelias-dbclient
, since we're currently repeating that same bit of code in all of our importers. If we're still keen on keeping that package generalized for anyone's use, we should then re-publish it as an elasticsearch-bulk-indexer
and rewrite pelias-dbclient
as a lightweight wrapper for it.
Overall, what a huge difference! Definitely a lot more streamlined and polished. Tests are a great addition. 👍 |
suggester.streams.suggester( suggester.generators ) | ||
]); | ||
// run import if executed directly; but not if imported via require() | ||
if( require.main === module ){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed it yesterday, but I think it's important. index.js
should be used strictly for exporting your package and shouldn't be directly executed. There should be an additional app.js
/ server.js
file that requires the index.js
and executes. That entry point then also serves as an example of usage for clients. I think this is the common expectation when looking at a node project and is based on the following. index.js
is the default file used when require
ing a directory, hence index.js
makes sense as the main export point for a module. npm start
will default to node server.js
if a server.js
file is found in the package root. So that means a separate entry point for execution is expected.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fair comment but unfortunately it will need to stay this way until we can update our dependants in pelias/vagrant
and our build scripts as they currently rely on executing the import via node index.js
.
I'd be happy to dev an issue to resolve this after we merge.
+1 |
|
Well, code reviews are meant to cover changes that would result in objectively cleaner/better code in addition to purely functional improvements. We all clearly have strong opinions about the former, and (dis)agree about a number of them. I think it's good that we got a chance to discuss them. |
openstreemap pipeline improvements
@heffergm can you please test this in our dev environment. experimental branch is now merged to master. $ npm publish
npm http PUT https://registry.npmjs.org/pelias-openstreetmap
npm http 201 https://registry.npmjs.org/pelias-openstreetmap
+ pelias-openstreetmap@1.0.0 |
In the future, should we test branches in imports before merging them to master? I did that with both geonames and quattroshapes. Seems like that's the best way to guarantee that only stable code hits |
PENDING REVIEW
First of all; apologies for the very large PR. This work is the result of research and development of the best way to build the OSM import pipeline over the last 3-4 months.
The result is a far more refined, stable and tested import pipeline which we can continue to build off going-forward.
The current master branch (the target of this PR) has 4 major pain points:
slow imports
- it can take up to 20 days to import the full planet file, it's not multi-core.failure to exit(0)
- sometimes the imports fail to exit cleanly, making ops scripting difficult.Recursive process.nextTick detected
- streams-related errors.code clarity/testing
- the code is only partially tested and in some places, it's messy.In order to address [1] the following work was done, which resulted in drastic speed improvements:
golang
pbf parserIn order to fix [2], I simply:
Issue [3] was resolved by fixing the above issues and:
0.10
and0.12
.Issue [4] was tackled by:
Some extra benefits we got along the way: