Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a single stream for importing records #119

Merged
merged 3 commits into from
Aug 8, 2016
Merged

Conversation

orangejulius
Copy link
Member

@orangejulius orangejulius commented Aug 3, 2016

Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:

  • It requires that all data can fit into memory. While this is not
    so bad for WOF admin data, where a reasonably new machine can handle
    things just fine, it's horrible for venue data, where there are already
    10s of millions of records and will likely be many more in the future.
  • Its slower: by separating the disk and network I/O sections, they
    can't be interleaved to speed things up.
  • It doesn't give good feedback when running the importer that something
    is happening: the importer sits for several minutes loading records
    before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

A change like this is necessary to support Who's on First venues, and in fact this code has already been tested by importing about 1M venues from California!

Fixes #101
Connects #7 (it doesn't quite fix it, for that we need to be able to not even store all admin areas at once, for example to import geometries)
Connects #94

This new importer style requires records to be imported starting at the
top of the heirarchy and working on down.
@orangejulius orangejulius added this to the WOF Venues milestone Aug 3, 2016
@orangejulius orangejulius self-assigned this Aug 3, 2016
@orangejulius orangejulius force-pushed the single-stream branch 2 times, most recently from 1bd19ca to 5f56eb9 Compare August 3, 2016 20:52
// how to convert WOF records to Pelias Documents
var documentGenerator = peliasDocGenerators.create(
hierarchyFinder.hierarchies_walker(wofRecords));
var readStream = readStream.create(directory, types, wofAdminRecords);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing that you redefine readStream here and assign readStream.create() to it. Would be great if this variable or the one in the require block had a different name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, doh! good catch

@dianashk
Copy link
Contributor

dianashk commented Aug 3, 2016

Other than the one confusing variable name, code looks solid.

Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_ bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Connects #7
Connects #94
@orangejulius
Copy link
Member Author

Variable name is fixed!
I also did some timing of this branch vs master. With a freshtly emptied and restarted Elasticsearch each time, master runs in 9m35s, and this branch runs in 6m37s! Also worth noting there are now 420k admin records in Who's on First!

@dianashk
Copy link
Contributor

dianashk commented Aug 4, 2016

:shipit:

1 similar comment
@missinglink
Copy link
Member

:shipit:

@orangejulius orangejulius merged commit 201eb0d into master Aug 8, 2016
@orangejulius orangejulius deleted the single-stream branch August 8, 2016 20:22
orangejulius added a commit that referenced this pull request Aug 16, 2016
I somehow messed up the order when working on #119.
Since `county` records were being loaded and processed before
`macrocounty` records, its possible that some records were missing the
`macrocounty` hierarchy elements.
@orangejulius orangejulius mentioned this pull request Aug 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Show progress when initially loading data
3 participants