Use a single stream for importing records #119

orangejulius · 2016-08-03T20:48:33Z

Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:

It requires that all data can fit into memory. While this is not
so bad for WOF admin data, where a reasonably new machine can handle
things just fine, it's horrible for venue data, where there are already
10s of millions of records and will likely be many more in the future.
Its slower: by separating the disk and network I/O sections, they
can't be interleaved to speed things up.
It doesn't give good feedback when running the importer that something
is happening: the importer sits for several minutes loading records
before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

A change like this is necessary to support Who's on First venues, and in fact this code has already been tested by importing about 1M venues from California!

Fixes #101
Connects #7 (it doesn't quite fix it, for that we need to be able to not even store all admin areas at once, for example to import geometries)
Connects #94

This new importer style requires records to be imported starting at the top of the heirarchy and working on down.

dianashk · 2016-08-03T22:30:38Z

import.js

-  // how to convert WOF records to Pelias Documents
-  var documentGenerator = peliasDocGenerators.create(
-    hierarchyFinder.hierarchies_walker(wofRecords));
+var readStream = readStream.create(directory, types, wofAdminRecords);


It's a bit confusing that you redefine readStream here and assign readStream.create() to it. Would be great if this variable or the one in the require block had a different name.

oh, doh! good catch

dianashk · 2016-08-03T22:34:43Z

Other than the one confusing variable name, code looks solid.

Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_ bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94

orangejulius · 2016-08-04T15:44:19Z

Variable name is fixed!
I also did some timing of this branch vs master. With a freshtly emptied and restarted Elasticsearch each time, master runs in 9m35s, and this branch runs in 6m37s! Also worth noting there are now 420k admin records in Who's on First!

dianashk · 2016-08-04T16:11:31Z

missinglink · 2016-08-08T14:05:01Z

I somehow messed up the order when working on #119. Since `county` records were being loaded and processed before `macrocounty` records, its possible that some records were missing the `macrocounty` hierarchy elements.

orangejulius added 2 commits August 3, 2016 16:31

Reorder types by heirarchy

2bc9b5b

This new importer style requires records to be imported starting at the top of the heirarchy and working on down.

Whitespace

4b2e1a6

orangejulius added the in review label Aug 3, 2016

orangejulius added this to the WOF Venues milestone Aug 3, 2016

orangejulius self-assigned this Aug 3, 2016

orangejulius force-pushed the single-stream branch 2 times, most recently from 1bd19ca to 5f56eb9 Compare August 3, 2016 20:52

dianashk reviewed Aug 3, 2016
View reviewed changes

orangejulius force-pushed the single-stream branch from 5f56eb9 to c500aaf Compare August 4, 2016 15:16

orangejulius merged commit 201eb0d into master Aug 8, 2016

orangejulius removed the in review label Aug 8, 2016

orangejulius deleted the single-stream branch August 8, 2016 20:22

orangejulius mentioned this pull request Aug 16, 2016

Fix placetype order #130

Merged

This was referenced Sep 19, 2016

Store fewer objects in memory during import #7

Open

Import Who's on First venues #94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a single stream for importing records #119

Use a single stream for importing records #119

orangejulius commented Aug 3, 2016 •

edited

Loading

dianashk Aug 3, 2016

orangejulius Aug 4, 2016

dianashk commented Aug 3, 2016

orangejulius commented Aug 4, 2016

dianashk commented Aug 4, 2016

missinglink commented Aug 8, 2016

Use a single stream for importing records #119

Use a single stream for importing records #119

Conversation

orangejulius commented Aug 3, 2016 • edited Loading

dianashk Aug 3, 2016

Choose a reason for hiding this comment

orangejulius Aug 4, 2016

Choose a reason for hiding this comment

dianashk commented Aug 3, 2016

orangejulius commented Aug 4, 2016

dianashk commented Aug 4, 2016

missinglink commented Aug 8, 2016

orangejulius commented Aug 3, 2016 •

edited

Loading