Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store fewer objects in memory during import #7

Open
orangejulius opened this issue Nov 10, 2015 · 0 comments
Open

Store fewer objects in memory during import #7

orangejulius opened this issue Nov 10, 2015 · 0 comments

Comments

@orangejulius
Copy link
Member

orangejulius commented Nov 10, 2015

Right now this importer stores an object for every single WOF item in memory. While that object doesn't have all the fields from the WOF item in it (polygon data, most notably, isn't stored), it still takes up quite a bit of memory.

As WOF grows or this importer loads more fields, it will almost certainly brush up closer against the Node.js memory limit (which currently defaults to 2GB but is adjustable with a flag). It's probably close enough to it now that performance is suffering.

Thanks to #119, we can now import all venues, because only the hierarchy records have to stay in memory. For now there isn't enough data there to cause problems. But someday there might be.

When that happens we'll probably want to process items individually, and load all the required parent items on demand, without storing them, or perhaps only storing a certain number.

@riordan riordan modified the milestone: Who's on First Jan 11, 2016
@dianashk dianashk added Q3-2016 and removed Q2-2016 labels Jul 21, 2016
orangejulius added a commit that referenced this issue Aug 3, 2016
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Fixes #7
Connects #94
orangejulius added a commit that referenced this issue Aug 3, 2016
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Connects #7
Connects #94
orangejulius added a commit that referenced this issue Aug 3, 2016
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_ bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Connects #7
Connects #94
orangejulius added a commit that referenced this issue Aug 4, 2016
Previously, the WOF importer loaded all records into memory in one
stream, and then processed and indexed the records in Elasticsearch in a
second stream after the first stream was done.

This has several problems:
* It requires that all data can fit into memory. While this is not
  _so_ bad for WOF admin data, where a reasonably new machine can handle
  things just fine, it's horrible for venue data, where there are already
  10s of millions of records.
* Its slower: by separating the disk and network I/O sections, they
  can't be interleaved to speed things up.
* It doesn't give good feedback when running the importer that something
  is happening: the importer sits for several minutes loading records
  before the dbclient progress logs start displaying

This change fixes all those issues, by processing all records in a
single stream, starting at the highest hierarchy level, and finishing at
the lowest, so that all records always have the admin data they need to
be processed.

Fixes #101
Connects #7
Connects #94
@dianashk dianashk removed this from the WOF Venues milestone Aug 10, 2016
@orangejulius orangejulius removed their assignment Jul 27, 2017
@dianashk dianashk added this to the Importers milestone Aug 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants