lazy parsing mode for csv plugin #919

Open
springmeyer opened this Issue Oct 18, 2011 · 2 comments

Comments

Projects
None yet
1 participant
@springmeyer
Member

springmeyer commented Oct 18, 2011

The CSV plugin currently parses the entire document into a memory featureset upon datasource initialization.

The reasoning behind this approach is:

  • Be able to give full diagnostic error/warning output (rather than failing during rendering on a given row). This is particularly critical because csv files can be corrupt or invalid at any line.
  • Parsing each line with boost_escaped_list tokenizer is quite slow, so up-fronting this cost is ideal for fast rendering
  • For featuresets in memory, rendering is very fast (faster than reading from mmapped shapefile) with csv files in the 1-15MB range

The drawbacks of this approach are:

  • Initial parse times are very slow - about 1 second/5MB
  • In applications (like tilemill) where layer initialization is done repeatedly (and lazily), startup costs of > 1-2 seconds very noticeably degrades usability: https://github.com/mapbox/tilemill/issues/798
  • All attributes in the csv are parsed in memory, even when they may not ever be needed or referenced in the mapnik::query
  • All features are parsed/filtered even if their geometry may not intersect with a given map query (filtering is fast but storage wasteful).

For CSV's over 10 MB the plugin could switch to reading the csv rows lazily for every request. This would be much slower for rendering, but hopefully optimizations could be made to keep the speed reasonable:

  • csv file could be memory mapped
  • boost spirit could be used in replace of boost_escaped_list tokenizer to try to speed up line tokenization.
  • geo-columns could perhaps be extracted from the line first (before full tokenization) and only if map intersection occurred would the entire line be parsed.
  • a quadtree spatial index could be used to enable faster skipping through lines that may not intersect a map query
@springmeyer

This comment has been minimized.

Show comment
Hide comment
@springmeyer

springmeyer Jun 3, 2015

Member

This would be much slower for rendering, but hopefully optimizations could be made to keep the speed reasonable

Returning back here to report that its absolutely feasible and fast enough to 1) parse the entire file once to generate an rtree, and 2) then lazily read just the features needed. This is now being done in the GeoJSON plugin and that landed (mostly) in 1263bc9#diff-8cf4ce9e2a4a1d18c080fde413f90868. We should now also do it with the CSV plugin.

Member

springmeyer commented Jun 3, 2015

This would be much slower for rendering, but hopefully optimizations could be made to keep the speed reasonable

Returning back here to report that its absolutely feasible and fast enough to 1) parse the entire file once to generate an rtree, and 2) then lazily read just the features needed. This is now being done in the GeoJSON plugin and that landed (mostly) in 1263bc9#diff-8cf4ce9e2a4a1d18c080fde413f90868. We should now also do it with the CSV plugin.

@springmeyer springmeyer added the post3x label Jun 3, 2015

@springmeyer

This comment has been minimized.

Show comment
Hide comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment