Skip to content

Loading…

lazy parsing mode for csv plugin #919

Open
springmeyer opened this Issue · 2 comments

1 participant

@springmeyer
Mapnik member

The CSV plugin currently parses the entire document into a memory featureset upon datasource initialization.

The reasoning behind this approach is:

  • Be able to give full diagnostic error/warning output (rather than failing during rendering on a given row). This is particularly critical because csv files can be corrupt or invalid at any line.
  • Parsing each line with boost_escaped_list tokenizer is quite slow, so up-fronting this cost is ideal for fast rendering
  • For featuresets in memory, rendering is very fast (faster than reading from mmapped shapefile) with csv files in the 1-15MB range

The drawbacks of this approach are:

  • Initial parse times are very slow - about 1 second/5MB
  • In applications (like tilemill) where layer initialization is done repeatedly (and lazily), startup costs of > 1-2 seconds very noticeably degrades usability: mapbox/tilemill#798
  • All attributes in the csv are parsed in memory, even when they may not ever be needed or referenced in the mapnik::query
  • All features are parsed/filtered even if their geometry may not intersect with a given map query (filtering is fast but storage wasteful).

For CSV's over 10 MB the plugin could switch to reading the csv rows lazily for every request. This would be much slower for rendering, but hopefully optimizations could be made to keep the speed reasonable:

  • csv file could be memory mapped
  • boost spirit could be used in replace of boost_escaped_list tokenizer to try to speed up line tokenization.
  • geo-columns could perhaps be extracted from the line first (before full tokenization) and only if map intersection occurred would the entire line be parsed.
  • a quadtree spatial index could be used to enable faster skipping through lines that may not intersect a map query
@springmeyer springmeyer was assigned
@springmeyer
Mapnik member

This would be much slower for rendering, but hopefully optimizations could be made to keep the speed reasonable

Returning back here to report that its absolutely feasible and fast enough to 1) parse the entire file once to generate an rtree, and 2) then lazily read just the features needed. This is now being done in the GeoJSON plugin and that landed (mostly) in 1263bc9#diff-8cf4ce9e2a4a1d18c080fde413f90868. We should now also do it with the CSV plugin.

@springmeyer springmeyer added the post3x label
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.