Support GeoPandas Polygons and MultiPolygons #1285

ianthomas23 · 2023-10-05T11:50:52Z

This is a WIP to add support for rendering Polygons and MultiPolygons from GeoPandas GeoDataFrames. Fixes #1006.

The approach taken is to use .cx to obtain the subset of geometries within the rendered bounds and then use shapely.io.to_ragged_array to obtain contiguous arrays of coordinates and integer offsets. From here it is very similar to the existing SpatialPandas rendering code which uses numba to quickly traverse the geometries. There is no support for dask yet, nor any tests.

Obligatory pretty picture:

I have included three example notebooks:

natural_earth.ipynb. This uses "naturalearth.land" from geodatasets which is directly read using GeoPandas and converted to SpatialPandas. Outputs are the same for both and rendering times (after the first slower render that compiles the numba code) are about the same on my dev machine (M1 mac) at about 0.26 seconds for 600x1200 pixels.
nyc_buildings.ipynb. Uses the NYC buildings dataset which is used in SpatialPandas examples and has been pre-prepared for such. Download location and code to convert to GeoPandas parquet file is included at the top of the notebook. It demonstrates that the rendered outputs are the same. Time to render the whole dataset at 1000x1000 pixels is 1.6 s for GeoPandas and 0.7 s for SpatialPandas.
nyc_buildings_zoom.ipynb. The same as the previous notebook with a zoom_factor to render a zoomed area centred on the middle of the dataset. For a zoom_factor of 30 GeoPandas is faster at 27 ms compared to SpatialPandas at 39 ms.

Here is a table of the render times as a function of zoom_level for my dev machine.

zoom_level	GeoPandas	SpatialPandas
1	1.6 s	0.7 s
3	0.45 s	0.23 ms
10	65 ms	61 ms
30	27 ms	39 ms
100	16 ms	30 ms

So for this size of dataset we render GeoPandas slower than SpatialPandas at full resolution, but it scales better with zoom factor so that it is faster at high zoom levels.

I think this is explained by the two different approaches (SpatialPandas I understand, GeoPandas I don't yet). SpatialPandas reads the whole dataset from file as ragged arrays as that is how it is stored. This is fast. To render a subset of the data the ragged arrays are kept as they are and a boolean array is used to identify which polygons are within the bounds and need to be rendered. The rendering loop iterates over all of the polygons and uses the boolean flags to not bother rendering polygons that are not needed.

For GeoPandas and the shapely to_ragged_array conversion, the spatial subset is calculated before the ragged arrays are obtained, and the ragged arrays are dynamically generated. At full resolution the to_ragged_array takes over a second for me. When it comes to rendering there is no need for any boolean flags, we only have the polygons within the bounds and we render each and every one of them.

Pinging @jorisvandenbossche and @martinfleis to see if I have taken a poor approach here and to check if my analysis is correct. I did wonder if the format of saving the GeoPandas file is important here as I just used the default to_parquet options, whereas the SpatialPandas file is highly optimised for this use case.

datashader/glyphs/polygon.py

datashader/glyphs/points.py

jorisvandenbossche · 2023-10-05T22:58:55Z

Cool!
I can't say I understand everything exactly, but generally the code that interacts with geopandas looks good. I think your analysis of the differences sounds correct (at least it also matches with what I wrote in the issue, and indeed using cx here ensures we first filter the geometries before converting to ragged arrays).

I did wonder if the format of saving the GeoPandas file is important here as I just used the default to_parquet options, whereas the SpatialPandas file is highly optimised for this use case.

For the timings it shouldn't matter, because once the data is read into memory, the representation in GeoPandas itself is the same independent from the format of the parquet file.
For a full workflow of reading + visualizing, it can of course matter. But right now the defaults are fine, I think (that will store the geometries as WKB). With a project like https://github.com/geoarrow/geoarrow-python we can also read parquet file directly into geoarrow format (i.e. the ragged arrays). But for the initial GeoPandas support here, standard GeoParquet is fine.

martinfleis · 2023-10-06T08:04:46Z

Thanks for looking into this!

I have tried timing the performance of cx versus sindex.query resulting in the same dataframe and it seems that at least with the NY buildings dataset you used, there's a space for improvements.

from spatialpandas import io
import shapely

ny = io.parquet.read_parquet("/Users/martin/Downloads/nyc_buildings.parq").to_geopandas()

x = -8230000
y = 4960000
offset = 10

%%timeit
ny.cx[x-offset:x+offset, y-offset:y+offset]

%%timeit
ny.iloc[ny.sindex.query(shapely.box(x-offset,y-offset, x+offset, y+offset))]

The results:

offset	cx	sindex	n_hits
10	29 ms ± 905 µs	349 µs ± 41.6 µs	3
50	28.1 ms ± 372 µs	312 µs ± 9.32 µs	29
250	28.8 ms ± 561 µs	346 µs ± 2.37 µs	572
1000	29.8 ms ± 714 µs	997 µs ± 41.3 µs	7740
5000	46.3 ms ± 130 µs	11.6 ms ± 238 µs	121858
10000	85.5 ms ± 1 ms	37.4 ms ± 1.07 ms	356570
50000	161 ms ± 3.61 ms	120 ms ± 1.71 ms	1157859 (all of them)

All the way to the extent covering the whole array, sindex solution seems to outperform cx. I am not sure how generalisable this is as there were some experiments around this in geopandas (geopandas/geopandas#1949, geopandas/geopandas#2671) that did not prove this to be always a better solution but development on the GEOS side in the meantime could have changed the situation. It is surely worth exploring (also on the geopandas side, if we should not use it in cx internally after all).

jorisvandenbossche · 2023-10-06T08:14:03Z

Interesting! For datashader, I would say that's something we should decide on the geopandas side, and then datashader can just keep using .cx? (I am not sure whether datashader would be able to make any smarter decision on what to use compared to geopandas)

jorisvandenbossche · 2023-10-06T08:38:12Z

@martinfleis to be fully equivalent, you need a predicate="intersects" in the sindex.query call, and checking my previous benchmarks locally, that's actually making a significant difference for the case when you are selecting a large part of the data. At the same time, that's also a small difference that datashader might not care about, and so potentially a reason for here to use sindex.query instead of cx.

martinfleis · 2023-10-06T08:46:36Z

True, there can be some minor differences as sindex just checks bounding boxes in this case. Though those are irrelevant for the usage in Datashader. And in the benchmark above, all the dfs are of equal shape, apart from 5000 where cx has one less polygons.

ianthomas23 · 2023-10-06T10:14:35Z

Thanks for both of your comments. I'll deal with Joris' suggestions and continue with dask-geopandas support which I suspect should be relatively straightforward.

Overall this seems really promising and the code additions are simple enough to be really low risk. So I'd be inclined to finish this off as soon as possible and merge it, with geopandas and dask-geopandas being optional dependencies. Then people can use it and we can work on improving the performance over time.

Beyond that it makes sense to also provide direct support for other geopandas geometry types (lines, points, etc) as they will be similar (and simpler) than this polygon support. Outside of Datashader there is probably work to be done in other HoloViz projects (HoloViews, GeoViews, hvPlot) but I'll need to speak to other HoloViz maintainers about this.

jbednar · 2023-10-06T18:50:58Z

This all sounds great! Thanks, everyone.

codecov · 2023-10-09T11:49:49Z

Codecov Report

Merging #1285 (0bddc97) into main (1c5df74) will increase coverage by 0.03%.
Report is 4 commits behind head on main.
The diff coverage is 88.39%.

@@            Coverage Diff             @@
##             main    #1285      +/-   ##
==========================================
+ Coverage   85.74%   85.77%   +0.03%     
==========================================
  Files          52       52              
  Lines       10870    10974     +104     
==========================================
+ Hits         9320     9413      +93     
- Misses       1550     1561      +11

Files	Coverage Δ
datashader/glyphs/polygon.py	`96.15% <96.55%> (+1.34%)`	⬆️
datashader/utils.py	`76.72% <66.66%> (+0.08%)`	⬆️
datashader/glyphs/points.py	`87.68% <82.35%> (-0.62%)`	⬇️
datashader/core.py	`87.74% <80.64%> (-0.52%)`	⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

ianthomas23 · 2023-10-09T15:50:13Z

This is all working now, giving exactly the same results as SpatialPandas using either GeoPandas or Dask-GeoPandas. I have moved the imports from the top of polygon.py to where it is used (thanks to @maximlt and @Hoxbro for this recommendation).

I have kept the use of .cx for the spatial filtering rather than sindex.query but I am happy to be led by GeoPandas maintainers in this going forward.

There is no change to the project install dependencies, but I have added a new set of optional dependencies so that users will be able to pip install datashader[geopandas] to get both geopandas and dask-geopandas installed at the same time. I have explicitly pinned shapely >= 2.0.0 so that .to_ragged_array is available and also added a runtime check just before we call it, in case an older version is installed.

I haven't yet added a new page to the docs about this. I was thinking that the best approach would be to add the Point and Line support and then write a new docs page about them all together.

There will indeed need to be some changes in HoloViews to support GeoPandas directly, as at the moment any attempt to pass a GeoPandas.GeoDataFrame from HoloViews to Datashader will force a conversion to SpatialPandas. Probably the best timing for this change is when the Point and Line support has been added to Datashader.

jorisvandenbossche · 2023-10-10T17:23:47Z

I have kept the use of .cx for the spatial filtering rather than sindex.query but I am happy to be led by GeoPandas maintainers in this going forward.

The upside of .cx is that it works for both geopandas and dask-geopandas at the same time. But if you want the fastest possible option for geopandas.GeoDataFrame, the .sindex.query approach will typically be a bit faster.

ianthomas23 · 2023-10-12T09:42:41Z

I have switched to using .sindex.query for geopandas and kept .cx for dask-geopandas. It is more complicated, partly because you cannot pass None for unspecified bounds but it seems that -np.inf and +np.inf are OK.

…euse

ianthomas23 · 2023-10-16T08:58:29Z

Merging as is. Support for lines and points to follow shortly.

ianthomas23 added 2 commits October 5, 2023 10:46

Add rendering of GeoPandas polygons and multipolygons via ragged arrays

08c604e

Add example notebooks

ebd0f7e

ianthomas23 added enhancement in progress labels Oct 5, 2023

ianthomas23 mentioned this pull request Oct 5, 2023

Support GeoPandas GeoDataFrames #1006

Closed

jorisvandenbossche reviewed Oct 5, 2023

View reviewed changes

datashader/glyphs/polygon.py Outdated Show resolved Hide resolved

datashader/glyphs/points.py Outdated Show resolved Hide resolved

ianthomas23 added 7 commits October 9, 2023 09:36

Correct scope of shapely function call

415513f

Calculate geopandas bounds once only

126d62e

Support dask-geopandas and use lazy imports

92c53eb

Add optional dependencies so can "pip install datashader[geopandas]"

9c7f1aa

Explicit Shapely version check

a2180ed

Add tests comparing geopandas, dask-geopandas and spatialpandas

a81dbac

Remove demo notebooks

bf8d91a

ianthomas23 removed the in progress label Oct 9, 2023

Use sindex.query instead of cx for GeoPandas.GeoDataFrame

761f007

ianthomas23 added this to the v0.15.3 milestone Oct 13, 2023

ianthomas23 added 2 commits October 13, 2023 13:34

Better error reporting

d9ad50f

Move geopandas check and spatial filtering to separate function for r…

0bddc97

…euse

ianthomas23 merged commit 9e20700 into main Oct 16, 2023
16 checks passed

ianthomas23 deleted the geopandas_2023 branch October 16, 2023 08:58

This was referenced Oct 18, 2023

Add support for GeoPandas lines and points #1293

Merged

Support GeoPandas passthrough to Datashader holoviz/holoviews#5951

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GeoPandas Polygons and MultiPolygons #1285

Support GeoPandas Polygons and MultiPolygons #1285

ianthomas23 commented Oct 5, 2023

jorisvandenbossche commented Oct 5, 2023

martinfleis commented Oct 6, 2023

jorisvandenbossche commented Oct 6, 2023

jorisvandenbossche commented Oct 6, 2023

martinfleis commented Oct 6, 2023 •

edited

ianthomas23 commented Oct 6, 2023

jbednar commented Oct 6, 2023

codecov bot commented Oct 9, 2023 •

edited

ianthomas23 commented Oct 9, 2023

jorisvandenbossche commented Oct 10, 2023

ianthomas23 commented Oct 12, 2023

ianthomas23 commented Oct 16, 2023

Support GeoPandas Polygons and MultiPolygons #1285

Support GeoPandas Polygons and MultiPolygons #1285

Conversation

ianthomas23 commented Oct 5, 2023

jorisvandenbossche commented Oct 5, 2023

martinfleis commented Oct 6, 2023

jorisvandenbossche commented Oct 6, 2023

jorisvandenbossche commented Oct 6, 2023

martinfleis commented Oct 6, 2023 • edited

ianthomas23 commented Oct 6, 2023

jbednar commented Oct 6, 2023

codecov bot commented Oct 9, 2023 • edited

Codecov Report

ianthomas23 commented Oct 9, 2023

jorisvandenbossche commented Oct 10, 2023

ianthomas23 commented Oct 12, 2023

ianthomas23 commented Oct 16, 2023

martinfleis commented Oct 6, 2023 •

edited

codecov bot commented Oct 9, 2023 •

edited