Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dicts and sets in to get junctions #76

Merged
merged 1 commit into from
May 6, 2020
Merged

Conversation

martinfleis
Copy link
Contributor

@martinfleis martinfleis commented May 2, 2020

Hi,

I tried to implement similar algorithm libpysal is using, to determine contiguity to get junctions and it seems to be faster than numpy solution in #75. It is based on performant dict/set.

gdf = gpd.read_file(gpd.datasets.get_path('nybb'))

%%timeit
topojson.Topology(gdf, shared_paths='dict')
# 1.77 s ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
topojson.Topology(gdf, shared_paths='shapely')
# 5.94 s ± 376 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
topojson.Topology(gdf, shared_paths='numpy')
# 3.72 s ± 133 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using naturalearth_lowres all versions seems to be equal.

1.52 s ± 92.8 ms per loop,
1.64 s ± 112 ms per loop
1.45 s ± 47.9 ms per loop

I haven't tested the correctness of resulting junctions, but simplified gdf looks alright.

The whole diff is between lines 156-189. It is based on #75 to have a comparison.

edit: it is wrong at the moment :) working on it

@martinfleis
Copy link
Contributor Author

Fixed. Performance for nybb remained similar, for naturaleart_lowres there is a boost to 964 ms ± 107 ms per loop.

@martinfleis martinfleis changed the base branch from master to fast_shared_paths May 2, 2020 18:56
@martinfleis
Copy link
Contributor Author

Another possible speed-up - in Extract you generate shapely.LineStrings from which are then (in my version and numpy version) extracted coordinates only. That is expensive, so I would generate shapely.geometry either once we need it for shared_paths or after Join.

@mattijn
Copy link
Owner

mattijn commented May 2, 2020

It's super great that you found another approach. I will check it out soon. Hopefully tomorrow otherwise early on in next week.

The package started with shapely/geos as base for most of the core functions (shared_paths, split, simplify), but now better variants are appearing in almost all steps, so it's good observation that there are occasions when linestrings are shapely geometries, sometimes numpy arrays or (list of) lists.

There is definitely improvement possible based on the input options for different routes within the package.

Again, much appreciated that you really dived into the code.

@mattijn
Copy link
Owner

mattijn commented May 5, 2020

Regarding

Another possible speed-up - in Extract you generate shapely.LineStrings from which are then (in my version and numpy version) extracted coordinates only. That is expensive, so I would generate shapely.geometry either once we need it for shared_paths or after Join.

You are right about the Extract class. I improved the serialization of geopandas objects in #77. Now I take the shapely objects directly from the geodataframe instead of taking the rountrip geodataframe > __geo_interface__ > featurecollection > geometrycollection > shapely objects.

I think this is fine than as well for the routine you introduced here in Join. I think your approach of geom.coords is the fastest compare to geom.xy or geom.coords.xy, but not completely sure.

Based on your results I will remove the numpy-based junction detection route from this PR, to avoid confusion.
Since this routine only detects junctions if the coordinate exist in both linestrings, can also improve the Cut class, since there is no need to introduce non-existing vertices halfway linestrings.

@martinfleis martinfleis changed the base branch from fast_shared_paths to master May 6, 2020 17:40
@martinfleis martinfleis marked this pull request as ready for review May 6, 2020 17:40
@martinfleis
Copy link
Contributor Author

I have cleaned the PR to include only dict-based method, and not your numpy one.

I think your approach of geom.coords is the fastest compare to geom.xy or geom.coords.xy, but not completely sure.

According to my timings it is approximately 10x faster than .xy.

Since this routine only detects junctions if the coordinate exist in both linestrings, can also improve the Cut class

That would be great! I haven't looked at the logic of Cut.

One more thing to consider is the default options for shared_paths. I would say that this one should be the default as it is way faster and will work in most of the cases, but the decision is up to you. (I am happy to update PR if you want coords to be default).

@mattijn mattijn merged commit b6508ec into mattijn:master May 6, 2020
@mattijn
Copy link
Owner

mattijn commented May 6, 2020

I agree that this proposed route should become default. Maybe have to revisit the naming of the settings once have worked up the other objects.

@mattijn
Copy link
Owner

mattijn commented May 6, 2020

Oh btw, I accepted the PR. Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants