Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Linestrings that follow the same path but where one contains extra redundant points are not deduplicated #192

Merged
merged 7 commits into from
Sep 8, 2022
Merged

Fix: Linestrings that follow the same path but where one contains extra redundant points are not deduplicated #192

merged 7 commits into from
Sep 8, 2022

Conversation

theroggy
Copy link
Contributor

@theroggy theroggy commented Sep 5, 2022

Fixes #191

@theroggy theroggy changed the title Fix Fix: Linestrings that follow the same path but where one contains extra redundant points are not deduplicated Sep 5, 2022
@mattijn
Copy link
Owner

mattijn commented Sep 6, 2022

Interesting approach. So, simplify(0) will not change the shape of the arc, but can remove points that are in line of the arc.
Is it cheap in time?

@theroggy
Copy link
Contributor Author

theroggy commented Sep 6, 2022

Interesting approach. So, simplify(0) will not change the shape of the arc, but can remove points that are in line of the arc. Is it cheap in time?

For the test file, processing time increases from 14.5 s to 16.5 s.

  • the simplify(0) takes ~0.8 seconds, the rest seems to be the extra conversions from np arrays to shapely objects and back. The conversions seem to take more time than I anticipated, but there is nothing different that was changed, so I don't see anything else that can take up the time?
  • the simplify is now with preserve_topology=True, I think this could be preserve_topology=False, because I cannot think immediately of cases where a simplify with tolerance 0 renders a line invalid? This lowers the simplify time to 0.3s.
  • in the dedup step the np arrays are converted back to shapely objects to be able to apply the linemerge I think... so possibly time cound be saved by somehow avoiding the different conversions back and forth... but I didn't really look into it in detail.

@theroggy
Copy link
Contributor Author

theroggy commented Sep 7, 2022

I had a quick further look at point 3:

  • conversion of data["linestring"] from np arrays to shapely objects takes 0.3s
  • conversion of data["linestring"] from shapely objects to np arrays takes 0.7s

It doesn't seem possible to avoid conversions without rewriting quite some code. And then the question is if it is really going to be faster... So it doesn't seem like "short term, low hanging fruit"...

@theroggy theroggy marked this pull request as draft September 7, 2022 08:47
@theroggy
Copy link
Contributor Author

theroggy commented Sep 7, 2022

I implemented a numpy function to remove the collinear points. It has similar performance to simplify(0), but the extra conversions are avoided so the test file can now be processed in +- 15.2 s.

I also had a look at the simplify function in ops, but:

  • performance wasn't as good: 16.2 s
  • tolerance=0 disabled simplifying entirely, so 0.00000...01 had to be used
  • it triggered errors + invalid results?
  • it would have meant creating a hard dependency on the simplification package (to avoid the conversions)...

@theroggy theroggy marked this pull request as ready for review September 7, 2022 11:53
@mattijn
Copy link
Owner

mattijn commented Sep 7, 2022

Yes I think a numpy approach would work as well, but from what I understand now it's still being done line by line.
I think it's possible to do it in once for all linestrings using broadcasting.

@theroggy
Copy link
Contributor Author

theroggy commented Sep 7, 2022

Yes I think a numpy approach would work as well, but from what I understand now it's still being done line by line. I think it's possible to do it in once for all linestrings using broadcasting.

I suppose it should be possible, but my search on the subject didn't bring up a lot of ideas vor vectorized solutions...

@mattijn
Copy link
Owner

mattijn commented Sep 7, 2022

One way (as you said before, there is more than one way to Rome),

  1. use as starting point the delta_encoded variant of the linestrings, as is implemented here:
    def delta_encoding(linestrings):
  2. calculate the angle before and after each point.
  3. If these angles are equal than the point within can be categorized as redundant.
  4. index these to be removed or flagged in once.

@theroggy
Copy link
Contributor Author

theroggy commented Sep 7, 2022

Don't quite know why I didn't sooner of it, but the current remove_collinear_points was actually quite easy to vectorize...

Profiler says remove_collinear_points now takes 0.28 seconds for the test file, versus 0.382 with the unvectorized version... The total timing didn't really change, but probably that's because there is some fluctuation on the performance of my develop machine (it's on shared infrasteructure).

@mattijn
Copy link
Owner

mattijn commented Sep 7, 2022

No problem! Once you get used to think in adding another dimension it's becoming more fun;).
This solution should probably scale better as well.

I setup another branch to test integration continuous benchmarking as part of GitHub Actions, by being inspired of your other PR. This would give also some timing for different type of files as part of each PR.

@theroggy
Copy link
Contributor Author

theroggy commented Sep 7, 2022

Clarification: it is vectorized per linestrings, not for all lines in one go. Will be possible as well, but would need quite ugly code with sparse arrays...
I doubt it is worth the trouble... it should scale reasonably well as it is now I think...

@theroggy
Copy link
Contributor Author

theroggy commented Sep 7, 2022

I setup another branch to test integration continuous benchmarking as part of GitHub Actions, by being inspired of your other PR. This would give also some timing for different type of files as part of each PR.

Sounds quite cool... I wonder how stable the results will be performance-wise on the github infrastructure, but if it reasonable it would be great.

Sadly I won't be able to "borrow" the idea for geofileops, because that project is focused on parallellizing spatial operations, so you need a reasonable amount of CPU's to properly benchmark it...

@mattijn
Copy link
Owner

mattijn commented Sep 7, 2022

Ragged arrays are indeed not great.
I use this function to create one big array, based on max length of arc and fill remaining with nan:

https://github.com/mattijn/topojson/blob/master/topojson/ops.py#L513

Like this I can get most out of numpy.
But if there is one arc super long and others very tiny then it is probably not optimal.

When I find some time I can have a look as well.

@theroggy
Copy link
Contributor Author

theroggy commented Sep 7, 2022

Ragged arrays are indeed not great. I use this function to create one big array, based on max length of arc and fill remaining with nan:

https://github.com/mattijn/topojson/blob/master/topojson/ops.py#L513

Like this I can get most out of numpy. But if there is one arc super long and others very tiny then it is probably not optimal.

Yeah... certainly memory-footprint wise this will be quite heavy for larger files... so doesn't sound very scalable. Performance-wise it also creates quite some overhead... but possibly the looping is that bad that it still compensates...

When I find some time I can have a look as well.

Feel free!

@mattijn
Copy link
Owner

mattijn commented Sep 8, 2022

I'm fine with this as is. Thanks again!

@mattijn mattijn merged commit e9f9fd0 into mattijn:master Sep 8, 2022
@mattijn
Copy link
Owner

mattijn commented Sep 8, 2022

Generated benchmark image:
8103649

@theroggy theroggy deleted the Fix-lines-following-same-path-not-deduplicated branch September 8, 2022 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Linestrings that follow the same path but where one contains extra redundant points are not deduplicated
2 participants