Fix: Linestrings that follow the same path but where one contains extra redundant points are not deduplicated #192

theroggy · 2022-09-05T19:14:20Z

Fixes #191

mattijn · 2022-09-06T16:49:48Z

Interesting approach. So, simplify(0) will not change the shape of the arc, but can remove points that are in line of the arc.
Is it cheap in time?

theroggy · 2022-09-06T22:59:01Z

Interesting approach. So, simplify(0) will not change the shape of the arc, but can remove points that are in line of the arc. Is it cheap in time?

For the test file, processing time increases from 14.5 s to 16.5 s.

the simplify(0) takes ~0.8 seconds, the rest seems to be the extra conversions from np arrays to shapely objects and back. The conversions seem to take more time than I anticipated, but there is nothing different that was changed, so I don't see anything else that can take up the time?
the simplify is now with preserve_topology=True, I think this could be preserve_topology=False, because I cannot think immediately of cases where a simplify with tolerance 0 renders a line invalid? This lowers the simplify time to 0.3s.
in the dedup step the np arrays are converted back to shapely objects to be able to apply the linemerge I think... so possibly time cound be saved by somehow avoiding the different conversions back and forth... but I didn't really look into it in detail.

theroggy · 2022-09-07T07:16:13Z

I had a quick further look at point 3:

conversion of data["linestring"] from np arrays to shapely objects takes 0.3s
conversion of data["linestring"] from shapely objects to np arrays takes 0.7s

It doesn't seem possible to avoid conversions without rewriting quite some code. And then the question is if it is really going to be faster... So it doesn't seem like "short term, low hanging fruit"...

This reverts commit 3c68a0c.

…ng-same-path-not-deduplicated

I'm not sure what this test is testing... so difficult to judge if the asserts should be this strict?

theroggy · 2022-09-07T11:53:02Z

I implemented a numpy function to remove the collinear points. It has similar performance to simplify(0), but the extra conversions are avoided so the test file can now be processed in +- 15.2 s.

I also had a look at the simplify function in ops, but:

performance wasn't as good: 16.2 s
tolerance=0 disabled simplifying entirely, so 0.00000...01 had to be used
it triggered errors + invalid results?
it would have meant creating a hard dependency on the simplification package (to avoid the conversions)...

mattijn · 2022-09-07T11:57:19Z

Yes I think a numpy approach would work as well, but from what I understand now it's still being done line by line.
I think it's possible to do it in once for all linestrings using broadcasting.

theroggy · 2022-09-07T12:10:42Z

Yes I think a numpy approach would work as well, but from what I understand now it's still being done line by line. I think it's possible to do it in once for all linestrings using broadcasting.

I suppose it should be possible, but my search on the subject didn't bring up a lot of ideas vor vectorized solutions...

mattijn · 2022-09-07T12:20:02Z

One way (as you said before, there is more than one way to Rome),

use as starting point the delta_encoded variant of the linestrings, as is implemented here:

topojson/topojson/ops.py

Line 911 in 4ebe2b1

def delta_encoding(linestrings):
calculate the angle before and after each point.
If these angles are equal than the point within can be categorized as redundant.
index these to be removed or flagged in once.

theroggy · 2022-09-07T14:22:54Z

Don't quite know why I didn't sooner of it, but the current remove_collinear_points was actually quite easy to vectorize...

Profiler says remove_collinear_points now takes 0.28 seconds for the test file, versus 0.382 with the unvectorized version... The total timing didn't really change, but probably that's because there is some fluctuation on the performance of my develop machine (it's on shared infrasteructure).

mattijn · 2022-09-07T14:52:46Z

No problem! Once you get used to think in adding another dimension it's becoming more fun;).
This solution should probably scale better as well.

I setup another branch to test integration continuous benchmarking as part of GitHub Actions, by being inspired of your other PR. This would give also some timing for different type of files as part of each PR.

theroggy · 2022-09-07T14:56:14Z

Clarification: it is vectorized per linestrings, not for all lines in one go. Will be possible as well, but would need quite ugly code with sparse arrays...
I doubt it is worth the trouble... it should scale reasonably well as it is now I think...

theroggy · 2022-09-07T15:01:47Z

I setup another branch to test integration continuous benchmarking as part of GitHub Actions, by being inspired of your other PR. This would give also some timing for different type of files as part of each PR.

Sounds quite cool... I wonder how stable the results will be performance-wise on the github infrastructure, but if it reasonable it would be great.

Sadly I won't be able to "borrow" the idea for geofileops, because that project is focused on parallellizing spatial operations, so you need a reasonable amount of CPU's to properly benchmark it...

mattijn · 2022-09-07T15:20:43Z

Ragged arrays are indeed not great.
I use this function to create one big array, based on max length of arc and fill remaining with nan:

https://github.com/mattijn/topojson/blob/master/topojson/ops.py#L513

Like this I can get most out of numpy.
But if there is one arc super long and others very tiny then it is probably not optimal.

When I find some time I can have a look as well.

theroggy · 2022-09-07T17:09:09Z

Ragged arrays are indeed not great. I use this function to create one big array, based on max length of arc and fill remaining with nan:

https://github.com/mattijn/topojson/blob/master/topojson/ops.py#L513

Like this I can get most out of numpy. But if there is one arc super long and others very tiny then it is probably not optimal.

Yeah... certainly memory-footprint wise this will be quite heavy for larger files... so doesn't sound very scalable. Performance-wise it also creates quite some overhead... but possibly the looping is that bad that it still compensates...

When I find some time I can have a look as well.

Feel free!

mattijn · 2022-09-08T13:36:57Z

I'm fine with this as is. Thanks again!

mattijn · 2022-09-08T13:40:00Z

Generated benchmark image:
8103649

Fix

39ce73f

theroggy changed the title ~~Fix~~ Fix: Linestrings that follow the same path but where one contains extra redundant points are not deduplicated Sep 5, 2022

theroggy added 2 commits September 7, 2022 09:20

Version with delayed conversion to numpy coords

3c68a0c

Revert "Version with delayed conversion to numpy coords"

9362fd3

This reverts commit 3c68a0c.

theroggy marked this pull request as draft September 7, 2022 08:47

theroggy added 3 commits September 7, 2022 12:52

Merge remote-tracking branch 'upstream/master' into Fix-lines-followi…

e4483d6

…ng-same-path-not-deduplicated

Improve performance with numpy function to replace simplify(0)

2233ebd

Less strict asserts in test_dedup_shared_line_ABCDBE_and_FBCG

84acfd6

I'm not sure what this test is testing... so difficult to judge if the asserts should be this strict?

theroggy marked this pull request as ready for review September 7, 2022 11:53

Vectorized remove_collinear_points

a204114

mattijn merged commit e9f9fd0 into mattijn:master Sep 8, 2022

theroggy deleted the Fix-lines-following-same-path-not-deduplicated branch September 8, 2022 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Linestrings that follow the same path but where one contains extra redundant points are not deduplicated #192

Fix: Linestrings that follow the same path but where one contains extra redundant points are not deduplicated #192

theroggy commented Sep 5, 2022

mattijn commented Sep 6, 2022

theroggy commented Sep 6, 2022 •

edited

theroggy commented Sep 7, 2022

theroggy commented Sep 7, 2022 •

edited

mattijn commented Sep 7, 2022

theroggy commented Sep 7, 2022

mattijn commented Sep 7, 2022 •

edited

theroggy commented Sep 7, 2022

mattijn commented Sep 7, 2022

theroggy commented Sep 7, 2022 •

edited

theroggy commented Sep 7, 2022 •

edited

mattijn commented Sep 7, 2022

theroggy commented Sep 7, 2022

mattijn commented Sep 8, 2022

mattijn commented Sep 8, 2022

Fix: Linestrings that follow the same path but where one contains extra redundant points are not deduplicated #192

Fix: Linestrings that follow the same path but where one contains extra redundant points are not deduplicated #192

Conversation

theroggy commented Sep 5, 2022

mattijn commented Sep 6, 2022

theroggy commented Sep 6, 2022 • edited

theroggy commented Sep 7, 2022

theroggy commented Sep 7, 2022 • edited

mattijn commented Sep 7, 2022

theroggy commented Sep 7, 2022

mattijn commented Sep 7, 2022 • edited

theroggy commented Sep 7, 2022

mattijn commented Sep 7, 2022

theroggy commented Sep 7, 2022 • edited

theroggy commented Sep 7, 2022 • edited

mattijn commented Sep 7, 2022

theroggy commented Sep 7, 2022

mattijn commented Sep 8, 2022

mattijn commented Sep 8, 2022

theroggy commented Sep 6, 2022 •

edited

theroggy commented Sep 7, 2022 •

edited

mattijn commented Sep 7, 2022 •

edited

theroggy commented Sep 7, 2022 •

edited

theroggy commented Sep 7, 2022 •

edited