Prepare Cut for coords routine #78

mattijn · 2020-05-06T22:20:28Z

This PR is a start to make the Cut class ready for the coords routine.
Currently it create some small changes to skip the step where coordinates were inserted in LineStrings that contain shared paths without shared junctions.

But while being in the class I realised that the real bottleneck is here: https://github.com/mattijn/topojson/blob/master/topojson/core/cut.py#L130.

The find_duplicates() function becomes really slow if the number of features are high (using eg. the sample.geojson test file)

mattijn · 2020-05-13T15:51:51Z

A pygeos vectorised approach in combination with the pygeos STRtree abilities can much likely improve this slower parts. Out of scope of this PR, so merging.

martinfleis · 2020-05-13T21:39:09Z

Within coords mode, find_duplicates could be done on numpy / dict-list basis, which would likely be faster than any geometry operation, even if using pygeos.

mattijn · 2020-05-14T21:07:24Z

I am very interested and all ears. Since I don't think for this part it really matters if the junctions were detected using shapely shared_paths function or the dict-list coords approach. In other words, a better method would support both modes.

Let me give a minimal example. How could one adopt a dict-list approach to detect the same duplicates given the following geometric objects?

from topojson.core.cut import Cut

data = [
    {"type": "LineString", "coordinates": [[0, 0], [2, 0], [2, 1], [3, 1]]},
    {"type": "LineString", "coordinates": [[1, 0], [4, 0], [4, 1], [3, 1], [2, 1], [0, 1]]},
    {"type": "Polygon", "coordinates": [[[5, 0], [6, 0], [6, 2], [5, 2], [5, 0]]]},
    {"type": "Polygon", "coordinates": [[[5, 0], [5, 2], [6, 2], [6, 0], [5, 0]]]}    
]

All linestrings can be extracted and cut in the following segments using coords mode:

c = Cut(data, options={"shared_paths": "coords"})
c.to_svg(separate=True, include_junctions=True)

Where the current approach detects the following duplicate-pairs (by index):

c.output["bookkeeping_duplicates"]

array([[3, 1],
       [6, 5]])

Based on the following LineStrings:

c.output["linestrings"]

[<shapely.geometry.linestring.LineString at 0x110208e10>,
 <shapely.geometry.linestring.LineString at 0x110208d10>,
 <shapely.geometry.linestring.LineString at 0x11052f1d0>,
 <shapely.geometry.linestring.LineString at 0x11052fb10>,
 <shapely.geometry.linestring.LineString at 0x11052f490>,
 <shapely.geometry.linestring.LineString at 0x11052f590>,
 <shapely.geometry.linestring.LineString at 0x11052f090>]

martinfleis · 2020-05-14T22:28:27Z

I am not sure I entirely follow your example. This is my line of thoughts. I haven't properly studied the code and there is likely something I am missing, but in principle:

# coordinates could be retained from Join as lists to make it faster
coords = []
for ls in c.output['linestrings']:
    coords.append(tuple(sorted(ls.coords)))  # tuple to make it hashable

seen = {}
dupes = []

for x in coords:
    if x not in seen:
        seen[x] = 1
    else:
        if seen[x] == 1:
            dupes.append(x)
        seen[x] += 1

idx_duplicates = []
for dup in dupes:
    idx_duplicates.append([i for i, e in enumerate(coords) if e == dup])

>>>idx_duplicates
[[1, 4], [3, 6], [8, 9]]

This is done in 121 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) compared to 568 µs ± 70.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) for c.find_duplicates(c.output['linestrings']). Results are exactly the same.

(For some reason c.output["linestrings"] gives me 10 linestrings, not yours 7.)

martinfleis · 2020-05-14T22:34:52Z

However, it might not be better in fact. For nybb dataset the situation is opposite.

50.5 ms ± 10.9 ms per loop with this and 11.5 ms ± 385 µs per loop with existing approach.

It likely depends on complexity of geometries. For naturalearth_lowres:

58.7 ms ± 11.4 ms per loop new one, 93.5 ms ± 15.8 ms per loop existing.

mattijn · 2020-05-15T10:43:13Z

Very interesting. I was not aware of tuples as hashable objects. Also the sorting on the shapely coords is interesting as it handles reversed but duplicate linestrings.

I tried another approach using the hashes of the linestring tuples using numpy and compared it as such:

def dups_numpy(linestrings):
    # get hash of sorted linestring
    coords = []
    for ls in linestrings:
        coords.append(hash(tuple(sorted(ls.coords))))
    coords = np.array(coords, dtype=np.int64)

    # get split locations of dups
    idx_sort = np.argsort(coords)
    sorted_coords = coords[idx_sort]
    vals, idx_start, count = np.unique(sorted_coords, return_counts=True, return_index=True)

    # split on indices that occures > 1
    idx_dups = np.split(idx_sort, idx_start[1:])
    idx_dups = np.array([dup for dup in idx_dups if dup.size > 1])
    
    return idx_dups

def dups_dictlist(linestrings):
    # coordinates could be retained from Join as lists to make it faster
    coords = []
    for ls in linestrings:
        coords.append(tuple(sorted(ls.coords)))  # tuple to make it hashable

    seen = {}
    dupes = []

    for x in coords:
        if x not in seen:
            seen[x] = 1
        else:
            if seen[x] == 1:
                dupes.append(x)
            seen[x] += 1

    idx_duplicates = []
    for dup in dupes:
        idx_duplicates.append([i for i, e in enumerate(coords) if e == dup])    
    return idx_duplicates

def dups_shapely(linestrings):
    c.find_duplicates(linestrings)

function	naturalearth_lowres	nybb	sample.geojson
dups_numpy	12.1 ms ± 308 µs per loop	20.6 ms ± 540 µs per loop	435 ms ± 6.71 ms per loop
dups_dictlist	36.3 ms ± 532 µs per loop	31.1 ms ± 362 µs per loop	interrupted
dups_shapely	53.2 ms ± 529 µs per loop	8.14 ms ± 112 µs per loop	8.62 s ± 107 ms per loop

With this nybb dataset it is still quickest with current shapely approach (2.5X), but regarding the sample.geojson, the numpy-hash approach is 19.8X quicker.

martinfleis · 2020-05-15T15:02:54Z

Cool. I am happy to see that my general idea is actually useful if put together with numpy. The performance is interesting though.

mattijn added 3 commits May 7, 2020 00:12

no need to insert coordinates

3b45149

fix tests because of new options

735a7a4

Merge branch 'master' into prepare-cut-for-coords

a3d4740

mattijn merged commit 2c0f636 into master May 13, 2020

mattijn mentioned this pull request May 15, 2020

Adopt numpy-hash approach to find duplicates #82

Merged

mattijn deleted the prepare-cut-for-coords branch July 6, 2020 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare Cut for coords routine #78

Prepare Cut for coords routine #78

mattijn commented May 6, 2020

mattijn commented May 13, 2020

martinfleis commented May 13, 2020

mattijn commented May 14, 2020

martinfleis commented May 14, 2020

martinfleis commented May 14, 2020 •

edited

mattijn commented May 15, 2020 •

edited

martinfleis commented May 15, 2020

Prepare Cut for coords routine #78

Prepare Cut for coords routine #78

Conversation

mattijn commented May 6, 2020

mattijn commented May 13, 2020

martinfleis commented May 13, 2020

mattijn commented May 14, 2020

martinfleis commented May 14, 2020

martinfleis commented May 14, 2020 • edited

mattijn commented May 15, 2020 • edited

martinfleis commented May 15, 2020

martinfleis commented May 14, 2020 •

edited

mattijn commented May 15, 2020 •

edited