Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare Cut for coords routine #78

Merged
merged 3 commits into from
May 13, 2020
Merged

Prepare Cut for coords routine #78

merged 3 commits into from
May 13, 2020

Conversation

mattijn
Copy link
Owner

@mattijn mattijn commented May 6, 2020

This PR is a start to make the Cut class ready for the coords routine.
Currently it create some small changes to skip the step where coordinates were inserted in LineStrings that contain shared paths without shared junctions.

But while being in the class I realised that the real bottleneck is here: https://github.com/mattijn/topojson/blob/master/topojson/core/cut.py#L130.

The find_duplicates() function becomes really slow if the number of features are high (using eg. the sample.geojson test file)

@mattijn mattijn merged commit 2c0f636 into master May 13, 2020
@mattijn
Copy link
Owner Author

mattijn commented May 13, 2020

A pygeos vectorised approach in combination with the pygeos STRtree abilities can much likely improve this slower parts. Out of scope of this PR, so merging.

@martinfleis
Copy link
Contributor

Within coords mode, find_duplicates could be done on numpy / dict-list basis, which would likely be faster than any geometry operation, even if using pygeos.

@mattijn
Copy link
Owner Author

mattijn commented May 14, 2020

I am very interested and all ears. Since I don't think for this part it really matters if the junctions were detected using shapely shared_paths function or the dict-list coords approach. In other words, a better method would support both modes.

Let me give a minimal example. How could one adopt a dict-list approach to detect the same duplicates given the following geometric objects?

from topojson.core.cut import Cut

data = [
    {"type": "LineString", "coordinates": [[0, 0], [2, 0], [2, 1], [3, 1]]},
    {"type": "LineString", "coordinates": [[1, 0], [4, 0], [4, 1], [3, 1], [2, 1], [0, 1]]},
    {"type": "Polygon", "coordinates": [[[5, 0], [6, 0], [6, 2], [5, 2], [5, 0]]]},
    {"type": "Polygon", "coordinates": [[[5, 0], [5, 2], [6, 2], [6, 0], [5, 0]]]}    
]

All linestrings can be extracted and cut in the following segments using coords mode:

c = Cut(data, options={"shared_paths": "coords"})
c.to_svg(separate=True, include_junctions=True)

Where the current approach detects the following duplicate-pairs (by index):

c.output["bookkeeping_duplicates"]
array([[3, 1],
       [6, 5]])

Based on the following LineStrings:

c.output["linestrings"]
[<shapely.geometry.linestring.LineString at 0x110208e10>,
 <shapely.geometry.linestring.LineString at 0x110208d10>,
 <shapely.geometry.linestring.LineString at 0x11052f1d0>,
 <shapely.geometry.linestring.LineString at 0x11052fb10>,
 <shapely.geometry.linestring.LineString at 0x11052f490>,
 <shapely.geometry.linestring.LineString at 0x11052f590>,
 <shapely.geometry.linestring.LineString at 0x11052f090>]

@martinfleis
Copy link
Contributor

I am not sure I entirely follow your example. This is my line of thoughts. I haven't properly studied the code and there is likely something I am missing, but in principle:

# coordinates could be retained from Join as lists to make it faster
coords = []
for ls in c.output['linestrings']:
    coords.append(tuple(sorted(ls.coords)))  # tuple to make it hashable

seen = {}
dupes = []

for x in coords:
    if x not in seen:
        seen[x] = 1
    else:
        if seen[x] == 1:
            dupes.append(x)
        seen[x] += 1

idx_duplicates = []
for dup in dupes:
    idx_duplicates.append([i for i, e in enumerate(coords) if e == dup])
>>>idx_duplicates
[[1, 4], [3, 6], [8, 9]]

This is done in 121 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) compared to 568 µs ± 70.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) for c.find_duplicates(c.output['linestrings']). Results are exactly the same.

(For some reason c.output["linestrings"] gives me 10 linestrings, not yours 7.)

@martinfleis
Copy link
Contributor

martinfleis commented May 14, 2020

However, it might not be better in fact. For nybb dataset the situation is opposite.

50.5 ms ± 10.9 ms per loop with this and 11.5 ms ± 385 µs per loop with existing approach.

It likely depends on complexity of geometries. For naturalearth_lowres:

58.7 ms ± 11.4 ms per loop new one, 93.5 ms ± 15.8 ms per loop existing.

@mattijn
Copy link
Owner Author

mattijn commented May 15, 2020

Very interesting. I was not aware of tuples as hashable objects. Also the sorting on the shapely coords is interesting as it handles reversed but duplicate linestrings.

I tried another approach using the hashes of the linestring tuples using numpy and compared it as such:

def dups_numpy(linestrings):
    # get hash of sorted linestring
    coords = []
    for ls in linestrings:
        coords.append(hash(tuple(sorted(ls.coords))))
    coords = np.array(coords, dtype=np.int64)

    # get split locations of dups
    idx_sort = np.argsort(coords)
    sorted_coords = coords[idx_sort]
    vals, idx_start, count = np.unique(sorted_coords, return_counts=True, return_index=True)

    # split on indices that occures > 1
    idx_dups = np.split(idx_sort, idx_start[1:])
    idx_dups = np.array([dup for dup in idx_dups if dup.size > 1])
    
    return idx_dups

def dups_dictlist(linestrings):
    # coordinates could be retained from Join as lists to make it faster
    coords = []
    for ls in linestrings:
        coords.append(tuple(sorted(ls.coords)))  # tuple to make it hashable

    seen = {}
    dupes = []

    for x in coords:
        if x not in seen:
            seen[x] = 1
        else:
            if seen[x] == 1:
                dupes.append(x)
            seen[x] += 1

    idx_duplicates = []
    for dup in dupes:
        idx_duplicates.append([i for i, e in enumerate(coords) if e == dup])    
    return idx_duplicates

def dups_shapely(linestrings):
    c.find_duplicates(linestrings)
function naturalearth_lowres nybb sample.geojson
dups_numpy 12.1 ms ± 308 µs per loop 20.6 ms ± 540 µs per loop 435 ms ± 6.71 ms per loop
dups_dictlist 36.3 ms ± 532 µs per loop 31.1 ms ± 362 µs per loop interrupted
dups_shapely 53.2 ms ± 529 µs per loop 8.14 ms ± 112 µs per loop 8.62 s ± 107 ms per loop

With this nybb dataset it is still quickest with current shapely approach (2.5X), but regarding the sample.geojson, the numpy-hash approach is 19.8X quicker.

@martinfleis
Copy link
Contributor

Cool. I am happy to see that my general idea is actually useful if put together with numpy. The performance is interesting though.

@mattijn mattijn deleted the prepare-cut-for-coords branch July 6, 2020 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants