-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare Cut for coords routine #78
Conversation
A pygeos vectorised approach in combination with the pygeos STRtree abilities can much likely improve this slower parts. Out of scope of this PR, so merging. |
Within |
I am very interested and all ears. Since I don't think for this part it really matters if the junctions were detected using shapely Let me give a minimal example. How could one adopt a dict-list approach to detect the same duplicates given the following geometric objects? from topojson.core.cut import Cut
data = [
{"type": "LineString", "coordinates": [[0, 0], [2, 0], [2, 1], [3, 1]]},
{"type": "LineString", "coordinates": [[1, 0], [4, 0], [4, 1], [3, 1], [2, 1], [0, 1]]},
{"type": "Polygon", "coordinates": [[[5, 0], [6, 0], [6, 2], [5, 2], [5, 0]]]},
{"type": "Polygon", "coordinates": [[[5, 0], [5, 2], [6, 2], [6, 0], [5, 0]]]}
] All linestrings can be extracted and cut in the following segments using c = Cut(data, options={"shared_paths": "coords"})
c.to_svg(separate=True, include_junctions=True) Where the current approach detects the following duplicate-pairs (by index): c.output["bookkeeping_duplicates"]
Based on the following LineStrings: c.output["linestrings"]
|
I am not sure I entirely follow your example. This is my line of thoughts. I haven't properly studied the code and there is likely something I am missing, but in principle: # coordinates could be retained from Join as lists to make it faster
coords = []
for ls in c.output['linestrings']:
coords.append(tuple(sorted(ls.coords))) # tuple to make it hashable
seen = {}
dupes = []
for x in coords:
if x not in seen:
seen[x] = 1
else:
if seen[x] == 1:
dupes.append(x)
seen[x] += 1
idx_duplicates = []
for dup in dupes:
idx_duplicates.append([i for i, e in enumerate(coords) if e == dup])
This is done in (For some reason |
However, it might not be better in fact. For
It likely depends on complexity of geometries. For
|
Very interesting. I was not aware of tuples as hashable objects. Also the sorting on the shapely coords is interesting as it handles reversed but duplicate linestrings. I tried another approach using the hashes of the linestring tuples using numpy and compared it as such: def dups_numpy(linestrings):
# get hash of sorted linestring
coords = []
for ls in linestrings:
coords.append(hash(tuple(sorted(ls.coords))))
coords = np.array(coords, dtype=np.int64)
# get split locations of dups
idx_sort = np.argsort(coords)
sorted_coords = coords[idx_sort]
vals, idx_start, count = np.unique(sorted_coords, return_counts=True, return_index=True)
# split on indices that occures > 1
idx_dups = np.split(idx_sort, idx_start[1:])
idx_dups = np.array([dup for dup in idx_dups if dup.size > 1])
return idx_dups
def dups_dictlist(linestrings):
# coordinates could be retained from Join as lists to make it faster
coords = []
for ls in linestrings:
coords.append(tuple(sorted(ls.coords))) # tuple to make it hashable
seen = {}
dupes = []
for x in coords:
if x not in seen:
seen[x] = 1
else:
if seen[x] == 1:
dupes.append(x)
seen[x] += 1
idx_duplicates = []
for dup in dupes:
idx_duplicates.append([i for i, e in enumerate(coords) if e == dup])
return idx_duplicates
def dups_shapely(linestrings):
c.find_duplicates(linestrings)
With this |
Cool. I am happy to see that my general idea is actually useful if put together with numpy. The performance is interesting though. |
This PR is a start to make the Cut class ready for the
coords
routine.Currently it create some small changes to skip the step where coordinates were inserted in LineStrings that contain shared paths without shared junctions.
But while being in the class I realised that the real bottleneck is here: https://github.com/mattijn/topojson/blob/master/topojson/core/cut.py#L130.
The
find_duplicates()
function becomes really slow if the number of features are high (using eg. the sample.geojson test file)