Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add polygon extract queries to the Python API #22

Closed
CloudNiner opened this issue Jul 9, 2020 · 12 comments
Closed

Add polygon extract queries to the Python API #22

CloudNiner opened this issue Jul 9, 2020 · 12 comments

Comments

@CloudNiner
Copy link
Contributor

We're exploring an application of OSMExpress that is interested in retrieving objects filtered by specific tags from a polygon bounding area.

In its simplest form, I'm imagining a python method that is something like:

# Polygon could be any number of things, such as a Shapely polygon, 
# a python dict representation of GeoJson, a list of coordinate tuples or a bbox
def extract(region: Polygon): [CellIds]

that reproduces the functionality of https://github.com/protomaps/OSMExpress/blob/master/src/extract.cpp#L130-L134

@bdon
Copy link
Member

bdon commented Jul 10, 2020

+1, should be doable by:

  1. bring in the s2 bindings
  2. use the s2 region coverer on your input geometry
  3. traverse the cell index based on prefix of the coverings, putting node IDs into a set
  4. do a pass over the set of nodes, finding the set of referenced ways, relations... etc

@CloudNiner
Copy link
Contributor Author

I attempted a first pass at this, but got a bit stuck. From your comment, are you suggesting that this can be done entirely in Python using the existing osmx bindings, perhaps pending addressing #23?

I'm using the s2sphere library as it appears to provide an s2geometry installation in an easily pip installable package so that should address getting cell ids.

In attempting to unpack

traverse the cell index based on prefix of the coverings

Can I pass the cell id I get back from RegionCoverer directly to the osmx bindings Index class? Is the name argument to pass to Index constructor the same as the lmdb table names, e.g. cell_node?

@bdon
Copy link
Member

bdon commented Jul 13, 2020

Can I pass the cell id I get back from RegionCoverer directly to the osmx bindings Index class?

Yes, if your coverer has both minimum and maximum cell levels set to 16, you can then query cell_node directly with this cell ID. This can be implemented as a first pass. It's not optimal though, because a huge region might need thousands or even millions of level 16 cells. A more efficient way is to let cells be anywhere from level 0 to 16 for a sparse representation of the covering; because every S2 parent cell shares a prefix with all its children, you can use LMDB cursor ranges to iterate over child cells efficiently like is done in the C++ code https://github.com/protomaps/OSMExpress/blob/master/src/storage.cpp#L171

The s2sphere package looks neat, I haven't tried it. I believe the S2 source tree itself has "official" python bindings but unsure what the workflow is to get that packaged up neatly. If the s2sphere library is easier to install and has everything we need, we should use that.

@bdon
Copy link
Member

bdon commented Jul 13, 2020

Is the name argument to pass to Index constructor the same as the lmdb table names, e.g. cell_node?

I think cell_node should work the same way as the other subclasses of Index, they're all just tables from int64 -> [int64]. Something like

class CellNode(Index):
    def __init__(self,txn):
        super().__init__(txn,cell_node')

@CloudNiner
Copy link
Contributor Author

if your coverer has both minimum and maximum cell levels set to 16

Looks like I wasn't setting minimum as well as maximum. Will play around with this a bit more. The s2sphere region coverer returns a list of CellId which does have the child_begin and child_end methods on it so that more efficient implementation seems feasible at first glance.

I believe the S2 source tree itself has "official" python bindings

It does here but it's not obvious to me either how to include them in a python/pip project easily.

other subclasses of Index

🤦 sometimes reading source on a Monday am is hard?

Thanks for the hints, I'll update with progress, no ETA though. If you find another use case for this or begin working on it let me know.

@CloudNiner
Copy link
Contributor Author

I managed to spend some time on this today, creating a Python CLI that generates a GeoJson file of nodes and ways filtered by the requested tags and bounding box.

Lots of possible future work left here, including but not limited to:

  • Relation support
  • Implement the more efficient s2 cell walk described above
  • Correctly generate LineString or Polygon for ways (currently always a LineString)
  • Support Polygon as input in addition to BBox
  • Add option to allow user to specify AND or OR when providing multiple tag filters
  • Refactor the core operation of returning a list of OSM objects within a polygon to a method that can live in OSMExpress

https://gist.github.com/CloudNiner/8b020f434d0509419032c6b2fe0577c5

@bdon
Copy link
Member

bdon commented Jul 14, 2020

Looks good so far! some random thoughts, some directly related and some more speculative:

Correctly generate LineString or Polygon for ways (currently always a LineString)
Relation support

This should be done outside of osmx, but I think for ways you need to 1) ensure the orientation is correct and 2) ensure it is not self-intersecting, which OSM ways can be, but is not allowed per the OGC Simple Features spec
If you intend to also support MultiPolygons, I think it's a pretty tough nut to crack, and a good implementation of multi polygon builders exists already within Osmium. Something I want to investigate is if a osmium handler can be "fed" directly from an osmx database instead of an .osm.pbf file

Support Polygon as input in addition to BBox
Refactor the core operation of returning a list of OSM objects within a polygon to a method that can live in OSMExpress

the osmx API should only have one way to specify the query region, taking a covering - a list of cell IDs - as the argument. We should have some helper methods to convert from common geometry specs to S2 regions and coverings. In the C++ code I called this "region", so we should create another python module probably called "region" to handle this

Add option to allow user to specify AND or OR when providing multiple tag filters

unrelated to osmx but I implemented something similar in another project which you may want to look at: https://github.com/hotosm/osm-export-tool-python/blob/master/osm_export_tool/sql.py

Node = namedtuple("Node", ["id", "location", "tags", "metadata"])
Way = namedtuple("Way", ["id", "nodes", "tags", "metadata"])

I would avoid having to mirror the layout of OSM objects in your client code because you need to malloc for every object. Ideally the osmx bindings make it easy to take advantage of zero-copy views of OSM objects - this may mean the API provides functions that take sets of object IDs and uses yield and generator expressions to iterate over them

node_ids.extend(cell_node_index.get(cell_id.id()))

node_ids should generally be a python set because there will be lots of duplicates. And ideally iterating over that set should guarantee ascending order

@CloudNiner
Copy link
Contributor Author

I updated my prior example to take an s2.S2Region instead of a bbox and wrote the Python code to convert a geojson.Polygon to s2.S2Region. There are other improvements, like the using generators to yield nodes and ways and the Set() fix described above. Here's the updated complete example: https://gist.github.com/CloudNiner/e5552aaae6c235193b8f196558b92e2c

Unfortunately, the parts of this example that might be useful in the OSMExpress python bindings (polygon -> s2region and node_ids_for_region) now utilize the s2 SWIG bindings because S2Polygon isn't implemented in s2sphere. This prevented a PR because the bindings are not directly pip installable. s2-py is a potential option but installs an out of date (from 2018) version of the library. It's also troublesome (no guidance from s2 docs and a few open issues relating to the SWIG bindings) to install the SWIG bindings to anywhere other than the global default python installation. There is an open PR that adds support for building a pip wheel but its been dormant for awhile now.

Regardless, I hope the example is instructive and provides a useful start to those looking to perform similar tasks or jump start a future effort to add some of these operations to the OSMX Python bindings when there's a clear path forward for inclusion of the s2 bindings as a dependency.

@bdon
Copy link
Member

bdon commented Jul 29, 2020

I agree that the bindings not being pip installable is an issue, thanks for figuring out the workaround.

Were you able to accomplish the goal of your script in an efficient manner using Python? It's useful to know if investing more in Python packaging is worth it for use cases similar to yours - If you encountered performance issues, it may be better to direct developers towards just using the C++ API

@CloudNiner
Copy link
Contributor Author

Were you able to accomplish the goal of your script in an efficient manner using Python?

I'd say yes in this case. There's a few different performance bottlenecks a user might run into, but once I had the efficient implementation for the function s2region -> node id set, building a prototype with the python bindings was really quick. The s2sphere version (is pip installable!) of that function wasn't significantly slower than the s2 swig bindings so perhaps an implementation in osmx bindings for LatLngRect and Cap might be warranted and would serve any query use case well.

The search_by_ways code begins to slow as you increase query area or the number of results returned but as far as I can tell, the issue is just volume of operations being performed. I was still able to write 100s of mb of geojson results for areas up to 1000 sq kms returning thousands of ways, it just took 1-2mins. Whether that's satisfactory or not depends on individual use case, for us it was fine since our use case is focused on smaller areas with few results at a time.

@bdon
Copy link
Member

bdon commented Jul 29, 2020

https://gist.github.com/CloudNiner/e5552aaae6c235193b8f196558b92e2c#file-osmx_utils-py-L23

This can be made even faster :) as is, on line 23, if you are traversing a cell at a low level, I think you are potentially issuing millions of queries to Index.get - consider the case where there is only 1 record in the database for that low-level cell, but the cell is made of 1 million level16 nodes.

Instead, use the child_begin cell to set the lmdb cursor state and then move the cursor forward. You won't be able to use Index.get, you will need to write another method that manipulates the cursor; the logic and LMDB functions used should be the same as in the C++ code. This will limit the number of LMDB operations to the number of records in the actual cell, 1 record in the example above.

Curious to see if this improves on the slowness of big query areas you mentioned.

@bdon
Copy link
Member

bdon commented Nov 5, 2023

I'm closing this issue because of lack of activity.

The main use case for OSMExpress remains getting PBFs out of a minutely-updated planet .osmx. Analytical use cases in Python may be better served by https://docs.geodesk.com/python which looks like an excellent new project.

@bdon bdon closed this as completed Nov 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants