feat: first impl #2

emirfabio · 2025-04-16T19:01:33Z

Adds the core functionality for fetching OpenStreetMap data via the HOTOSM Raw Data API.

Includes:

Asynchronous API client (RawDataAPI)
Main function get_osm_data
Input validation (GeometryInput, RequestParams)
Basic CLI interface
Utility functions (bbox conversion, saving, tile splitting)
Initial integration tests (API & CLI)
Example usage script

kshitijrajsharma

Bravo ! And impressive work ! Amazing

kshitijrajsharma · 2025-04-17T02:53:49Z

osm_data_client/api.py

+            except Exception as ex:
+                raise DownloadError(f"Error downloading snapshot: {str(ex)}") from ex
+
+    def _extract_zip_data(self, zip_data: bytes, file_name: str) -> Union[Dict[str, Any], List[Dict[str, Any]], Dict[str, Union[str, bytes]]]:


This implementation is nice , however it might fail for the large zip files (specially considering you are unzipping in the memory ) raw-data-api supports multiple different file formats and unzipping might not require for all of them , for instance user might like to get the data in zip file for shapefile file format so keeping this function optional deliver the zip back to user might be useful, specially considering this is python client and user will have more flexibility to do what they want with the data !

Totally makes sense.

Thinking about it, the main implementation could always return the zip file bytes in a result wrapper class.
Without adding much complexity.
Then specific result handling based on size and output format could be re-added back in as extensions.

So that if the user does want the client to handle the unpacking of data it can but it's not forced behaviour.

What do you think ?

For shapefiles it would be larger ! I don't like shapefiles either but unfortunately there are fortune of people who still use it and one of the popular file formats in raw-data-api

Problems like these would be handled by targeted, swappable implementations

Memory Issue

For large file unzipping, I would recommend this nice package: https://github.com/uktrade/stream-unzip.

A very memory efficient approach to stream unzip the files from zip to disk.

Default unzipping

By default the API uses bind_zip=True for all filetypes right?

But as this is a python lib (not direct download) and we can do neat things like stream unzip as above, we could override this behaviour.

I think we should have some sane defaults configured for each filetype, plus an override to force_unzip and set bind_zip=False.

Shapefile: no (as comprises many files)

KML: yes

Mbtiles yes

FlatGeoBuf: yes

CSV: yes

GeoPackage: yes

PGDUMP: no (could be large, this is a niche format)

GeoJSON: yes

So in reality I think most file types will be unzipped.

The zip also contains a metadata JSON file if I'm not mistaken right? (off the top of my head)

So perhaps instead of downloading that, we could simply read the metadata from the zip, and print to terminal as a DEBUG level log? (this way the user can see the info if they need, but for most they probably don't want it).

Only thing I didn't really get:
I can send bind_zip=False as part of the request to get a streamed response ? Is that correct?

Yes that's true! I neglected that part 😅

Basically what we want is the file on the users filesystem, without the zip 👍
(unless its shapefile or pgdump)

kshitijrajsharma · 2025-04-17T02:55:08Z

osm_data_client/api.py

+                all_files = zip_ref.namelist()
+
+                if len(all_files) < 2:
+                    raise DownloadError(f"Expected at least 2 files in the archive, found {len(all_files)}")


For shapefiles it would be larger ! I don't like shapefiles either but unfortunately there are fortune of people who still use it and one of the popular file formats in raw-data-api

Yep exactly! Related to the comments above. Shapefile is the only format we provide (I think) that is actually composed of many files to make up the dataset.

# info from esri .shp—The main file that stores the feature geometry; required. .shx—The index file that stores the index of the feature geometry; required. .dbf—The dBASE table that stores the attribute information of features; required. There is a one-to-one relationship between geometry and attributes, which is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file. .sbn and .sbx—The files that store the spatial index of the features. ... etc

they are typically all zipped together when distributed online

kshitijrajsharma · 2025-04-17T02:58:09Z

osm_data_client/api.py

+                    data=json.dumps(payload),
+                    headers=self.headers,
+                ) as response:
+                    response_data = await response.json()


You will also get queue information after the posting the request , sometimes it is possible API might take longer and user might have to wait for a bit if there are so many tasks running :

Response will be like this :

{
"task_id": "e4f7c25e-4904-439b-93f0-470f7807913b",
"track_link": "/tasks/status/e4f7c25e-4904-439b-93f0-470f7807913b/",
"queue": 0
}

Here if queue says 3 then you are after 3 tasks ! Worth sending info to the client ! Also same during the pooling you will know if the extraction is started or not so you can display that in the logs !

Oh for sure, totally agree!

You can add logs if pooling status changes ! It will change from submitted to started when tasks starts

Together with this it can give the user useful feedback on the request status

kshitijrajsharma · 2025-04-17T03:00:45Z

osm_data_client/api.py

+            base_api_url: Base URL for the Raw Data API
+        """
+        self.base_api_url = base_api_url
+        self.headers = {


Also add access-token option in the headers and let people supply if they have it , by default API is public but with this token ability they will be able to bypass the export limit , have some staff privilege to queues based on their access token ! We can add those advance extraction methods later on , for now just letting user pass their access token would be nice if they have it ! For eg : FMTM might require priority queue

Again, totally makes sense.

kshitijrajsharma · 2025-04-17T03:02:03Z

osm_data_client/api.py

+                try:
+                    async with session.get(
+                        url=f"{self.base_api_url}{task_link}", headers=self.headers
+                    ) as response:


You can add logs if pooling status changes ! It will change from submitted to started when tasks starts

Yeah I agree - adding logs throughout the lib would be a really nice change 👍

In each file:

import logging # other imports log = logging.getLogger(__name__)

Then any upstream project using this can hook into logs.

Here you could add log.debug("xxx") to provide the response each time, with a log.info("xxx") after SUCCESS or FAILED.

For the CLI, you could instantiate a logger to the terminal stdout to hook into logs that way too:

import sys import logging # set a --verbose -v flag if args.verbose: logging.basicConfig( level="DEBUG", format=( "%(asctime)s.%(msecs)03d [%(levelname)s] " "%(name)s | %(funcName)s:%(lineno)d | %(message)s" ), datefmt="%y-%m-%d %H:%M:%S", stream=sys.stdout, )

kshitijrajsharma · 2025-04-17T03:03:14Z

pyproject.toml

+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.13"


lets be little bit flexible with the python version ? lets support those application which uses older python , perhaps >= 3.10 would be okay ? what do you think

yup totally, I'm simply not familiar with python environments, I've left default values on

kshitijrajsharma · 2025-04-17T03:04:22Z

osm_data_client/utils/file.py

+
+    return file_path
+
+def split_by_tiles(


This is lovely , I require this all the time . I am just curious what's your motivation behind this implementation ? Are you using tiles somewhere already ?

Honestly I saw the function in your implementation and it looked conceptually cool so I tried understanding how it works.
Might not be needed here, not sure.

Perhaps this belongs in another module!

Especially as it's needs geopandas and shapely.

But it's pretty awesome 😄
(good that it's optional imports)

kshitijrajsharma · 2025-04-17T03:06:43Z

osm_data_client/models.py

+            raise ValidationError("Geometry must have a 'coordinates' field")
+
+        valid_types = ["Polygon", "MultiPolygon"]
+        if geometry_dict["type"] not in valid_types:


Worth adding validation to check crs of the input geometry !

While a basic amount of geojson parsing / checking is useful here, I wouldn't go too wild on the CRS checking.

Mostly just check the CRS is EPSG:4326 if provided, else throw an error saying the user should convert it first (this is a python wrapper after all, so users should have some level of technical competency).

The idea of using geopandas for possible conversion is nice, but should always remain an optional feature. I'm not keep on pulling the large geopandas dependency into a small python module like this - users upstream can decide if they want such as large library.

For FieldTM, we do some basic checks like this:

async def check_crs(input_geojson: Union[dict, geojson.FeatureCollection]): """Validate CRS is valid for a geojson.""" log.debug("validating coordinate reference system") def is_valid_crs(crs_name): valid_crs_list = [ "urn:ogc:def:crs:OGC:1.3:CRS84", "urn:ogc:def:crs:EPSG::4326", "WGS 84", ] return crs_name in valid_crs_list def is_valid_coordinate(coord): if coord is None: return False return -180 <= coord[0] <= 180 and -90 <= coord[1] <= 90 error_message = ( "ERROR: Unsupported coordinate system, it is recommended to use a " "GeoJSON file in WGS84(EPSG 4326) standard." ) if "crs" in input_geojson: crs = input_geojson.get("crs", {}).get("properties", {}).get("name") if not is_valid_crs(crs): log.error(error_message) raise HTTPException( status_code=HTTPStatus.BAD_REQUEST, detail=error_message, ) return if (input_geojson_type := input_geojson.get("type")) == "FeatureCollection": features = input_geojson.get("features", []) coordinates = ( features[-1].get("geometry", {}).get("coordinates", []) if features else [] ) elif input_geojson_type == "Feature": coordinates = input_geojson.get("geometry", {}).get("coordinates", []) else: coordinates = input_geojson.get("coordinates", {}) first_coordinate = None if coordinates: while isinstance(coordinates, list): first_coordinate = coordinates coordinates = coordinates[0] error_message = ( "ERROR: The coordinates within the GeoJSON file are not valid. " "Is the file empty?" ) if not is_valid_coordinate(first_coordinate): log.error(error_message) raise HTTPException(HTTPStatus.BAD_REQUEST, detail=error_message)

I'm hoping this functionality is moved to https://github.com/hotosm/geojson-aoi-parser in the end.

The goal is to have all our tools first process input AOIs via https://github.com/hotosm/geojson-aoi-parser, then do further processing, like downloading the data extract with this lib.

So as much as possible, I think the coordinate system checking should be done elsewhere, with this lib implementing the basics 👍

kshitijrajsharma · 2025-04-17T03:09:23Z

osm_data_client/utils/transform.py

+    gdf.set_crs(epsg=4326, inplace=True)
+
+    # Reproject to the target CRS
+    gdf = gdf.to_crs(epsg=int(target_crs))


this is good , just make sure list of supported crs are there , this function doesn't work for the local coordinate systems or utm projected crs ! They require extra additional transformation parameters which we don't wanna go through in our first implementation, like above comment you can add validation of supported crs like 3857 and 4326 perhaps ! worth check what geopandas crs projection is capable of else we will mess up its geometry

Yup, I got totally lost in the crs concepts trying to make the tiles implementation work.
I'll definitely make another pass to implement your suggestions while I learn more about it!

As I mentioned in my previous comment, sorry to send you down the CRS rabbit hole 😅

I actually think this sort of thing could be really nicely handled by converting all the geopandas / shapely logic into PostGIS here https://github.com/hotosm/geojson-aoi-parser.

It's good to have separation of concerns / no mission creep for this module

I agree with sam !

emirfabio · 2025-04-17T09:52:00Z

Bravo ! And impressive work ! Amazing

Thank you!!

spwoodcock

Fantastic work!!

Really loving this 😄

(added to some of Kshitij's comments + added some of my own)

spwoodcock · 2025-04-17T10:35:11Z

osm_data_client/api.py

+                all_files = zip_ref.namelist()
+
+                if len(all_files) < 2:
+                    raise DownloadError(f"Expected at least 2 files in the archive, found {len(all_files)}")


Yep exactly! Related to the comments above. Shapefile is the only format we provide (I think) that is actually composed of many files to make up the dataset.

# info from esri .shp—The main file that stores the feature geometry; required. .shx—The index file that stores the index of the feature geometry; required. .dbf—The dBASE table that stores the attribute information of features; required. There is a one-to-one relationship between geometry and attributes, which is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file. .sbn and .sbx—The files that store the spatial index of the features. ... etc

they are typically all zipped together when distributed online

spwoodcock · 2025-04-17T10:37:37Z

osm_data_client/api.py

+                        import csv
+                        from io import TextIOWrapper
+                        csv_reader = csv.DictReader(TextIOWrapper(file, 'utf-8'))
+                        return list(csv_reader)


I guess the idea here is so that users can ingest the CSV and do additional processing?

They could load the CSV into Pandas for example

spwoodcock · 2025-04-17T10:38:20Z

osm_data_client/cli.py

+                }
+            }
+
+            print(f"Downloading OSM data for {args.feature_type}...")


As above, info related to logging:
All print statements could be replaced by log.info 👍

spwoodcock · 2025-04-17T10:44:31Z

osm_data_client/models.py

+            raise ValidationError("Geometry must have a 'coordinates' field")
+
+        valid_types = ["Polygon", "MultiPolygon"]
+        if geometry_dict["type"] not in valid_types:


While a basic amount of geojson parsing / checking is useful here, I wouldn't go too wild on the CRS checking.

Mostly just check the CRS is EPSG:4326 if provided, else throw an error saying the user should convert it first (this is a python wrapper after all, so users should have some level of technical competency).

The idea of using geopandas for possible conversion is nice, but should always remain an optional feature. I'm not keep on pulling the large geopandas dependency into a small python module like this - users upstream can decide if they want such as large library.

For FieldTM, we do some basic checks like this:

async def check_crs(input_geojson: Union[dict, geojson.FeatureCollection]): """Validate CRS is valid for a geojson.""" log.debug("validating coordinate reference system") def is_valid_crs(crs_name): valid_crs_list = [ "urn:ogc:def:crs:OGC:1.3:CRS84", "urn:ogc:def:crs:EPSG::4326", "WGS 84", ] return crs_name in valid_crs_list def is_valid_coordinate(coord): if coord is None: return False return -180 <= coord[0] <= 180 and -90 <= coord[1] <= 90 error_message = ( "ERROR: Unsupported coordinate system, it is recommended to use a " "GeoJSON file in WGS84(EPSG 4326) standard." ) if "crs" in input_geojson: crs = input_geojson.get("crs", {}).get("properties", {}).get("name") if not is_valid_crs(crs): log.error(error_message) raise HTTPException( status_code=HTTPStatus.BAD_REQUEST, detail=error_message, ) return if (input_geojson_type := input_geojson.get("type")) == "FeatureCollection": features = input_geojson.get("features", []) coordinates = ( features[-1].get("geometry", {}).get("coordinates", []) if features else [] ) elif input_geojson_type == "Feature": coordinates = input_geojson.get("geometry", {}).get("coordinates", []) else: coordinates = input_geojson.get("coordinates", {}) first_coordinate = None if coordinates: while isinstance(coordinates, list): first_coordinate = coordinates coordinates = coordinates[0] error_message = ( "ERROR: The coordinates within the GeoJSON file are not valid. " "Is the file empty?" ) if not is_valid_coordinate(first_coordinate): log.error(error_message) raise HTTPException(HTTPStatus.BAD_REQUEST, detail=error_message)

I'm hoping this functionality is moved to https://github.com/hotosm/geojson-aoi-parser in the end.

The goal is to have all our tools first process input AOIs via https://github.com/hotosm/geojson-aoi-parser, then do further processing, like downloading the data extract with this lib.

So as much as possible, I think the coordinate system checking should be done elsewhere, with this lib implementing the basics 👍

spwoodcock · 2025-04-17T10:46:11Z

osm_data_client/test.py

I assume this is here as a sanity check test of sorts?

As you already have test modules for pytest under /tests.

Or was this committed by accident, as you tester during development?

A bit of both, it was the first testing script i wrote, committed by accident, i'll remove it hehe.

spwoodcock · 2025-04-17T10:47:34Z

osm_data_client/utils/file.py

+
+    return file_path
+
+def split_by_tiles(


Perhaps this belongs in another module!

Especially as it's needs geopandas and shapely.

But it's pretty awesome 😄
(good that it's optional imports)

spwoodcock · 2025-04-17T10:49:01Z

osm_data_client/utils/transform.py

+    gdf.set_crs(epsg=4326, inplace=True)
+
+    # Reproject to the target CRS
+    gdf = gdf.to_crs(epsg=int(target_crs))


As I mentioned in my previous comment, sorry to send you down the CRS rabbit hole 😅

I actually think this sort of thing could be really nicely handled by converting all the geopandas / shapely logic into PostGIS here https://github.com/hotosm/geojson-aoi-parser.

It's good to have separation of concerns / no mission creep for this module

spwoodcock · 2025-04-17T10:59:10Z

pyproject.toml

+readme = "README.md"
+requires-python = ">=3.13"
+dependencies = [
+    "aiohttp>=3.11.16",


Love that the primary dep is only a simple http client 👍

Also see: https://docs.hotosm.org/dev-guide/repo-management/dependencies/#locking-for-libraries-packages

When pinning for installable packages, the minimum required versions should ideally be determined.

This means that anyone installing this package isn't forced to upgrade to aoihttp v3.11.16 (April, 2025), which might break something say if they were using v3.7.0 (Oct, 2020).

I can't imagine their is a whole lot of breaking change between versions 3.x.x (there shouldn't be any).

I generally scour the changelog to determine what could be best, while being pragmatic and using something reasonably recent.

In this case, I would say v3.8.0 seems appropriate: https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst#380-2021-10-31
It's about 3.5yrs old now, but I can't see any breaking changes that would affect us

Makes sense!
ty for the explanation

spwoodcock · 2025-04-17T10:59:56Z

pyproject.toml

+
+[dependency-groups]
+test = [
+    "pytest>=8.3.5",


The above doesn't apply to test deps, as they aren't installed upstream.
Pinned >= the latest is fine!

spwoodcock · 2025-04-17T11:02:00Z

pyproject.toml

+    "pytest-asyncio>=0.26.0",
+]
+dev = [
+    "geopandas>=1.0.1",


I like the approach of making these dev, the requiring the user to have them installed upstream to use them.

But note it does mean there may be version incompatibility. The user might have a version installed that doesn't work with the code written, as we have no dependency locking/solving for these deps. They could install using pip install raw-data-api-py[dev]` and this would avoid it.

I would make a note to just say, the optional functionality using geopandas was only tested on v1.0.1.

This commit addresses feedback from the initial code review and implements several key improvements to enhance the OSM Data Client: Memory Management: - Add stream-unzip support for memory-efficient extraction of large files - Implement configurable memory thresholds for extraction decisions - Add format-specific extraction policies for different output types Flexibility & Configuration: - Add access token support with proper authentication headers - Implement options to control extraction behavior (force_extract, force_zip) - Support bindZip parameter for controlling zip packaging - Make RawDataOutputOptions configurable with sensible defaults Improved User Experience: - Add proper logging throughout with appropriate log levels - Display queue position information when requests are queued - Track and log status changes during task polling Architecture Improvements: - Create standalone OutputProcessor for handling file processing - Add immutable RawDataApiMetadata for request metadata - Implement RawDataResult class for consistent result handling Additional Changes: - Update dependency versions to use minimum required versions - Make file extraction for shapefiles create a dedicated directory - Implement proper path handling with Path objects throughout

emirfabio · 2025-04-25T13:22:39Z

Helloooo @spwoodcock and @kshitijrajsharma! 👋

I've implemented the suggestions from the previous review. Here's what's been added:

Memory management improvements with stream-unzip for large files
Added proper logging throughout with appropriate levels
Implemented queue position tracking and status changes during polling
Added access token support in headers
Created format-specific extraction policies (zip vs. extract) with defaults
Special handling for shapefiles with dedicated directories
Added proper error messages throughout

Let me know if there's anything else I should address!

Thanks for the feedback so far, really appreciate it!

spwoodcock · 2025-04-26T11:15:09Z

Wow you absolutely killed it Emir - this is a very nice lib!

spwoodcock · 2025-04-26T11:16:22Z

osm_data_client/cli.py

+    """
+    try:
+        # Prepare geometry
+        if args.bbox:


Minor pendantic note: in the geo world a bbox would typically only be a box shape. A better name for this might be --bounds or even just --geojson

for sure, i'll change it.

feat: first impl

1a517d2

emirfabio requested a review from spwoodcock April 16, 2025 19:01

spwoodcock requested a review from kshitijrajsharma April 16, 2025 21:37

kshitijrajsharma reviewed Apr 17, 2025

View reviewed changes

spwoodcock requested changes Apr 17, 2025

View reviewed changes

emirfabio requested review from kshitijrajsharma and spwoodcock April 25, 2025 13:18

tests: improved tests

d343e7b

spwoodcock approved these changes Apr 26, 2025

View reviewed changes

spwoodcock merged commit 952d5a0 into hotosm:main Apr 29, 2025

feat: first impl #2

feat: first impl #2

Uh oh!

Conversation

emirfabio commented Apr 16, 2025

Uh oh!

kshitijrajsharma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emirfabio Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spwoodcock Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Memory Issue

Default unzipping

Uh oh!

emirfabio Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spwoodcock Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spwoodcock Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emirfabio Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emirfabio commented Apr 17, 2025

Uh oh!

spwoodcock left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

emirfabio Apr 17, 2025 •

edited

Loading

spwoodcock Apr 17, 2025 •

edited

Loading

emirfabio Apr 18, 2025 •

edited

Loading

spwoodcock Apr 18, 2025 •

edited

Loading

spwoodcock Apr 17, 2025 •

edited

Loading

emirfabio Apr 17, 2025 •

edited

Loading

spwoodcock left a comment •

edited

Loading