Skip to content

Conversation

@emirfabio
Copy link
Collaborator

Adds the core functionality for fetching OpenStreetMap data via the HOTOSM Raw Data API.

Includes:

  • Asynchronous API client (RawDataAPI)
  • Main function get_osm_data
  • Input validation (GeometryInput, RequestParams)
  • Basic CLI interface
  • Utility functions (bbox conversion, saving, tile splitting)
  • Initial integration tests (API & CLI)
  • Example usage script

Copy link
Member

@kshitijrajsharma kshitijrajsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bravo ! And impressive work ! Amazing

except Exception as ex:
raise DownloadError(f"Error downloading snapshot: {str(ex)}") from ex

def _extract_zip_data(self, zip_data: bytes, file_name: str) -> Union[Dict[str, Any], List[Dict[str, Any]], Dict[str, Union[str, bytes]]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is nice , however it might fail for the large zip files (specially considering you are unzipping in the memory ) raw-data-api supports multiple different file formats and unzipping might not require for all of them , for instance user might like to get the data in zip file for shapefile file format so keeping this function optional deliver the zip back to user might be useful, specially considering this is python client and user will have more flexibility to do what they want with the data !

Copy link
Collaborator Author

@emirfabio emirfabio Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally makes sense.

Thinking about it, the main implementation could always return the zip file bytes in a result wrapper class.
Without adding much complexity.
Then specific result handling based on size and output format could be re-added back in as extensions.

So that if the user does want the client to handle the unpacking of data it can but it's not forced behaviour.

What do you think ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For shapefiles it would be larger ! I don't like shapefiles either but unfortunately there are fortune of people who still use it and one of the popular file formats in raw-data-api

Problems like these would be handled by targeted, swappable implementations

Copy link
Member

@spwoodcock spwoodcock Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory Issue

Default unzipping

  • By default the API uses bind_zip=True for all filetypes right?
  • But as this is a python lib (not direct download) and we can do neat things like stream unzip as above, we could override this behaviour.
  • I think we should have some sane defaults configured for each filetype, plus an override to force_unzip and set bind_zip=False.
    • Shapefile: no (as comprises many files)
    • KML: yes
    • Mbtiles yes
    • FlatGeoBuf: yes
    • CSV: yes
    • GeoPackage: yes
    • PGDUMP: no (could be large, this is a niche format)
    • GeoJSON: yes
  • So in reality I think most file types will be unzipped.
  • The zip also contains a metadata JSON file if I'm not mistaken right? (off the top of my head)
  • So perhaps instead of downloading that, we could simply read the metadata from the zip, and print to terminal as a DEBUG level log? (this way the user can see the info if they need, but for most they probably don't want it).

Copy link
Collaborator Author

@emirfabio emirfabio Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only thing I didn't really get:
I can send bind_zip=False as part of the request to get a streamed response ? Is that correct?

Copy link
Member

@spwoodcock spwoodcock Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's true! I neglected that part 😅

Basically what we want is the file on the users filesystem, without the zip 👍
(unless its shapefile or pgdump)

all_files = zip_ref.namelist()

if len(all_files) < 2:
raise DownloadError(f"Expected at least 2 files in the archive, found {len(all_files)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For shapefiles it would be larger ! I don't like shapefiles either but unfortunately there are fortune of people who still use it and one of the popular file formats in raw-data-api

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly! Related to the comments above. Shapefile is the only format we provide (I think) that is actually composed of many files to make up the dataset.

# info from esri
.shp—The main file that stores the feature geometry; required.
.shx—The index file that stores the index of the feature geometry; required.
.dbf—The dBASE table that stores the attribute information of features; required.

There is a one-to-one relationship between geometry and attributes, which is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file.
.sbn and .sbx—The files that store the spatial index of the features.
... etc

they are typically all zipped together when distributed online

data=json.dumps(payload),
headers=self.headers,
) as response:
response_data = await response.json()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will also get queue information after the posting the request , sometimes it is possible API might take longer and user might have to wait for a bit if there are so many tasks running :

Response will be like this :

{
"task_id": "e4f7c25e-4904-439b-93f0-470f7807913b",
"track_link": "/tasks/status/e4f7c25e-4904-439b-93f0-470f7807913b/",
"queue": 0
}

Here if queue says 3 then you are after 3 tasks ! Worth sending info to the client ! Also same during the pooling you will know if the extraction is started or not so you can display that in the logs !

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh for sure, totally agree!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add logs if pooling status changes ! It will change from submitted to started when tasks starts

Together with this it can give the user useful feedback on the request status

base_api_url: Base URL for the Raw Data API
"""
self.base_api_url = base_api_url
self.headers = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add access-token option in the headers and let people supply if they have it , by default API is public but with this token ability they will be able to bypass the export limit , have some staff privilege to queues based on their access token ! We can add those advance extraction methods later on , for now just letting user pass their access token would be nice if they have it ! For eg : FMTM might require priority queue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, totally makes sense.

try:
async with session.get(
url=f"{self.base_api_url}{task_link}", headers=self.headers
) as response:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add logs if pooling status changes ! It will change from submitted to started when tasks starts

Copy link
Member

@spwoodcock spwoodcock Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree - adding logs throughout the lib would be a really nice change 👍

In each file:

import logging
# other imports
log = logging.getLogger(__name__)

Then any upstream project using this can hook into logs.

Here you could add log.debug("xxx") to provide the response each time, with a log.info("xxx") after SUCCESS or FAILED.

For the CLI, you could instantiate a logger to the terminal stdout to hook into logs that way too:

import sys
import logging

# set a --verbose -v flag
if args.verbose:
    logging.basicConfig(
        level="DEBUG",
        format=(
            "%(asctime)s.%(msecs)03d [%(levelname)s] "
            "%(name)s | %(funcName)s:%(lineno)d | %(message)s"
        ),
        datefmt="%y-%m-%d %H:%M:%S",
        stream=sys.stdout,
    )

pyproject.toml Outdated
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.13"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets be little bit flexible with the python version ? lets support those application which uses older python , perhaps >= 3.10 would be okay ? what do you think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup totally, I'm simply not familiar with python environments, I've left default values on


return file_path

def split_by_tiles(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is lovely , I require this all the time . I am just curious what's your motivation behind this implementation ? Are you using tiles somewhere already ?

Copy link
Collaborator Author

@emirfabio emirfabio Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I saw the function in your implementation and it looked conceptually cool so I tried understanding how it works.
Might not be needed here, not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this belongs in another module!

Especially as it's needs geopandas and shapely.

But it's pretty awesome 😄
(good that it's optional imports)

raise ValidationError("Geometry must have a 'coordinates' field")

valid_types = ["Polygon", "MultiPolygon"]
if geometry_dict["type"] not in valid_types:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding validation to check crs of the input geometry !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While a basic amount of geojson parsing / checking is useful here, I wouldn't go too wild on the CRS checking.

Mostly just check the CRS is EPSG:4326 if provided, else throw an error saying the user should convert it first (this is a python wrapper after all, so users should have some level of technical competency).

The idea of using geopandas for possible conversion is nice, but should always remain an optional feature. I'm not keep on pulling the large geopandas dependency into a small python module like this - users upstream can decide if they want such as large library.

For FieldTM, we do some basic checks like this:

async def check_crs(input_geojson: Union[dict, geojson.FeatureCollection]):
    """Validate CRS is valid for a geojson."""
    log.debug("validating coordinate reference system")

    def is_valid_crs(crs_name):
        valid_crs_list = [
            "urn:ogc:def:crs:OGC:1.3:CRS84",
            "urn:ogc:def:crs:EPSG::4326",
            "WGS 84",
        ]
        return crs_name in valid_crs_list

    def is_valid_coordinate(coord):
        if coord is None:
            return False
        return -180 <= coord[0] <= 180 and -90 <= coord[1] <= 90

    error_message = (
        "ERROR: Unsupported coordinate system, it is recommended to use a "
        "GeoJSON file in WGS84(EPSG 4326) standard."
    )
    if "crs" in input_geojson:
        crs = input_geojson.get("crs", {}).get("properties", {}).get("name")
        if not is_valid_crs(crs):
            log.error(error_message)
            raise HTTPException(
                status_code=HTTPStatus.BAD_REQUEST,
                detail=error_message,
            )
        return

    if (input_geojson_type := input_geojson.get("type")) == "FeatureCollection":
        features = input_geojson.get("features", [])
        coordinates = (
            features[-1].get("geometry", {}).get("coordinates", []) if features else []
        )
    elif input_geojson_type == "Feature":
        coordinates = input_geojson.get("geometry", {}).get("coordinates", [])
    else:
        coordinates = input_geojson.get("coordinates", {})

    first_coordinate = None
    if coordinates:
        while isinstance(coordinates, list):
            first_coordinate = coordinates
            coordinates = coordinates[0]

    error_message = (
        "ERROR: The coordinates within the GeoJSON file are not valid. "
        "Is the file empty?"
    )
    if not is_valid_coordinate(first_coordinate):
        log.error(error_message)
        raise HTTPException(HTTPStatus.BAD_REQUEST, detail=error_message)

gdf.set_crs(epsg=4326, inplace=True)

# Reproject to the target CRS
gdf = gdf.to_crs(epsg=int(target_crs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good , just make sure list of supported crs are there , this function doesn't work for the local coordinate systems or utm projected crs ! They require extra additional transformation parameters which we don't wanna go through in our first implementation, like above comment you can add validation of supported crs like 3857 and 4326 perhaps ! worth check what geopandas crs projection is capable of else we will mess up its geometry

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I got totally lost in the crs concepts trying to make the tiles implementation work.
I'll definitely make another pass to implement your suggestions while I learn more about it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in my previous comment, sorry to send you down the CRS rabbit hole 😅

I actually think this sort of thing could be really nicely handled by converting all the geopandas / shapely logic into PostGIS here https://github.com/hotosm/geojson-aoi-parser.

It's good to have separation of concerns / no mission creep for this module

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with sam !

@emirfabio
Copy link
Collaborator Author

Bravo ! And impressive work ! Amazing

Thank you!!

Copy link
Member

@spwoodcock spwoodcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work!!

Really loving this 😄

(added to some of Kshitij's comments + added some of my own)

all_files = zip_ref.namelist()

if len(all_files) < 2:
raise DownloadError(f"Expected at least 2 files in the archive, found {len(all_files)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly! Related to the comments above. Shapefile is the only format we provide (I think) that is actually composed of many files to make up the dataset.

# info from esri
.shp—The main file that stores the feature geometry; required.
.shx—The index file that stores the index of the feature geometry; required.
.dbf—The dBASE table that stores the attribute information of features; required.

There is a one-to-one relationship between geometry and attributes, which is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file.
.sbn and .sbx—The files that store the spatial index of the features.
... etc

they are typically all zipped together when distributed online

import csv
from io import TextIOWrapper
csv_reader = csv.DictReader(TextIOWrapper(file, 'utf-8'))
return list(csv_reader)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the idea here is so that users can ingest the CSV and do additional processing?

They could load the CSV into Pandas for example

}
}

print(f"Downloading OSM data for {args.feature_type}...")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, info related to logging:
All print statements could be replaced by log.info 👍

raise ValidationError("Geometry must have a 'coordinates' field")

valid_types = ["Polygon", "MultiPolygon"]
if geometry_dict["type"] not in valid_types:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While a basic amount of geojson parsing / checking is useful here, I wouldn't go too wild on the CRS checking.

Mostly just check the CRS is EPSG:4326 if provided, else throw an error saying the user should convert it first (this is a python wrapper after all, so users should have some level of technical competency).

The idea of using geopandas for possible conversion is nice, but should always remain an optional feature. I'm not keep on pulling the large geopandas dependency into a small python module like this - users upstream can decide if they want such as large library.

For FieldTM, we do some basic checks like this:

async def check_crs(input_geojson: Union[dict, geojson.FeatureCollection]):
    """Validate CRS is valid for a geojson."""
    log.debug("validating coordinate reference system")

    def is_valid_crs(crs_name):
        valid_crs_list = [
            "urn:ogc:def:crs:OGC:1.3:CRS84",
            "urn:ogc:def:crs:EPSG::4326",
            "WGS 84",
        ]
        return crs_name in valid_crs_list

    def is_valid_coordinate(coord):
        if coord is None:
            return False
        return -180 <= coord[0] <= 180 and -90 <= coord[1] <= 90

    error_message = (
        "ERROR: Unsupported coordinate system, it is recommended to use a "
        "GeoJSON file in WGS84(EPSG 4326) standard."
    )
    if "crs" in input_geojson:
        crs = input_geojson.get("crs", {}).get("properties", {}).get("name")
        if not is_valid_crs(crs):
            log.error(error_message)
            raise HTTPException(
                status_code=HTTPStatus.BAD_REQUEST,
                detail=error_message,
            )
        return

    if (input_geojson_type := input_geojson.get("type")) == "FeatureCollection":
        features = input_geojson.get("features", [])
        coordinates = (
            features[-1].get("geometry", {}).get("coordinates", []) if features else []
        )
    elif input_geojson_type == "Feature":
        coordinates = input_geojson.get("geometry", {}).get("coordinates", [])
    else:
        coordinates = input_geojson.get("coordinates", {})

    first_coordinate = None
    if coordinates:
        while isinstance(coordinates, list):
            first_coordinate = coordinates
            coordinates = coordinates[0]

    error_message = (
        "ERROR: The coordinates within the GeoJSON file are not valid. "
        "Is the file empty?"
    )
    if not is_valid_coordinate(first_coordinate):
        log.error(error_message)
        raise HTTPException(HTTPStatus.BAD_REQUEST, detail=error_message)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is here as a sanity check test of sorts?

As you already have test modules for pytest under /tests.

Or was this committed by accident, as you tester during development?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of both, it was the first testing script i wrote, committed by accident, i'll remove it hehe.


return file_path

def split_by_tiles(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this belongs in another module!

Especially as it's needs geopandas and shapely.

But it's pretty awesome 😄
(good that it's optional imports)

gdf.set_crs(epsg=4326, inplace=True)

# Reproject to the target CRS
gdf = gdf.to_crs(epsg=int(target_crs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in my previous comment, sorry to send you down the CRS rabbit hole 😅

I actually think this sort of thing could be really nicely handled by converting all the geopandas / shapely logic into PostGIS here https://github.com/hotosm/geojson-aoi-parser.

It's good to have separation of concerns / no mission creep for this module

pyproject.toml Outdated
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
"aiohttp>=3.11.16",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love that the primary dep is only a simple http client 👍

Also see: https://docs.hotosm.org/dev-guide/repo-management/dependencies/#locking-for-libraries-packages

When pinning for installable packages, the minimum required versions should ideally be determined.

This means that anyone installing this package isn't forced to upgrade to aoihttp v3.11.16 (April, 2025), which might break something say if they were using v3.7.0 (Oct, 2020).

I can't imagine their is a whole lot of breaking change between versions 3.x.x (there shouldn't be any).

I generally scour the changelog to determine what could be best, while being pragmatic and using something reasonably recent.

In this case, I would say v3.8.0 seems appropriate: https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst#380-2021-10-31
It's about 3.5yrs old now, but I can't see any breaking changes that would affect us

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!
ty for the explanation


[dependency-groups]
test = [
"pytest>=8.3.5",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above doesn't apply to test deps, as they aren't installed upstream.
Pinned >= the latest is fine!

"pytest-asyncio>=0.26.0",
]
dev = [
"geopandas>=1.0.1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach of making these dev, the requiring the user to have them installed upstream to use them.

But note it does mean there may be version incompatibility. The user might have a version installed that doesn't work with the code written, as we have no dependency locking/solving for these deps. They could install using pip install raw-data-api-py[dev]` and this would avoid it.

I would make a note to just say, the optional functionality using geopandas was only tested on v1.0.1.

This commit addresses feedback from the initial code review and implements
several key improvements to enhance the OSM Data Client:

Memory Management:
- Add stream-unzip support for memory-efficient extraction of large files
- Implement configurable memory thresholds for extraction decisions
- Add format-specific extraction policies for different output types

Flexibility & Configuration:
- Add access token support with proper authentication headers
- Implement options to control extraction behavior (force_extract, force_zip)
- Support bindZip parameter for controlling zip packaging
- Make RawDataOutputOptions configurable with sensible defaults

Improved User Experience:
- Add proper logging throughout with appropriate log levels
- Display queue position information when requests are queued
- Track and log status changes during task polling

Architecture Improvements:
- Create standalone OutputProcessor for handling file processing
- Add immutable RawDataApiMetadata for request metadata
- Implement RawDataResult class for consistent result handling

Additional Changes:
- Update dependency versions to use minimum required versions
- Make file extraction for shapefiles create a dedicated directory
- Implement proper path handling with Path objects throughout
@emirfabio
Copy link
Collaborator Author

Helloooo @spwoodcock and @kshitijrajsharma! 👋

I've implemented the suggestions from the previous review. Here's what's been added:

  • Memory management improvements with stream-unzip for large files
  • Added proper logging throughout with appropriate levels
  • Implemented queue position tracking and status changes during polling
  • Added access token support in headers
  • Created format-specific extraction policies (zip vs. extract) with defaults
  • Special handling for shapefiles with dedicated directories
  • Added proper error messages throughout

Let me know if there's anything else I should address!

Thanks for the feedback so far, really appreciate it!

@spwoodcock
Copy link
Member

Wow you absolutely killed it Emir - this is a very nice lib!

"""
try:
# Prepare geometry
if args.bbox:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor pendantic note: in the geo world a bbox would typically only be a box shape. A better name for this might be --bounds or even just --geojson

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sure, i'll change it.

@spwoodcock spwoodcock merged commit 952d5a0 into hotosm:main Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants