-
Notifications
You must be signed in to change notification settings - Fork 1
feat: first impl #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kshitijrajsharma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bravo ! And impressive work ! Amazing
osm_data_client/api.py
Outdated
| except Exception as ex: | ||
| raise DownloadError(f"Error downloading snapshot: {str(ex)}") from ex | ||
|
|
||
| def _extract_zip_data(self, zip_data: bytes, file_name: str) -> Union[Dict[str, Any], List[Dict[str, Any]], Dict[str, Union[str, bytes]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is nice , however it might fail for the large zip files (specially considering you are unzipping in the memory ) raw-data-api supports multiple different file formats and unzipping might not require for all of them , for instance user might like to get the data in zip file for shapefile file format so keeping this function optional deliver the zip back to user might be useful, specially considering this is python client and user will have more flexibility to do what they want with the data !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally makes sense.
Thinking about it, the main implementation could always return the zip file bytes in a result wrapper class.
Without adding much complexity.
Then specific result handling based on size and output format could be re-added back in as extensions.
So that if the user does want the client to handle the unpacking of data it can but it's not forced behaviour.
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For shapefiles it would be larger ! I don't like shapefiles either but unfortunately there are fortune of people who still use it and one of the popular file formats in raw-data-api
Problems like these would be handled by targeted, swappable implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory Issue
- For large file unzipping, I would recommend this nice package: https://github.com/uktrade/stream-unzip.
- A very memory efficient approach to stream unzip the files from zip to disk.
Default unzipping
- By default the API uses
bind_zip=Truefor all filetypes right? - But as this is a python lib (not direct download) and we can do neat things like stream unzip as above, we could override this behaviour.
- I think we should have some sane defaults configured for each filetype, plus an override to
force_unzipand setbind_zip=False.- Shapefile: no (as comprises many files)
- KML: yes
- Mbtiles yes
- FlatGeoBuf: yes
- CSV: yes
- GeoPackage: yes
- PGDUMP: no (could be large, this is a niche format)
- GeoJSON: yes
- So in reality I think most file types will be unzipped.
- The zip also contains a metadata JSON file if I'm not mistaken right? (off the top of my head)
- So perhaps instead of downloading that, we could simply read the metadata from the zip, and print to terminal as a DEBUG level log? (this way the user can see the info if they need, but for most they probably don't want it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only thing I didn't really get:
I can send bind_zip=False as part of the request to get a streamed response ? Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's true! I neglected that part 😅
Basically what we want is the file on the users filesystem, without the zip 👍
(unless its shapefile or pgdump)
osm_data_client/api.py
Outdated
| all_files = zip_ref.namelist() | ||
|
|
||
| if len(all_files) < 2: | ||
| raise DownloadError(f"Expected at least 2 files in the archive, found {len(all_files)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For shapefiles it would be larger ! I don't like shapefiles either but unfortunately there are fortune of people who still use it and one of the popular file formats in raw-data-api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep exactly! Related to the comments above. Shapefile is the only format we provide (I think) that is actually composed of many files to make up the dataset.
# info from esri
.shp—The main file that stores the feature geometry; required.
.shx—The index file that stores the index of the feature geometry; required.
.dbf—The dBASE table that stores the attribute information of features; required.
There is a one-to-one relationship between geometry and attributes, which is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file.
.sbn and .sbx—The files that store the spatial index of the features.
... etc
they are typically all zipped together when distributed online
osm_data_client/api.py
Outdated
| data=json.dumps(payload), | ||
| headers=self.headers, | ||
| ) as response: | ||
| response_data = await response.json() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will also get queue information after the posting the request , sometimes it is possible API might take longer and user might have to wait for a bit if there are so many tasks running :
Response will be like this :
{
"task_id": "e4f7c25e-4904-439b-93f0-470f7807913b",
"track_link": "/tasks/status/e4f7c25e-4904-439b-93f0-470f7807913b/",
"queue": 0
}
Here if queue says 3 then you are after 3 tasks ! Worth sending info to the client ! Also same during the pooling you will know if the extraction is started or not so you can display that in the logs !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh for sure, totally agree!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add logs if pooling status changes ! It will change from submitted to started when tasks starts
Together with this it can give the user useful feedback on the request status
osm_data_client/api.py
Outdated
| base_api_url: Base URL for the Raw Data API | ||
| """ | ||
| self.base_api_url = base_api_url | ||
| self.headers = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add access-token option in the headers and let people supply if they have it , by default API is public but with this token ability they will be able to bypass the export limit , have some staff privilege to queues based on their access token ! We can add those advance extraction methods later on , for now just letting user pass their access token would be nice if they have it ! For eg : FMTM might require priority queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, totally makes sense.
osm_data_client/api.py
Outdated
| try: | ||
| async with session.get( | ||
| url=f"{self.base_api_url}{task_link}", headers=self.headers | ||
| ) as response: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add logs if pooling status changes ! It will change from submitted to started when tasks starts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree - adding logs throughout the lib would be a really nice change 👍
In each file:
import logging
# other imports
log = logging.getLogger(__name__)Then any upstream project using this can hook into logs.
Here you could add log.debug("xxx") to provide the response each time, with a log.info("xxx") after SUCCESS or FAILED.
For the CLI, you could instantiate a logger to the terminal stdout to hook into logs that way too:
import sys
import logging
# set a --verbose -v flag
if args.verbose:
logging.basicConfig(
level="DEBUG",
format=(
"%(asctime)s.%(msecs)03d [%(levelname)s] "
"%(name)s | %(funcName)s:%(lineno)d | %(message)s"
),
datefmt="%y-%m-%d %H:%M:%S",
stream=sys.stdout,
)
pyproject.toml
Outdated
| version = "0.1.0" | ||
| description = "Add your description here" | ||
| readme = "README.md" | ||
| requires-python = ">=3.13" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets be little bit flexible with the python version ? lets support those application which uses older python , perhaps >= 3.10 would be okay ? what do you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup totally, I'm simply not familiar with python environments, I've left default values on
osm_data_client/utils/file.py
Outdated
|
|
||
| return file_path | ||
|
|
||
| def split_by_tiles( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is lovely , I require this all the time . I am just curious what's your motivation behind this implementation ? Are you using tiles somewhere already ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I saw the function in your implementation and it looked conceptually cool so I tried understanding how it works.
Might not be needed here, not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this belongs in another module!
Especially as it's needs geopandas and shapely.
But it's pretty awesome 😄
(good that it's optional imports)
| raise ValidationError("Geometry must have a 'coordinates' field") | ||
|
|
||
| valid_types = ["Polygon", "MultiPolygon"] | ||
| if geometry_dict["type"] not in valid_types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding validation to check crs of the input geometry !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While a basic amount of geojson parsing / checking is useful here, I wouldn't go too wild on the CRS checking.
Mostly just check the CRS is EPSG:4326 if provided, else throw an error saying the user should convert it first (this is a python wrapper after all, so users should have some level of technical competency).
The idea of using geopandas for possible conversion is nice, but should always remain an optional feature. I'm not keep on pulling the large geopandas dependency into a small python module like this - users upstream can decide if they want such as large library.
For FieldTM, we do some basic checks like this:
async def check_crs(input_geojson: Union[dict, geojson.FeatureCollection]):
"""Validate CRS is valid for a geojson."""
log.debug("validating coordinate reference system")
def is_valid_crs(crs_name):
valid_crs_list = [
"urn:ogc:def:crs:OGC:1.3:CRS84",
"urn:ogc:def:crs:EPSG::4326",
"WGS 84",
]
return crs_name in valid_crs_list
def is_valid_coordinate(coord):
if coord is None:
return False
return -180 <= coord[0] <= 180 and -90 <= coord[1] <= 90
error_message = (
"ERROR: Unsupported coordinate system, it is recommended to use a "
"GeoJSON file in WGS84(EPSG 4326) standard."
)
if "crs" in input_geojson:
crs = input_geojson.get("crs", {}).get("properties", {}).get("name")
if not is_valid_crs(crs):
log.error(error_message)
raise HTTPException(
status_code=HTTPStatus.BAD_REQUEST,
detail=error_message,
)
return
if (input_geojson_type := input_geojson.get("type")) == "FeatureCollection":
features = input_geojson.get("features", [])
coordinates = (
features[-1].get("geometry", {}).get("coordinates", []) if features else []
)
elif input_geojson_type == "Feature":
coordinates = input_geojson.get("geometry", {}).get("coordinates", [])
else:
coordinates = input_geojson.get("coordinates", {})
first_coordinate = None
if coordinates:
while isinstance(coordinates, list):
first_coordinate = coordinates
coordinates = coordinates[0]
error_message = (
"ERROR: The coordinates within the GeoJSON file are not valid. "
"Is the file empty?"
)
if not is_valid_coordinate(first_coordinate):
log.error(error_message)
raise HTTPException(HTTPStatus.BAD_REQUEST, detail=error_message)- I'm hoping this functionality is moved to https://github.com/hotosm/geojson-aoi-parser in the end.
- The goal is to have all our tools first process input AOIs via https://github.com/hotosm/geojson-aoi-parser, then do further processing, like downloading the data extract with this lib.
- So as much as possible, I think the coordinate system checking should be done elsewhere, with this lib implementing the basics 👍
osm_data_client/utils/transform.py
Outdated
| gdf.set_crs(epsg=4326, inplace=True) | ||
|
|
||
| # Reproject to the target CRS | ||
| gdf = gdf.to_crs(epsg=int(target_crs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good , just make sure list of supported crs are there , this function doesn't work for the local coordinate systems or utm projected crs ! They require extra additional transformation parameters which we don't wanna go through in our first implementation, like above comment you can add validation of supported crs like 3857 and 4326 perhaps ! worth check what geopandas crs projection is capable of else we will mess up its geometry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I got totally lost in the crs concepts trying to make the tiles implementation work.
I'll definitely make another pass to implement your suggestions while I learn more about it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in my previous comment, sorry to send you down the CRS rabbit hole 😅
I actually think this sort of thing could be really nicely handled by converting all the geopandas / shapely logic into PostGIS here https://github.com/hotosm/geojson-aoi-parser.
It's good to have separation of concerns / no mission creep for this module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with sam !
Thank you!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work!!
Really loving this 😄
(added to some of Kshitij's comments + added some of my own)
osm_data_client/api.py
Outdated
| all_files = zip_ref.namelist() | ||
|
|
||
| if len(all_files) < 2: | ||
| raise DownloadError(f"Expected at least 2 files in the archive, found {len(all_files)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep exactly! Related to the comments above. Shapefile is the only format we provide (I think) that is actually composed of many files to make up the dataset.
# info from esri
.shp—The main file that stores the feature geometry; required.
.shx—The index file that stores the index of the feature geometry; required.
.dbf—The dBASE table that stores the attribute information of features; required.
There is a one-to-one relationship between geometry and attributes, which is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file.
.sbn and .sbx—The files that store the spatial index of the features.
... etc
they are typically all zipped together when distributed online
osm_data_client/api.py
Outdated
| import csv | ||
| from io import TextIOWrapper | ||
| csv_reader = csv.DictReader(TextIOWrapper(file, 'utf-8')) | ||
| return list(csv_reader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the idea here is so that users can ingest the CSV and do additional processing?
They could load the CSV into Pandas for example
osm_data_client/cli.py
Outdated
| } | ||
| } | ||
|
|
||
| print(f"Downloading OSM data for {args.feature_type}...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, info related to logging:
All print statements could be replaced by log.info 👍
| raise ValidationError("Geometry must have a 'coordinates' field") | ||
|
|
||
| valid_types = ["Polygon", "MultiPolygon"] | ||
| if geometry_dict["type"] not in valid_types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While a basic amount of geojson parsing / checking is useful here, I wouldn't go too wild on the CRS checking.
Mostly just check the CRS is EPSG:4326 if provided, else throw an error saying the user should convert it first (this is a python wrapper after all, so users should have some level of technical competency).
The idea of using geopandas for possible conversion is nice, but should always remain an optional feature. I'm not keep on pulling the large geopandas dependency into a small python module like this - users upstream can decide if they want such as large library.
For FieldTM, we do some basic checks like this:
async def check_crs(input_geojson: Union[dict, geojson.FeatureCollection]):
"""Validate CRS is valid for a geojson."""
log.debug("validating coordinate reference system")
def is_valid_crs(crs_name):
valid_crs_list = [
"urn:ogc:def:crs:OGC:1.3:CRS84",
"urn:ogc:def:crs:EPSG::4326",
"WGS 84",
]
return crs_name in valid_crs_list
def is_valid_coordinate(coord):
if coord is None:
return False
return -180 <= coord[0] <= 180 and -90 <= coord[1] <= 90
error_message = (
"ERROR: Unsupported coordinate system, it is recommended to use a "
"GeoJSON file in WGS84(EPSG 4326) standard."
)
if "crs" in input_geojson:
crs = input_geojson.get("crs", {}).get("properties", {}).get("name")
if not is_valid_crs(crs):
log.error(error_message)
raise HTTPException(
status_code=HTTPStatus.BAD_REQUEST,
detail=error_message,
)
return
if (input_geojson_type := input_geojson.get("type")) == "FeatureCollection":
features = input_geojson.get("features", [])
coordinates = (
features[-1].get("geometry", {}).get("coordinates", []) if features else []
)
elif input_geojson_type == "Feature":
coordinates = input_geojson.get("geometry", {}).get("coordinates", [])
else:
coordinates = input_geojson.get("coordinates", {})
first_coordinate = None
if coordinates:
while isinstance(coordinates, list):
first_coordinate = coordinates
coordinates = coordinates[0]
error_message = (
"ERROR: The coordinates within the GeoJSON file are not valid. "
"Is the file empty?"
)
if not is_valid_coordinate(first_coordinate):
log.error(error_message)
raise HTTPException(HTTPStatus.BAD_REQUEST, detail=error_message)- I'm hoping this functionality is moved to https://github.com/hotosm/geojson-aoi-parser in the end.
- The goal is to have all our tools first process input AOIs via https://github.com/hotosm/geojson-aoi-parser, then do further processing, like downloading the data extract with this lib.
- So as much as possible, I think the coordinate system checking should be done elsewhere, with this lib implementing the basics 👍
osm_data_client/test.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is here as a sanity check test of sorts?
As you already have test modules for pytest under /tests.
Or was this committed by accident, as you tester during development?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of both, it was the first testing script i wrote, committed by accident, i'll remove it hehe.
osm_data_client/utils/file.py
Outdated
|
|
||
| return file_path | ||
|
|
||
| def split_by_tiles( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this belongs in another module!
Especially as it's needs geopandas and shapely.
But it's pretty awesome 😄
(good that it's optional imports)
osm_data_client/utils/transform.py
Outdated
| gdf.set_crs(epsg=4326, inplace=True) | ||
|
|
||
| # Reproject to the target CRS | ||
| gdf = gdf.to_crs(epsg=int(target_crs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in my previous comment, sorry to send you down the CRS rabbit hole 😅
I actually think this sort of thing could be really nicely handled by converting all the geopandas / shapely logic into PostGIS here https://github.com/hotosm/geojson-aoi-parser.
It's good to have separation of concerns / no mission creep for this module
pyproject.toml
Outdated
| readme = "README.md" | ||
| requires-python = ">=3.13" | ||
| dependencies = [ | ||
| "aiohttp>=3.11.16", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love that the primary dep is only a simple http client 👍
Also see: https://docs.hotosm.org/dev-guide/repo-management/dependencies/#locking-for-libraries-packages
When pinning for installable packages, the minimum required versions should ideally be determined.
This means that anyone installing this package isn't forced to upgrade to aoihttp v3.11.16 (April, 2025), which might break something say if they were using v3.7.0 (Oct, 2020).
I can't imagine their is a whole lot of breaking change between versions 3.x.x (there shouldn't be any).
I generally scour the changelog to determine what could be best, while being pragmatic and using something reasonably recent.
In this case, I would say v3.8.0 seems appropriate: https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst#380-2021-10-31
It's about 3.5yrs old now, but I can't see any breaking changes that would affect us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
ty for the explanation
|
|
||
| [dependency-groups] | ||
| test = [ | ||
| "pytest>=8.3.5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above doesn't apply to test deps, as they aren't installed upstream.
Pinned >= the latest is fine!
| "pytest-asyncio>=0.26.0", | ||
| ] | ||
| dev = [ | ||
| "geopandas>=1.0.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach of making these dev, the requiring the user to have them installed upstream to use them.
But note it does mean there may be version incompatibility. The user might have a version installed that doesn't work with the code written, as we have no dependency locking/solving for these deps. They could install using pip install raw-data-api-py[dev]` and this would avoid it.
I would make a note to just say, the optional functionality using geopandas was only tested on v1.0.1.
This commit addresses feedback from the initial code review and implements several key improvements to enhance the OSM Data Client: Memory Management: - Add stream-unzip support for memory-efficient extraction of large files - Implement configurable memory thresholds for extraction decisions - Add format-specific extraction policies for different output types Flexibility & Configuration: - Add access token support with proper authentication headers - Implement options to control extraction behavior (force_extract, force_zip) - Support bindZip parameter for controlling zip packaging - Make RawDataOutputOptions configurable with sensible defaults Improved User Experience: - Add proper logging throughout with appropriate log levels - Display queue position information when requests are queued - Track and log status changes during task polling Architecture Improvements: - Create standalone OutputProcessor for handling file processing - Add immutable RawDataApiMetadata for request metadata - Implement RawDataResult class for consistent result handling Additional Changes: - Update dependency versions to use minimum required versions - Make file extraction for shapefiles create a dedicated directory - Implement proper path handling with Path objects throughout
|
Helloooo @spwoodcock and @kshitijrajsharma! 👋 I've implemented the suggestions from the previous review. Here's what's been added:
Let me know if there's anything else I should address! Thanks for the feedback so far, really appreciate it! |
|
Wow you absolutely killed it Emir - this is a very nice lib! |
| """ | ||
| try: | ||
| # Prepare geometry | ||
| if args.bbox: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor pendantic note: in the geo world a bbox would typically only be a box shape. A better name for this might be --bounds or even just --geojson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sure, i'll change it.
Adds the core functionality for fetching OpenStreetMap data via the HOTOSM Raw Data API.
Includes:
RawDataAPI)get_osm_dataGeometryInput,RequestParams)