Skip to content

Commit

Permalink
feat: add option to save parquet files with wkt geometry (#56)
Browse files Browse the repository at this point in the history
* chore: change readme cli exmaples

* feat: added option to save geometry in the wkt format

* chore: add changelog entry

* refactor: change reading table function

* feat: added total elapsed time summary message

* ci: fix docs building pipeline

* chore: change elapsed time notification logic

* chore: disable time elapsed information on environments without rich installed

* chore: apply refurb suggestion
  • Loading branch information
RaczeQ committed Mar 13, 2024
1 parent 0f26626 commit d08e9a0
Show file tree
Hide file tree
Showing 12 changed files with 279 additions and 132 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ jobs:
- name: Install quackosm
run: |
pdm build -v -d dist
pip install 'quackosm @ file://'"$(pwd)/$(find dist -name '*.whl')"
pip install 'quackosm[cli] @ file://'"$(pwd)/$(find dist -name '*.whl')"
- name: Configure Git user
run: |
git config --local user.email "github-actions[bot]@users.noreply.github.com"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ci-prod.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ jobs:
- name: Install quackosm
run: |
pdm build -v -d dist
pip install 'quackosm @ file://'"$(pwd)/$(find dist -name '*.whl')"
pip install 'quackosm[cli] @ file://'"$(pwd)/$(find dist -name '*.whl')"
- name: Configure Git user
run: |
git config --local user.email "github-actions[bot]@users.noreply.github.com"
Expand Down
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- Option to disable progress reporting with the `--silent` flag and `silent_mode` argument [#14](https://github.com/kraina-ai/quackosm/issues/14)
- New example notebook dedicated to the command line interface
- Option to save parquet files with WKT geometry [#7](https://github.com/kraina-ai/quackosm/issues/7)
- Total elapsed time summary at the end [#15](https://github.com/kraina-ai/quackosm/issues/15)

### Changed

- Simplified and improved ways grouping process
- Renamed `rows_per_bucket` parameter to `rows_per_group`

## [0.4.5] - 2024-03-07

Expand Down
152 changes: 81 additions & 71 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,41 +163,45 @@ way/993121275 {'building': 'yes', 'name': ... POLYGON ((7.43214 43.7481...

```console
$ quackosm monaco.osm.pbf
⠏ [ 1/33] Reading nodes • 0:00:00
⠋ [ 2/33] Filtering nodes - intersection • 0:00:00
⠋ [ 3/33] Filtering nodes - tags • 0:00:00
⠋ [ 4/33] Calculating distinct filtered nodes ids • 0:00:00
⠸ [ 5/33] Reading ways • 0:00:00
⠙ [ 6/33] Unnesting ways • 0:00:00
⠋ [ 7/33] Filtering ways - valid refs • 0:00:00
⠋ [ 8/33] Filtering ways - intersection • 0:00:00
⠋ [ 9/33] Filtering ways - tags • 0:00:00
⠋ [ 10/33] Calculating distinct filtered ways ids • 0:00:00
⠋ [ 11/33] Reading relations • 0:00:00
⠋ [ 12/33] Unnesting relations • 0:00:00
⠋ [ 13/33] Filtering relations - valid refs • 0:00:00
⠋ [ 14/33] Filtering relations - intersection • 0:00:00
⠋ [ 15/33] Filtering relations - tags • 0:00:00
⠋ [ 16/33] Calculating distinct filtered relations ids • 0:00:00
⠋ [ 17/33] Loading required ways - by relations • 0:00:00
⠋ [ 18/33] Calculating distinct required ways ids • 0:00:00
⠙ [ 19/33] Saving filtered nodes with geometries • 0:00:00
⠸ [ 20/33] Saving required nodes with structs • 0:00:00
⠼ [ 21/33] Grouping filtered ways • 0:00:00
[ 22/33] Saving filtered ways with linestrings 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/10:00:00 < 0:00:00
⠙ [ 23/33] Grouping required ways • 0:00:00
[ 24/33] Saving required ways with linestrings 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/10:00:00 < 0:00:00
⠴ [ 25/33] Saving filtered ways with geometries • 0:00:00
⠹ [ 26/33] Saving valid relations parts • 0:00:00
⠋ [27.1/33] Saving relations inner parts - valid geometries • 0:00:00
⠋ [27.2/33] Saving relations inner parts - invalid geometries • 0:00:00
⠋ [28.1/33] Saving relations outer parts - valid geometries • 0:00:00
⠋ [28.2/33] Saving relations outer parts - invalid geometries • 0:00:00
⠋ [ 29/33] Saving relations outer parts with holes • 0:00:00
⠋ [ 30/33] Saving relations outer parts without holes • 0:00:00
⠋ [ 31/33] Saving filtered relations with geometries • 0:00:00
⠸ [32.1/33] Saving valid features • 0:00:00
⠙ [ 33/33] Saving final geoparquet file0:00:00
⠙ [ 1/32] Reading nodes • 0:00:00
⠋ [ 2/32] Filtering nodes - intersection • 0:00:00
⠙ [ 3/32] Filtering nodes - tags • 0:00:00
⠋ [ 4/32] Calculating distinct filtered nodes ids • 0:00:00
⠙ [ 5/32] Reading ways • 0:00:00
⠋ [ 6/32] Unnesting ways • 0:00:00
⠹ [ 7/32] Filtering ways - valid refs • 0:00:00
⠋ [ 8/32] Filtering ways - intersection • 0:00:00
⠙ [ 9/32] Filtering ways - tags • 0:00:00
⠋ [ 10/32] Calculating distinct filtered ways ids • 0:00:00
⠋ [ 11/32] Reading relations • 0:00:00
⠋ [ 12/32] Unnesting relations • 0:00:00
⠹ [ 13/32] Filtering relations - valid refs • 0:00:00
⠋ [ 14/32] Filtering relations - intersection • 0:00:00
⠙ [ 15/32] Filtering relations - tags • 0:00:00
⠋ [ 16/32] Calculating distinct filtered relations ids • 0:00:00
⠙ [ 17/32] Loading required ways - by relations • 0:00:00
⠋ [ 18/32] Calculating distinct required ways ids • 0:00:00
⠙ [ 19/32] Saving filtered nodes with geometries • 0:00:00
⠋ [20.1/32] Grouping filtered ways - assigning groups • 0:00:00
⠧ [20.2/32] Grouping filtered ways - joining with nodes • 0:00:00
⠋ [20.3/32] Grouping filtered ways - partitioning by group • 0:00:00
[ 21/32] Saving filtered ways with linestrings 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/10:00:00 < 0:00:00
⠋ [22.1/32] Grouping required ways - assigning groups • 0:00:00
⠧ [22.2/32] Grouping required ways - joining with nodes • 0:00:00
⠋ [22.3/32] Grouping required ways - partitioning by group • 0:00:00
[ 23/32] Saving required ways with linestrings 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/10:00:00 < 0:00:00
⠹ [ 24/32] Saving filtered ways with geometries • 0:00:00
⠸ [ 25/32] Saving valid relations parts • 0:00:00
⠋ [26.1/32] Saving relations inner parts - valid geometries • 0:00:00
⠋ [26.2/32] Saving relations inner parts - invalid geometries • 0:00:00
⠋ [27.1/32] Saving relations outer parts - valid geometries • 0:00:00
⠋ [27.2/32] Saving relations outer parts - invalid geometries • 0:00:00
⠋ [ 28/32] Saving relations outer parts with holes • 0:00:00
⠋ [ 29/32] Saving relations outer parts without holes • 0:00:00
⠙ [ 30/32] Saving filtered relations with geometries • 0:00:00
⠹ [31.1/32] Saving valid features • 0:00:00
⠋ [ 32/32] Saving final geoparquet file0:00:00
Finished operation in 0:00:06
files/monaco_nofilter_noclip_compact.geoparquet
```

Expand Down Expand Up @@ -285,41 +289,45 @@ way/998561139 {'barrier': 'bollard', 'bicyc... LINESTRING (12.45828 41.9...
```console
$ quackosm --geom-filter-geocode "Shibuya, Tokyo"
100%|██████████████████████████████████████| 45.7M/45.7M [00:00<00:00, 259GB/s]
⠹ [ 1/33] Reading nodes • 0:00:03
⠇ [ 2/33] Filtering nodes - intersection • 0:00:01
⠇ [ 3/33] Filtering nodes - tags • 0:00:01
⠙ [ 4/33] Calculating distinct filtered nodes ids • 0:00:00
⠋ [ 5/33] Reading ways • 0:00:04
⠧ [ 6/33] Unnesting ways • 0:00:02
⠹ [ 7/33] Filtering ways - valid refs • 0:00:03
⠴ [ 8/33] Filtering ways - intersection • 0:00:02
⠼ [ 9/33] Filtering ways - tags • 0:00:00
⠋ [ 10/33] Calculating distinct filtered ways ids • 0:00:00
⠦ [ 11/33] Reading relations • 0:00:00
⠴ [ 12/33] Unnesting relations • 0:00:00
⠼ [ 13/33] Filtering relations - valid refs • 0:00:00
⠸ [ 14/33] Filtering relations - intersection • 0:00:00
⠙ [ 15/33] Filtering relations - tags • 0:00:00
⠋ [ 16/33] Calculating distinct filtered relations ids • 0:00:00
⠹ [ 17/33] Loading required ways - by relations • 0:00:00
⠋ [ 18/33] Calculating distinct required ways ids • 0:00:00
⠇ [ 19/33] Saving filtered nodes with geometries • 0:00:02
⠋ [ 20/33] Saving required nodes with structs • 0:00:05
⠧ [ 21/33] Grouping filtered ways • 0:00:03
[ 22/33] Saving filtered ways with linestrings 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/10:00:00 < 0:00:00
⠴ [ 23/33] Grouping required ways • 0:00:02
[ 24/33] Saving required ways with linestrings 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/10:00:00 < 0:00:00
⠹ [ 25/33] Saving filtered ways with geometries • 0:00:01
⠦ [ 26/33] Saving valid relations parts • 0:00:00
⠙ [27.1/33] Saving relations inner parts - valid geometries • 0:00:00
⠙ [27.2/33] Saving relations inner parts - invalid geometries • 0:00:00
⠙ [28.1/33] Saving relations outer parts - valid geometries • 0:00:00
⠙ [28.2/33] Saving relations outer parts - invalid geometries • 0:00:00
⠋ [ 29/33] Saving relations outer parts with holes • 0:00:00
⠙ [ 30/33] Saving relations outer parts without holes • 0:00:00
⠙ [ 31/33] Saving filtered relations with geometries • 0:00:00
⠼ [32.1/33] Saving valid features • 0:00:00
⠋ [ 33/33] Saving final geoparquet file0:00:00
⠦ [ 1/32] Reading nodes • 0:00:05
⠋ [ 2/32] Filtering nodes - intersection • 0:00:00
⠦ [ 3/32] Filtering nodes - tags • 0:00:00
⠋ [ 4/32] Calculating distinct filtered nodes ids • 0:00:00
⠏ [ 5/32] Reading ways • 0:00:04
⠦ [ 6/32] Unnesting ways • 0:00:02
⠴ [ 7/32] Filtering ways - valid refs • 0:00:02
⠙ [ 8/32] Filtering ways - intersection • 0:00:01
⠹ [ 9/32] Filtering ways - tags • 0:00:00
⠋ [ 10/32] Calculating distinct filtered ways ids • 0:00:00
⠴ [ 11/32] Reading relations • 0:00:00
⠴ [ 12/32] Unnesting relations • 0:00:00
⠹ [ 13/32] Filtering relations - valid refs • 0:00:00
⠙ [ 14/32] Filtering relations - intersection • 0:00:00
⠙ [ 15/32] Filtering relations - tags • 0:00:00
⠋ [ 16/32] Calculating distinct filtered relations ids • 0:00:00
⠙ [ 17/32] Loading required ways - by relations • 0:00:00
⠋ [ 18/32] Calculating distinct required ways ids • 0:00:00
⠋ [ 19/32] Saving filtered nodes with geometries • 0:00:00
⠋ [20.1/32] Grouping filtered ways - assigning groups • 0:00:00
⠴ [20.2/32] Grouping filtered ways - joining with nodes • 0:00:03
⠹ [20.3/32] Grouping filtered ways - partitioning by group • 0:00:00
[ 21/32] Saving filtered ways with linestrings 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/10:00:01 < 0:00:00
⠋ [22.1/32] Grouping required ways - assigning groups • 0:00:00
⠴ [22.2/32] Grouping required ways - joining with nodes • 0:00:03
⠋ [22.3/32] Grouping required ways - partitioning by group • 0:00:00
[ 23/32] Saving required ways with linestrings 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/10:00:00 < 0:00:00
⠋ [ 24/32] Saving filtered ways with geometries • 0:00:01
⠸ [ 25/32] Saving valid relations parts • 0:00:00
⠋ [26.1/32] Saving relations inner parts - valid geometries • 0:00:00
⠋ [26.2/32] Saving relations inner parts - invalid geometries • 0:00:00
⠙ [27.1/32] Saving relations outer parts - valid geometries • 0:00:00
⠋ [27.2/32] Saving relations outer parts - invalid geometries • 0:00:00
⠋ [ 28/32] Saving relations outer parts with holes • 0:00:00
⠋ [ 29/32] Saving relations outer parts without holes • 0:00:00
⠙ [ 30/32] Saving filtered relations with geometries • 0:00:00
⠸ [31.1/32] Saving valid features • 0:00:00
⠋ [ 32/32] Saving final geoparquet file0:00:00
Finished operation in 0:00:39
files/9ae2b160eb7556991148f5a2693aaf4b38bbb225c3700a6bfe9e5e54f48b987e_nofilter_compact.geoparquet
```

Expand Down Expand Up @@ -397,11 +405,13 @@ When the `keep_all_tags` parameter is passed while filtering by OSM tags, and ad
General schema of multiple segments that are concatenated together:
`pbf_file_name`\_(`osm_filter_tags_hash_part`/`nofilter`)(\_`alltags`)\_(`clipping_geometry_hash_part`/`noclip`)\_(`compact`/`exploded`)(\_`filter_osm_ids_hash_part`).geoparquet

> If the WKT mode is turned on, then the result file will be saved with a `parquet` extension and a `_wkt` suffix.

### Memory usage

DuckDB queries requiring `JOIN`, `GROUP` and `ORDER BY` operations are very memory intensive. Because of that, some steps are divided into chunks (groups) with a set number of rows per chunk.

QuackOSM has been roughly tuned to different workloads. The `rows_per_bucket` variable is set based on an available memory in the system:
QuackOSM has been roughly tuned to different workloads. The `rows_per_group` variable is set based on an available memory in the system:

| Memory | Rows per group |
| ---------: | -------------: |
Expand Down
35 changes: 35 additions & 0 deletions examples/command_line_interface.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -452,6 +452,41 @@
" \"files/andorra_8a275d4edddd035eb6a5d8120a8b42a320b25cf93577335600faba8c2d69d85a_noclip_compact.geoparquet\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## WKT mode\n",
"\n",
"By default, QuackOSM saves parsed files in the `GeoParquet` format with the geometry in the `WKB` format.\n",
"\n",
"There is also an option to save the file as a `Parquet` file with the geometry in the `WKT` format using `--wkt-result` (or `--wkt`) parameter."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"! QuackOSM andorra.osm.pbf --wkt-result"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"duckdb.read_parquet(\n",
" \"files/andorra_nofilter_noclip_compact_wkt.parquet\"\n",
")"
]
}
],
"metadata": {
Expand Down
10 changes: 10 additions & 0 deletions quackosm/_rich_progress.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
"""Wrapper over Rich progress bar."""

from collections.abc import Iterable
from contextlib import suppress
from datetime import timedelta

__all__ = ["TaskProgressSpinner", "TaskProgressBar"]

Expand All @@ -17,6 +19,14 @@ def log_message(message: str) -> None:
print(message)


def show_total_elapsed_time(elapsed_seconds: float) -> None:
with suppress(ImportError): # pragma: no cover
from rich import print as rprint

elapsed_time_formatted = str(timedelta(seconds=int(elapsed_seconds)))
rprint(f"Finished operation in [progress.elapsed]{elapsed_time_formatted}")


class TaskProgressSpinner:
def __init__(self, step_name: str, step_number: str, silent_mode: bool):
self.step_name = step_name
Expand Down
11 changes: 11 additions & 0 deletions quackosm/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -500,6 +500,15 @@ def main(
callback=_filter_osm_ids_callback,
),
] = None,
wkt_result: Annotated[
bool,
typer.Option(
"--wkt-result/",
"--wkt/",
help="Whether to save the geometry as a WKT string instead of WKB blob.",
show_default=False,
),
] = False,
silent_mode: Annotated[
bool,
typer.Option(
Expand Down Expand Up @@ -583,6 +592,7 @@ def main(
else None
),
filter_osm_ids=filter_osm_ids, # type: ignore
save_as_wkt=wkt_result,
silent_mode=silent_mode,
)
else:
Expand All @@ -601,6 +611,7 @@ def main(
else None
),
filter_osm_ids=filter_osm_ids, # type: ignore
save_as_wkt=wkt_result,
silent_mode=silent_mode,
)
typer.secho(geoparquet_path, fg="green")
18 changes: 14 additions & 4 deletions quackosm/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ def convert_pbf_to_gpq(
filter_osm_ids: Optional[list[str]] = None,
working_directory: Union[str, Path] = "files",
osm_way_polygon_features_config: Optional[Union[OsmWayPolygonConfig, dict[str, Any]]] = None,
save_as_wkt: bool = False,
silent_mode: bool = False,
) -> Path:
"""
Expand Down Expand Up @@ -69,7 +70,10 @@ def convert_pbf_to_gpq(
Config used to determine which closed way features are polygons.
Modifications to this config left are left for experienced OSM users.
Defaults to predefined "osm_way_polygon_features.json".
silent_mode (bool): Disable progress bars.
save_as_wkt (bool): Whether to save the file with geometry in the WKT form instead of WKB.
If `True`, it will be saved as a `.parquet` file, because it won't be in the GeoParquet
standard. Defaults to `False`.
silent_mode (bool): Disable progress bars. Defaults to `False`.
Returns:
Path: Path to the generated GeoParquet file.
Expand Down Expand Up @@ -229,6 +233,7 @@ def convert_pbf_to_gpq(
explode_tags=explode_tags,
ignore_cache=ignore_cache,
filter_osm_ids=filter_osm_ids,
save_as_wkt=save_as_wkt,
)


Expand All @@ -243,6 +248,7 @@ def convert_geometry_to_gpq(
filter_osm_ids: Optional[list[str]] = None,
working_directory: Union[str, Path] = "files",
osm_way_polygon_features_config: Optional[Union[OsmWayPolygonConfig, dict[str, Any]]] = None,
save_as_wkt: bool = False,
silent_mode: bool = False,
) -> Path:
"""
Expand Down Expand Up @@ -287,7 +293,10 @@ def convert_geometry_to_gpq(
Config used to determine which closed way features are polygons.
Modifications to this config left are left for experienced OSM users.
Defaults to predefined "osm_way_polygon_features.json".
silent_mode (bool): Disable progress bars.
save_as_wkt (bool): Whether to save the file with geometry in the WKT form instead of WKB.
If `True`, it will be saved as a `.parquet` file, because it won't be in the GeoParquet
standard. Defaults to `False`.
silent_mode (bool): Disable progress bars. Defaults to `False`.
Returns:
Path: Path to the generated GeoParquet file.
Expand Down Expand Up @@ -399,6 +408,7 @@ def convert_geometry_to_gpq(
explode_tags=explode_tags,
ignore_cache=ignore_cache,
filter_osm_ids=filter_osm_ids,
save_as_wkt=save_as_wkt,
)


Expand Down Expand Up @@ -454,7 +464,7 @@ def get_features_gdf(
Config used to determine which closed way features are polygons.
Modifications to this config left are left for experienced OSM users.
Defaults to predefined "osm_way_polygon_features.json".
silent_mode (bool): Disable progress bars.
silent_mode (bool): Disable progress bars. Defaults to `False`.
Returns:
gpd.GeoDataFrame: GeoDataFrame with OSM features.
Expand Down Expand Up @@ -634,7 +644,7 @@ def get_features_gdf_from_geometry(
Config used to determine which closed way features are polygons.
Modifications to this config left are left for experienced OSM users.
Defaults to predefined "osm_way_polygon_features.json".
silent_mode (bool): Disable progress bars.
silent_mode (bool): Disable progress bars. Defaults to `False`.
Returns:
gpd.GeoDataFrame: GeoDataFrame with OSM features.
Expand Down
Loading

0 comments on commit d08e9a0

Please sign in to comment.