# 

# Dive into GeoParquet using Python (GeoPandas+PyArrow)

While this notebook focuses on exploring GeoParquet files using GeoPandas and PyArrow, note that there are many alternatives, even in Python (DuckDB, geoarrow-rs, Apache Sedona, ..).

## Example: the Overture building data for Delft

Overture is a collaborative open-data initiative focusing on open map data (https://overturemaps.org/). It has data sets about buildings, streets, addresses, boundaries, land use, etc, Ã nd makes it data available as GeoParquet files.

https://explore.overturemaps.org/#16.08/52.007122/4.366463

https://overturemaps.org/blog/2025/why-we-chose-geoparquet-breaking-down-data-silos-at-overture-maps/

As an example, let's explore the building data (~250 GB dataset). Reading the data for an area around the center of Delft:

In [None]:
import geopandas

In [None]:
%%time
gdf = geopandas.read_parquet(
    "s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/",
    bbox=(4.29, 51.98, 4.38, 52.03)
)

In [None]:
gdf[["id", "geometry", "subtype"]]

The subset of data is also provided in the workshop material (downloaded using the [overturemaps CLI](https://github.com/OvertureMaps/overturemaps-py) with `overturemaps download -o delft.parquet -f geoparquet -t building --bbox=4.29,51.98,4.38,52.03`):

```python
gdf = geopandas.read_parquet("delft.parquet")
```

Visualize the downloaded data quickly (we can also use the built-in `gdf.plot()`, but lonboard gives an easy interactive map that works for larger data):

In [None]:
import lonboard

In [None]:
lonboard.viz(gdf)

## How does this GeoParquet dataset look like?

Schema of the Parquet dataset (notice the "box" struct column):

In [None]:
import pyarrow.dataset as ds

In [None]:
dataset = ds.dataset("s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/")

In [None]:
dataset.schema

The partitioned dataset consists of 236 files:

In [None]:
len(dataset.files)

Size of the dataset:

In [None]:
import pyarrow.fs

In [None]:
fs, path = pyarrow.fs.FileSystem.from_uri("s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/")

In [None]:
files = fs.get_file_info(pyarrow.fs.FileSelector(path, recursive=True))

In [None]:
sum([f.size for f in files]) / 1000**3

Number of records:

In [None]:
dataset.count_rows()

GeoParquet is nothing more than some custom metadata:

In [None]:
import pyarrow.parquet as pq

In [None]:
file_metadata = pq.read_metadata(dataset.files[0], filesystem=dataset.filesystem)
file_metadata

In [None]:
file_metadata.metadata

In [None]:
import json

In [None]:
json.loads(file_metadata.metadata[b"geo"])

## Examples of other tools

- QGIS:
  - https://medium.com/radiant-earth-insights/a-deep-dive-into-geoparquet-downloader-qgis-plug-in-017c0b1facb1
  - If QGIS is installed with GDAL built with Parquet support
- DuckDB: https://docs.overturemaps.org/getting-data/duckdb/
- GDAL cli

In [None]:
!gdal vector info --format=text /vsis3/overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/ 

## Spatial index versus spatial partitioning

> From https://cloudnativegeo.org/blog/2024/12/interview-with-kyle-barron-on-geoarrow-and-geoparquet-and-the-future-of-geospatial-data-analysis/

Spatial **partitioning** in GeoParquet breaks data into multiple files and "chunks" based on a spatial attribute like a country or quadkey, allowing systems to skip entire files and chunks for faster querying.  

Spatial **indexing**, on the other hand, uses a data structure (like an R-tree) within a file to efficiently locate individual features based on their position, which can improve performance but can make the file itself larger and doesn't scale as well to larger data. 

<img src="img/geoparquet_layout.png" style="width: 600px;"/>

Note! For GeoParquet 1.x files, filtering at the chunk level requires the `bbox` column.

By adjusting the chunk size, the writer can choose the tradeoff between index size and indexing efficiency.

Let's illustrate this difference by comparing some queries using the Delft buildings subset for GeoParquet vs FlatGeoBuf:

In [None]:
gdf = geopandas.read_parquet("delft.parquet")

In [None]:
gdf.to_file("delft.fgb")

Querying a very small subset based on position:

In [None]:
len(geopandas.read_file("delft.fgb", bbox=(4.3310, 51.0010, 4.3315, 52.0015)))

In [None]:
%timeit geopandas.read_file("delft.fgb", bbox=(4.3310, 51.0010, 4.3315, 52.0015))

In [None]:
%timeit geopandas.read_file("delft.parquet", bbox=(4.3310, 51.0010, 4.3315, 52.0015))

In contrast, reading a single column:

In [None]:
%timeit geopandas.read_file("delft.fgb", columns=["height"], read_geometry=False, use_arrow=True)

In [None]:
%timeit geopandas.read_file("delft.parquet", columns=["height"], read_geometry=False, use_arrow=True)

### Exploring the spatial partitioning of the Overture buildings dataset

First, let's visualize the file-level spatial partitions.

For Overture, they actually provide an overview parquet file with information of all files:

In [None]:
import pandas as pd

In [None]:
collections = pd.read_parquet("https://stac.overturemaps.org/2025-10-22.0/collections.parquet")
collections[collections["collection"] == "building"]

More in general, we can get the partition bounding box information from the metadata:

In [None]:
import json

import pyarrow.dataset as ds
import shapely

import lonboard

In [None]:
dataset = ds.dataset("s3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/")

In [None]:
file_bounds = [
    shapely.box(*json.loads(fragment.metadata.metadata[b"geo"])["columns"]["geometry"]["bbox"])
    for fragment in dataset.get_fragments()
]
file_bounds = geopandas.GeoSeries(file_bounds, crs="EPSG:4326")

In [None]:
file_bounds

In [None]:
lonboard.viz(file_bounds)

Next, we can visualize the row-group-base spatial partitions:

In [None]:
row_group_bounds = []
for fragment in dataset.get_fragments():
    meta = fragment.metadata
    field_xmin = meta.schema.names.index("xmin")
    field_ymin = meta.schema.names.index("ymin")
    field_xmax = meta.schema.names.index("xmax")
    field_ymax = meta.schema.names.index("ymax")
    for i in range(meta.num_row_groups):
        row_group = meta.row_group(i)
        xmin = row_group.column(field_xmin).statistics.min
        ymin = row_group.column(field_ymin).statistics.min
        xmax = row_group.column(field_xmax).statistics.max
        ymax = row_group.column(field_ymax).statistics.max
        row_group_bounds.append(shapely.box(xmin, ymin, xmax, ymax))

row_group_bounds = geopandas.GeoSeries(row_group_bounds, crs="EPSG:4326")

In [None]:
row_group_bounds

In [None]:
lonboard.viz(row_group_bounds)

## Get started with providing GeoParquet datasets

Best Practices for Distributing GeoParquet: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/distributing-geoparquet.md

## [extra] Parquet native Geometry/Geography logical types

Example data (https://geoarrow.org/data.html): https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_cities.parquet

In [None]:
import pyarrow.parquet as pq

In [None]:
f = pq.ParquetFile("../Downloads/natural-earth_cities.parquet")

In [None]:
f.metadata.schema

In [1]:
import geoarrow.pyarrow

ModuleNotFoundError: No module named 'geoarrow.pyarrow'

In [None]:
pq.read_table("../Downloads/natural-earth_cities.parquet").schema

In [None]:
f = pq.ParquetFile("../Downloads/example-crs_vermont-utm.parquet")

In [None]:
f.metadata.schema