# PBF File Reader

`PBFFileReader` can really quickly parse full OSM extract in the form of `*.osm.pbf` file.

It uses `DuckDB` with `spatial` extension to convert `pbf` files into `geoparquet` files without GDAL dependency.

Reader can filter objects by geometry and by OSM tags with option to split tags into columns or keep it as a single dictionary.

Caching strategy is implemented to reduce computations, but it can be overriden using `ignore_cache` parameter.

## Download all buildings in Reykjavík, Iceland

Filtering the data with geometry and by tags, with tags in exploded form

In [None]:
from quackosm import PbfFileReader
import urllib.request
import osmnx as ox

In [None]:
iceland_pbf_url = "https://download.geofabrik.de/europe/iceland-latest.osm.pbf"
iceland_pbf_file = "iceland.osm.pbf"
urllib.request.urlretrieve(iceland_pbf_url, iceland_pbf_file)

In [None]:
reykjavik_gdf = ox.geocode_to_gdf("Reykjavík, IS")
reykjavik_gdf

To filter out buildings, we will utilize format used also in the `osmnx` library: a dictionary with keys representing tag keys and values that could be a bool, string or a list of string.

By default, `QuackOSM` will return only the tags that are present in the passed filter.

In this example we will select all the buildings using `{ "building": True }` filter and only `building` tag values will be present in the result.

In [None]:
reader = PbfFileReader(
    geometry_filter=reykjavik_gdf.geometry.iloc[0], tags_filter={"building": True}
)

reykjavik_buildings_gpq = reader.convert_pbf_to_parquet("iceland.osm.pbf")
reykjavik_buildings_gpq

### Read those features using DuckDB

In [None]:
import duckdb

connection = duckdb.connect()

connection.load_extension("parquet")
connection.load_extension("spatial")

features_relation = connection.read_parquet(str(reykjavik_buildings_gpq)).project(
    "* REPLACE (ST_GeomFromWKB(geometry) AS geometry)"
)
features_relation

### Count all buildings

In [None]:
features_relation.count("feature_id")

### Keeping all the tags while filtering the data

To keep all of the tags present in the source data, we can use `keep_all_tags` parameter. That way we will still return only buildings, but with all of the tags attached. 

By default, all of those tags will be kept in a single column as a `dict`.

In [None]:
reader.convert_pbf_to_geodataframe("iceland.osm.pbf", keep_all_tags=True)

## Download main roads for Estonia
Filtering the data only by tags, with tags in exploded form

In [None]:
highways_filter = {
    "highway": [
        "motorway",
        "trunk",
        "primary",
        "secondary",
        "tertiary",
    ]
}

In [None]:
estonia_pbf_url = "http://download.geofabrik.de/europe/estonia-latest.osm.pbf"
estonia_pbf_file = "estonia.osm.pbf"
urllib.request.urlretrieve(estonia_pbf_url, estonia_pbf_file)

reader = PbfFileReader(geometry_filter=None, tags_filter=highways_filter)
estonia_features_gpq = reader.convert_pbf_to_parquet(estonia_pbf_file)
estonia_features_gpq

In [None]:
features_relation = connection.read_parquet(str(estonia_features_gpq)).project(
    "* REPLACE (ST_GeomFromWKB(geometry) AS geometry)"
)
features_relation

### Count loaded roads

In [None]:
features_relation.count("feature_id")

### Calculate roads length
We will transform the geometries to the Estonian CRS - [EPSG:3301](https://epsg.io/3301)

In [None]:
length_in_meters = (
    features_relation.project(
        "ST_Length(ST_Transform(geometry, 'EPSG:4326', 'EPSG:3301')) AS road_length"
    )
    .sum("road_length")
    .fetchone()[0]
)
length_in_km = length_in_meters / 1000
length_in_km

### Plot the roads using GeoPandas

With fast loading of geoparquet files using `geoarrow.pyarrow` library.

In [None]:
import geoarrow.pyarrow as ga
from geoarrow.pyarrow import io

from quackosm._constants import GEOMETRY_COLUMN

parquet_table = io.read_geoparquet_table(estonia_features_gpq)
ga.to_geopandas(parquet_table.column(GEOMETRY_COLUMN)).plot()

## Download all data for Liechtenstein
Without filtering, with tags in a compact form

In [None]:
liechtenstein_pbf_url = "https://download.geofabrik.de/europe/liechtenstein-latest.osm.pbf"
liechtenstein_pbf_file = "liechtenstein.osm.pbf"
urllib.request.urlretrieve(liechtenstein_pbf_url, liechtenstein_pbf_file)

# Here explode_tags is set to False explicitly,
# but it would set automatically when not filtering the data
reader = PbfFileReader(geometry_filter=None, tags_filter=None)
liechtenstein_features_gpq = reader.convert_pbf_to_parquet(
    liechtenstein_pbf_file, explode_tags=False
) 
liechtenstein_features_gpq

In [None]:
features_relation = connection.read_parquet(str(liechtenstein_features_gpq)).project(
    "* REPLACE (ST_GeomFromWKB(geometry) AS geometry)"
)
features_relation

### Return data as GeoDataFrame

`PbfFileReader` can also return the data in the GeoDataFrame form.

Here the caching strategy will be utilized - file won't be transformed again.

In [None]:
features_gdf = reader.convert_pbf_to_geodataframe(liechtenstein_pbf_file)
features_gdf

### Plot the forests using GeoPandas

Filter all polygons and features with `landuse`=`forest`.

In [None]:
features_gdf[
    features_gdf.geom_type.isin(("Polygon", "MultiPolygon"))
    & features_gdf.tags.apply(lambda x: "landuse" in x and x["landuse"] == "forest")
].plot(color="green")