# Wide format 

**OvertureMaestro** implements a logic for transforming downloaded data into a `wide` format. This format is dedicated for geospatial machine learning usage, where selected datasets are pivoted based on their categories to a columnar format.

This notebook will explore what is this format and how to work with it.

## New functions

New module contains the same set of functions as the basic api, just with the `wide_form` part inside:

* `convert_geometry_to_parquet` → <code>convert_geometry_to_<strong>wide_form</strong>_parquet</code>
* `convert_geometry_to_geodataframe` → <code>convert_geometry_to_<strong>wide_form</strong>_geodataframe</code>
* other functions ...

Additionally, special functions for downloading all available datasets are available:

* `convert_geometry_to_wide_form_parquet_for_all_types`
* `convert_geometry_to_wide_form_geodataframe_for_all_types`
* `convert_bounding_box_to_wide_form_parquet_for_all_types`
* `convert_bounding_box_to_wide_form_geodataframe_for_all_types`

You can import them from the `overturemaestro.advanced_functions` module.

In [None]:
from overturemaestro import convert_geometry_to_geodataframe, geocode_to_geometry
from overturemaestro.advanced_functions import convert_geometry_to_wide_form_geodataframe

## What is the wide format?

In this section we will compare how the original data format differs from the wide format based on water data.

Let's start by looking at the official Overture Maps schema for the base water data type:

In [None]:
import requests
import yaml

response = requests.get(
    "https://raw.githubusercontent.com/OvertureMaps/schema/refs/tags/v1.4.0/schema/base/water.yaml",
    allow_redirects=True,
)
water_schema = yaml.safe_load(response.content.decode("utf-8"))
water_schema

Two required fields are defined in the specification: **`subtype`** and **`class`**. There are even lists of possible values defined.

Both of these values detail the meaning of the feature. Together, everything maps to the path:

`theme` (base) → `type` (water) → `subtype` (eg. reservoir) → `class` (eg. basin).

Based on this hierarchy, all available values can be determined and mapped to columns.

In this way, you will obtain data in a **wide** format, where each feature defines what it is with boolean flags.

In [None]:
amsterdam = geocode_to_geometry("Amsterdam")

original_data = convert_geometry_to_geodataframe("base", "water", amsterdam)
wide_data = convert_geometry_to_wide_form_geodataframe("base", "water", amsterdam)

In [None]:
original_data

In [None]:
wide_data

Using this format, we can quickly filter out data or calculate number of features per category.

In [None]:
wide_data.drop(columns="geometry").sum().sort_values(ascending=False)

Each theme type has defined list of columns used for generating final list of columns.

Most of the datasets have two columns (`subtype` and `class`) with three exceptions:
- `base|land_cover` → `subtype` only
- `transportation|segment` → `subtype`, `class` and **`subclass`**
- `places|place` → `categories.primary` and `categories.alternative` (this one is described in detail [below](#places))

In [None]:
from overturemaestro.advanced_functions.wide_form import THEME_TYPE_CLASSIFICATION

for (theme_value, type_value), definition in sorted(THEME_TYPE_CLASSIFICATION.items()):
    print(theme_value, type_value, definition.hierachy_columns)

## Multiple data types

You can also download data for multiple data theme/types at once, or even download all at once.

If some datasets have been downloaded during previous executions, then only missing data is downloaded.

Here we will look at the top 10 most common features for both examples.

In [None]:
from overturemaestro.advanced_functions import (
    convert_geometry_to_wide_form_geodataframe_for_all_types,
    convert_geometry_to_wide_form_geodataframe_for_multiple_types,
)

two_datasets_gdf = convert_geometry_to_wide_form_geodataframe_for_multiple_types(
    [("base", "water"), ("base", "land_cover")], amsterdam
)
two_datasets_gdf.drop(columns="geometry").sum().sort_values(ascending=False).head(10)

In [None]:
len(two_datasets_gdf.columns)

In [None]:
all_datasets_gdf = convert_geometry_to_wide_form_geodataframe_for_all_types(amsterdam)
all_datasets_gdf.drop(columns="geometry").sum().sort_values(ascending=False).head(10)

In [None]:
len(all_datasets_gdf.columns)

## Limiting hierarchy depth

If for some reason you want to only have higher level aggregation of the data, you can limit the hierarchy depth of the data.

By default full hierarchy is used to generate the columns.

<div class="admonition note">
    <p class="admonition-title">Note</p>
    <p>
        If you pass too high value, it will be automatically capped to the highest possible for a given theme/type pair.
    </p>
</div>

In [None]:
limited_depth_water_gdf = convert_geometry_to_wide_form_geodataframe(
    "base", "water", amsterdam, hierarchy_depth=1
)
limited_depth_water_gdf.drop(columns="geometry").sum()

In [None]:
limited_depth_all_gdf = convert_geometry_to_wide_form_geodataframe_for_all_types(
    amsterdam, hierarchy_depth=0
)
limited_depth_all_gdf.drop(columns="geometry").sum()

## Places

Places data have different schema than other datasets and it's the only one with possible multiple categories at once: `primary` and optional multiple `alternative`.

This structure is preserved in the `wide` format and it's the only dataset where a single feature can have multiple `True` values in the columns.

Using the `hirerarchy_depth` of 1 results in keeping only `primary` category of the feature.

There are two pyarrow filters applied automatically when downloading the data for the `wide` format: `confidence` value >= 0.75 and `categories` cannot be empty.

In [None]:
import pyarrow.compute as pc

category_not_null_filter = pc.invert(pc.field("categories").is_null())
minimal_confidence_filter = pc.field("confidence") >= pc.scalar(0.75)
combined_filter = category_not_null_filter & minimal_confidence_filter

original_places_data = convert_geometry_to_geodataframe(
    "places",
    "place",
    amsterdam,
    pyarrow_filter=combined_filter,
    columns_to_download=["id", "geometry", "categories", "confidence"],
)
original_places_data

In [None]:
first_index = (
    # Find first object with at least one alternate category
    original_places_data[original_places_data.categories.str.get("alternate").str.len() > 1]
    .iloc[0]
    .name
)

first_index, original_places_data.loc[first_index].categories

In [None]:
wide_form_places_data = convert_geometry_to_wide_form_geodataframe("places", "place", amsterdam)
wide_form_places_data

As you can see, only those features existing in the `categories` column are `True` and the rest is `False`.

In [None]:
wide_form_places_data.loc[first_index].drop("geometry").sort_values(ascending=False)

After limiting the `hierarchy_depth` only one value is positive - the `primary` category.

In [None]:
limited_depth_wide_form_places_data = convert_geometry_to_wide_form_geodataframe(
    "places", "place", amsterdam, hierarchy_depth=1
)
limited_depth_wide_form_places_data.loc[first_index].drop("geometry").sort_values(ascending=False)

Below you can see the difference in the counts of `True` values across all columns.

In [None]:
wide_form_places_data.drop(columns="geometry").sum().sort_values(ascending=False)

In [None]:
limited_depth_wide_form_places_data.drop(columns="geometry").sum().sort_values(ascending=False)

## Pruning final list of columns

By default, `OvertureMaestro` includes all possible columns regardless of whether any features of a given category exist.

This is done to keep the overall schema consistent for different geographical regions and simplifying the feature engineering process.

However, there is a dedicated parameter `include_all_possible_columns` that can be set to `False` to keep only columns based on actually existing features.

In [None]:
convert_geometry_to_wide_form_geodataframe(
    "base", "infrastructure", amsterdam, include_all_possible_columns=True  # default value
)

In [None]:
convert_geometry_to_wide_form_geodataframe(
    "base", "infrastructure", amsterdam, include_all_possible_columns=False
)

## Getting a full list of possible column names

You can also preview the final list of columns before downloading the data using `get_all_possible_column_names` function.

You can specify the `release`, `theme` and `type`, as well as `hierarchy_depth`.

In [None]:
from overturemaestro.advanced_functions.wide_form import get_all_possible_column_names

get_all_possible_column_names(theme="base", type="water")

With all parameters empty, function will return a full list of all possible columns with maximal depth.

In [None]:
columns = get_all_possible_column_names()
len(columns)

In [None]:
columns[:10]

You can also specify different `hierarchy_depth` values.

In [None]:
get_all_possible_column_names(theme="transportation", type="segment", hierarchy_depth=1)