# Wide format 

**OvertureMaestro** implements a logic for transforming downloaded data into a `wide` format. This format is dedicated for geospatial machine learning usage, where selected datasets are pivoted based on their categories to a columnar format.

This notebook will explore what is this format and how to work with it.

## New functions

New module contains the same set of functions as the basic api, just with the `wide_form` part inside:

* `convert_geometry_to_parquet` -> <code>convert_geometry_to_<strong>wide_form</strong>_parquet</code>
* `convert_geometry_to_geodataframe` -> <code>convert_geometry_to_<strong>wide_form</strong>_geodataframe</code>
* other functions ...

Additionally, special functions for downloading all available datasets are available:

* `convert_geometry_to_wide_form_parquet_for_all_types`
* `convert_geometry_to_wide_form_geodataframe_for_all_types`
* `convert_bounding_box_to_wide_form_parquet_for_all_types`
* `convert_bounding_box_to_wide_form_geodataframe_for_all_types`

You can import them from the `overturemaestro.advanced_functions` module.

In [None]:
from overturemaestro import convert_geometry_to_geodataframe, geocode_to_geometry
from overturemaestro.advanced_functions import convert_geometry_to_wide_form_geodataframe

## What is the wide format?

In this section we will compare how the original data format differs from the wide format based on water data.

Let's start by looking at the official Overture Maps schema for the base water data type:

In [None]:
import requests
import yaml

response = requests.get(
    "https://raw.githubusercontent.com/OvertureMaps/schema/refs/tags/v1.4.0/schema/base/water.yaml",
    allow_redirects=True,
)
water_schema = yaml.safe_load(response.content.decode("utf-8"))
water_schema

Two required fields are defined in the specification: **`subtype`** and **`class`**. There are even lists of possible values defined.

Both of these values detail the meaning of the object. Together, everything maps to the path:

`theme` (base) → `type` (water) → `subtype` (eg. reservoir) → `class` (eg. basin).

Based on this hierarchy, all available values can be determined and mapped to columns.

In this way, you will obtain data in a **wide** format, where each object defines what it is with boolean flags.

In [None]:
# porto = geocode_to_geometry("Porto")
amsterdam = geocode_to_geometry("Amsterdam")

original_data = convert_geometry_to_geodataframe("base", "water", amsterdam)
wide_data = convert_geometry_to_wide_form_geodataframe(
    "base", "water", amsterdam, ignore_cache=True
)

In [None]:
original_data

In [None]:
wide_data

Using this format, we can quickly filter out data or calculate number of objects per category.

In [None]:
wide_data.drop(columns="geometry").sum().sort_values(ascending=False)