# Stationbench Tutorial

This tutorial demonstrates how to use the stationbench repository to:
1. Preprocess weather forecast and ground truth data
2. Calculate verification metrics
3. Compare multiple forecasts and visualize results

This tutorial runs in a notebook environment. The same commands can be run in a terminal or script.

## Setup

First, complete the [setup guide](setup.md) then import the required packages.

In [1]:
import xarray as xr
import pandas as pd
import numpy as np
import os
from datetime import datetime

## 1. Data Preprocessing

Stationbench expects forecast data and ground truth observations in Zarr format. Let's look at both datasets.

### Data format

The forecast data should be a Zarr dataset with the following structure:

```
<xarray.Dataset>
Dimensions:
  - time: Forecast initialization times
  - prediction_timedelta: Forecast lead times
  - latitude: Grid latitudes
  - longitude: Grid longitudes

Coordinates:
  - latitude: (latitude) float32, grid latitudes in degrees North
  - longitude: (longitude) float32, grid longitudes in degrees East  
  - prediction_timedelta: (prediction_timedelta) timedelta64[ns], forecast lead times
  - time: (time) datetime64[ns], initialization times

Data variables:
  - 10m_wind_speed: (time, prediction_timedelta, latitude, longitude) float32
  - 2m_temperature: (time, prediction_timedelta, latitude, longitude) float32
```

The wind speed and temperature data should be in m/s and °C respectively.

For this tutorial, lets create a simple forecast dataset.


In [2]:
lats = np.linspace(36, 72, 74)
lons = np.linspace(-15, 45, 124)
times = pd.date_range("2023-01-01", "2023-01-31", freq="D")
prediction_timedeltas = pd.timedelta_range(start='0h', end='23h', freq='1h')

# Generate reasonable temperature range in Celsius (roughly -10°C to 30°C)
temp_data = np.random.uniform(
    low=-10, 
    high=30,
    size=(len(times), len(prediction_timedeltas), len(lats), len(lons))
)

# Generate reasonable wind speeds in m/s (roughly 0-20 m/s)
wind_data = np.random.uniform(
    low=0,
    high=20,
    size=(len(times), len(prediction_timedeltas), len(lats), len(lons))
)

forecast = xr.Dataset(
    data_vars=dict(
        temperature=(["time", "prediction_timedelta", "latitude", "longitude"], temp_data),
        wind_speed=(["time", "prediction_timedelta", "latitude", "longitude"], wind_data),
    ),
    coords=dict(
        latitude=("latitude", lats),
        longitude=("longitude", lons),
        prediction_timedelta=("prediction_timedelta", prediction_timedeltas),
        time=("time", times)
    )
)

forecast.to_zarr("data/forecast.zarr", mode="w")
forecast

Let's also have a look at the METEOSTAT ground truth data.

In [3]:
ground_truth_path = '/mnt/jua-shared-1/jua-bronze-layer/meteostat_benchmark.zarr'
ground_truth = xr.open_zarr(ground_truth_path)
ground_truth

Unnamed: 0,Array,Chunk
Bytes,3.31 GiB,33.42 MiB
Shape,"(61368, 14491)","(4380, 2000)"
Dask graph,120 chunks in 2 graph layers,120 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 3.31 GiB 33.42 MiB Shape (61368, 14491) (4380, 2000) Dask graph 120 chunks in 2 graph layers Data type float32 numpy.ndarray",14491  61368,

Unnamed: 0,Array,Chunk
Bytes,3.31 GiB,33.42 MiB
Shape,"(61368, 14491)","(4380, 2000)"
Dask graph,120 chunks in 2 graph layers,120 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.31 GiB,33.42 MiB
Shape,"(61368, 14491)","(4380, 2000)"
Dask graph,120 chunks in 2 graph layers,120 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 3.31 GiB 33.42 MiB Shape (61368, 14491) (4380, 2000) Dask graph 120 chunks in 2 graph layers Data type float32 numpy.ndarray",14491  61368,

Unnamed: 0,Array,Chunk
Bytes,3.31 GiB,33.42 MiB
Shape,"(61368, 14491)","(4380, 2000)"
Dask graph,120 chunks in 2 graph layers,120 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


Notice the difference that the ground truth data is not a grid but unstructured point data made up of stations. This package will automatically align the grid data to the station locations using linear interpolation.

## 2. Calculate Verification Metrics

Now we'll calculate RMSE between the forecast and ground truth data.
For this we need to set the following parameters:
- `--forecast_loc`: Location of the forecast data (required)
- `--ground_truth_loc`: Location of the ground truth data (required)
- `--start_date`: Start date for benchmarking (required)
- `--end_date`: End date for benchmarking (required)
- `--output`: Output path for benchmarks (required)
- `--region`: Region to benchmark (see `regions.py` for available regions)
- `--name_10m_wind_speed`: Name of 10m wind speed variable (optional)
- `--name_2m_temperature`: Name of 2m temperature variable (optional)

In [4]:
forecast_loc = "data/forecast.zarr"
ground_truth_loc = ground_truth_path
start_date = "2023-03-18"
end_date = "2023-07-31"
output = "data/tutorial_benchmark.zarr"
region = "europe"
name_10m_wind_speed = "wind_speed"
name_2m_temperature = "temperature"

command = f"poetry run python ../stationbench/calculate_metrics.py --forecast_loc {forecast_loc} --ground_truth_loc {ground_truth_loc} --start_date {start_date} --end_date {end_date} --output {output} --region {region} --name_10m_wind_speed {name_10m_wind_speed} --name_2m_temperature {name_2m_temperature}"

print("Running benchmark command: ", command)
os.system(command)

Running benchmark command:  poetry run python ../stationbench/calculate_metrics.py --forecast_loc data/forecast.zarr --ground_truth_loc /mnt/jua-shared-1/jua-bronze-layer/meteostat_benchmark.zarr --start_date 2023-03-18 --end_date 2023-07-31 --output data/tutorial_benchmark.zarr --region europe --name_10m_wind_speed wind_speed --name_2m_temperature temperature


2025-01-14 09:04:36,250 - root - INFO - Dask dashboard http://127.0.0.1:8787/status
2025-01-14 09:04:36,250 - __main__ - INFO - preprocessing dataset data/forecast.zarr
2025-01-14 09:04:36,525 - __main__ - INFO - creating valid time...
2025-01-14 09:04:36,531 - __main__ - INFO - Selecting region: https://linestrings.com/bbox/#-15,36,45,72
2025-01-14 09:04:36,531 - __main__ - INFO - Finished processing of data/forecast.zarr: <xarray.Dataset> Size: 2kB
Dimensions:         (latitude: 74, longitude: 124, lead_time: 24, init_time: 0)
Coordinates:
  * latitude        (latitude) float64 592B 36.0 36.49 36.99 ... 71.51 72.0
  * longitude       (longitude) float64 992B -15.0 -14.51 -14.02 ... 44.51 45.0
  * lead_time       (lead_time) timedelta64[ns] 192B 00:00:00 ... 23:00:00
  * init_time       (init_time) datetime64[ns] 0B 
    valid_time      (init_time, lead_time) datetime64[ns] 0B 
Data variables:
    2m_temperature  (init_time, lead_time, latitude, longitude) float64 0B dask.array<chunks

256

## 3. Compare Multiple Forecasts

For comparing the forecast against multiple reference forecasts, we need to set the following parameters:
- `--evaluation_benchmarks_loc`: Path to the evaluation benchmarks (required)
- `--reference_benchmark_locs`: Dictionary of reference benchmark locations, the first one is used for skill score (required)
- `--run_name`: W&B run name (required)
- `--regions`: Comma-separated list of regions, see `regions.py` for available regions (required)

Let's create a reference forecast dataset and also calculate the metrics for this dataset.

In [5]:
lats = np.linspace(36, 72, 74)
lons = np.linspace(-15, 45, 124)
times = pd.date_range("2023-01-01", "2023-01-31", freq="D")
prediction_timedeltas = pd.timedelta_range(start='0h', end='23h', freq='1h')

# Generate reasonable temperature range in Celsius (roughly -10°C to 30°C)
temp_data = np.random.uniform(
    low=-10, 
    high=30,
    size=(len(times), len(prediction_timedeltas), len(lats), len(lons))
)

# Generate reasonable wind speeds in m/s (roughly 0-20 m/s)
wind_data = np.random.uniform(
    low=0,
    high=20,
    size=(len(times), len(prediction_timedeltas), len(lats), len(lons))
)

reference_forecast = xr.Dataset(
    data_vars=dict(
        temperature=(["time", "prediction_timedelta", "latitude", "longitude"], temp_data),
        wind_speed=(["time", "prediction_timedelta", "latitude", "longitude"], wind_data),
    ),
    coords=dict(
        latitude=("latitude", lats),
        longitude=("longitude", lons),
        prediction_timedelta=("prediction_timedelta", prediction_timedeltas),
        time=("time", times)
    )
)

reference_forecast.to_zarr("data/reference_forecast.zarr", mode="w")

forecast_loc = "data/reference_forecast.zarr"
ground_truth_loc = ground_truth_path
start_date = "2023-03-18"
end_date = "2023-07-31"
output = "data/reference_benchmark.zarr"
region = "europe"
name_10m_wind_speed = "wind_speed"
name_2m_temperature = "temperature"

command = f"poetry run python ../stationbench/calculate_metrics.py --forecast_loc {forecast_loc} --ground_truth_loc {ground_truth_loc} --start_date {start_date} --end_date {end_date} --output {output} --region {region} --name_10m_wind_speed {name_10m_wind_speed} --name_2m_temperature {name_2m_temperature}"

print("Running benchmark command: ", command)
os.system(command)

Running benchmark command:  poetry run python ../stationbench/calculate_metrics.py --forecast_loc data/reference_forecast.zarr --ground_truth_loc /mnt/jua-shared-1/jua-bronze-layer/meteostat_benchmark.zarr --start_date 2023-03-18 --end_date 2023-07-31 --output data/reference_benchmark.zarr --region europe --name_10m_wind_speed wind_speed --name_2m_temperature temperature


2025-01-14 09:04:41,514 - root - INFO - Dask dashboard http://127.0.0.1:8787/status
2025-01-14 09:04:41,514 - __main__ - INFO - preprocessing dataset data/reference_forecast.zarr
2025-01-14 09:04:41,807 - __main__ - INFO - creating valid time...
2025-01-14 09:04:41,815 - __main__ - INFO - Selecting region: https://linestrings.com/bbox/#-15,36,45,72
2025-01-14 09:04:41,815 - __main__ - INFO - Finished processing of data/reference_forecast.zarr: <xarray.Dataset> Size: 2kB
Dimensions:         (latitude: 74, longitude: 124, lead_time: 24, init_time: 0)
Coordinates:
  * latitude        (latitude) float64 592B 36.0 36.49 36.99 ... 71.51 72.0
  * longitude       (longitude) float64 992B -15.0 -14.51 -14.02 ... 44.51 45.0
  * lead_time       (lead_time) timedelta64[ns] 192B 00:00:00 ... 23:00:00
  * init_time       (init_time) datetime64[ns] 0B 
    valid_time      (init_time, lead_time) datetime64[ns] 0B 
Data variables:
    2m_temperature  (init_time, lead_time, latitude, longitude) float64 

256

Let's compare our forecast against the reference forecast and visualize the results using Weights & Biases.

In [6]:
evaluation_benchmarks_loc = "data/tutorial_benchmark.zarr"
reference_benchmark_locs = {"reference_model": "data/reference_benchmark.zarr"}
# include day of today in the run name
run_name = f"example-comparison_{datetime.now().strftime('%Y-%m-%d')}"
regions = "europe"

command = f'poetry run python ../stationbench/compare_forecasts.py --evaluation_benchmarks_loc {evaluation_benchmarks_loc} --reference_benchmark_locs "{reference_benchmark_locs}" --run_name {run_name} --regions {regions}'

print("Running comparison command: ", command)
os.system(command)

Running comparison command:  poetry run python ../stationbench/compare_forecasts.py --evaluation_benchmarks_loc data/tutorial_benchmark.zarr --reference_benchmark_locs "{'reference_model': 'data/reference_benchmark.zarr'}" --run_name example-comparison_2025-01-14 --regions europe


2025-01-14 09:04:44,529 - __main__ - INFO - regions: ['europe']
wandb: Currently logged in as: model-engineering-team (jua). Use `wandb login --relogin` to force relogin
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.2
wandb: Run data is saved locally in /home/leonie/stationbench/docs/wandb/run-20250114_090445-example-comparison_2025-01-14
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run example-comparison_2025-01-14
wandb: ⭐️ View project at https://wandb.ai/jua/test
wandb: 🚀 View run at https://wandb.ai/jua/test/runs/example-comparison_2025-01-14
2025-01-14 09:04:46,068 - __main__ - INFO - Artifact example-comparison_2025-01-14-temporal_plots not found, will creating new artifact
2025-01-14 09:04:46,278 - __main__ - INFO - Point based benchmarks computed, generating plots and writing to wandb...


[1;34mwandb[0m: 
[1;34mwandb[0m: 🚀 View run [33mexample-comparison_2025-01-14[0m at: [34mhttps://wandb.ai/jua/test/runs/example-comparison_2025-01-14[0m
[1;34mwandb[0m: Find logs at: [1;35mwandb/run-20250114_090445-example-comparison_2025-01-14/logs[0m


0

## Understanding the Results

The comparison generates several visualizations:

1. **Geographical scatter plots**:
   - RMSE values at each station location
   - Skill scores comparing against reference forecasts

2. **Time series plots**:
   - RMSE evolution over forecast lead time
   - Skill score evolution over forecast lead time

These plots are automatically uploaded to your W&B project where you can:
- Compare different model versions
- Track performance improvements
- Share results with your team

Visit your W&B project page to explore the interactive visualizations!