# Final Project

The project aims to develop and implement a concept as a Jupyter notebook framework for the spatio-temporal harmonisation of ERA5 reanalysis data, enabling comparison with DWD station measurements.

## Project Parts

- Conceptual: definition of data processing workflow including
    - target temporal resolution and aggregation strategy
    - definition of matched records as basic unit to enable comparison
    - spatial and temporal harmonisation between gridded ERA5 data and point-based DWD observations
    - criteria for joining the two data sources (scheme for a joined table)
- Coding: implementation of defined concept in a suitable data processing workflow 
    - collecting ERA5 reanalysis data and DWD station observations
    - preparation and harmonisation of both data according to a matched-record concept
    - generation of a combined dataset suitable for comparison and basic evaluation
    - implement basic evaluation via metrics like bias and RMSE


## Imports and Prerequisites

In [57]:
import io, zipfile, requests
from datetime import date, datetime, timedelta
import pandas as pd

## Access data from the DWD

In [135]:
# dictionary that stores metadata on different DWD variables

DWD_PRODUCTS = {
    "air_temperature": {
        "code": "TU",
        "column_name": "TT_TU"
    },
    "wind": {
        "code": "FF",
        "column_name": "F"
    },
    "precipitation": {
        "code": "RR",
        "column_name": "R1"
    },
}

BASE_CDC = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly"

In [136]:
# Generates suitable CDC URL based on given paramters 

def dwd_zip_url(variable, station_id, mode="recent"):
    if variable not in DWD_PRODUCTS:
        raise KeyError(f"Unknown variable: {variable}")

    code   = DWD_PRODUCTS[variable]["code"]
    
    if mode == "recent":
        suffix = "akt"
    else:
        suffix = "hist"

    return f"{BASE_CDC}/{variable}/{mode}/stundenwerte_{code}_{station_id}_{suffix}.zip"


In [137]:
# Downloads data from DWD CDC in the RAM and stores only the wanted data in a dataframe 

def download_DWD_data(variable, station_id, date):

    zip_url = dwd_zip_url(variable, station_id, mode="recent")
    
    r = requests.get(zip_url, timeout=60)
    r.raise_for_status()
    
    zf = zipfile.ZipFile(io.BytesIO(r.content))
    produkt = next(n for n in zf.namelist() if n.lower().startswith("produkt_") and n.lower().endswith(".txt"))
    
    with zf.open(produkt) as f:
        df = pd.read_csv(f, sep=";")

    df.columns = (
        df.columns
          .str.strip()
          .str.replace(r"\s+", "", regex=True)
    )
    
    df["time"] = pd.to_datetime(df["MESS_DATUM"], format="%Y%m%d%H", errors="coerce")
    target_date = pd.to_datetime(f"{date}").date()
    df_day = df[df["time"].dt.date == target_date][["time", DWD_PRODUCTS[variable]["column_name"]]]
    df_day = df_day.rename(columns={DWD_PRODUCTS[variable]["column_name"]: f"{variable}"})

    return df_day

## Testing

In [130]:
download_DWD_data(
    variable = "precipitation",
    station_id = "02667",
    date = date(2026,1,19),   
).head(24)

Unnamed: 0,time,precipitation
13056,2026-01-19 00:00:00,0.0
13057,2026-01-19 01:00:00,0.0
13058,2026-01-19 02:00:00,0.0
13059,2026-01-19 03:00:00,0.0
13060,2026-01-19 04:00:00,0.0
13061,2026-01-19 05:00:00,0.0
13062,2026-01-19 06:00:00,0.0
13063,2026-01-19 07:00:00,0.0
13064,2026-01-19 08:00:00,0.0
13065,2026-01-19 09:00:00,0.0
