# **Jupyter notebook for downloading DWD Hourly Precipitation Data**

### Earth System Data Processing - Homework 1
**Author:** Darya Fomicheva  

**Matriculation number:** 7447755

**Date:** 04 December 2025


## Content of this notebook

This notebook demonstrates how to download and process hourly precipitation data for May 2024 from stations in North Rhine-Westphalia, implementing the following steps:

- describe the data source (DWD Open Data) and the hourly station precipitation dataset used in this homework
- navigate the DWD Open Data directory structure and locate the hourly precipitation files
- define the time period of interest (May 2024) and identify the corresponding zip archives
- download the selected zip files from the DWD Open Data server
- unpack the downloaded archives and load the data into Python (pandas)
- perform basic checks to verify that the download has been successful

## Data source: Deutscher Wetterdienst (DWD)

The data used in this homework are obtained from the Deutscher Wetterdienst (DWD), the national meteorological service of Germany. DWD has a legal mandate to make most of its weather and climate information publicly available and fulfils this mandate by operating an official **Open Data server** at `https://opendata.dwd.de`.  

On this server, DWD offers weather and climate data free of charge within its legal mandate. Access is provided directly via HTTPS and does not require registration. Users can download files anonymously under the terms of use specified by DWD (copyright and usage conditions). This Open Data area includes a large range of climate datasets from the DWD Climate Data Center (CDC). These datasets are available for direct download via HTTP and FTP.

For this homework, I selected the DWD dataset with hourly precipitation observations from meteorological stations across Germany available on this Open Data service.

## Dataset overview
The observations in this dataset come from stations operated by DWD as well as from legally and qualitatively equivalent partner networks. For each station, extensive metadata are provided, including information on station relocations, instrument changes, changes in reference times, processing algorithms and operator details.

According to the official dataset description provided by the DWD Climate Data Center, the hourly precipitation dataset has the following main characteristics:

- **Name:** Hourly station observations of precipitation for Germany (version v24.03)  
- **Provider:** DWD Climate Data Center (CDC)  
- **Parameters:** hourly precipitation height and related precipitation variables, including indicators of whether precipitation fell, the kind and form of precipitation, and associated quality and indicator flags  
- **Unit:** millimetres (mm)  
- **Statistical processing:** time series of hourly sums  
- **Temporal coverage:** from 1995-09-01 onwards  
- **Spatial coverage:** observation stations distributed across Germany  
- **Access URL:** `https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/precipitation/`  

The dataset is organised into a versioned archive (`historical/`) and a daily updated part covering roughly the last 500 days (`recent/`). The `historical/` directory is updated about once per year and its contents remain stable, whereas data in `recent/` are still undergoing quality control and may change as further checks and corrections are applied.

In this notebook, I use only station files from the `historical/` directory.

## Identifying files covering May 2024

In the homework description, the time period 1–31 May 2024 is specified as the target interval for the download. In the DWD hourly precipitation dataset, each station is stored in a separate zip archive in the `historical/` directory. The file names follow the pattern

`stundenwerte_RR_{station_id}_{begin_date}_{end_date}_hist.zip`

where `RR` is the product code for hourly precipitation, `station_id` is the numeric station identifier, and `begin_date` and `end_date` define the time period covered by the file (format `YYYYMMDD`). To obtain all data for May 2024, the station archives whose coverage interval `[begin_date, end_date]` includes the target period `[20240501, 20240531]` have to be selected. 

The mapping between `station_id` and station metadata such as station name, location and height is provided in the file `RR_Stundenwerte_Beschreibung_Stationen.txt` in the same directory.


## Counting the number of archives

In [12]:
import re
import requests
from datetime import datetime

In [13]:
# 1) Basic setup: where to look, and what time period we want

BASE_URL = (
    "https://opendata.dwd.de/climate_environment/CDC/observations_germany/"
    "climate/hourly/precipitation/historical/"
)

# Target period: May 2024 (inclusive)
MAY_START = datetime(2024, 5, 1)
MAY_END   = datetime(2024, 5, 31)

# 2) Download the directory listing (HTML) from the DWD server

resp = requests.get(BASE_URL)
resp.raise_for_status()
html = resp.text

# 3) Find all ZIP filenames + extract their begin/end dates

zip_pattern = re.compile(
    r'href="(stundenwerte_RR_\d+_(\d{8})_(\d{8})_hist\.zip)"'
)

matches = zip_pattern.findall(html)
print("Total RR zip files in 'historical/':", len(matches))

Total RR zip files in 'historical/': 1453


In [14]:
# 4) Keep only files whose date range overlaps with May 2024


relevant_files = []

for filename, begin_str, end_str in matches:
    begin_date = datetime.strptime(begin_str, "%Y%m%d")
    end_date   = datetime.strptime(end_str, "%Y%m%d")

    # We keep files whose coverage interval [begin_date, end_date]
    # overlaps with the target period [MAY_START, MAY_END]
    overlaps = (begin_date <= MAY_END) and (end_date >= MAY_START)
    if overlaps:
        relevant_files.append(filename)

print("Number of zip files covering May 2024:", len(relevant_files))


# 5) Print a few examples (sanity check)

print("First 5 matching filenames:")
for fn in relevant_files[:5]:
    print(" -", fn)

Number of zip files covering May 2024: 1335
First 5 matching filenames:
 - stundenwerte_RR_00020_20040814_20241231_hist.zip
 - stundenwerte_RR_00029_20060110_20241231_hist.zip
 - stundenwerte_RR_00044_20070401_20241231_hist.zip
 - stundenwerte_RR_00046_20060101_20241231_hist.zip
 - stundenwerte_RR_00053_20051001_20241231_hist.zip


In total, 1453 station archives are available in the `historical/` directory. Of these, 1335 contain data for May 2024. This is already a very large number of stations for just one month and one variable, and it means that I would almost download the entire historical hourly precipitation dataset for Germany. Even though the total download volume (about 0.6 GB) is still manageable on a standard laptop, this is not very convenient when the actual interest is only a single month. To make the dataset more manageable, I decided to restrict the selection to stations from one federal state, namely North Rhine-Westphalia. This still provides an interesting regional dataset, but with fewer station archives and a smaller data volume.


# Selection of NRW Stations

In [51]:
metadata_url = (
    "https://opendata.dwd.de/climate_environment/CDC/observations_germany/"
    "climate/hourly/precipitation/historical/RR_Stundenwerte_Beschreibung_Stationen.txt"
)

#1) Download the metadata file as plain text

response = requests.get(metadata_url)
response.raise_for_status()
text = response.text

lines = text.splitlines()

# Optional: look at the first few lines
for line in lines[:5]:
    print(line)


Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge Stationsname Bundesland Abgabe
----------- --------- --------- ------------- --------- --------- ----------------------------------------- ---------- ------
00003 19950901 20110401            202     50.7827    6.0941 Aachen                                   Nordrhein-Westfalen                      Frei                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

The station metadata is provided as a fixed-width text file instead of a CSV table, which means that extracting specific columns (e.g. Bundesland) involves an additional parsing step.

In [17]:
metadata_url = (
    "https://opendata.dwd.de/climate_environment/CDC/observations_germany/"
    "climate/hourly/precipitation/historical/"
    "RR_Stundenwerte_Beschreibung_Stationen.txt"
)

# 1) Download the metadata file as plain text

response = requests.get(metadata_url)
response.raise_for_status()
text = response.text

lines = text.splitlines()

# 2) Go through all lines and collect station IDs for NRW
#    Rule: if the 8th column == 'Nordrhein-Westfalen',
#          keep the ID from the 1st column.

nrw_ids = []

for line in lines:
    # Skip empty lines
    if not line.strip():
        continue

    # Skip header line that starts with 'Stations_id'
    if line.lstrip().startswith("Stations_id"):
        continue

    # Split by any whitespace into "columns"
    parts = line.split()

    # We need at least 8 columns to access parts[7]
    if len(parts) < 8:
        continue

    station_id_str = parts[0]
    bundesland = parts[7]

    if bundesland == "Nordrhein-Westfalen":
        # Convert to int (optional, can also keep as string)
        try:
            nrw_ids.append(int(station_id_str))
        except ValueError:
            # If parsing fails, just skip this line
            continue

# Remove duplicates and sort
nrw_ids = sorted(set(nrw_ids))

print(f"Number of stations in North Rhine-Westphalia: {len(nrw_ids)}")

# 3) If you want a set of IDs as strings (for matching filenames)

nrw_id_set = set(str(sid) for sid in nrw_ids)


Number of stations in North Rhine-Westphalia: 199


In total, the directory contained 199 ZIP archives corresponding to stations located in North Rhine-Westphalia.


## Downloading the data

In [52]:
from pathlib import Path


In [54]:
# 0) Prepare NRW station ID set in the same format as filenames

nrw_id_set = set(f"{sid:05d}" for sid in nrw_ids) 


# 1) Base URL and local download folder

BASE_URL = (
    "https://opendata.dwd.de/climate_environment/CDC/observations_germany/"
    "climate/hourly/precipitation/historical/"
)

DATA_DIR = Path("data/dwd_rr_hourly")
DATA_DIR.mkdir(parents=True, exist_ok=True)

# 2) Get the directory listing from DWD and find all RR zip files

resp = requests.get(BASE_URL)
resp.raise_for_status()
html = resp.text

# Example filename:
# stundenwerte_RR_00001_19950901_20241231_hist.zip
pattern = re.compile(
    r'href="(stundenwerte_RR_(\d{5})_(\d{8})_(\d{8})_hist\.zip)"'
)

matches = pattern.findall(html)
print("Total RR zip files in 'historical/':", len(matches))


# 3) Filter files: (a) station in NRW, (b) overlap with May 2024

MAY_START = datetime(2024, 5, 1)
MAY_END   = datetime(2024, 5, 31)

nrw_may_files = []

for filename, station_id, begin_str, end_str in matches:
    # keep only NRW stations
    if station_id not in nrw_id_set:
        continue

    begin_date = datetime.strptime(begin_str, "%Y%m%d")
    end_date   = datetime.strptime(end_str, "%Y%m%d")

    # overlap condition: [begin_date, end_date] ∩ [MAY_START, MAY_END] ≠ ∅
    overlaps_may = (begin_date <= MAY_END) and (end_date >= MAY_START)
    if not overlaps_may:
        continue

    nrw_may_files.append(filename)

print("Number of NRW station archives covering May 2024:", len(nrw_may_files))


# 4) Helper function: download one ZIP file

def download_zip_file(url: str, local_path: Path) -> None:
    """Download a ZIP file from url and save it to local_path."""
    with requests.get(url, stream=True, timeout=60) as r:
        r.raise_for_status()
        with open(local_path, "wb") as f:
            for chunk in r.iter_content(chunk_size=1024 * 1024):
                if chunk:
                    f.write(chunk)


# 5) Main loop: download only the NRW+May 2024 files

for filename in nrw_may_files:
    url = BASE_URL + filename
    local_path = DATA_DIR / filename

    # Skip if the file is already present locally
    if local_path.exists():
        print(f"[skip] {filename} (already exists)")
        continue

    print(f"[download] {filename} ...", end=" ")
    try:
        download_zip_file(url, local_path)
        print("done")
    except Exception as e:
        print(f"FAILED ({e})")

print("Download step finished.")


Total RR zip files in 'historical/': 1453
Number of NRW station archives covering May 2024: 73
[download] stundenwerte_RR_00216_20041001_20241231_hist.zip ... done
[download] stundenwerte_RR_00555_20080101_20241231_hist.zip ... done
[download] stundenwerte_RR_00603_19990303_20241231_hist.zip ... done
[download] stundenwerte_RR_00613_20041101_20241231_hist.zip ... done
[download] stundenwerte_RR_00644_20050101_20241231_hist.zip ... done
[download] stundenwerte_RR_00796_20041101_20241231_hist.zip ... done
[download] stundenwerte_RR_00871_20050801_20241231_hist.zip ... done
[download] stundenwerte_RR_00902_20061001_20241231_hist.zip ... done
[download] stundenwerte_RR_00934_20041001_20241231_hist.zip ... done
[download] stundenwerte_RR_00989_20050201_20241231_hist.zip ... done
[download] stundenwerte_RR_01024_20060801_20241231_hist.zip ... done
[download] stundenwerte_RR_01046_20041001_20241231_hist.zip ... done
[download] stundenwerte_RR_01078_19950901_20241231_hist.zip ... done
[downloa

In total, 73 station archives were downloaded. On my local machine, this took about one minute and required roughly 36 MB of disk space, which is easily manageable on a standard laptop.

## Opening one file

In [25]:
import zipfile
import pandas as pd

In [55]:
# 1)  Pick one ZIP file to inspect

DATA_DIR = Path("data/dwd_rr_hourly")

zip_path = DATA_DIR / "stundenwerte_RR_00216_20041001_20241231_hist.zip"

In [56]:
# 2) Open the ZIP and list all files inside (just to see what's there)

with zipfile.ZipFile(zip_path, "r") as zf:
    names = zf.namelist()

    print("\nFiles inside the zip:")
    for name in names:
        print(" -", name)



Files inside the zip:
 - Metadaten_Stationsname_Betreibername_00216.html
 - Metadaten_Stationsname_Betreibername_00216.txt
 - Metadaten_Parameter_rr_stunde_00216.html
 - Metadaten_Parameter_rr_stunde_00216.txt
 - Metadaten_Geraete_Niederschlagshoehe_00216.html
 - Metadaten_Geraete_Niederschlagshoehe_00216.txt
 - Metadaten_Geographie_00216.txt
 - Metadaten_Fehldaten_00216_20041001_20241231.html
 - Metadaten_Fehldaten_00216_20041001_20241231.txt
 - Metadaten_Fehlwerte_00216_20041001_20241231.txt
 - produkt_rr_stunde_20041001_20241231_00216.txt


In [57]:

# 3) Open ZIP, find product file, read it into a DataFrame

with zipfile.ZipFile(zip_path, "r") as zf:
    names = zf.namelist()

    print("\nFiles inside the zip:")
    for name in names:
        print(" -", name)

    # Find the produkt*.txt file (usually only one)
    product_name = None
    for n in names:
        lower = n.lower()
        if ("produkt" in lower) and lower.endswith(".txt"):
            product_name = n
            break

    if product_name is None:
        raise ValueError("No produkt*.txt file found inside the zip archive.")

    print("\nProduct file:", product_name)

    # Read that product file directly from the ZIP into a pandas DataFrame
    with zf.open(product_name) as f:
        df_00216 = pd.read_csv(
            f,
            sep=";",                
            na_values=[-999, -999.0] 
        )
df_00216.head()


Files inside the zip:
 - Metadaten_Stationsname_Betreibername_00216.html
 - Metadaten_Stationsname_Betreibername_00216.txt
 - Metadaten_Parameter_rr_stunde_00216.html
 - Metadaten_Parameter_rr_stunde_00216.txt
 - Metadaten_Geraete_Niederschlagshoehe_00216.html
 - Metadaten_Geraete_Niederschlagshoehe_00216.txt
 - Metadaten_Geographie_00216.txt
 - Metadaten_Fehldaten_00216_20041001_20241231.html
 - Metadaten_Fehldaten_00216_20041001_20241231.txt
 - Metadaten_Fehlwerte_00216_20041001_20241231.txt
 - produkt_rr_stunde_20041001_20241231_00216.txt

Product file: produkt_rr_stunde_20041001_20241231_00216.txt


Unnamed: 0,STATIONS_ID,MESS_DATUM,QN_8,R1,RS_IND,WRTR,eor
0,216,2004100100,1,0.0,0.0,,eor
1,216,2004100101,1,0.0,0.0,0.0,eor
2,216,2004100102,1,0.0,0.0,0.0,eor
3,216,2004100103,1,0.0,0.0,,eor
4,216,2004100104,1,0.0,0.0,0.0,eor


## Extracting May 2024 from one station

In the DWD hourly precipitation files, the timestamp is stored in the column MESS_DATUM. It is encoded as YYYYMMDDHH (year, month, day, hour). In this step I convert MESS_DATUM to a proper datetime column and then filter the time series to keep only observations from May 2024.

In [59]:
# 1) Convert MESS_DATUM (YYYYMMDDHH) to pandas datetime

ts = df_00216["MESS_DATUM"].astype(str).str.strip()
ts = ts.str.replace(r"\.0$", "", regex=True) 

df_00216["datetime"] = pd.to_datetime(ts, format="%Y%m%d%H", errors="coerce")

# 2) Filter to May 2024

may_start = pd.Timestamp("2024-05-01")
june_start = pd.Timestamp("2024-06-01")

mask_may = (df_00216["datetime"] >= may_start) & (df_00216["datetime"] < june_start)
df_00216_may2024 = df_00216.loc[mask_may].copy()

# 3) Quick sanity checks

df_00216_may2024 = df_00216_may2024.sort_values("datetime")
df_00216_may2024.head()

Unnamed: 0,STATIONS_ID,MESS_DATUM,QN_8,R1,RS_IND,WRTR,eor,datetime
171159,216,2024050100,3,0.0,0.0,,eor,2024-05-01 00:00:00
171160,216,2024050101,3,0.0,0.0,,eor,2024-05-01 01:00:00
171161,216,2024050102,3,0.0,0.0,,eor,2024-05-01 02:00:00
171162,216,2024050103,3,0.0,0.0,,eor,2024-05-01 03:00:00
171163,216,2024050104,3,0.0,0.0,,eor,2024-05-01 04:00:00


## Extracting May 2024 and joining all stations


In [50]:
# Basic setup: folder with NRW ZIP files + time window

DATA_DIR = Path("data/dwd_rr_hourly")

may_start = pd.Timestamp("2024-05-01")
june_start = pd.Timestamp("2024-06-01") 

nrw_dfs = []  # here we collect May 2024 data from all NRW stations

# 2) Loop over all ZIP files in the folder

for zip_path in sorted(DATA_DIR.glob("*.zip")):

    # DWD filename pattern:
    # stundenwerte_RR_<STATIONS_ID>_<BEGIN>_<END>_hist.zip
    parts = zip_path.name.split("_")
    if len(parts) < 4:
        continue

    station_id_str = parts[2] 

    
    # 3) Open ZIP and find the produkt*.txt file
    
    try:
        with zipfile.ZipFile(zip_path, "r") as zf:
            names = zf.namelist()

            product_name = None
            for n in names:
                lower = n.lower()
                if ("produkt" in lower) and lower.endswith(".txt"):
                    product_name = n
                    break

            if product_name is None:
                continue

            
            # 4) Read produkt file into a DataFrame
            
            with zf.open(product_name) as f:
                df = pd.read_csv(
                    f,
                    sep=";",                    
                    na_values=[-999, -999.0],    
                    encoding="latin1"
                )
    except Exception:
        continue

    
    # 5) Convert MESS_DATUM (YYYYMMDDHH) to pandas datetime
    
    if "MESS_DATUM" not in df.columns:
        continue

    ts = df["MESS_DATUM"].astype(str).str.strip()
    ts = ts.str.replace(r"\.0$", "", regex=True)  # just in case it came as '....0'

    df["datetime"] = pd.to_datetime(
        ts,
        format="%Y%m%d%H",
        errors="coerce"
    )

    
    # 6) Filter rows to May 2024
    
    mask_may = (df["datetime"] >= may_start) & (df["datetime"] < june_start)
    df_may = df.loc[mask_may].copy()

    if df_may.empty:
        # This station has no data in May 2024
        continue

    # Sort by time (nice for checks and later analysis)
    df_may = df_may.sort_values("datetime")

    
    # 7) Add station ID and store the filtered DataFrame
    
    df_may["STATIONS_ID"] = int(station_id_str) 

    nrw_dfs.append(df_may)



# 8) Combine all NRW stations into one DataFrame

if nrw_dfs:
    nrw_may_2024 = pd.concat(nrw_dfs, ignore_index=True)
else:
    nrw_may_2024 = pd.DataFrame()

nrw_may_2024.head()


Unnamed: 0,STATIONS_ID,MESS_DATUM,QN_8,R1,RS_IND,WRTR,eor,datetime
0,216,2024050100,3,0.0,0.0,,eor,2024-05-01 00:00:00
1,216,2024050101,3,0.0,0.0,,eor,2024-05-01 01:00:00
2,216,2024050102,3,0.0,0.0,,eor,2024-05-01 02:00:00
3,216,2024050103,3,0.0,0.0,,eor,2024-05-01 03:00:00
4,216,2024050104,3,0.0,0.0,,eor,2024-05-01 04:00:00


In [49]:
print(nrw_may_2024)

       STATIONS_ID  MESS_DATUM  QN_8    R1  RS_IND  WRTR  eor  \
0              216  2024050100     3   0.0     0.0   NaN  eor   
1              216  2024050101     3   0.0     0.0   NaN  eor   
2              216  2024050102     3   0.0     0.0   NaN  eor   
3              216  2024050103     3   0.0     0.0   NaN  eor   
4              216  2024050104     3   0.0     0.0   NaN  eor   
...            ...         ...   ...   ...     ...   ...  ...   
54008        15000  2024053119     3   0.0     0.0   0.0  eor   
54009        15000  2024053120     3   0.4     1.0   6.0  eor   
54010        15000  2024053121     3   0.0     0.0   0.0  eor   
54011        15000  2024053122     3   0.0     0.0   0.0  eor   
54012        15000  2024053123     3   1.2     1.0   6.0  eor   

                 datetime  
0     2024-05-01 00:00:00  
1     2024-05-01 01:00:00  
2     2024-05-01 02:00:00  
3     2024-05-01 03:00:00  
4     2024-05-01 04:00:00  
...                   ...  
54008 2024-05-31 19:00:

In [46]:
nrw_may_2024.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54013 entries, 0 to 54012
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   STATIONS_ID  54013 non-null  int64         
 1   MESS_DATUM   54013 non-null  int64         
 2   QN_8         54013 non-null  int64         
 3     R1         53136 non-null  float64       
 4   RS_IND       53136 non-null  float64       
 5   WRTR         8052 non-null   float64       
 6   eor          54013 non-null  object        
 7   datetime     54013 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(3), object(1)
memory usage: 3.3+ MB


At this point, the dataset is sufficiently compact and structured to allow further analysis and to generate the visualisations of interest.