# Climatic attributes

Notebook to create the file `CAMELS_DE_climatic_attributes.csv`.  

columns:
- gauge_id
- p_mean [mm/day]
- p_seasonality [-]
- frac_snow [-]
- high_prec_freq [days/yr]
- high_prec_dur [days]
- high_prec_timing [season]
- low_prec_freq [days/yr]
- low_prec_dur [days]
- low_prec_timing [season]


In [1]:
from glob import glob
import pandas as pd
import numpy as np
from scipy.optimize import curve_fit

CAMELS-CH: [Note that observed discharge is the most limiting variable. Hence, if discharge was only available for three hydrologic years for a certain basin to allow for the calculation of hydrologic signatures, the climatic indices were evaluated only for the same 3 years for the sake of consistency.](https://essd.copernicus.org/articles/15/5755/2023/#:~:text=.%20Note%20that%20observed%20discharge%20is%20the%20most%20limiting%20variable.)  

We do the same for CAMELS-DE.


In [2]:
# get camels_ids from hydromet timeseries
camels_ids = [camels_id.split("_")[-1].split(".csv")[0] for camels_id in glob("../output_data/camels_de/timeseries/*.csv")]

# sort camels_ids
camels_ids = sorted(camels_ids)

print(f"Total number of stations in CAMELS-DE v1: {len(camels_ids)}")

Total number of stations in CAMELS-DE v1: 1460


In [3]:
def filter_complete_hydro_years(df, tolerance=0.05):
    """
    Helper function to filter a DataFrame to only include complete hydrological 
    years (October - September). A hydrological year is considered complete if 
    it has less than the specified tolerance of missing values.

    """
    # if date is not in index, set it as index
    if 'date' in df.columns:
        df = df.set_index('date')

    # convert the index to datetime if it is not already
    if not isinstance(df.index, pd.DatetimeIndex): 
        df.index = pd.to_datetime(df.index)

    # make the dataframe start at the beginning of the hydrological year, i.e. 01.10. of the previous year
    min_year = df.index.year.min()
    df = df.reindex(pd.date_range(start=f"{min_year-1}-10-01", end=df.index.max(), freq='D'))

    # make the dataframe end at the end of the hydrological year, i.e. 30.09. of the next year
    max_year = df.index.year.max()
    df = df.reindex(pd.date_range(start=df.index.min(), end=f"{max_year+1}-09-30", freq='D'))

    # Calculate the number of missing values per hydrological year for 'discharge_vol' column
    df['hydro_year'] = df.index.year
    df.loc[df.index.month >= 10, 'hydro_year'] += 1
    missing_values_per_year = df['discharge_vol'].groupby(df['hydro_year']).apply(lambda x: x.isna().mean())

    # Filter the DataFrame to only include years with less than the tolerance of missing values
    df_filtered = df[df['hydro_year'].isin(missing_values_per_year[missing_values_per_year <= tolerance].index)]

    # Drop the 'hydro_year' column
    df_filtered = df_filtered.drop(columns='hydro_year')

    return df_filtered

In [4]:
# dataframe to store results
df_results = pd.DataFrame()

## p_mean

*mean daily preciptiation*

In [5]:
for camels_id in camels_ids:
    # read camels de hydromet timeseries data
    df = pd.read_csv(f"../output_data/camels_de/timeseries/CAMELS_DE_hydromet_timeseries_{camels_id}.csv")

    # filter complete hydrological years
    df = filter_complete_hydro_years(df)

    # calculate p_mean from precipitation_mean
    p_mean = df["precipitation_mean"].mean()

    # add to results
    df_results.loc[camels_id, "p_mean"] = round(p_mean, 2)

df_results.head()

  0%|          | 0/1460 [00:00<?, ?it/s]

100%|██████████| 1460/1460 [01:04<00:00, 22.64it/s]


Unnamed: 0,p_mean
DE110000,2.97
DE110010,2.87
DE110020,2.54
DE110030,2.45
DE110040,2.61


## pet_mean

*mean daily PET*

At the moment, we do not include PET data directly in CAMELS-DE.

## aridity

*aridity, calculated as the ratio of mean daily potential evapotranspiration to mean daily precipitation*

No PET in CAMELS-DE.

## p_seasonality

*seasonality and timing of precipitation (estimated using sine curves to represent the annual temperature and precipitation cycles; positive (negative) values indicate that precipitation peaks in summer (winter) and values close to zero indicate uniform precipitation throughout the year)*


In [6]:
# Define the sine function to fit
def sine_curve(day_of_year, mean_value, amplitude, phase_shift):
    return mean_value * (1 + amplitude * np.sin(2 * np.pi * (day_of_year - phase_shift) / 365.25))

for camels_id in camels_ids:
    # read camels de hydromet timeseries data
    df = pd.read_csv(f"../output_data/camels_de/timeseries/CAMELS_DE_hydromet_timeseries_{camels_id}.csv")

    # filter complete hydrological years
    df = filter_complete_hydro_years(df)

    # reset index
    df = df.reset_index(drop=False, names="date")

    # convert date to datetime
    df["date"] = pd.to_datetime(df["date"])

    # Create a time variable that represents the day of the year
    df["day_of_year"] = df["date"].dt.dayofyear

    # Get the mean precipitation and temperature
    average_precipitation = df["precipitation_mean"].mean()
    average_temperature = df["temperature_mean"].mean()

    # Get the first guess for the phase shift
    initial_phase_shift_guess_prec = 90 - df["precipitation_mean"].idxmax() * 30
    initial_phase_shift_guess_prec = initial_phase_shift_guess_prec % 360

    initial_phase_shift_guess_temp = -90

    # Fit a sine curve to the precipitation and temperature data
    optimized_parameters_prec, parameter_covariances_prec = curve_fit(sine_curve, df["day_of_year"], df["precipitation_mean"], p0=[average_precipitation, 0.4, initial_phase_shift_guess_prec])
    optimized_parameters_temp, parameter_covariances_temp = curve_fit(sine_curve, df["day_of_year"], df["temperature_mean"], p0=[average_temperature, 5, initial_phase_shift_guess_temp])

    # The phase shifts are optimized_parameters[2]
    precipitation_seasonality = optimized_parameters_prec[2]
    temperature_seasonality = optimized_parameters_temp[2]

    # The amplitudes are optimized_parameters[1]
    amplitude_prec = optimized_parameters_prec[1]
    amplitude_temp = optimized_parameters_temp[1]

    # Calculate p_seasonality
    p_seasonality = amplitude_prec * np.sign(amplitude_temp) * np.cos(2 * np.pi * (precipitation_seasonality - temperature_seasonality) / 365.25)

    # Add to results
    df_results.loc[camels_id, "p_seasonality"] = round(p_seasonality, 2)

df_results

100%|██████████| 1460/1460 [01:49<00:00, 13.33it/s]


Unnamed: 0,p_mean,p_seasonality
DE110000,2.97,0.01
DE110010,2.87,0.05
DE110020,2.54,0.19
DE110030,2.45,0.25
DE110040,2.61,0.40
...,...,...
DEG10580,2.29,0.09
DEG10590,2.21,0.06
DEG10600,1.62,0.15
DEG10610,1.80,0.31


## frac_snow

*fraction of precipitation falling as snow (for days colder than 0°C)*

TODO Ralf: nehmen wir hier Tage an denen temperature_mean < 0°C ist? Oder temperature_min < 0°C? Oder temperature_max < 0°C?

In [7]:
for camels_id in camels_ids:
    # read camels de hydromet timeseries data
    df = pd.read_csv(f"../output_data/camels_de/timeseries/CAMELS_DE_hydromet_timeseries_{camels_id}.csv")

    # filter complete hydrological years
    df = filter_complete_hydro_years(df)
    
    # fraction of precipitation falling as snow (for days colder than 0°C)
    sum_precip_snow = df[df["temperature_mean"] < 0]["precipitation_mean"].sum()
    sum_precip_water = df[df["temperature_mean"] >= 0]["precipitation_mean"].sum()
    frac_snow = sum_precip_snow / (sum_precip_snow + sum_precip_water)

    # add to results
    df_results.loc[camels_id, "frac_snow"] = round(frac_snow, 2)

df_results

100%|██████████| 1460/1460 [01:05<00:00, 22.14it/s]


Unnamed: 0,p_mean,p_seasonality,frac_snow
DE110000,2.97,0.01,0.12
DE110010,2.87,0.05,0.12
DE110020,2.54,0.19,0.10
DE110030,2.45,0.25,0.09
DE110040,2.61,0.40,0.07
...,...,...,...
DEG10580,2.29,0.09,0.09
DEG10590,2.21,0.06,0.10
DEG10600,1.62,0.15,0.08
DEG10610,1.80,0.31,0.09


## high_prec_freq

*frequency of high precipitation days (≥ 5 times mean daily precipitation)* [days/yr]

In [8]:
for camels_id in camels_ids:
    # read camels de hydromet timeseries data
    df = pd.read_csv(f"../output_data/camels_de/timeseries/CAMELS_DE_hydromet_timeseries_{camels_id}.csv")

    # filter complete hydrological years
    df = filter_complete_hydro_years(df)
    
    # mean precipitation
    p_mean = df["precipitation_mean"].mean()

    # number of days >= 5 times mean precipitation
    n_days_high_freq = len(df[df["precipitation_mean"] >= 5 * p_mean]) / (df.index[-1] - df.index[0]).days * 365.25

    # add to results
    df_results.loc[camels_id, "high_prec_freq"] = round(n_days_high_freq, 2)

df_results.head()

100%|██████████| 1460/1460 [01:08<00:00, 21.41it/s]


Unnamed: 0,p_mean,p_seasonality,frac_snow,high_prec_freq
DE110000,2.97,0.01,0.12,15.23
DE110010,2.87,0.05,0.12,13.3
DE110020,2.54,0.19,0.1,15.22
DE110030,2.45,0.25,0.09,15.23
DE110040,2.61,0.4,0.07,17.16


## high_prec_dur

*average duration of high precipitation events (number of consecutive days ≥ 5 times mean daily precipitation)* [days]

In [9]:
# initialize variables to keep track of high precipitation event
high_precip_streaks = []
current_streak = 0

for camels_id in camels_ids:
    # read camels de hydromet timeseries data
    df = pd.read_csv(f"../output_data/camels_de/timeseries/CAMELS_DE_hydromet_timeseries_{camels_id}.csv")

    # filter complete hydrological years
    df = filter_complete_hydro_years(df)
    
    # mean precipitation
    p_mean = df["precipitation_mean"].mean()

    # iterate over the DataFrame's rows
    for precip in df["precipitation_mean"]:
        if precip >= 5 * p_mean:
            # if the day's precipitation is higher 5 times mean precipitation, increment the current streak
            current_streak += 1
        elif current_streak > 0:
            # if the day's precipitation is not high and there's a current streak, add it to the list of all streaks and reset it
            high_precip_streaks.append(current_streak)
            current_streak = 0

    # if there's a current streak at the end of the DataFrame, add it to the list of all streaks
    if current_streak > 0:
        high_precip_streaks.append(current_streak)

    # calculate the average streak length for the station
    average_streak_length = sum(high_precip_streaks) / len(high_precip_streaks) if high_precip_streaks else 0

    # add to results
    df_results.loc[camels_id, "high_prec_dur"] = round(average_streak_length, 2)

df_results.head()

100%|██████████| 1460/1460 [01:14<00:00, 19.56it/s]


Unnamed: 0,p_mean,p_seasonality,frac_snow,high_prec_freq,high_prec_dur
DE110000,2.97,0.01,0.12,15.23,1.19
DE110010,2.87,0.05,0.12,13.3,1.19
DE110020,2.54,0.19,0.1,15.22,1.18
DE110030,2.45,0.25,0.09,15.23,1.17
DE110040,2.61,0.4,0.07,17.16,1.18


## high_prec_timing

*season during which most high precipitation days (≥ 5 times mean daily precipitation) occur (djf December, January, February; mam March, April, May; jja June, July, August; son September, October, November). If two seasons register the same number of events, a value of NaN is given.* [season]

In [10]:
# Define a function to get the season from a date
def get_season(date):
    month = date.month
    if month in [12, 1, 2]:
        return 'djf'
    elif month in [3, 4, 5]:
        return 'mam'
    elif month in [6, 7, 8]:
        return 'jja'
    else:
        return 'son'

for camels_id in camels_ids:
    # read camels de hydromet timeseries data
    df = pd.read_csv(f"../output_data/camels_de/timeseries/CAMELS_DE_hydromet_timeseries_{camels_id}.csv")

    # filter complete hydrological years
    df = filter_complete_hydro_years(df)

    # make date column from index
    df["date"] = df.index

    # mean precipitation
    p_mean = df["precipitation_mean"].mean()

    # Initialize a dictionary to keep track of the number of high-precipitation days in each season
    season_counts = {'djf': 0, 'mam': 0, 'jja': 0, 'son': 0}

    # Iterate over the DataFrame's rows
    for index, row in df.iterrows():
        if row['precipitation_mean'] >= 5 * p_mean:
            # If the day's precipitation is high, increment the count for its season
            season_counts[get_season(row['date'])] += 1

    # Find the season with the most high-precipitation days
    max_count = max(season_counts.values())
    max_seasons = [season for season, count in season_counts.items() if count == max_count]

    # If there's a tie, return NaN
    if len(max_seasons) > 1:
        max_season = float('nan')
    else:
        max_season = max_seasons[0]

    # Add to results
    df_results.loc[camels_id, "high_prec_timing"] = max_season

df_results.head()

100%|██████████| 1460/1460 [08:59<00:00,  2.71it/s]


Unnamed: 0,p_mean,p_seasonality,frac_snow,high_prec_freq,high_prec_dur,high_prec_timing
DE110000,2.97,0.01,0.12,15.23,1.19,djf
DE110010,2.87,0.05,0.12,13.3,1.19,jja
DE110020,2.54,0.19,0.1,15.22,1.18,jja
DE110030,2.45,0.25,0.09,15.23,1.17,jja
DE110040,2.61,0.4,0.07,17.16,1.18,jja


## low_prec_freq

*frequency of low precipitation days (< 1 mm/day)* [days/yr]

In [11]:
for camels_id in camels_ids:
    # read camels de hydromet timeseries data
    df = pd.read_csv(f"../output_data/camels_de/timeseries/CAMELS_DE_hydromet_timeseries_{camels_id}.csv")

    # filter complete hydrological years
    df = filter_complete_hydro_years(df)

    # number of days < 1 mm of precipitation
    n_days_low_freq = len(df[df["precipitation_mean"] < 1]) / (df.index[-1] - df.index[0]).days * 365.25

    # add to results
    df_results.loc[camels_id, "low_prec_freq"] = round(n_days_low_freq, 2)

df_results.head()

100%|██████████| 1460/1460 [01:03<00:00, 23.02it/s]


Unnamed: 0,p_mean,p_seasonality,frac_snow,high_prec_freq,high_prec_dur,high_prec_timing,low_prec_freq
DE110000,2.97,0.01,0.12,15.23,1.19,djf,202.67
DE110010,2.87,0.05,0.12,13.3,1.19,jja,178.97
DE110020,2.54,0.19,0.1,15.22,1.18,jja,212.8
DE110030,2.45,0.25,0.09,15.23,1.17,jja,213.89
DE110040,2.61,0.4,0.07,17.16,1.18,jja,223.92


## low_prec_dur

*average duration of dry periods (number of consecutive days < 1 mm/day)* [days]

In [12]:
# initialize variables to keep track of high precipitation event
low_precip_streaks = []
current_streak = 0

for camels_id in camels_ids:
    # read camels de hydromet timeseries data
    df = pd.read_csv(f"../output_data/camels_de/timeseries/CAMELS_DE_hydromet_timeseries_{camels_id}.csv")

    # filter complete hydrological years
    df = filter_complete_hydro_years(df)
    
    # iterate over the DataFrame's rows
    for precip in df["precipitation_mean"]:
        if precip < 1:
            # if the day's precipitation is higher 5 times mean precipitation, increment the current streak
            current_streak += 1
        elif current_streak > 0:
            # if the day's precipitation is not high and there's a current streak, add it to the list of all streaks and reset it
            low_precip_streaks.append(current_streak)
            current_streak = 0

    # if there's a current streak at the end of the DataFrame, add it to the list of all streaks
    if current_streak > 0:
        low_precip_streaks.append(current_streak)

    # calculate the average streak length for the station
    average_streak_length = sum(low_precip_streaks) / len(low_precip_streaks) if low_precip_streaks else 0

    # add to results
    df_results.loc[camels_id, "low_prec_dur"] = round(average_streak_length, 2)

df_results.head()

100%|██████████| 1460/1460 [01:12<00:00, 20.20it/s]


Unnamed: 0,p_mean,p_seasonality,frac_snow,high_prec_freq,high_prec_dur,high_prec_timing,low_prec_freq,low_prec_dur
DE110000,2.97,0.01,0.12,15.23,1.19,djf,202.67,3.71
DE110010,2.87,0.05,0.12,13.3,1.19,jja,178.97,3.72
DE110020,2.54,0.19,0.1,15.22,1.18,jja,212.8,3.74
DE110030,2.45,0.25,0.09,15.23,1.17,jja,213.89,3.74
DE110040,2.61,0.4,0.07,17.16,1.18,jja,223.92,3.75


## low_prec_timing

*season during which most dry days (< 1mm day-1) occur (djf December, January, February; mam March, April, May; jja June, July, August; son September, October, November). If two seasons register the same number of events, a value of NaN is given.* [season]

In [13]:
# Define a function to get the season from a date
def get_season(date):
    month = date.month
    if month in [12, 1, 2]:
        return 'djf'
    elif month in [3, 4, 5]:
        return 'mam'
    elif month in [6, 7, 8]:
        return 'jja'
    else:
        return 'son'

for camels_id in camels_ids:
    # read camels de hydromet timeseries data
    df = pd.read_csv(f"../output_data/camels_de/timeseries/CAMELS_DE_hydromet_timeseries_{camels_id}.csv")

    # filter complete hydrological years
    df = filter_complete_hydro_years(df)

    # make date column from index
    df["date"] = df.index

    # Initialize a dictionary to keep track of the number of low-precipitation days in each season
    season_counts = {'djf': 0, 'mam': 0, 'jja': 0, 'son': 0}

    # Iterate over the DataFrame's rows
    for index, row in df.iterrows():
        if row['precipitation_mean'] < 1:
            # If the day's precipitation is low, increment the count for its season
            season_counts[get_season(row['date'])] += 1

    # Find the season with the most low-precipitation days
    max_count = max(season_counts.values())
    max_seasons = [season for season, count in season_counts.items() if count == max_count]

    # If there's a tie, return NaN
    if len(max_seasons) > 1:
        max_season = float('nan')
    else:
        max_season = max_seasons[0]

    # Add to results
    df_results.loc[camels_id, "low_prec_timing"] = max_season

df_results.head()

100%|██████████| 1460/1460 [09:26<00:00,  2.58it/s]


Unnamed: 0,p_mean,p_seasonality,frac_snow,high_prec_freq,high_prec_dur,high_prec_timing,low_prec_freq,low_prec_dur,low_prec_timing
DE110000,2.97,0.01,0.12,15.23,1.19,djf,202.67,3.71,son
DE110010,2.87,0.05,0.12,13.3,1.19,jja,178.97,3.72,son
DE110020,2.54,0.19,0.1,15.22,1.18,jja,212.8,3.74,son
DE110030,2.45,0.25,0.09,15.23,1.17,jja,213.89,3.74,son
DE110040,2.61,0.4,0.07,17.16,1.18,jja,223.92,3.75,son


## Save results

Create file `CAMELS_DE_climatic_attributes.csv`.

In [14]:
# save results
df_results.to_csv(f"../output_data/camels_de/CAMELS_DE_climatic_attributes.csv", index_label="gauge_id")