Goal of this notebook is to add additional features to our dataset.

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('/workspaces/Room_7_Bakery_Prediction/0_DataPreparation/0.2 Additional Data/complete_dataset_with_features.csv')
df['date'] = pd.to_datetime(df['date'])
df.dtypes

id                            float64
date                   datetime64[ns]
Warengruppe                   float64
umsatz                        float64
KielerWoche                   float64
Bewoelkung                    float64
Temperatur                    float64
Windgeschwindigkeit           float64
Wettercode                    float64
Is_Holiday                      int64
Day_Before_Holiday              int64
Day_After_Holiday               int64
Is_Vacation                     int64
Vacation_Type                  object
dtype: object

Adding day_of_the_week, month and is_weekend to the dataset.

In [4]:
# Add day_of_the_week (0=Monday, 6=Sunday)
df['day_of_the_week'] = df['date'].dt.dayofweek

# Add month (1-12)
df['month'] = df['date'].dt.month

# Add is_weekend (True for Saturday and Sunday)
df['is_weekend'] = df['day_of_the_week'].isin([5, 6])

# Display the first few rows to verify
df[['date', 'day_of_the_week', 'month', 'is_weekend']].head(10)

Unnamed: 0,date,day_of_the_week,month,is_weekend
0,2013-07-01,0,7,False
1,2013-07-01,0,7,False
2,2013-07-01,0,7,False
3,2013-07-01,0,7,False
4,2013-07-01,0,7,False
5,2013-07-02,1,7,False
6,2013-07-02,1,7,False
7,2013-07-02,1,7,False
8,2013-07-02,1,7,False
9,2013-07-02,1,7,False


**Adding features for the weather data.**  
Converting the Okta-Values of 'Bewölkung' into categorical data.  
    - **0** = Clear sky  
	- **1-2** = Partly cloudy  
	- **3-4** = Cloudy  
	- **5-6** = Very cloudy  
	- **7-8** = Overcast  

In [24]:
# Bin Okta values of 'Bewölkung' into categories
okta_bins = [-0.1, 0.5, 2.5, 4.5, 6.5, 8.5]  # covers 0, 1-2, 3-4, 5-6, 7-8
okta_labels = ["Clear sky", "Partly cloudy", "Cloudy", "Very cloudy", "Overcast"]

# Ensure numeric, then create categorical column
bew = pd.to_numeric(df['Bewoelkung'], errors='coerce')
df['bewoelkung_category'] = pd.cut(
    bew,
    bins=okta_bins,
    labels=okta_labels,
    include_lowest=True
)

df[['Bewoelkung', 'bewoelkung_category']].head(10)

Unnamed: 0,Bewoelkung,bewoelkung_category
0,6.0,Very cloudy
1,6.0,Very cloudy
2,6.0,Very cloudy
3,6.0,Very cloudy
4,6.0,Very cloudy
5,3.0,Cloudy
6,3.0,Cloudy
7,3.0,Cloudy
8,3.0,Cloudy
9,3.0,Cloudy


In [5]:
print(np.sort(df['Wettercode'].dropna().unique()).tolist())

[0.0, 3.0, 5.0, 10.0, 17.0, 20.0, 21.0, 22.0, 28.0, 45.0, 49.0, 53.0, 55.0, 61.0, 63.0, 65.0, 68.0, 69.0, 71.0, 73.0, 75.0, 77.0, 79.0, 95.0]


Converting WMO-Code of 'Wettercode' into categorical data.  
0.0 = Cloud development not observed or not observable  
3.0 = Clouds generally forming or developing  
5.0 = Haze  
10.0 = Mist  
17.0 = Thunderstorm, but no precipitation at the time of observation  
20.0 = Drizzle (not freezing) or snow grains  
21.0 = Rain (not freezing)  
22.0 = Snow  
28.0 = Fog or ice fog  
45.0 = Fog or ice fog, sky invisible  
49.0 = Fog, depositing rime, sky invisible  
53.0 = Moderate drizzle, not freezing, continuous  
55.0 = Heavy drizzle, not freezing, continuous  
61.0 = Slight rain, not freezing, intermittent  
63.0 = Moderate rain, not freezing, continuous  
65.0 = Heavy rain, not freezing, continuous  
68.0 = Rain or drizzle and snow, slight  
69.0 = Rain or drizzle and snow, moderate or heavy  
71.0 = Slight continuous fall of snowflakes  
73.0 = Moderate continuous fall of snowflakes  
75.0 = Heavy continuous fall of snowflakes  
77.0 = Snow grains (with or without fog)  
79.0 = Ice pellets  
95.0 = Thunderstorm, slight or moderate, without hail but with rain and/or snow at time of observation

In [22]:
# Map WMO weather codes to categories
wmo_map = {
    0.0: "Cloud development not observed or not observable",
    3.0: "Clouds generally forming or developing",
    5.0: "Haze",
    10.0: "Mist",
    17.0: "Thunderstorm, no precipitation at observation",
    20.0: "Drizzle (not freezing) or snow grains",
    21.0: "Rain (not freezing)",
    22.0: "Snow",
    28.0: "Fog or ice fog",
    45.0: "Fog or ice fog, sky invisible",
    49.0: "Fog, depositing rime, sky invisible",
    53.0: "Moderate drizzle, not freezing, continuous",
    55.0: "Heavy drizzle, not freezing, continuous",
    61.0: "Slight rain, not freezing, intermittent",
    63.0: "Moderate rain, not freezing, continuous",
    65.0: "Heavy rain, not freezing, continuous",
    68.0: "Rain or drizzle and snow, slight",
    69.0: "Rain or drizzle and snow, moderate or heavy",
    71.0: "Slight continuous fall of snowflakes",
    73.0: "Moderate continuous fall of snowflakes",
    75.0: "Heavy continuous fall of snowflakes",
    77.0: "Snow grains (with or without fog)",
    79.0: "Ice pellets",
    95.0: "Thunderstorm, slight/moderate, no hail but rain/snow"
}

# Ensure numeric then map; keep unknowns as 'Other/Unknown'
wmo_numeric = pd.to_numeric(df['Wettercode'], errors='coerce')
df['wettercode_category'] = wmo_numeric.map(wmo_map).fillna('Missing')

df[['Wettercode', 'wettercode_category']].head(10)

Unnamed: 0,Wettercode,wettercode_category
0,20.0,Drizzle (not freezing) or snow grains
1,20.0,Drizzle (not freezing) or snow grains
2,20.0,Drizzle (not freezing) or snow grains
3,20.0,Drizzle (not freezing) or snow grains
4,20.0,Drizzle (not freezing) or snow grains
5,,Missing
6,,Missing
7,,Missing
8,,Missing
9,,Missing


Adding a new categorie for season.  
**Spring** = March - May  
**Summer** = June - August  
**Autumn** = September - November  
**Winter** = December - February

In [21]:
# Add season based on month
def get_season(month):
    if month in [3, 4, 5]:
        return "Spring"
    elif month in [6, 7, 8]:
        return "Summer"
    elif month in [9, 10, 11]:
        return "Autumn"
    else:  # 12, 1, 2
        return "Winter"

df['season'] = df['month'].apply(get_season)

df[['date', 'month', 'season']].head(10)


Unnamed: 0,date,month,season
0,2013-07-01,7,Summer
1,2013-07-01,7,Summer
2,2013-07-01,7,Summer
3,2013-07-01,7,Summer
4,2013-07-01,7,Summer
5,2013-07-02,7,Summer
6,2013-07-02,7,Summer
7,2013-07-02,7,Summer
8,2013-07-02,7,Summer
9,2013-07-02,7,Summer


Adding categorical data for temperature.  
Low: ≤ 10°C (cold, winter-like conditions)  
Medium: 10°C - 20°C (mild, spring/autumn conditions)  
High: > 20°C (warm, summer conditions)

In [28]:
# Categorize temperature into Low, Medium, and High
# Temperature thresholds (in Celsius)
temp_bins = [-np.inf, 0, 10, 20, np.inf]
temp_labels = ["Freezing", "Low", "Medium", "High"]

# Ensure numeric then categorize
temp_numeric = pd.to_numeric(df['Temperatur'], errors='coerce')
df['temperature_category'] = pd.cut(
    temp_numeric,
    bins=temp_bins,
    labels=temp_labels,
    include_lowest=True
)

df[['Temperatur', 'temperature_category']].head(10)

Unnamed: 0,Temperatur,temperature_category
0,17.8375,Medium
1,17.8375,Medium
2,17.8375,Medium
3,17.8375,Medium
4,17.8375,Medium
5,17.3125,Medium
6,17.3125,Medium
7,17.3125,Medium
8,17.3125,Medium
9,17.3125,Medium
