## Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data that make machine learning algorithms work. Effective feature engineering can significantly improve the performance of a model by providing it with relevant information.


In [None]:
# import Functions
import FeatureEngineeringFunctions
import pandas as pd
import nump as np
import holidays
from sklearn.preprocessing import OneHotEncoder

In [None]:
# import and show unprocessed data
file_path = "/home/renku/work/ckw-pv-data/PVPanelInsights.parquet"
df_edh = pd.read_parquet(file_path)

print(df_edh25.shape)
print(df_edh25.head())

#### Data Preparation

**Define Column Types** <br>
We convert the 'day_of_week' column to an integer type for easier analysis and the 'time' column to datetime for time-based features.

**Create Binary Columns**<br>
These binary columns help the model understand the influence of weather conditions and weekends on energy consumption.

In [None]:
df_edh25['day_of_week'] = df_edh25['day_of_week'].astype(int)
df_edh25['time'] = pd.to_datetime(df_edh25['time'])

# create a binary is raining and is weekend column 
df_edh25['is_raining'] = (df_edh25['precip_15min:mm'] > 0).astype(int)
df_edh25['is_weekend'] = df_edh25['day_of_week'].isin([5, 6]).astype(int)

# round values
df_edh25['PanelPeakLeistung'] = df_edh25['PanelPeakLeistung'].round(1)

# remove negative values
df_edh25['sun_elevation:d'].clip(lower=0, inplace=True)

# extract time in minutes after midnight
df_edh25['time_in_minutes'] = df_edh25["time"].dt.hour * 60 + df_edh25["time"].dt.minute

#### Sin-Cos Transform
This transformation is useful for cyclical features like time or day of the week, as it allows the model to learn the cyclical nature of these features without introducing discontinuities.

In [None]:
columns_to_transform = {
    "month": 12,
    "hour": 24,
    "time_in_minutes": 1440,
    "day_of_week": 7,
    "Ausrichtung_Grad": 360,
    "Anstellwinkel": 90,
    "sun_elevation:d": 90
}

# transform columns 
for column, period in columns_to_transform.items():
    df_edh25 = add_sin_cos_transform(df_edh25, column, period)

#### One-Hot Encoding
One-hot encoding converts categorical variables into a format that can be provided to ML algorithms. Dropping the first category prevents redundancy and multicollinearity.

In [None]:
df_edh25 = pd.get_dummies(df_edh25, columns=['buildingCategory', 'buildingClass', "periodOfConstruction",
                                             "heatGeneratorHeating", "energySourceHeating", "heatGeneratorHotWater",
                                             "energySourceHotWater", "Ausrichtung"], drop_first=True)

#### Build Lags
Lag features are critical in time series analysis as they help capture temporal dependencies, allowing the model to learn from previous values. We also extract rolling mean and the autocorrelation.

In [None]:
columns_to_prepare = ["Überschuss", "clear_sky_rad_15min:Wh", "dew_point_min_2m_1h:C", "diffuse_rad_15min:Wh", "direct_rad_15min:Wh", "global_rad_15min:Wh", "precip_15min:mm", "relative_humidity_min_2m_1h:p", "snow_depth:mm", "sun_elevation:d", "t_mean_2m_1h:C", "wind_speed_mean_10m_15min:ms"]

for column in columns_to_prepare:
    df_edh25 = build_lags(df_edh25, column=column, group_by='id', time_column='datum')

# the first values are missing. So we fill them. 
df_edh25.fillna(method='bfill', inplace=True)

#### Holidays
Including holidays as a feature helps the model account for variations in energy consumption patterns during these days.

In [None]:
# fetch holidays 
swiss_holidays = holidays.Switzerland(subdiv='LU', years=range(2022, 2026))
df_edh25['is_holiday'] = df_edh25['datum'].isin(swiss_holidays)

#### Clean Dataset and Split
We clean the data to remove any inconsistencies and split it into training and validation sets while ensuring the validation set contains future values to simulate real-world predictions.

In [None]:
df_edh25 = clean_data(df_edh25)

In [None]:
# sort the data by date
df_sorted = df_edh25.sort_values('datum')

# define the split, we go for a 80% train and 20% validation split
split_index = int(len(df_sorted) * 0.8)  
train_data = df_sorted.iloc[:split_index]
val_data = df_sorted.iloc[split_index:]

# define the value we need to predict 
y_train = train_data['production:kWh']
x_train = train_data.drop(columns=["datum", "production:kWh"])

y_val = val_data['production:kWh']
x_val = val_data.drop(columns=["datum", "production:kWh"])