### Welcome to the CKW Energy Data Hackday Challenge
The Data has already been processed and is ready for building machine learning models on it. This Notebook will help you to first import the data stored in a Blob Storage and then apply a scaler. 

In [None]:
import pandas as pd
import math 
import joblib
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [None]:
# import and show data

The data consists about 1'180'000 rows and 191 columns. We have this much of columns because some values already have been feature engineered. If you want to see the features we have expanded please watch in the folder FeatureEngineering. To get ready for machine learning we need to first scale our data.

In [None]:
# scale the data with a standard scaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  
X_val_scaled = scaler.transform(X_val)  

#### Easy Machine Learning Model with XGBoostRegressor
Now we try to build our machine learning model with XGBoost. 

In [None]:
# initialize XGBoost Regressor model
model = xgb.XGBRegressor(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    objective='reg:squarederror'
)

# train model on scaled train data
model.fit(X_train_scaled, y_Train)

# predict y with validation data
y_pred = model.predict(X_val_scaled)

# measure performance values
mse = mean_squared_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)
mae = mean_absolute_error(y_val, y_pred)

print(f"Validation MSE: {mse:.4f}")
print(f"Validation R2: {r2:.4f}")
print(f"Validation MAE: {mae:.4f}")

As you can see, our model is already pretty good in performance. If you are happy with the model save it. 

In [None]:
# Optional: save Model
# model.save_model("01_Models/xgb_regressor_model_with_feed_in.json")

#### Model Insights
In this step we will see what kind of features have the most influence on the predicted value. 

In [None]:
importances = model.feature_importances_
feature_names = X_train.columns if hasattr(X_train, 'columns') else [f'feature_{i}' for i in range(X_train.shape[1])]

# Sortieren
sorted_idx = np.argsort(importances)[::-1]
top_n = 15  # Anzahl der Top-Features

plt.figure(figsize=(10, 6))
bars = plt.bar(range(top_n), importances[sorted_idx][:top_n], color='skyblue')
plt.xticks(range(top_n), [feature_names[i] for i in sorted_idx][:top_n], rotation=45, ha='right')
plt.title('Top Feature Importances')
plt.ylabel('Importance')
plt.tight_layout()

# Werte über Balken anzeigen
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, height, f'{height:.3f}', ha='center', va='bottom')

plt.show()

As you can see, the feed in feature has the biggest impact on the final predicted value. also the global radiation and the panel peak power have good influences on the model. 

#### Model without feed in feature
As a hard test we try to forecast a value where we don't know the true value of the feed in energy.  

In [None]:
# remove feed in features 
ueberschuss_cols = [col for col in X_train.columns if 'Überschuss' in col]
cols_to_remove = ueberschuss_cols + ['feed_in:kWh']
cols_to_keep = [col for col in X_train.columns if col not in cols_to_remove]

X_train_scaled_without_feed_in = X_train_scaled[:, [X_train.columns.get_loc(col) for col in cols_to_keep]]
X_val_scaled_without_feed_in = X_val_scaled[:, [X_val.columns.get_loc(col) for col in cols_to_keep]]

In [None]:
# initialize XGBoost Regressor model
model = xgb.XGBRegressor(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    objective='reg:squarederror'
)

# train model on scaled train data
model.fit(X_train_scaled_without_feed_in, y_Train)

# predict y with validation data
y_pred_wo_feed_in = model.predict(X_val_scaled_without_feed_in)

# measure performance values
mse = mean_squared_error(y_val, y_pred_wo_feed_in)
r2 = r2_score(y_val, y_pred_wo_feed_in)
mae = mean_absolute_error(y_val, y_pred)

print(f"Validation MSE: {mse:.4f}")
print(f"Validation R2: {r2:.4f}")
print(f"Validation MAE: {mae:.4f}")

In [None]:
# Optional: save model
# model.save_model("01_Models/xgb_regressor_model_without_feed_in.json")

In [None]:
importances = model.feature_importances_
feature_names = cols_to_keep

# Sortieren
sorted_idx = np.argsort(importances)[::-1]
top_n = 15  # Anzahl der Top-Features

plt.figure(figsize=(10, 6))
bars = plt.bar(range(top_n), importances[sorted_idx][:top_n], color='skyblue')
plt.xticks(range(top_n), [feature_names[i] for i in sorted_idx][:top_n], rotation=45, ha='right')
plt.title('Top Feature Importances')
plt.ylabel('Importance')
plt.tight_layout()

# Werte über Balken anzeigen
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, height, f'{height:.3f}', ha='center', va='bottom')

plt.show()

As you can see the model performs worse, because very important features are missing now. The key discipline is to improve this model.

In [None]:
# make a scatterplot to visualize the true vs the predicted values of both models
plt.figure(figsize=(10, 6))
plt.scatter(y_val, y_pred_wo_feed_in, alpha=0.1, color='blue', label='Ohne Rückspeisemessug')
plt.scatter(y_val, y_pred, alpha=0.1, color='green', label='Mit Rückspeisemessug')
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# without feed in feature
axes[0].scatter(y_val, y_pred_wo_feed_in, alpha=0.1, color='blue')
axes[0].plot([0, 12], [0, 12], linestyle='--', color='red', label='Perfektes Modell')
axes[0].set_title('Ohne Rückspeisemessung')
axes[0].set_xlabel('Wahre Werte')
axes[0].set_ylabel('Vorhergesagte Werte')
axes[0].legend()
axes[0].grid()

# model with feed in feature
axes[1].scatter(y_val, y_pred, alpha=0.1, color='green')
axes[1].plot([0, 12], [0, 12], linestyle='--', color='red', label='Perfektes Modell')
axes[1].set_title('Mit Rückspeisemessung')
axes[1].set_xlabel('Wahre Werte')
axes[1].set_ylabel('Vorhergesagte Werte')
axes[1].legend()
axes[1].grid()

plt.tight_layout()
plt.show()