# Project Description

This project focuses on predicting electricity prices in Switzerland using machine learning models. The dataset includes price information from various countries and features related to seasonality. The key challenges addressed in this project include handling missing data and optimizing model performance in the face of low predictive power.

The machine learning pipeline involves data preprocessing, kernel selection, and model tuning. Data preprocessing techniques such as KNN Imputation, One-Hot Encoding, and Standardization are employed to ensure the model receives quality input data. Kernelized regression models, particularly Gaussian processes, are used to capture the complex relationships within the data. The model selection process includes cross-validation and hyperparameter tuning to find the optimal kernel and parameters for accurate predictions.

---

# Code Description

In this notebook, the `scikit-learn` library is used to preprocess the provided data and find an optimal kernel for modeling its characteristics using machine learning. The workflow includes the following steps:

1. **Data Preprocessing**:
   - **KNN Imputation**: Missing values are imputed using K-Nearest Neighbors.
   - **One-Hot Encoding**: Categorical variables are converted to a numeric format.
   - **Standardization**: The features are standardized to have a mean of zero and a standard deviation of one.
   
   This preprocessing pipeline was determined through exploratory data analysis (EDA) and identified as the optimal approach.

2. **Kernel Selection**:
   - The model employs kernelized regression techniques.
   - **Cross-validation with 10 folds** is used to select the optimal kernel. The **Matern kernel** shows the smallest RMSE.

3. **Model Tuning & Final Prediction**:
   - The selected Matern kernel is fine-tuned using **Randomized Search** to optimize model parameters.
   - The tuned model is then used to make the final predictions.

This structured approach provides a robust and accurate model for predicting electricity prices in Switzerland. The project reached an accuracy of 97.1% on the unseen, public dataset.


# Load Requirements

In [14]:
# data handling
import numpy as np
import pandas as pd
from scipy import stats
# Add any other imports you need here
from datetime import datetime

# plotting
import seaborn as sns
from matplotlib import pyplot as plt

# preprocessing
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector, make_column_transformer

# training
from sklearn.metrics import mean_squared_error
# from sklearn.linear_model import LinearRegression
# from sklearn.tree import DecisionTreeRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct, RBF, Matern, RationalQuadratic
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint, uniform

# model evaluation
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_classif, VarianceThreshold


# set exploration options for print
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns

# set display to be interactive for pipelines
import sklearn
sklearn.set_config(display="diagram")

# Load Data

In [4]:
"""
This loads the testing and test data, preprocesses it, removes the NaN
values and interpolates the missing data using imputation

Parameters
----------
Compute
----------
X_test: matrix of floats, testing input with features
y_train: array of floats, training output with labels
X_test: matrix of floats: dim = (100, ?), test input with features
"""
# Load training data
train_df = pd.read_csv("train.csv")
    
print("testing data:")
print("Shape:", train_df.shape)
print(train_df.head(2))
print('\n')
    
# Load test data
test_df = pd.read_csv("test.csv")

print("Test data:")
print(test_df.shape)
print(test_df.head(2))
# Split the training and test data

testing data:
Shape: (900, 11)
   season  price_AUS  price_CHF  price_CZE  price_GER  price_ESP  price_FRA  \
0  spring        NaN   9.644028  -1.686248  -1.748076  -3.666005        NaN   
1  summer        NaN   7.246061  -2.132377  -2.054363  -3.295697  -4.104759   

   price_UK  price_ITA  price_POL  price_SVK  
0 -1.822720  -3.931031        NaN  -3.238197  
1 -1.826021        NaN        NaN  -3.212894  


Test data:
(100, 10)
   season  price_AUS  price_CZE  price_GER  price_ESP  price_FRA  price_UK  \
0  spring        NaN   0.472985   0.707957        NaN  -1.136441 -0.596703   
1  summer  -1.184837   0.358019        NaN  -3.199028  -1.069695       NaN   

   price_ITA  price_POL  price_SVK  
0        NaN   3.298693   1.921886  
1  -1.420091   3.238307        NaN  


# Split the training set
We decided to drop rows with NAs in the target.

In [5]:
## For now: drop NAs in ground truth
def groud_truth_NA_handling(full_df, y_col = "price_CHF"):
    yNA_mask = full_df[y_col].isna()
    X = full_df[~yNA_mask].drop([y_col],axis=1)
    y = full_df[~yNA_mask].loc[:,y_col]
    return X, y

X_train, y_train = groud_truth_NA_handling(train_df)
X_test = test_df.copy()

# given assert from task
assert (X_train.shape[1] == X_test.shape[1]) and (X_train.shape[0] == y_train.shape[0]) and (X_test.shape[0] == 100), "Invalid data shape"

# Cleaned preprocessing
From various preprocessing exploration we have distilled the final pipeline here 

In [6]:
# first we define the pipelines that we found above
num_pipeline = make_pipeline(KNNImputer(n_neighbors=5),
                              StandardScaler())
cat_pipeline = make_pipeline(OneHotEncoder())

# second, we merge them into a column selection transformer
preprocessing = ColumnTransformer(transformers=[("num", num_pipeline, make_column_selector(dtype_include=np.number)),
                                                ("cat", cat_pipeline, make_column_selector(dtype_include=object))], 
                                  remainder="passthrough")

X_train_prepared =  preprocessing.fit_transform(X_train)

X_train_prepared_df = pd.DataFrame(X_train_prepared, 
                                   columns = preprocessing.get_feature_names_out())

# Kernel Selection
## Quick and dirty

In [7]:
# First we define possible regressor pipelines
# lin_reg = make_pipeline(preprocessing, LinearRegression())
# tree_reg = make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))
kernel_names = ["DotProduct", "RBF", "Matern", "RationalQuadratic"]
kernels = [DotProduct(), RBF(), Matern(length_scale_bounds=(1e-10, 1e10)), RationalQuadratic(length_scale_bounds=(1e-10, 1e10))]
model_pipelines = [Pipeline([("preprocessing", preprocessing), 
                             (f"GPR_{kernel_names[i]}", GaussianProcessRegressor(kernel=kernels[i]))
                             ]) for i in range(len(kernels))]
                            

dirty_rmses = np.zeros((len(kernels), 3), dtype=object)
# Second, quick and dirty goodness of fit
for i, model in enumerate(model_pipelines): 
    model.fit(X_train, y_train)
    y_pred = model.predict(X_train)
    rmse = mean_squared_error(y_pred, y_train)
    model_name = model.steps[-1][0]
    kernel_name = model.steps[-1][1].kernel_
    # print("Model: {} Kernel: {} RMSE: {:10.4e}".format(model_name, kernel_name, rmse))
    dirty_rmses[i, :] = (model_name, kernel_names[i], rmse)
# linear regression (dot product) performs the worst on first look

# for higher accuracy pd.Dataframe display
pd.options.display.float_format = '{:.2e}'.format

pd.DataFrame(dirty_rmses, 
             columns = ["Model", "Kernel", "RMSE"])
# pd.options.display.float_format = None



Unnamed: 0,Model,Kernel,RMSE
0,GPR_DotProduct,DotProduct,0.8
1,GPR_RBF,RBF,3.17e-19
2,GPR_Matern,Matern,4.97e-19
3,GPR_RationalQuadratic,RationalQuadratic,6e-20


## Cross Validation

In [8]:
cross_val_list = []
# perform the cross validations
for model in model_pipelines:
    cv_rmses = pd.Series(-cross_val_score(model, X_train, y_train, scoring="neg_root_mean_squared_error", cv=10)).describe()
    cross_val_list.append(cv_rmses)



In [9]:
pd.DataFrame(cross_val_list, index = kernel_names)
# Matern and RationalQuadratic seem to perform better

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
DotProduct,10.0,1.05,0.377,0.741,0.847,0.946,1.08,2.05
RBF,10.0,5.41,1.62,3.33,4.09,5.25,6.59,8.17
Matern,10.0,0.691,0.383,0.412,0.481,0.521,0.621,1.5
RationalQuadratic,10.0,0.705,0.483,0.281,0.461,0.505,0.644,1.82


# Hyperparameter tuning

In [11]:
gpr_matern = model_pipelines[2]
gpr_rquadratic = model_pipelines[3]

# First we need to get the access string to the hyperparams for the different pipelines
for key in gpr_matern.get_params().keys(): print(key)

memory
steps
verbose
preprocessing
GPR_Matern
preprocessing__n_jobs
preprocessing__remainder
preprocessing__sparse_threshold
preprocessing__transformer_weights
preprocessing__transformers
preprocessing__verbose
preprocessing__verbose_feature_names_out
preprocessing__num
preprocessing__cat
preprocessing__num__memory
preprocessing__num__steps
preprocessing__num__verbose
preprocessing__num__knnimputer
preprocessing__num__standardscaler
preprocessing__num__knnimputer__add_indicator
preprocessing__num__knnimputer__copy
preprocessing__num__knnimputer__keep_empty_features
preprocessing__num__knnimputer__metric
preprocessing__num__knnimputer__missing_values
preprocessing__num__knnimputer__n_neighbors
preprocessing__num__knnimputer__weights
preprocessing__num__standardscaler__copy
preprocessing__num__standardscaler__with_mean
preprocessing__num__standardscaler__with_std
preprocessing__cat__memory
preprocessing__cat__steps
preprocessing__cat__verbose
preprocessing__cat__onehotencoder
preprocessi

In [12]:
# Second, we choose the hyperparameters we want to perform the randomized search on

# Here we could also consider "Halve" search classes fyi

param_distribs = {"preprocessing__num__knnimputer__n_neighbors": randint(2, 20), 
                  "preprocessing__num__knnimputer__weights": ["uniform", "distance"], 
                  "GPR_Matern__kernel__length_scale": uniform(0.1, 3), 
                  "GPR_Matern__kernel__nu": [0.5, 1.5, 2.5, np.inf]} 

rnd_search_matern =  RandomizedSearchCV(gpr_matern, 
                                        param_distributions=param_distribs,
                                        n_iter=10,
                                        cv=3, 
                                        scoring='neg_root_mean_squared_error', 
                                        random_state=42)

rnd_search_matern.fit(X_train, y_train)



In [16]:
# extract the matern model with the best params
final_gpr_matern = rnd_search_matern.best_estimator_## Fine Tuning of the Pipeline and Model
print(rnd_search_matern.best_score_)

-0.8237226828719563


In [18]:
# Predict and store the result
"""
This defines the model, fits training data and then does the prediction
with the test data 

Parameters
----------
X_train: matrix of floats, training input with 10 features
y_train: array of floats, training output
X_test: matrix of floats: dim = (100, ?), test input with 10 features

Compute
----------
y_test: array of floats: dim = (100,), predictions on test set
"""

# predict with final model
y_pred=final_gpr_matern.predict(X_test)

assert y_pred.shape == (100,), "Invalid data shape"

dt = pd.DataFrame(y_pred) 
dt.columns = ['price_CHF']
dt.to_csv('results/results_{time}.csv'.format(time = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")), 
          index=False)
print("\nResults file successfully generated!")


Results file successfully generated!
