# Data Challenge

This notebook details the construction of the anaylisis for the project.

## Libraries

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import spearmanr
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import make_scorer
import matplotlib.pyplot as plt

## Loading data

- `X_train` and `X_test` both have $35$ columns that represent the same explanatory variables but over different time periods. 

- `X_train` and `Y_train` share the same column `ID` - each row corresponds to a unique ID associated wwith a day and a country. 

- The target of this challenge `TARGET` in `Y_train` corresponds to the price change for daily futures contracts of 24H electricity baseload. 

- **You will notice some columns have missing values**.

In [2]:
#We download the data as pandas Data Frames.

X = pd.read_csv('X_train.csv')
Y = pd.read_csv('Y_train.csv')
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.1, random_state=42)
#(Random state = 42 is used to have replicable results)


In [3]:
X_train.head()

Unnamed: 0,ID,DAY_ID,COUNTRY,DE_CONSUMPTION,FR_CONSUMPTION,DE_FR_EXCHANGE,FR_DE_EXCHANGE,DE_NET_EXPORT,FR_NET_EXPORT,DE_NET_IMPORT,...,FR_RESIDUAL_LOAD,DE_RAIN,FR_RAIN,DE_WIND,FR_WIND,DE_TEMP,FR_TEMP,GAS_RET,COAL_RET,CARBON_RET
1009,1779,97,FR,-0.028612,-0.805939,-1.130925,1.130925,-1.737154,1.30368,1.737154,...,-0.746805,-1.065734,0.269952,0.179885,0.453358,-0.073011,-1.686025,1.288241,-0.033589,0.141656
324,1869,672,FR,0.764817,-0.108776,0.584286,-0.584286,0.153608,0.612692,-0.153608,...,-0.631747,-0.632019,0.148246,0.506879,1.584285,0.212696,0.659325,-2.959449,0.093542,-2.713832
1352,1857,1132,FR,0.820839,-0.128189,0.417064,-0.417064,-0.360641,-0.005078,0.360641,...,-0.116523,0.248154,-0.653387,0.5062,0.358748,-1.186698,-1.021822,-0.798838,0.102641,1.757913
1074,618,180,DE,0.031579,-0.693302,-0.577114,0.577114,-0.46734,1.175802,0.46734,...,-0.619215,2.38363,0.798208,0.071497,0.693684,1.100208,0.71357,0.338003,-1.207689,0.644785
1247,1528,503,FR,-0.50127,-0.796449,-0.480015,0.480015,-0.792889,-0.723719,0.792889,...,-0.885554,0.574685,-0.486974,-0.409457,-0.262854,-0.326737,-1.061768,-0.645343,0.464638,0.263928


In [4]:
Y_train.head()

Unnamed: 0,ID,TARGET
1009,1779,-0.011401
324,1869,-0.047611
1352,1857,0.376436
1074,618,0.311772
1247,1528,-0.046973


## Model and train score

The simplest model for this challenge consists in a simple linear regression, after a light cleaning of the data: The missing (NaN) values are simply filled with 0's and the `COUNTRY` column is dropped - namely we used the same model for France and Germany. 

In this project, our goal is to compare the efficiency of different challenger models compared to a champion model, the LinearRegression. We will not split the columns between FR and DE, instead we will keep a single dataset, to which we will apply the 4 models (champion and challenger).

# Processing Data

In this cell we reprocess the data in order to have a clean dataset. To measure the performance of our models, we use the spearman correlation between the Y predict and the Y_train. 

In [5]:
X_train_clean = X_train.drop(['ID','DAY_ID','COUNTRY','DE_FR_EXCHANGE'], axis=1).fillna(X_train.mean())
Y_train_clean = Y_train['TARGET'].fillna(0)
X_test_clean = X_test.drop(['ID','DAY_ID','COUNTRY','DE_FR_EXCHANGE'], axis=1).fillna(X_test.mean())
Y_test_clean = Y_test['TARGET']

def metric_train(output,Y_train):

    return  100*spearmanr(output, Y_train).correlation

# We create an object `scorer` to make Cross Validation. 
spearman_scorer = make_scorer(metric_train, greater_is_better=True)

#WE also configurate the Cross Validation (KFold) (Random state is used to have replicable results)
kf = KFold(n_splits=5, shuffle=True, random_state=42)


NB: Electricity price variations can be quite volatile and this is why we have chosen the Spearman rank correlation as a robust metric for our analysis, instead of the more standard Pearson correlation.

# Training and Cross Validation of the champion model

In [6]:
lr = LinearRegression()

# We do a Cross Validation with the Cross Val Score function, and we use the spearman correlation as reference to train. 
scores = cross_val_score(
    lr, 
    X_train_clean, 
    Y_train_clean, 
    cv=kf, 
    scoring=spearman_scorer
)


# Results
print("Spearman Correlation Scores:", scores)
print("Mean Spearman Correlation:", np.mean(scores))

lr.fit(X_train_clean,Y_train_clean)
output_train=lr.predict(X_test_clean)
print('Spearman correlation for the test set: {:.1f}%'.format(metric_train(output_train, Y_test_clean) ))

Spearman Correlation Scores: [13.96410934 13.12014392 22.01708565 27.97379943 21.79806445]
Mean Spearman Correlation: 19.774640558565164
Spearman correlation for the train set: 22.4%


This first approach show us that the relation between the datas and Y does not seem to be linear. 
The challenger models can be more flexible and permit hyperparametrization. 

# Challenger models

In this part we are going to use more relevant models, using the Machine Learning Map of Scikit Learn. 
We choose to deepen the analyse using RidgeRegression, the Lasso model and a RandomForest Regression. 
To be more efficient we use a GridSearch Cross Validation method which helps us to find the best parameters to train the model. 


# Ridge Regression : SVR (kernel = 'rbf') 

Kernel 'rbf' is equivalent to a Gaussian kernel.

In [7]:
from sklearn.model_selection import GridSearchCV

# Hyperparameters Grid: we did many tests to converge to these tree parameters for Gamma and Alpha. 
param_grid = {
    'alpha': [1, 5, 10],
    'gamma': [0.005,0.01,0.05],
    'kernel': ['rbf']
}



grid_search = GridSearchCV(KernelRidge(), param_grid, cv=5, scoring=spearman_scorer)
grid_search.fit(X_train_clean, Y_train_clean)

print("Best parameters :", grid_search.best_params_)

# Utiliser les meilleurs paramètres
best_krr = grid_search.best_estimator_
y_pred_best = best_krr.predict(X_test_clean)

scores = cross_val_score(
    best_krr, 
    X_train_clean, 
    Y_train_clean, 
    cv=kf, 
    scoring=spearman_scorer
)


# Results
print("Spearman Correlation Scores:", scores)
print("Mean Spearman Correlation:", np.mean(scores))

print('Spearman correlation for the test set: {:.1f}%'.format(metric_train(y_pred_best, Y_test_clean) ))

Best parameters : {'alpha': 5, 'gamma': 0.005, 'kernel': 'rbf'}
Spearman Correlation Scores: [13.99357363 15.10835043 25.30459278 26.72217616 25.02509286]
Mean Spearman Correlation: 21.230757171744127
Spearman correlation for the train set: 20.2%


This solution provides us a result which seems to be worse, since the result on the training set is better than the result on the test set. Therefore, there could be a problem of overfitting, and we conclude that this challenger model is not improving our analysis. 

# Lasso Regression

This model is a type of linear regression that includes a regularization term. This regularization term helps to prevent overfitting by shrinking the coefficients of less important variables to zero, effectively performing variable selection. This makes the model simpler and more interpretable. 

In [39]:
T= np.arange(0.001,0.01,0.001)

param_alpha=T.tolist()

param_grid = {
    'alpha': param_alpha
}

lasso = Lasso(max_iter=10000)

grid_search = GridSearchCV(lasso, param_grid, cv=5, scoring=spearman_scorer)
grid_search.fit(X_train_clean, Y_train_clean)

print("Best parameters :", grid_search.best_params_)

# We use the best parameters
best_krr = grid_search.best_estimator_
y_pred_best = best_krr.predict(X_test_clean)

scores = cross_val_score(
    best_krr, 
    X_train_clean, 
    Y_train_clean, 
    cv=kf, 
    scoring=spearman_scorer
)


# Results
print("Spearman Correlation Scores:", scores)
print("Mean Spearman Correlation:", np.mean(scores))

print('Spearman correlation for the test set: {:.1f}%'.format(metric_train(y_pred_best, Y_test_clean) ))


Best parameters : {'alpha': 0.005}
Spearman Correlation Scores: [15.47995442 15.64859071 23.88920776 27.83391595 24.13085946]
Mean Spearman Correlation: 21.396505658812842
Spearman correlation for the train set: 22.7%


In the first part of this analysis we identify the best parameter alpha for this regression. The results show us that the advantages of Lasso Regression are not improving our analysis. Lasso Regression tries to identify the most relevant variable, however in this case every variable seems to be relevant. 

# Random Forest Regressor

The Random Forest Regressor is a model that uses an group of decision trees to make predictions. Each tree in the forest is built from a random sample of the training data, and the predictions from all the trees are combined to produce a final prediction. This method helps to reduce the risk of overfitting and improves the robustness of the model.

In [40]:
from sklearn.ensemble import RandomForestRegressor

model1 = RandomForestRegressor()
model1.fit(X_train_clean, Y_train_clean)

RandomForestRegressor()

In [43]:
T=np.arange(95,105,5)
estimators_param=T.tolist()


param_grid = {
    'n_estimators': estimators_param
}

model1 = RandomForestRegressor()

grid_search = GridSearchCV(model1, param_grid, cv=5, scoring=spearman_scorer)
grid_search.fit(X_train_clean, Y_train_clean)

print("Best parameters :", grid_search.best_params_)

# We use the best parameters
best_krr = grid_search.best_estimator_
y_pred_best = best_krr.predict(X_test_clean)

scores = cross_val_score(
    best_krr, 
    X_train_clean, 
    Y_train_clean, 
    cv=kf, 
    scoring=spearman_scorer
)

# Results
print("Spearman Correlation Scores:", scores)
print("Mean Spearman Correlation:", np.mean(scores))

print('Spearman correlation for the test set: {:.1f}%'.format(metric_train(y_pred_best, Y_test_clean) ))

Best parameters : {'n_estimators': 95}
Spearman Correlation Scores: [18.22269199  7.80857464 16.80779612 14.05548516 20.912333  ]
Mean Spearman Correlation: 15.561376183994303
Spearman correlation for the test set: 24.9%


We can see that this method is the most efficient across our analysis. Results on the train sets are low, but we do not overfit and we have the highest, among the 4 models, Spearman correlation on the test set. 