# Training Models with most Important  Features
The aim of this notebooks is to evaluate the impact of training models only with the most important features. The most important features are considered to be the ones that obtain the highest MI scores addressed in the `feature_engineering.ipynb` notebook. For now, only the 4 most important features are considered. The dataset considered will be the one created in the `feature_engineering.ipynb` notebook.

The best features for high voltage constraints are:
- $cos(\text{hour})$: Cosine of hour of day.
- $R$: Forecasted Irradince, measure in $W/m^2$.
- $\text{last hour mean irradiance}$: Mean irradiance of the last hour.
- $T$: Temperatura in Kelvins.

After training Linear Regression and Gradient Boost Regression models, the evaluation will be perfomed by using the metric implemented in `ml_hybrid_metrics.ipynb`.

**Summary of this article**
- Input the exogneous data previously treated and explored.
- Train the Linear Regression and Gradient Boost Regression models with
    - All the datasets features 
    - Only with the most relevant features.
- Evaluate the models with the metric implemented in `ml_hybrid_metrics.ipynb`.

### Data

The dataset is the one created in the `feature_engineering.ipynb` notebook. It contains features that are external to the network, such as Temperature and $cos(hour)$.

In [None]:
# Data 
import pandas as pd
output = pd.read_csv('..\data\ground_truth\pf_res_bus_vm_pu.csv')
# create a timestamps variable and convert it to datetime
timestamps = output['timestamps'].apply(lambda x: pd.to_datetime(x))
output.drop(['timestamps'], axis=1, inplace=True)
output = output.apply(lambda x: (0.95 - x).apply(lambda y: max(0, y)))
#Training data
_exogenous_data = pd.read_csv('..\data\processed\production\exogenous_data_extended.csv')

The features are:

In [None]:
_exogenous_data.columns

From the `feature_engineering.ipynb` notebook it was possible to identify the most important features to be used in the model, by using calculateing the *MI scores* of the features. In this present article we will use the 4 most important features to train the models.

In [None]:
exogenous_data_most_relevant = _exogenous_data.loc[:, ['T', 'R', 'last_hour_mean_irradiance', 'cos_hour_day']]
exogenous_data_complete = _exogenous_data.drop('date', axis=1)

### Traning Models
In this section both models, Linear Regression and Gradient Boost Regression, will be trained with the different datasets. In terms of train test splitting the dataset, the full dataset will be split into train and test datasets with a 80% and 20% respectively. The train dataset will be suffled for training.

#### All Features Dataset

In [None]:
import sys
sys.path.append('..')
from thesis_package import aimodels as my_ai, utils
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.preprocessing import LabelEncoder
import os
le = LabelEncoder()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(exogenous_data_complete, output, test_size=0.2, shuffle=False)
# Shuffle the data
X_train, y_train = shuffle(X_train, y_train)
# Linear Regression
X_train['season'] = le.fit_transform(X_train['season'])
X_test['season'] = le.fit_transform(X_test['season'])   
if 'regressor_all_features.pickle' not in os.listdir():
    regressor_all_features = my_ai.Context(strategy=my_ai.LinearRegressionStrategy())
    regressor_all_features.fit(data={'X_train': X_train, 'y_train': y_train})
    # Gradient Boost Regression
    hyper_params = {'n_estimators': 1000, 'learning_rate': 0.1, 'loss': 'squared_error'}
    regressor_all_features.strategy = my_ai.GradientBoostRegressorStrategy(hyper_params)
    regressor_all_features.fit(data={'X_train': X_train.values, 'y_train': y_train.values})
    utils.serialize_object('regressor_all_features', regressor_all_features)
else: 
    regressor_all_features = utils.deserialize_object('regressor_all_features')
prediction_lr_all_features = regressor_all_features.predict(data={'X_test': X_test})
prediction_lr_all_features = pd.DataFrame(prediction_lr_all_features , columns=y_test.columns)
prediction_gb_all_features =  regressor_all_features.predict(data={'X_test': X_test})
prediction_gb_all_features = pd.DataFrame(prediction_gb_all_features, columns=y_test.columns)

In [None]:
regressor_all_features._strategy.__dict__

#### Most Relevant Features Dataset

In [None]:
X_train, X_test, y_train, y_test = train_test_split(exogenous_data_most_relevant, output, test_size=0.2, shuffle=False)
# Shuffle the data
X_train, y_train = shuffle(X_train, y_train)
if 'regressor_most_important_features.pickle' not in os.listdir():
    # Linear Regression
    regressor_most_important_features = my_ai.Context(strategy=my_ai.LinearRegressionStrategy())
    regressor_most_important_features.fit(data={'X_train': X_train, 'y_train': y_train})
    # Gradient Boost Regression
    hyper_params = {'n_estimators': 1000, 'learning_rate': 0.1, 'loss': 'squared_error'}
    regressor_most_important_features.strategy = my_ai.GradientBoostRegressorStrategy(hyper_params)
    regressor_most_important_features.fit(data={'X_train': X_train.values, 'y_train': y_train.values})
    utils.serialize_object('regressor_most_important_features', regressor_most_important_features)
else:
    regressor_most_important_features = utils.deserialize_object('regressor_most_important_features')
prediction_lr_most_important_features = regressor_most_important_features.predict(data={'X_test': X_test})
prediction_lr_most_important_features = pd.DataFrame(prediction_lr_most_important_features , columns=y_test.columns)
prediction_gb_most_important_features =  regressor_most_important_features.predict(data={'X_test': X_test.values})
prediction_gb_most_important_features = pd.DataFrame(prediction_gb_most_important_features, columns=y_test.columns)

### Evaluating Models
In this section the results using the two datasets will be compared for both models trained. The TP, TN, FP, FN, Accuracy, Precision, Recall, F1 Score will be plot and compared.

The evaluation in done by using the metric implemented in `ml_hybrid_metrics.ipynb`. The threshold percentage for the metrics will be set to 10%, as concluded in the ml_hybrid_metrics.ipynb notebook.


In [None]:
import numpy as np
from thesis_package import metrics 
metric = metrics.Metrics()
threshold = output.loc[:, output.max(axis=0) != 0].max(axis=0).mean() * 0.1 
# Signal the same length of the index of y_test, with a constant value of threshold.
threshold_signal = pd.Series(np.ones([2000]) * threshold)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

#### Linear Regression

In [None]:
# Input
all_features_prediction = 1
best_features_prediction = 1
# Function
metric.get_prediction_scores(all_features_prediction, y_test, threshold=threshold)
tp_all_features, fp_all_features, fn_all_features, tn_all_features = metric.true_positives_ctr, metric.false_positives_ctr, metric.false_negatives_ctr, metric.true_negatives_ctr
metric.get_report()
tp_rmse_all_features, fp_rmse_all_features, fn_rmse_all_features, tn_rmse_all_features = metric.true_positives_rmse, metric.false_positives_rmse, metric.false_negatives_rmse, metric.true_negatives_rmse
# same for best features
metric.get_prediction_scores(best_features_prediction, y_test, threshold=threshold)
tp_best_features, fp_best_features, fn_best_features, tn_best_features = metric.true_positives_ctr, metric.false_positives_ctr, metric.false_negatives_ctr, metric.true_negatives_ctr
metric.get_report()
tp_rmse_best_features, fp_rmse_best_features, fn_rmse_best_features, tn_rmse_best_features = metric.true_positives_rmse, metric.false_positives_rmse, metric.false_negatives_rmse, metric.true_negatives_rmse
# Create dataframe with the results
index = ['all_features', 'best_features']
columns = ['tp', 'fp', 'fn', 'tn', 'tp_rmse', 'fp_rmse', 'fn_rmse', 'tn_rmse']
results = pd.DataFrame(index=index, columns=columns)
results.loc['all_features', 'tp'] = tp_all_features
results.loc['all_features', 'fp'] = fp_all_features
results.loc['all_features', 'fn'] = fn_all_features
results.loc['all_features', 'tn'] = tn_all_features
results.loc['all_features', 'tp_rmse'] = tp_rmse_all_features
results.loc['all_features', 'fp_rmse'] = fp_rmse_all_features
results.loc['all_features', 'fn_rmse'] = fn_rmse_all_features
results.loc['all_features', 'tn_rmse'] = tn_rmse_all_features
results.loc['best_features', 'tp'] = tp_best_features
results.loc['best_features', 'fp'] = fp_best_features
results.loc['best_features', 'fn'] = fn_best_features
results.loc['best_features', 'tn'] = tn_best_features
results.loc['best_features', 'tp_rmse'] = tp_rmse_best_features
results.loc['best_features', 'fp_rmse'] = fp_rmse_best_features
results.loc['best_features', 'fn_rmse'] = fn_rmse_best_features
results.loc['best_features', 'tn_rmse'] = tn_rmse_best_features
# Plot the results in a bar chart using seaborn. One subplot with the tp, fp, fn, tn and the other with the rmse.
fig, axs = plt.subplots(2, 1, figsize=(10, 10))
sns.barplot(x='index', y='tp', data=results, ax=axs[0])
sns.barplot(x='index', y='fp', data=results, ax=axs[0])
sns.barplot(x='index', y='fn', data=results, ax=axs[0])
sns.barplot(x='index', y='tn', data=results, ax=axs[0])
sns.barplot(x='index', y='tp_rmse', data=results, ax=axs[1])
sns.barplot(x='index', y='fp_rmse', data=results, ax=axs[1])
sns.barplot(x='index', y='fn_rmse', data=results, ax=axs[1])
sns.barplot(x='index', y='tn_rmse', data=results, ax=axs[1])
plt.show()