# Datasets Benchmark

**Summary of this Article** 
- Loading best hyperparameters for each model
- Model training
- Results discussion


## Loading best hyperparameters for each model

As explained in another notebook, the hyperparameters for each model were tunnned using the Optuna library. For each dataset and model, the hyperparameters have different values. The values for each hyperparameters are seen bellow.   


In [1]:
# Import hyperparameters dataset.
import os 
import pandas as pd

In [2]:
sparse_hyper_params = {}
focused_hyper_params = {}
boolean_hyper_params = {}
for file in os.listdir('hyper_params_results'):
    if file.endswith('.csv') and 'sparse' in file.split('_') and 'classifier' not in file:
        df = pd.read_csv(os.path.join('hyper_params_results', file))
        sparse_hyper_params[file] = df
    elif file.endswith('.csv') and 'focused' in file.split('_') and 'classifier' not in file:
        df = pd.read_csv(os.path.join('hyper_params_results', file))
        focused_hyper_params[file] = df
    elif file.endswith('.csv') and 'classifier' in file:
        df = pd.read_csv(os.path.join('hyper_params_results', file))
        boolean_hyper_params[file] = df
print('Sparse hyper params:\n')
for key in sparse_hyper_params.keys():
    print(key, ':\n ',sparse_hyper_params[key])
print('Focused hyper params:\n')
for key in focused_hyper_params.keys():
    print(key, ':\n',focused_hyper_params[key])
print('Boolean hyper params:\n')
for key in boolean_hyper_params.keys():
    print(key, ':\n',boolean_hyper_params[key])

Sparse hyper params:

params_gradient_boost_regression_sparse_max_u.csv :
            params                  value
0   n_estimators                    544
1  learning_rate    0.36423151911958196
2           loss          squared_error
3          value  0.0064745090355284845
params_gradient_boost_regression_sparse_min_u.csv :
            params               value
0   n_estimators                  22
1  learning_rate  0.8296843568407096
2           loss      absolute_error
3          value                 0.0
params_support_vector_regression_sparse_max_u.csv :
     params                   value
0  kernel                    poly
1       C  0.00018807624871896921
2  degree                       5
3   gamma      0.8511446423066539
4   value      0.4008346176510517
params_support_vector_regression_sparse_min_u.csv :
     params                   value
0  kernel                    poly
1       C  3.4084813417139984e-06
2  degree                       2
3   gamma  7.5582957595600045e-06
4  

In [3]:
import ast
def get_hyper_params_from_df(df):
    output = {}
    for row in df.iterrows():
        if row[1]['params'] != 'value':
            try:
                output[row[1]['params']] = ast.literal_eval(row[1]['value'])
            except :
                output[row[1]['params']] = row[1]['value']
    return output
get_hyper_params_from_df(sparse_hyper_params['params_gradient_boost_regression_sparse_max_u.csv'])

{'n_estimators': 544,
 'learning_rate': 0.36423151911958196,
 'loss': 'squared_error'}

## Loading the data

In [4]:
import sys
sys.path.append('..')
from thesis_package import aimodels as my_ai, utils, metrics
from copy import deepcopy
import sklearn.metrics
from sklearn.model_selection import train_test_split

exogenous_data = pd.read_csv('..\data\processed\production\exogenous_data_extended.csv').drop(columns=['date'])

In [5]:
# Regression data sparse
y_max_u_sparse = pd.read_csv('..\data\ground_truth\\res_bus_vm_pu_max_constr.csv').drop(columns=['timestamps'])
y_min_u_sparse = pd.read_csv('..\data\ground_truth\\res_bus_vm_pu_min_constr.csv').drop(columns=['timestamps'])

train_x, test_x, train_y, test_y = utils.split_and_suffle(exogenous_data, y_max_u_sparse)
data_max_u_sparse = {'X_train': deepcopy(train_x), 'X_test': deepcopy(test_x), 'y_train': deepcopy(train_y), 'y_test': deepcopy(test_y)}

train_x, test_x, train_y, test_y = utils.split_and_suffle(exogenous_data, y_max_u_sparse, scaling=True)
data_max_u_scaled_sparse = {'X_train': deepcopy(train_x), 'X_test': deepcopy(test_x), 'y_train': deepcopy(train_y), 'y_test': deepcopy(test_y)}

train_x, test_x, train_y, test_y = utils.split_and_suffle(exogenous_data, y_min_u_sparse)
data_min_u_sparse = {'X_train': deepcopy(train_x), 'X_test': deepcopy(test_x), 'y_train': deepcopy(train_y), 'y_test': deepcopy(test_y)}

train_x, test_x, train_y, test_y = utils.split_and_suffle(exogenous_data, y_min_u_sparse, scaling=True)
data_min_u_scaled_sparse = {'X_train': deepcopy(train_x), 'X_test': deepcopy(test_x), 'y_train': deepcopy(train_y), 'y_test': deepcopy(test_y)}


In [6]:
# Regresison data focused
y_max_u_focused = pd.read_csv('..\data\ground_truth\\res_bus_vm_pu_max_bal_constr.csv')
exogenous_data_focused_max_u = pd.read_csv('..\data\ground_truth\exogenous_data_vm_pu_max_bal.csv').drop(columns=['date'])
y_min_u_focused = pd.read_csv('..\data\ground_truth\\res_bus_vm_pu_min_bal_constr.csv')
exogenous_data_focused_min_u = pd.read_csv('..\data\ground_truth\exogenous_data_vm_pu_min_bal.csv').drop(columns=['date'])

train_x, test_x, train_y, test_y = utils.split_and_suffle(exogenous_data_focused_max_u, y_max_u_focused)
data_max_u_focused = {'X_train': deepcopy(train_x), 'X_test': deepcopy(test_x), 'y_train': deepcopy(train_y), 'y_test': deepcopy(test_y)}

train_x, test_x, train_y, test_y = utils.split_and_suffle(exogenous_data_focused_max_u, y_max_u_focused, scaling=True)
data_max_u_scaled_focused = {'X_train': deepcopy(train_x), 'X_test': deepcopy(test_x), 'y_train': deepcopy(train_y), 'y_test': deepcopy(test_y)}

train_x, test_x, train_y, test_y = utils.split_and_suffle(exogenous_data_focused_min_u, y_min_u_focused)
data_min_u_focused = {'X_train': deepcopy(train_x), 'X_test': deepcopy(test_x), 'y_train': deepcopy(train_y), 'y_test': deepcopy(test_y)}

train_x, test_x, train_y, test_y = utils.split_and_suffle(exogenous_data_focused_min_u, y_min_u_focused, scaling=True)
data_min_u_scaled_focused = {'X_train': train_x, 'X_test': test_x, 'y_train': train_y, 'y_test': test_y}


In [7]:
# Classification data
y_max_u = pd.read_csv('..\data\ground_truth\\res_bus_vm_pu_max_bool_constr.csv').drop(columns=['timestamps'])
y_min_u = pd.read_csv('..\data\ground_truth\\res_bus_vm_pu_min_bool_constr.csv').drop(columns=['timestamps'])
y_max_u = y_max_u[utils.cols_with_positive_values(y_max_u)]
y_min_u = y_min_u[utils.cols_with_positive_values(y_min_u)]

train_x, test_x, train_y, test_y = utils.split_and_suffle(exogenous_data, y_max_u)
data_max_u_bool = {'X_train': deepcopy(train_x), 'X_test': deepcopy(test_x), 'y_train': deepcopy(train_y), 'y_test': deepcopy(test_y)}

train_x, test_x, train_y, test_y = utils.split_and_suffle(exogenous_data, y_max_u, scaling=True)
data_max_u_bool_scaled = {'X_train': deepcopy(train_x), 'X_test': deepcopy(test_x), 'y_train': deepcopy(train_y), 'y_test': deepcopy(test_y)}

train_x, test_x, train_y, test_y = utils.split_and_suffle(exogenous_data, y_min_u)
data_min_u_bool = {'X_train': deepcopy(train_x), 'X_test': deepcopy(test_x), 'y_train': deepcopy(train_y), 'y_test': deepcopy(test_y)}

train_x, test_x, train_y, test_y = utils.split_and_suffle(exogenous_data, y_min_u, scaling=True)
data_min_u_bool_scaled = {'X_train': deepcopy(train_x), 'X_test': deepcopy(test_x), 'y_train': deepcopy(train_y), 'y_test': deepcopy(test_y)}

## Training models
In this section the models will be trained with the hyperparameters loaded above. All the models will be stored in the same `Context` object for later evaluation. The `Context` object is a class that stores all the models and their respective hyperparameters. The `Context` object is defined in the `aimodels.py` file. The `Context` object is defined as follows:

### Max Voltage

In [8]:
sparse_hyper_params.keys()

dict_keys(['params_gradient_boost_regression_sparse_max_u.csv', 'params_gradient_boost_regression_sparse_min_u.csv', 'params_support_vector_regression_sparse_max_u.csv', 'params_support_vector_regression_sparse_min_u.csv', 'params_xgboost_regression_sparse_max_u.csv', 'params_xgboost_regression_sparse_min_u.csv'])

In [9]:
# max_u regression sparse
if 'max_u_regressor_sparse.pickle' not in os.listdir('pickles\dataset_benchmark'):
    print('Training max_u regression sparse')
    # Linear Regression
    regressor_max_u = my_ai.Context(strategy=my_ai.LinearRegressionStrategy())
    regressor_max_u.fit(data=data_max_u_sparse)
    # Gradient Boost Regression
    hyper_params = get_hyper_params_from_df(sparse_hyper_params['params_gradient_boost_regression_sparse_max_u.csv'])
    regressor_max_u.strategy = my_ai.GradientBoostRegressorStrategy(hyper_params)
    regressor_max_u.fit(data=data_max_u_sparse)
    # Extreme GBoost Regression
    hyper_params = get_hyper_params_from_df(sparse_hyper_params['params_xgboost_regression_sparse_max_u.csv']) 
    regressor_max_u.strategy = my_ai.XGBoostRegressorStrategy(hyper_params)
    regressor_max_u.fit(data=data_max_u_sparse)
    # Support Vector Regression
    hyper_params = get_hyper_params_from_df(sparse_hyper_params['params_support_vector_regression_sparse_max_u.csv'])
    regressor_max_u.strategy = my_ai.SupportVectorRegressorStrategy(hyper_params)
    regressor_max_u.fit(data=data_max_u_scaled_sparse)
    utils.serialize_object('pickles\dataset_benchmark\max_u_regressor_sparse', regressor_max_u)
else:
    print('Loading max_u regression sparse') 
    regressor_max_u = utils.deserialize_object('pickles\dataset_benchmark\max_u_regressor_sparse')
models = ['lr', 'gb', 'xgb', 'svr']
testing_data = {'max_u_regressor_sparse': {}}
for model, strategy in zip(models, regressor_max_u.strategies):
    if model != 'svr':
        prediction = strategy.predict(data=data_max_u_sparse)
        prediction = pd.DataFrame(prediction, columns=data_max_u_sparse['y_test'].columns)
        testing_data['max_u_regressor_sparse'][model] = {'real': None, 'predicted': None}
        testing_data['max_u_regressor_sparse'][model]['predicted'] = deepcopy(prediction)
        testing_data['max_u_regressor_sparse'][model]['real'] = deepcopy(data_max_u_sparse['y_test'])
    else:
        prediction = strategy.predict(data=data_max_u_scaled_sparse)
        prediction = pd.DataFrame(prediction, columns=data_max_u_scaled_sparse['y_test'].columns)
        testing_data['max_u_regressor_sparse'][model] = {'real': None, 'predicted': None}
        testing_data['max_u_regressor_sparse'][model]['predicted'] = deepcopy(prediction)
        testing_data['max_u_regressor_sparse'][model]['real'] = deepcopy(data_max_u_scaled_sparse['y_test'])

Loading max_u regression sparse


In [10]:
# max_u regression focused
if 'max_u_regressor_focused.pickle' not in os.listdir('pickles\dataset_benchmark'):
    print('Training max_u regression focused')
    # Linear Regression
    regressor_max_u_focused = my_ai.Context(strategy=my_ai.LinearRegressionStrategy())
    regressor_max_u_focused.fit(data=data_max_u_focused)
    # Gradient Boost Regression
    hyper_params = get_hyper_params_from_df(focused_hyper_params['params_gradient_boost_regression_focused_max_u.csv'])
    regressor_max_u_focused.strategy = my_ai.GradientBoostRegressorStrategy(hyper_params)
    regressor_max_u_focused.fit(data=data_max_u_focused)
    # Extreme GBoost Regression
    hyper_params = get_hyper_params_from_df(focused_hyper_params['params_xgboost_regression_focused_max_u.csv']) 
    regressor_max_u_focused.strategy = my_ai.XGBoostRegressorStrategy(hyper_params)
    regressor_max_u_focused.fit(data=data_max_u_focused)
    # Support Vector Regression
    hyper_params = get_hyper_params_from_df(focused_hyper_params['params_support_vector_regression_focused_max_u.csv'])
    regressor_max_u_focused.strategy = my_ai.SupportVectorRegressorStrategy(hyper_params)
    regressor_max_u_focused.fit(data=data_max_u_scaled_focused)
    utils.serialize_object('pickles\dataset_benchmark\max_u_regressor_focused', regressor_max_u_focused)
else: 
    print('Loading max_u regression focused')
    regressor_max_u_focused = utils.deserialize_object('pickles\dataset_benchmark\\max_u_regressor_focused')

models = ['lr', 'gb', 'xgb', 'svr']
testing_data['max_u_regressor_focused'] = {}
for model, strategy in zip(models, regressor_max_u_focused.strategies):
    if model != 'svr':
        prediction = strategy.predict(data=data_max_u_focused)
        prediction = pd.DataFrame(prediction, columns=data_max_u_focused['y_test'].columns)
        testing_data['max_u_regressor_focused'][model] = {'real': None, 'predicted': None}
        testing_data['max_u_regressor_focused'][model]['predicted'] = deepcopy(prediction)
        testing_data['max_u_regressor_focused'][model]['real'] = deepcopy(data_max_u_focused['y_test'])
    else:
        prediction = strategy.predict(data=data_max_u_scaled_focused)
        prediction = pd.DataFrame(prediction, columns=data_max_u_scaled_focused['y_test'].columns)
        testing_data['max_u_regressor_focused'][model] = {'real': None, 'predicted': None}
        testing_data['max_u_regressor_focused'][model]['predicted'] = deepcopy(prediction)
        testing_data['max_u_regressor_focused'][model]['real'] = deepcopy(data_max_u_scaled_focused['y_test'])

Loading max_u regression focused


In [11]:
# max_u classification
if 'max_u_classifier.pickle' not in os.listdir('pickles\dataset_benchmark'):
    print('Training max_u classification')
    # Gradient Boost Classifier
    hyper_params = get_hyper_params_from_df(boolean_hyper_params['params_gradient_boost_classifier_max_u.csv'])
    classifier_max_u = my_ai.Context(strategy=my_ai.GradientBoostClassifierStrategy(hyper_params))
    classifier_max_u.fit(data=data_max_u_bool)
    # Extreme GBoost Classifier
    hyper_params = get_hyper_params_from_df(boolean_hyper_params['params_xgboost_classifier_max_u.csv'])
    classifier_max_u.strategy = my_ai.XGBoostClassifierStrategy(hyper_params)
    classifier_max_u.fit(data=data_max_u_bool)
    # Support Vector Classifier
    hyper_params = get_hyper_params_from_df(boolean_hyper_params['params_support_vector_classifier_max_u.csv'])
    classifier_max_u.strategy = my_ai.SupportVectorClassifierStrategy(hyper_params)
    classifier_max_u.fit(data=data_max_u_bool_scaled)
    utils.serialize_object('pickles\dataset_benchmark\max_u_classifier', classifier_max_u)
else: 
    print('Loading max_u classification')
    classifier_max_u = utils.deserialize_object('pickles\dataset_benchmark\max_u_classifier')
models = ['gb', 'xgb', 'svr']
testing_data['max_u_classifier'] = {}
for model, strategy in zip(models, classifier_max_u.strategies):
    if model != 'svr':
        prediction = strategy.predict(data=data_max_u_bool)
        prediction = pd.DataFrame(prediction, columns=data_max_u_bool['y_test'].columns)
        testing_data['max_u_classifier'][model] = {'real': None, 'predicted': None}
        testing_data['max_u_classifier'][model]['predicted'] = deepcopy(prediction)
        testing_data['max_u_classifier'][model]['real'] = deepcopy(data_max_u_bool['y_test'])
    else:
        prediction = strategy.predict(data=data_max_u_bool_scaled)
        prediction = pd.DataFrame(prediction, columns=data_max_u_bool_scaled['y_test'].columns)
        testing_data['max_u_classifier'][model] = {'real': None, 'predicted': None}
        testing_data['max_u_classifier'][model]['predicted'] = deepcopy(prediction)
        testing_data['max_u_classifier'][model]['real'] = deepcopy(data_max_u_bool_scaled['y_test'])

Loading max_u classification


### Min u regression training


In [12]:
# min_u regression sparse
if 'min_u_regressor_sparse.pickle' not in os.listdir('pickles\dataset_benchmark'):
    print('Training min_u regression sparse')
    # Linear Regression
    regressor_min_u = my_ai.Context(strategy=my_ai.LinearRegressionStrategy())
    regressor_min_u.fit(data=data_min_u_sparse)
    # Gradient Boost Regression
    hyper_params = get_hyper_params_from_df(sparse_hyper_params['params_gradient_boost_regression_sparse_min_u.csv'])
    regressor_min_u.strategy = my_ai.GradientBoostRegressorStrategy(hyper_params)
    regressor_min_u.fit(data=data_min_u_sparse)
    # Extreme GBoost Regression
    hyper_params = get_hyper_params_from_df(sparse_hyper_params['params_xgboost_regression_sparse_min_u.csv'])
    regressor_min_u.strategy = my_ai.XGBoostRegressorStrategy(hyper_params)
    regressor_min_u.fit(data=data_min_u_sparse)
    # Support Vector Regression
    hyper_params = get_hyper_params_from_df(sparse_hyper_params['params_support_vector_regression_sparse_min_u.csv'])
    regressor_min_u.strategy = my_ai.SupportVectorRegressorStrategy(hyper_params)
    regressor_min_u.fit(data=data_min_u_scaled_sparse)
    utils.serialize_object('pickles\dataset_benchmark\min_u_regressor_sparse', regressor_min_u)
else:
    print('Loading min_u regression sparse')
    regressor_min_u = utils.deserialize_object('pickles\dataset_benchmark\min_u_regressor_sparse')
    
models = ['lr', 'gb', 'xgb', 'svr']
testing_data['min_u_regressor_sparse'] = {}
for model, strategy in zip(models, regressor_min_u.strategies):
    if model != 'svr':
        prediction = strategy.predict(data=data_min_u_sparse)
        prediction = pd.DataFrame(prediction, columns=data_min_u_sparse['y_test'].columns)
        testing_data['min_u_regressor_sparse'][model] = {'real': None, 'predicted': None}
        testing_data['min_u_regressor_sparse'][model]['predicted'] = deepcopy(prediction)
        testing_data['min_u_regressor_sparse'][model]['real'] = deepcopy(data_min_u_sparse['y_test'])
    else:
        prediction = strategy.predict(data=data_min_u_scaled_sparse)
        prediction = pd.DataFrame(prediction, columns=data_min_u_scaled_sparse['y_test'].columns)
        testing_data['min_u_regressor_sparse'][model] = {'real': None, 'predicted': None}
        testing_data['min_u_regressor_sparse'][model]['predicted'] = deepcopy(prediction)
        testing_data['min_u_regressor_sparse'][model]['real'] = deepcopy(data_min_u_scaled_sparse['y_test'])

Loading min_u regression sparse


In [13]:
# min_u regression focused
if 'min_u_regressor_focused.pickle' not in os.listdir('pickles\dataset_benchmark'):
    print('Training min_u regression focused')
    # Linear Regression
    regressor_min_u_focused = my_ai.Context(strategy=my_ai.LinearRegressionStrategy())
    regressor_min_u_focused.fit(data=data_min_u_focused)
    # Gradient Boost Regression
    hyper_params = get_hyper_params_from_df(focused_hyper_params['params_gradient_boost_regression_focused_min_u.csv'])
    regressor_min_u_focused.strategy = my_ai.GradientBoostRegressorStrategy(hyper_params)
    regressor_min_u_focused.fit(data=data_min_u_focused)
    # Extreme GBoost Regression
    hyper_params = get_hyper_params_from_df(focused_hyper_params['params_xgboost_regression_focused_min_u.csv'])
    regressor_min_u_focused.strategy = my_ai.XGBoostRegressorStrategy(hyper_params)
    regressor_min_u_focused.fit(data=data_min_u_focused)
    # Support Vector Regression
    hyper_params = get_hyper_params_from_df(focused_hyper_params['params_support_vector_regression_focused_min_u.csv'])
    regressor_min_u_focused.strategy = my_ai.SupportVectorRegressorStrategy(hyper_params)
    regressor_min_u_focused.fit(data=data_min_u_scaled_focused)
    utils.serialize_object('pickles\dataset_benchmark\min_u_regressor_focused', regressor_min_u_focused)
else:
    print('Loading min_u regression focused')
    regressor_min_u_focused = utils.deserialize_object('pickles\dataset_benchmark\min_u_regressor_focused')
models = ['lr', 'gb', 'xgb', 'svr']
testing_data['min_u_regressor_focused'] = {}
for model, strategy in zip(models, regressor_min_u_focused.strategies):
    if model != 'svr':
        prediction = strategy.predict(data=data_min_u_focused)
        prediction = pd.DataFrame(prediction, columns=data_min_u_focused['y_test'].columns)
        testing_data['min_u_regressor_focused'][model] = {'real': None, 'predicted': None}
        testing_data['min_u_regressor_focused'][model]['predicted'] = deepcopy(prediction)
        testing_data['min_u_regressor_focused'][model]['real'] = deepcopy(data_min_u_focused['y_test'])
    else:
        prediction = strategy.predict(data=data_min_u_scaled_focused)
        prediction = pd.DataFrame(prediction, columns=data_min_u_scaled_focused['y_test'].columns)
        testing_data['min_u_regressor_focused'][model] = {'real': None, 'predicted': None}
        testing_data['min_u_regressor_focused'][model]['predicted'] = deepcopy(prediction)
        testing_data['min_u_regressor_focused'][model]['real'] = deepcopy(data_min_u_scaled_focused['y_test'])

Loading min_u regression focused


In [14]:
# min_u classification
if 'min_u_classifier.pickle' not in os.listdir('pickles\dataset_benchmark'):
    print('Training min_u classification')
    # Gradient Boost Classifier
    hyper_params = get_hyper_params_from_df(boolean_hyper_params['params_gradient_boost_classifier_max_u.csv'])
    classifier_min_u = my_ai.Context(strategy=my_ai.GradientBoostClassifierStrategy(hyper_params))
    classifier_min_u.fit(data=data_min_u_bool)
    # Extreme GBoost Classifier
    hyper_params = get_hyper_params_from_df(boolean_hyper_params['params_xgboost_classifier_min_u.csv'])
    classifier_min_u.strategy = my_ai.XGBoostClassifierStrategy(hyper_params)
    classifier_min_u.fit(data=data_min_u_bool)
    # Support Vector Classifier
    hyper_params = get_hyper_params_from_df(boolean_hyper_params['params_support_vector_classifier_min_u.csv'])
    classifier_min_u.strategy = my_ai.SupportVectorClassifierStrategy(hyper_params)
    classifier_min_u.fit(data=data_min_u_bool_scaled)
    utils.serialize_object('pickles\dataset_benchmark\min_u_classifier', classifier_min_u)
else: 
    print('Loading min_u classification')
    classifier_min_u = utils.deserialize_object('pickles\dataset_benchmark\min_u_classifier')
models = ['gb', 'xgb', 'svr']
testing_data['min_u_classifier'] = {}
for model, strategy in zip(models, classifier_min_u.strategies):
    if model != 'svr':
        prediction = strategy.predict(data=data_min_u_bool)
        prediction = pd.DataFrame(prediction, columns=data_min_u_bool['y_test'].columns)
        testing_data['min_u_classifier'][model] = {'real': None, 'predicted': None}
        testing_data['min_u_classifier'][model]['predicted'] = deepcopy(prediction)
        testing_data['min_u_classifier'][model]['real'] = deepcopy(data_min_u_bool['y_test'])
    else:
        prediction = strategy.predict(data=data_min_u_bool_scaled)
        prediction = pd.DataFrame(prediction, columns=data_min_u_bool_scaled['y_test'].columns)
        testing_data['min_u_classifier'][model] = {'real': None, 'predicted': None}
        testing_data['min_u_classifier'][model]['predicted'] = deepcopy(prediction)
        testing_data['min_u_classifier'][model]['real'] = deepcopy(data_min_u_bool_scaled['y_test'])

Loading min_u classification


## Results Discussion
In this section the results of the training and testing are presented and compared. The main objectives of this experience is to compare the performance of the regression models in terms of the hybrid metrics confusion matrix and the hybrid metrics rmse. The comparisons will be the following:
- Compare the confusion matrices of the classification models and the regression models evaluate with the hybrid metrics.
- Compare the error results of the regression models trained with the focused dataset and the sparse dataset. 

In [21]:
# Testing all models: Function that receives a dict with the real and predicted values, and outputs a dataframe with the results of the metrics.
# Build confusion matrix with sklearn
from sklearn.metrics import confusion_matrix
# Accumulate all the classifications for each bus.
tp, tn, fp, fn = 0, 0, 0, 0
for bus in testing_data['max_u_classifier']['gb']['predicted'].columns:
    try:
        _tp, _tn, _fp, _fn = confusion_matrix(testing_data['max_u_classifier']['gb']['real'][bus], testing_data['max_u_classifier']['gb']['predicted'][bus]).ravel()
        tp += _tp; tn += _tn; fp += _fp; fn += _fn
    except: 
        print('Problem with bus: ', bus)

Problem with bus:  bus_7


In [29]:
# Build a multi-index dataframe with the results of the metrics. The first index is the testing_data.keys(), the second index are the tp, tn, fp, fn, and the columns are the models.
columns = ['tp', 'tn', 'fp', 'fn', 'accuracy', 'precision', 'recall', 'f1']
index = pd.MultiIndex.from_product([testing_data.keys(), ['lr', 'gb', 'xgb', 'svr']], names=['experiment', 'class'])
df = pd.DataFrame(index=index, columns=columns)
classifier_experiments =[experiment for experiment in testing_data.keys() if 'classifier' in experiment.split('_')]
regressor_experiments = [experiment for experiment in testing_data.keys() if 'regressor' in experiment.split('_')]
# Classifier experiments
for experiment in classifier_experiments:
    for model in testing_data[experiment].keys():
        for bus in testing_data[experiment][model]['predicted'].columns:
            try:
                _tp, _tn, _fp, _fn = confusion_matrix(testing_data[experiment][model]['real'][bus], testing_data[experiment][model]['predicted'][bus]).ravel()
                tp += _tp; tn += _tn; fp += _fp; fn += _fn
            except: 
                print('In the experiment ', experiment, ' and model ', model, ' there was a problem with bus: ', bus)
                if not testing_data[experiment][model]['real'][bus].any():
                    print('Bus {} has no positive data points. Just ignore the little shit.'.format(bus))    
        df.loc[(experiment, model), 'tp'] = tp
        df.loc[(experiment, model), 'tn'] = tn
        df.loc[(experiment, model), 'fp'] = fp
        df.loc[(experiment, model), 'fn'] = fn
        accuracy = (tp + tn ) / (tp + tn + fp + fn)
        precision = tp / (tp + fp)
        recall = tp / (tp + fn)
        f1 = 2 * (precision * recall) / (precision + recall)
        df.loc[(experiment, model), 'accuracy'] = accuracy
        df.loc[(experiment, model), 'precision'] = precision
        df.loc[(experiment, model), 'recall'] = recall
        df.loc[(experiment, model), 'f1'] = f1
        tp, tn, fp, fn = 0, 0, 0, 0
# Regressor experiments.
for experiment in regressor_experiments:
    for model in testing_data[experiment].keys():
        test_data = testing_data[experiment][model]['real']
        threshold = test_data.loc[:, test_data.max(axis=0) != 0].max(axis=0).mean() * 0.1 
        hybrid_metrics = metrics.Metrics()
        hybrid_metrics.get_prediction_scores(testing_data[experiment][model]['predicted'], testing_data[experiment][model]['real'], threshold=threshold)
        df.loc[(experiment, model), 'tp'] = hybrid_metrics.true_positives_ctr
        df.loc[(experiment, model), 'tn'] = hybrid_metrics.true_negatives_ctr
        df.loc[(experiment, model), 'fp'] = hybrid_metrics.false_positives_ctr
        df.loc[(experiment, model), 'fn'] = hybrid_metrics.false_negatives_ctr
        df.loc[(experiment, model), 'accuracy'] = hybrid_metrics.accuracy
        df.loc[(experiment, model), 'precision'] = hybrid_metrics.precision
        df.loc[(experiment, model), 'recall'] = hybrid_metrics.recall
        df.loc[(experiment, model), 'f1'] = hybrid_metrics.f1_score

In the experiment  max_u_classifier  and model  gb  there was a problem with bus:  bus_7
Bus bus_7 has no positive data points. Just ignore the little shit.
In the experiment  max_u_classifier  and model  xgb  there was a problem with bus:  bus_7
Bus bus_7 has no positive data points. Just ignore the little shit.
In the experiment  max_u_classifier  and model  svr  there was a problem with bus:  bus_7
Bus bus_7 has no positive data points. Just ignore the little shit.


In [30]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,tp,tn,fp,fn,accuracy,precision,recall,f1
experiment,class,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
max_u_regressor_sparse,lr,3339.0,295690.0,6770.0,1697.0,0.972465,0.3303,0.663026,0.440938
max_u_regressor_sparse,gb,4038.0,298628.0,3832.0,998.0,0.984292,0.513088,0.801827,0.625755
max_u_regressor_sparse,xgb,2632.0,121757.0,180703.0,2404.0,0.404522,0.014356,0.522637,0.027945
max_u_regressor_sparse,svr,3153.0,300590.0,1992.0,1761.0,0.987795,0.612828,0.641636,0.626901
max_u_regressor_focused,lr,4937.0,19939.0,1569.0,75.0,0.938009,0.758838,0.985036,0.857267
max_u_regressor_focused,gb,4776.0,20369.0,1139.0,236.0,0.948152,0.807439,0.952913,0.874165
max_u_regressor_focused,xgb,5012.0,0.0,21508.0,0.0,0.188989,0.188989,1.0,0.317899
max_u_regressor_focused,svr,4142.0,20212.0,1411.0,755.0,0.918326,0.745903,0.845824,0.792727
max_u_classifier,lr,,,,,,,,
max_u_classifier,gb,101661.0,1272.0,2494.0,3101.0,0.948446,0.976055,0.9704,0.973219
