### Model selection and evaluation

In this notebook, we will explore the use of cross-validation to select the best model and evaluate its performance.

We will use the same dataset as in the previous notebook.



### 1 - First, we load the dataset and split it into a training and a test set.


In [1]:
import pandas as pd

df = pd.read_csv('../src/data/processed/dataset.csv')
df.head()

Unnamed: 0,weekday_friday,weekday_monday,weekday_saturday,weekday_sunday,weekday_thursday,weekday_tuesday,weekday_wednesday,category_bus,category_entertainment,category_lifestyle,...,number_unique_words,number_words_content,degree_of_subjectivity,number_of_keywords,number_videos,average_word_length,number_words_title,number_links,number_no_stopwords,views
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.418692,1089,0.495945,8,0,4.694215,11,20,1.0,0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.459542,682,0.473285,6,0,4.620235,12,10,1.0,0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.624679,397,0.374314,6,0,5.445844,8,11,1.0,1
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.618234,356,0.435975,10,1,4.47191,5,3,1.0,0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.69186,174,0.588636,8,0,4.798851,6,0,1.0,0


In [13]:
#load column names from json file
import json

with open('../src/data/processed/features.json') as f:
    columns_name = json.load(f)
    f.close()


In [16]:
#declare target column name
target_col = 'views'

# X = df.drop(target_col, axis=1)
X = df[columns_name]
y = df[target_col]


Como estrategia de validacion, elijo una combinacion de las mas conocidas, separando el dataset en 3 grupo, uno de los cuales sera el de test, y los otros dos seran el de train y el de validacion. El de train se usara para entrenar el modelo, y el de validacion para evaluarlo. El de test se usara para evaluar el modelo final, una vez que se haya elegido el mejor modelo.

En este caso, elijo un 60% para train, un 20% para validacion y un 20% para test.

La validacion durante el entrenamiento se hara con la tecnica de validacion cruzada, que consiste en dividir el dataset de train en k subconjuntos, y entrenar el modelo k veces, cada vez con un subconjunto distinto. El resultado final sera la media de los resultados de las k iteraciones.

Mas informacion sobre la validacion cruzada en el siguiente enlace:

https://scikit-learn.org/stable/modules/cross_validation.html

In [17]:
#split dataset into train, test and validation

from sklearn.model_selection import train_test_split

# split data into train and test
train, test = train_test_split(df, test_size=0.2, random_state=42, stratify=df[target_col])

# split train into train and validation
train, val = train_test_split(train, test_size=0.2, random_state=42, stratify=train[target_col])

# check the split
print('Train shape: {}'.format(train.shape))
print('Validation shape: {}'.format(val.shape))
print('Test shape: {}'.format(test.shape))

Train shape: (10924, 43)
Validation shape: (2732, 43)
Test shape: (3414, 43)


In [18]:
#check the distribution of target variable in train, test and validation
print('Train distribution of target variable')
print(train.views.value_counts(normalize=True))
print('Validation distribution of target variable')
print(val.views.value_counts(normalize=True))
print('Test distribution of target variable')
print(test.views.value_counts(normalize=True))


Train distribution of target variable
0    0.5
1    0.5
Name: views, dtype: float64
Validation distribution of target variable
0    0.5
1    0.5
Name: views, dtype: float64
Test distribution of target variable
0    0.5
1    0.5
Name: views, dtype: float64


In [19]:
#define X and y for train, test and validation

X_train = train.drop(target_col, axis=1)
y_train = train[target_col]

X_val = val.drop(target_col, axis=1)
y_val = val[target_col]

In [20]:
X_test = test.drop(target_col, axis=1)
y_test = test[target_col]

In [25]:
#create a function to calculate the accuracy of the model

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.metrics import specificity_score

def model_evaluation(model, X, y):
    """
    This function takes a model, X and y and returns all classification metrics
    """
    results = []
    y_pred = model.predict(X)
    accuracy = accuracy_score(y, y_pred)
    recall = recall_score(y, y_pred, average='weighted')
    specificity = specificity_score(y, y_pred, average='weighted')
    fp_rate = 1 - specificity
    precision = precision_score(y, y_pred, average='weighted')
    f1 = f1_score(y, y_pred, average='weighted')
    results.append(accuracy)
    results.append(recall)
    results.append(specificity)
    results.append(fp_rate)
    results.append(precision)
    results.append(f1)
    return results



In [26]:
# define 5 models to compare

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'SVM': SVC(),
    'XGBoost': XGBClassifier()
}

# fit and evaluate the models
comparison = pd.DataFrame(columns=['Accuracy', 'Recall', 'Specificity', 'FP-Rate' ,'Precision', 'F1'])

for name, model in models.items():
    print("Training model: ",name)
    model.fit(X_train, y_train)
    results = model_evaluation(model, X_val, y_val)
    comparison.loc[name] = results
    print('model trained')

# print the comparison table
comparison.sort_values(by='Accuracy', ascending=False)
    

Training model:  Logistic Regression
model trained
Training model:  Decision Tree
model trained
Training model:  Random Forest
model trained
Training model:  Gradient Boosting
model trained
Training model:  SVM
model trained
Training model:  XGBoost
model trained


Unnamed: 0,Accuracy,Recall,Specificity,FP-Rate,Precision,F1
Random Forest,0.743777,0.743777,0.743777,0.256223,0.755959,0.740692
Gradient Boosting,0.741947,0.741947,0.741947,0.258053,0.763781,0.736495
XGBoost,0.729502,0.729502,0.729502,0.270498,0.735671,0.72772
Decision Tree,0.655198,0.655198,0.655198,0.344802,0.655273,0.655156
Logistic Regression,0.602123,0.602123,0.602123,0.397877,0.602123,0.602123
SVM,0.529649,0.529649,0.529649,0.470351,0.54269,0.490757


In [27]:
#hyperparameter tuning

# define the hyperparameters to tune
param_grid = {
    'Logistic Regression': {
        'penalty': ['l1', 'l2'],
        'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
    },
    'Decision Tree': {
        'criterion': ['gini', 'entropy'],
        'max_depth': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
        'min_samples_split': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
        'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    },
    'Random Forest': {
        'n_estimators': [10, 50, 100, 200, 300, 400, 500],
        'criterion': ['gini', 'entropy'],
        'max_depth': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
        'min_samples_split': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
        'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    },
    'Gradient Boosting': {
        'n_estimators': [10, 50, 100, 200, 300, 400, 500],
        'learning_rate': [0.001, 0.01, 0.1, 1, 10, 100],
        'max_depth': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
        'min_samples_split': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
        'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    },
    'SVM': {
        'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'degree': [2, 3, 4, 5, 6, 7, 8, 9, 10],
        'gamma': ['scale', 'auto']
    },
    'XGBoost': {
        'n_estimators': [10, 50, 100, 200, 300, 400, 500],
        'learning_rate': [0.001, 0.01, 0.1, 1, 10, 100],
        'max_depth': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
        'min_child_weight': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'gamma': [0, 0.25, 0.5, 1],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'reg_alpha': [0, 0.001, 0.005, 0.01, 0.05],
        'reg_lambda': [1, 1.25, 1.5, 1.75, 2, 10, 50, 100]
    }
}

# define the function to tune the hyperparameters
from sklearn.model_selection import GridSearchCV

def hyperparameter_tuning(model, param_grid, X, y):
    
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
    grid_search.fit(X, y)
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    return best_params, best_score

# tune the hyperparameters
for name, model in models.items():
    print("Tuning hyperparameters for: ", name)
    best_params, best_score = hyperparameter_tuning(model, param_grid[name], X_train, y_train)
    print("Best params: ", best_params)
    print("Best score: ", best_score)
    print("")

Tuning hyperparameters for:  Logistic Regression
Fitting 5 folds for each of 14 candidates, totalling 70 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best params:  {'C': 1, 'penalty': 'l2'}
Best score:  0.6056386367255933

Tuning hyperparameters for:  Decision Tree
Fitting 5 folds for each of 2000 candidates, totalling 10000 fits


KeyboardInterrupt: 

In [None]:
#save the best model and his parameters

#get name of best model
best_model_name = comparison.sort_values(by='Accuracy', ascending=False).index[0]

#create best model
best_model = models[best_model_name]

# fit the best model
best_model.fit(X_train, y_train)

# save the model
import pickle
pickle.dump(best_model, open('best_model.pkl', 'wb'))



In [None]:
# load the model
import pickle
best_model = pickle.load(open('best_model.pkl', 'rb'))

# evaluate the model
model_evaluation(best_model, X_val, y_val)

