# Model Validation

Using the validation set to gauge the ability of the model to generalize, the fourth model iteration was chosen.  For the models final evaluation it will be tested on the holdout set, consisting of 17,392 rows, representing a unique streamer/channel.  This holdout dataset was collected after the final model was selected to avoid data leakage.  Based on the validation set, it was determined that for this model, the optimal threshold for classification was 0.74.

In [1]:
# imports
import pandas as pd
from keras.models import load_model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix
# load final model
model = load_model('../models/model_4.h5')
# load pickled scaler
from pickle import load
scaler = load(open('../data/scaler.pkl', 'rb'))

In [2]:
# load the holdout data and transform with StandardScaler imported from training set.
df_2 = pd.read_csv('../data/streamer_data_2.csv')
df_2 = df_2.drop(columns = ['game_name', 'login', 'broadcaster_type', 'language'])
df_2.account_age = pd.to_timedelta(df_2.account_age).map(lambda x: x.days)
# separate the features from the target
X_test = df_2.drop(columns = 'target')
y_test = df_2.target
# scale the features for the model
X_test = scaler.transform(X_test)

In [3]:
# set the threshold
threshold = 0.74
# generate prediction probabilities
y_preds = model.predict(X_test)
# classify predictions based on threshold
y_preds_thresh = [1 if x > threshold else 0 for x in y_preds]
# calculate scores
f1 = f1_score(y_test, y_preds_thresh)
prec = precision_score(y_test, y_preds_thresh)
recall = recall_score(y_test, y_preds_thresh)
print('Test F1:', f1)
print('Test precision:', prec)
print('Test recall:', recall)

Test F1: 0.7030878859857482
Test precision: 0.921161825726141
Test recall: 0.5685019206145967


In [9]:
# create confusion matrix from predictions
tn, fp, fn, tp = confusion_matrix(y_test, y_preds_thresh).ravel()
tn, fp, fn, tp

(16573, 38, 337, 444)

From this confusion matrix, out of `16,611` truly `non-partnered` channels, only `38` were falsely classified as being `partner`, indicating a specificity of 0.998.  Regarding the true `partner` class, of `781`, `444` were correctly classified as partner, giving a recall/sensitivity of `0.57`.  This corresponds with the initial assumption that the nature of the model would be more selective due to the inherent bias in the data.  

In [11]:
model.predict([scaler.transform(X_test[602].reshape(1, -1))])

array([[1.2604144e-13]], dtype=float32)