# Model Performance

In this notebook, we load in our trained models and compute the performance for each of the models for comparison.

Because we are doing a multilabel (not multiclass) classification problem, we must be smart with the metrics we should use. While accuracy is generally fine with binary classification problems, it isn't ideal for this. 

For instance, suppose a target label is [1, 1, 0, 0, 1] meaning the first two labels and the last label apply. If a model predicts [1, 1, 0, 0, 0], we argue this is a fairly good model because it got 4/5 correct. However, the vectors are not the same so if accuracy were our primary metric, this would be a score of 0. In short, accuracy is not a detailed enough metric to classify performance.

Instead, we chose to use the **Hamming Loss**, which is a measure $\in [0, 1]$ representing the proportion of incorrect labels. In the example above, $HL = 0.2$ because 1 of the 5 labels was incorrect.

During hyperparameter tuning, we still decided to leave the scoring metric to be accuracy because increasing accuracy will always lead to better Hamming Loss.

In [1]:
# Imports
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import pickle
import matplotlib.pyplot as plt
from typing import Callable, Union

from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, hamming_loss
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier

### Running KMeans to Label Data 

In [2]:
# Load in the embeddings 
path = "../data/X20_embeddings.csv.zip"
df = pd.read_csv(path)

In [3]:
df.shape

(24676, 768)

In [4]:
# Train the Kmeans clustering model with 8 clusters (from Module 2)
kmeans = KMeans(n_clusters = 8, random_state = 1)
kmeans = kmeans.fit(df)

# Grab the labels from GPT API from Module 2
labels = {
    0: ['social-issues', 'personal-development', 'business-and-economics', 'community-building'],
    1: ['india', 'updates', 'testing', 'fatalities', 'recoveries', 'healthcare'],
    2: ['face-masks', 'safety', 'protection', 'public-health', 'prevention'],
    3: ['social-media', 'resilience', 'community-support', 'online-events'],
    4: ['global', 'cases', 'deaths', 'statistics'],
    5: ['politics', 'government-response', 'conspiracy', 'human-rights'],
    6: ['health', 'information','vacccine', 'public-awareness'],
    7: ['layoffs', 'misinformation', 'mental-health', 'lockdown', 'access', 'financial-impact', 'political-response', 'education']
}

In [5]:
# Add the labels as ground truth
df['cluster'] = kmeans.predict(df)

# Add a column caled labels which is a list of strings
df['labels'] = df['cluster'].apply(lambda x: labels[x])

In [6]:
df['labels'] = df['labels'].apply(lambda x: " ".join(x))

In [9]:
pickle.dump(kmeans, open('../trained_models/kmeans.pkl', 'wb'))

### Sample 5k Rows

In [7]:
# Sample Rows
df_sampled = df.sample(n = 10_000, random_state = 1)

In [8]:
# Split the data 70/30
train, test = train_test_split(df_sampled, test_size = 0.3)

# Remove non-training cols and split to X,y
drop_cols = ['cluster', 'labels']
X_train, X_test = train.drop(drop_cols, axis = 1), test.drop(drop_cols, axis = 1)

# One-hot encode the list of labels to multioutputs
y_train = train['labels'].str.get_dummies(sep =' ')
y_test = test['labels'].str.get_dummies(sep =' ')

### Compile Model Performance

In [9]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [14]:
lr = MultiOutputClassifier(LogisticRegression())
lr.fit(X_train, y_train)

In [10]:
lda = MultiOutputClassifier(LinearDiscriminantAnalysis())
lda.fit(X_train, y_train)

In [15]:
pickle.dump(lda, open('../trained_models/lda_v2.pkl', 'wb'))

In [15]:
train_pred = lr.predict(X_train)
test_pred = lr.predict(X_test)

In [16]:
hamming_loss(train_pred, y_train), hamming_loss(test_pred, y_test)

(0.03594139194139194, 0.04156410256410256)

In [7]:
def compute_performance(model: Callable, 
                        X_train: pd.DataFrame, 
                        y_train: Union[pd.DataFrame, pd.Series], 
                        X_test: pd.DataFrame,
                        y_test: Union[pd.DataFrame, pd.Series]): 
    
    
    """
    Computes the performance of a given model for training and testing
    Params: 
        - model (Callable): any sklearn callable 
        - rest are obvious
    Returns: 
        - results (dict): dict of form {metric1: value1, metric2: value2 ....} of the performance metrics
    """
    
    
    # Make predictions 
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    
    # Hamming Score 
    train_hamming = hamming_loss(train_pred, y_train)
    test_hamming = hamming_loss(test_pred, y_test)
    
    # Accuracy 
    train_acc = accuracy_score(train_pred, y_train)
    test_acc = accuracy_score(test_pred, y_test)
    
    
    # Compile results 
    metric_names = ['hamming_loss', 'accuracy']
    train_metrics = [train_hamming, train_acc]
    test_metrics = [test_hamming, test_acc]
    
    results = {}
    for metric, train_value, test_value in zip(metric_names, train_metrics, test_metrics): 
        
        results.update({
            f'train_{metric}': round(train_value, 4),
            f'test_{metric}': round(test_value, 4)
        })
        
    return results

In [8]:
# Compute model performance 
model_names = ['LR', 'LDA', 'MLP', 'RF', 'GBC']
results = {}
for model_name in model_names: 
    
    # Load model
    model = pickle.load(open(f'../trained_models/{model_name.lower()}.pkl', 'rb'))

    
    # Compute performance 
    performance = compute_performance(model, X_train, y_train, X_test, y_test)
    
    # Update results object
    results.update({model_name: performance})

FileNotFoundError: [Errno 2] No such file or directory: '../trained_models/mlp.pkl'

In [59]:
# Convert to D 
pd.DataFrame(results).T

Unnamed: 0,train_hamming_loss,test_hamming_loss,train_accuracy,test_accuracy
LR,0.0426,0.0412,0.6829,0.6913
LDA,0.0285,0.0294,0.8071,0.8033
MLP,0.2582,0.2625,0.0069,0.0053
RF,0.1949,0.1966,0.0046,0.006
GBC,0.1064,0.1046,0.2143,0.2247
