# Compare Classifiers

This notebook explores which base classifiers perform best on the dataset.  
We use the same TF-IDF component for vectorization but with a different classifier for each test.  
The classifier parameters are left by default.

Summary of experiences:

| | Vectorizer     | Classifier            |
|-|----------------|-----------------------|
|1| TfidfVectorizer| LinearSVC             |
|2| TfidfVectorizer| LogisticRegression    |
|3| TfidfVectorizer| XGBClassifier         |
|4| TfidfVectorizer| LGBMClassifier        |
|5| TfidfVectorizer| RandomForestClassifier|
|6| TfidfVectorizer| MultinomialNB         |

### Configure the environment

In [2]:
# Load the autoreload extension to automatically reload modules when they are modified.
%load_ext autoreload

# Configure the autoreload extension to automatically reload imported modules.
%autoreload 2

# Add the path '../src' to the module search path.
import sys
sys.path.append('../src')

### Compare basic classifiers

Train all specified classifiers and store all results in a DataFrame.

In [5]:
# Import necessary libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from helper import load_data, score_model

# Set pandas display options
pd.set_option("display.max_columns", None)

# Function to create a text classification model pipeline
def create_model(classifier) -> Pipeline:
    '''
    Takes a classifier as input and constructs a pipeline. 
    The pipeline consists of a feature union of TF-IDF vectors for words and characters, 
    followed by the specified classifier.
    '''
    
    # Create a feature union of TF-IDF vectors for words and characters
    tfidf = FeatureUnion(
        [
            ("word", TfidfVectorizer()),
            ("char", TfidfVectorizer(analyzer="char")),
        ]
    )

    # Create a pipeline with TF-IDF feature extraction and the classifier
    pipeline = Pipeline([("tfidf", tfidf), ("cls", classifier)])
    return pipeline

# Function to compare the performance of different classifiers
def compare():
    '''
    Compares the performance of various classifiers.
    It loads training data, iterates through each classifier, creates a model, evaluates its performance, 
    and stores the metrics in a DataFrame.
    '''
    # List of classifiers to compare
    classifiers = [
        LinearSVC(dual="auto"),
        LogisticRegression(),
        XGBClassifier(),
        LGBMClassifier(verbose=-1),     
        RandomForestClassifier(),
        MultinomialNB(),
    ]

    # Load training data
    X_train, y_train = load_data("../data/train.parquet")

    # List to store performance metrics for each classifier
    metrics_table = []

    # Iterate through each classifier
    for classifier in classifiers:
        classifier_name = classifier.__class__.__name__
        print(f"Experiment '{classifier_name}' in progress...")

        # Create the model pipeline
        model = create_model(classifier)
        
        # Evaluate and score the model on training data
        scores = score_model(model, X_train, y_train)

        # Store performance metrics in a dictionary
        metrics = {"classifier": classifier_name}
        for name, values in scores.items():
            value = values.mean()
            metrics[name] = value

        # Append metrics to the metrics table
        metrics_table.append(metrics)

    print('Training done.')  

    # Create a DataFrame from the metrics table
    df_metrics = pd.DataFrame.from_records(metrics_table)
    df_metrics = df_metrics.set_index("classifier")

    return df_metrics

# Compare classifiers and display the results
df_metrics = compare()
df_metrics


Experiment 'LinearSVC' in progress...
Experiment 'LogisticRegression' in progress...
Experiment 'XGBClassifier' in progress...
Experiment 'LGBMClassifier' in progress...
Experiment 'RandomForestClassifier' in progress...
Experiment 'MultinomialNB' in progress...
Training done.


Unnamed: 0_level_0,fit_time,score_time,test_recall,test_precision,test_f1,test_accuracy,test_roc_auc
classifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
LinearSVC,0.104506,0.044321,0.902587,0.933446,0.917656,0.919041,0.967263
LogisticRegression,0.240167,0.061297,0.879604,0.897726,0.888503,0.889659,0.953335
XGBClassifier,0.862664,0.049009,0.885872,0.889518,0.887547,0.887831,0.954333
LGBMClassifier,0.676668,0.052217,0.890053,0.900667,0.895307,0.895927,0.959565
RandomForestClassifier,2.145529,0.102244,0.841474,0.911841,0.875145,0.879997,0.953507
MultinomialNB,0.085884,0.042294,0.881955,0.915805,0.898451,0.900367,0.966241


# Analysis of Model Performance Metrics

The code sorts the performance metrics DataFrame based on the "test_f1" column in descending order, showcasing the models with the highest F1 scores at the top. Additionally, colorize the DataFrame for better interpretation.

In [6]:
df_metrics = df_metrics.sort_values("test_f1", ascending=False)
df_style = df_metrics.style

df_style.highlight_max(
    subset=df_metrics.columns[2:],
    props="background-color:lightblue;color:black"
)

df_style.bar(df_metrics.columns[:2], color='LightSalmon', width=50, height=50)
df_style

Unnamed: 0_level_0,fit_time,score_time,test_recall,test_precision,test_f1,test_accuracy,test_roc_auc
classifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
LinearSVC,0.104506,0.044321,0.902587,0.933446,0.917656,0.919041,0.967263
MultinomialNB,0.085884,0.042294,0.881955,0.915805,0.898451,0.900367,0.966241
LGBMClassifier,0.676668,0.052217,0.890053,0.900667,0.895307,0.895927,0.959565
LogisticRegression,0.240167,0.061297,0.879604,0.897726,0.888503,0.889659,0.953335
XGBClassifier,0.862664,0.049009,0.885872,0.889518,0.887547,0.887831,0.954333
RandomForestClassifier,2.145529,0.102244,0.841474,0.911841,0.875145,0.879997,0.953507
