### Comparing Models

Now that you have seen a variety of models for regression and classification problems, it is good to step back and weigh the pros and cons of these options.  In the case of classification models, there are at least three things to consider:

1. Is the model good at handling imbalanced classes?
2. Does the model train quickly?
3. Does the model yield interpretable results?

Depending on your dataset and goals, the importance of these considerations will vary from project to project.  Your goal is to review our models to this point and discuss the pros and cons of each.  Two example datasets are offered as a way to offer two very different tasks where interpretability of the model may be of differing importance.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

In [20]:
def get_models():
    return {
    'LogisticRegression': dict(
        params = {
            'C': [0.1, 1, 10],
            'penalty': ['l1', 'l2'],     
        },
        model = LogisticRegression(max_iter=2000, solver="liblinear"),  
    ),
    'KNN': dict(
        params = {
            'n_neighbors': [3, 5, 7, 9],
            'weights': ['uniform', 'distance'],
        },
        model = KNeighborsClassifier(),  
    ),
    'SVC': dict(
        params = {
            'C': [0.01, 0.1, 1, 10, 50],       
            'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],  
            'gamma': ['scale', 'auto']
        },
        model = SVC(),  
    ),
    'DecisionTree': dict(
        params = {
            'criterion': ['gini', 'entropy'],
            'max_depth': [None, 5, 10, 20, 30],
            'min_samples_split': [2, 5, 10, 20],
            'min_samples_leaf': [1, 2, 4, 10]
        },
        model = DecisionTreeClassifier(),  
    ),
}

### Data and Task

Your goal is to discuss the pros and cons of Logistic Regression, Decision Trees, KNN, and SVM for the tasks below.  Consider at least the three questions above and list any additional considerations you believe are important to determining the "best" model for the task.  Share your response with your peers on the class discussion board.  

**TASK 1**: Predicting Customer Churn

Suppose you are tasked with producing a model to predict customer churn.  Which of your classification models would you use and what are the pros and cons of this model for this task?  Be sure to consider interpretability, imbalnced classes, and the speed of training.



The data is loaded below.  Note that the handwritten digit data is already split into features and target (`digits`, `labels`). 

In [21]:
churn = pd.read_csv('data/telecom_churn.csv')
digits, labels = load_digits(return_X_y=True)

In [22]:
X = churn.drop('Churn', axis=1).copy()
y = churn['Churn'].copy()

X = X.drop(X.select_dtypes(object).columns, axis=1).drop('Area code', axis=1)
y = y.map({True: 1, False: 0})

preprocessor = StandardScaler().set_output(transform="pandas")
X = preprocessor.fit_transform(X)
X.head()

Unnamed: 0,Account length,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
0,0.676489,1.234883,1.566767,0.476643,1.567036,-0.07061,-0.05594,-0.070427,0.866743,-0.465494,0.866029,-0.085008,-0.601195,-0.08569,-0.427932
1,0.149065,1.307948,-0.333738,1.124503,-0.334013,-0.10808,0.144867,-0.107549,1.058571,0.147825,1.05939,1.240482,-0.601195,1.241169,-0.427932
2,0.902529,-0.59176,1.168304,0.675985,1.168464,-1.573383,0.496279,-1.5739,-0.756869,0.198935,-0.755571,0.703121,0.211534,0.697156,-1.188218
3,-0.42859,-0.59176,2.196596,-1.466936,2.196759,-2.742865,-0.608159,-2.743268,-0.078551,-0.567714,-0.078806,-1.303026,1.024263,-1.306401,0.332354
4,-0.654629,-0.59176,-0.24009,0.626149,-0.240041,-1.038932,1.098699,-1.037939,-0.276311,1.067803,-0.276562,-0.049184,-0.601195,-0.045885,1.092641


In [23]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y,
)

models_churn = get_models()
for model_name, model_info in models_churn.items():
    models_churn[model_name]["gridcv"] = GridSearchCV(
        model_info["model"],
        param_grid=model_info["params"],
        verbose=3,
        n_jobs=-1,
        return_train_score=True,
    ).fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[CV 1/5] END C=0.1, penalty=l1;, score=(train=0.856, test=0.859) total time=   0.0s
[CV 5/5] END C=0.1, penalty=l1;, score=(train=0.857, test=0.854) total time=   0.1s
[CV 1/5] END C=0.1, penalty=l2;, score=(train=0.856, test=0.859) total time=   0.0s
[CV 2/5] END C=0.1, penalty=l2;, score=(train=0.856, test=0.854) total time=   0.0s
[CV 3/5] END C=0.1, penalty=l2;, score=(train=0.861, test=0.857) total time=   0.0s
[CV 4/5] END C=0.1, penalty=l2;, score=(train=0.855, test=0.861) total time=   0.0s
[CV 5/5] END C=0.1, penalty=l2;, score=(train=0.859, test=0.850) total time=   0.0s
[CV 1/5] END C=1, penalty=l1;, score=(train=0.856, test=0.859) total time=   0.1s
[CV 2/5] END C=0.1, penalty=l1;, score=(train=0.857, test=0.859) total time=   0.1s[CV 4/5] END C=0.1, penalty=l1;, score=(train=0.855, test=0.863) total time=   0.1s

[CV 2/5] END C=1, penalty=l1;, score=(train=0.856, test=0.854) total time=   0.1s
[CV 4/5] END C=1, penalty=l1;, score=(train=0.855, test=0.863) total time=   0.1

In [27]:
for model_name, model_data in models_churn.items():
    print(model_name, model_data["gridcv"].best_params_)

LogisticRegression {'C': 0.1, 'penalty': 'l1'}
KNN {'n_neighbors': 5, 'weights': 'uniform'}
SVC {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}
DecisionTree {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 5}


In [26]:
results_churn = pd.DataFrame.from_records(
    [
        {
            "model": model_name, 
            "train score": model_data["gridcv"].score(X_train, y_train),
            "test score": model_data["gridcv"].score(X_test, y_test),
            "fit time": np.mean(model_data["gridcv"].cv_results_["mean_fit_time"]),
        }
        for model_name, model_data in models_churn.items()
    ],
    index="model",
)

results_churn

Unnamed: 0_level_0,train score,test score,fit time
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LogisticRegression,0.856408,0.855,0.082082
KNN,0.913416,0.881,0.007447
SVC,0.933133,0.903,3.707461
DecisionTree,0.930562,0.913,0.042486


**TASK 2**: Recognizing Handwritten Digits

Suppose you are tasked with training a model to recognize handwritten digits.  Which of your classifier would you use here and why?  Again, be sure to consider the balance of classes, speed of training, and importance of interpretability.



In [None]:
#example image
plt.imshow(digits[0].reshape(8, 8))
plt.title('This is a handwritten 0.');

In [33]:
digits_X_train, digits_X_test, digits_y_train, digits_y_test = train_test_split(
    digits, labels, test_size=0.3, random_state=42, stratify=labels,
)

models_digits = get_models()
for model_name, model_info in models_digits.items():
    models_digits[model_name]["gridcv"] = GridSearchCV(
        model_info["model"],
        param_grid=model_info["params"],
        verbose=3,
        n_jobs=-1,
        return_train_score=True,
    ).fit(digits_X_train, digits_y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV 2/5] END C=0.1, penalty=l1;, score=(train=0.983, test=0.972) total time=   0.1s
[CV 1/5] END C=0.1, penalty=l1;, score=(train=0.983, test=0.956) total time=   0.1s
[CV 4/5] END C=0.1, penalty=l1;, score=(train=0.981, test=0.964) total time=   0.1s
[CV 3/5] END C=0.1, penalty=l1;, score=(train=0.985, test=0.956) total time=   0.1s
[CV 5/5] END C=0.1, penalty=l1;, score=(train=0.981, test=0.960) total time=   0.1s
[CV 1/5] END C=0.1, penalty=l2;, score=(train=0.997, test=0.964) total time=   0.1s
[CV 2/5] END C=0.1, penalty=l2;, score=(train=0.997, test=0.972) total time=   0.1s
[CV 3/5] END C=0.1, penalty=l2;, score=(train=0.997, test=0.968) total time=   0.1s
[CV 4/5] END C=0.1, penalty=l2;, score=(train=0.995, test=0.948) total time=   0.1s
[CV 5/5] END C=0.1, penalty=l2;, score=(train=0.996, test=0.948) total time=   0.1s
[CV 1/5] END C=1, penalty=l1;, score=(train=0.997, test=0.956) total time=   0.3s
[CV 4/5] END C=1, 

In [34]:
for model_name, model_data in models_digits.items():
    print(model_name, model_data["gridcv"].best_params_)

LogisticRegression {'C': 0.1, 'penalty': 'l1'}
KNN {'n_neighbors': 3, 'weights': 'uniform'}
SVC {'C': 1, 'gamma': 'scale', 'kernel': 'poly'}
DecisionTree {'criterion': 'gini', 'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 5}


In [36]:
results_digits = pd.DataFrame.from_records(
    [
        {
            "model": model_name, 
            "train score": model_data["gridcv"].score(digits_X_train, digits_y_train),
            "test score": model_data["gridcv"].score(digits_X_test, digits_y_test),
            "fit time": np.mean(model_data["gridcv"].cv_results_["mean_fit_time"]),
        }
        for model_name, model_data in models_digits.items()
    ],
    index="model",
)

results_digits

Unnamed: 0_level_0,train score,test score,fit time
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LogisticRegression,0.980907,0.959259,0.259956
KNN,0.995227,0.987037,0.001146
SVC,0.999204,0.988889,0.067707
DecisionTree,0.976134,0.848148,0.018129
