### Comparing Models

Now that you have seen a variety of models for regression and classification problems, it is good to step back and weigh the pros and cons of these options.  In the case of classification models, there are at least three things to consider:

1. Is the model good at handling imbalanced classes?
2. Does the model train quickly?
3. Does the model yield interpretable results?

Depending on your dataset and goals, the importance of these considerations will vary from project to project.  Your goal is to review our models to this point and discuss the pros and cons of each.  Two example datasets are offered as a way to offer two very different tasks where interpretability of the model may be of differing importance.

### Data and Task

Your goal is to discuss the pros and cons of Logistic Regression, Decision Trees, KNN, and SVM for the tasks below.  Consider at least the three questions above and list any additional considerations you believe are important to determining the "best" model for the task.  Share your response with your peers on the class discussion board.  

**TASK 1**: Predicting Customer Churn

Suppose you are tasked with producing a model to predict customer churn.  Which of your classification models would you use and what are the pros and cons of this model for this task?  Be sure to consider interpretability, imbalnced classes, and the speed of training.



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV, train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [2]:
from sklearn.datasets import load_digits

The data is loaded below.  Note that the handwritten digit data is already split into features and target (`digits`, `labels`). 

In [3]:
churn = pd.read_csv('data/telecom_churn.csv')
digits, labels = load_digits(return_X_y=True)

In [4]:
#churn data
churn.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [5]:
gs_models = {
    'KNN': dict(
        params = {
            'n_neighbors': [3, 5, 7, 9],
            'weights': ['uniform', 'distance'],
        },
        model = KNeighborsClassifier(),  
    ),
    'SVC': dict(
        params = {
            'C': [0.01, 0.1, 1, 10, 50],       
            'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],  
            'gamma': ['scale', 'auto']
        },
        model = SVC(),  
    ),
    'DecisionTree': dict(
        params = {
            'criterion': ['gini', 'entropy'],
            'max_depth': [None, 5, 10, 20],
            'min_samples_split': [2, 5, 10, 20],
            'min_samples_leaf': [1, 2, 4, 10]
        },
        model = DecisionTreeClassifier(),  
    ),
    'LogisticRegression': dict(
        params = {
            'C': [0.1, 1, 10],
            'penalty': ['l1', 'l2'],     
        },
        model = LogisticRegression(max_iter=1500, solver="liblinear"),  
    ),
}

def run_modeling(X, y):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y,
    )
    for model_name, model_metadata in gs_models.items():
        gs_models[model_name]["gridcv"] = RandomizedSearchCV(
            model_metadata["model"],
            param_distributions=model_metadata["params"],
            verbose=3,
#             n_jobs=-1,
            return_train_score=True,
            random_state=42,
            n_iter = 50
        ).fit(X_train, y_train)
        
    for model_name, model_data in gs_models.items():
        print(model_name, model_data["gridcv"].best_params_)
        
    results = pd.DataFrame.from_records(
        [
            {
                "model": model_name, 
                "train score": model_data["gridcv"].score(X_train, y_train),
                "test score": model_data["gridcv"].score(X_test, y_test),
                "fit time": np.mean(model_data["gridcv"].cv_results_["mean_fit_time"]),
            }
            for model_name, model_data in gs_models.items()
        ],
        index="model",
    )

    return results



Task 1


In [None]:
X = churn.drop('Churn', axis=1).copy()
y = churn['Churn'].copy()

X = X.drop(X.select_dtypes(object).columns, axis=1).drop('Area code', axis=1)
y = y.map({True: 1, False: 0})

results_churn = run_modeling(X, y)



Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV 1/5] END n_neighbors=3, weights=uniform;, score=(train=0.909, test=0.844) total time=   0.0s
[CV 2/5] END n_neighbors=3, weights=uniform;, score=(train=0.909, test=0.850) total time=   0.0s
[CV 3/5] END n_neighbors=3, weights=uniform;, score=(train=0.902, test=0.861) total time=   0.0s
[CV 4/5] END n_neighbors=3, weights=uniform;, score=(train=0.902, test=0.861) total time=   0.0s
[CV 5/5] END n_neighbors=3, weights=uniform;, score=(train=0.907, test=0.878) total time=   0.0s
[CV 1/5] END n_neighbors=3, weights=distance;, score=(train=1.000, test=0.844) total time=   0.0s
[CV 2/5] END n_neighbors=3, weights=distance;, score=(train=1.000, test=0.848) total time=   0.0s
[CV 3/5] END n_neighbors=3, weights=distance;, score=(train=1.000, test=0.861) total time=   0.0s
[CV 4/5] END n_neighbors=3, weights=distance;, score=(train=1.000, test=0.861) total time=   0.0s
[CV 5/5] END n_neighbors=3, weights=distance;, score=(train=1.0



[CV 1/5] END C=0.01, gamma=scale, kernel=linear;, score=(train=0.855, test=0.854) total time=   9.4s
[CV 2/5] END C=0.01, gamma=scale, kernel=linear;, score=(train=0.855, test=0.854) total time=   8.7s
[CV 3/5] END C=0.01, gamma=scale, kernel=linear;, score=(train=0.855, test=0.854) total time=   9.7s
[CV 4/5] END C=0.01, gamma=scale, kernel=linear;, score=(train=0.855, test=0.856) total time=   9.0s
[CV 5/5] END C=0.01, gamma=scale, kernel=linear;, score=(train=0.855, test=0.856) total time=   8.5s
[CV 1/5] END C=0.01, gamma=scale, kernel=poly;, score=(train=0.855, test=0.854) total time=   0.0s
[CV 2/5] END C=0.01, gamma=scale, kernel=poly;, score=(train=0.855, test=0.854) total time=   0.0s
[CV 3/5] END C=0.01, gamma=scale, kernel=poly;, score=(train=0.855, test=0.854) total time=   0.0s
[CV 4/5] END C=0.01, gamma=scale, kernel=poly;, score=(train=0.855, test=0.856) total time=   0.0s
[CV 5/5] END C=0.01, gamma=scale, kernel=poly;, score=(train=0.855, test=0.856) total time=   0.0s


In [None]:
results_churn = run_models(X_train, y_train)

**TASK 2**: Recognizing Handwritten Digits

Suppose you are tasked with training a model to recognize handwritten digits.  Which of your classifier would you use here and why?  Again, be sure to consider the balance of classes, speed of training, and importance of interpretability.



In [None]:
#example image
plt.imshow(digits[0].reshape(8, 8))
plt.title('This is a handwritten 0.');

In [None]:
digits_results = run_modeling(digits, labels)