### Evaluating Classification Models

**OBJECTIVES**
- Use the confusion matrix to evaluate classification models
- Explore precision and recall as evaluation metrics
- Determine cost of predicting highest probability targets

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import make_column_transformer
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from sklearn.datasets import load_breast_cancer, load_digits, fetch_openml

### Evaluating Classifiers

Today, we want to think a bit more about the appropriate classification metrics in different situations.  Please use this [form](https://forms.gle/nU785s3MaQL33xG97) to summarize your work.

### Problem

Below, a dataset with measurements of cancerous and non-cancerous breast tumors is loaded and displayed.  Use `LogisticRegression` and `KNeighborsClassifier` to build predictive models on train/test splits.  Generate a confusion matrix and explore the classifiers mistakes.  

- Which model do you prefer and why?
- Do you care about predicting each of these classes equally?
- Is there a ratio other than accuracy you think is more important based on the confusion matrix?  

In [None]:
cancer = load_breast_cancer(as_frame=True).frame

In [None]:
cancer.head()

In [None]:
# changing target label
#cancer['target'] = np.where(cancer['target'] == 0, 1, 0)

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score

In [None]:

X = cancer.iloc[:, :-1]
y = cancer['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 11)

In [None]:

lgr = LogisticRegression()
knn = KNeighborsClassifier(n_neighbors=30)

In [None]:

scaler = StandardScaler()

In [None]:
from sklearn.pipeline import Pipeline

In [None]:

lgr_pipe = Pipeline([('scale', scaler), ('model', lgr)])
knn_pipe = Pipeline([('scale', scaler), ('model', knn)])

In [None]:

lgr_pipe.fit(X_train, y_train)
knn_pipe.fit(X_train, y_train)

In [None]:
#plot confusion matrices

### Problem

Below, a dataset around customer churn is loaded and displayed.  Build classification models on the data and visualize the confusion matrix.  

- Suppose you want to offer an incentive to customers you think are likely to churn, what is an appropriate evaluation metric?
- Suppose you only have a budget to target 100 individuals you expect to churn.  By targeting the most likely predictions to churn, what percent of churned customers did you capture?

In [None]:
churn = fetch_openml(data_id = 43390).frame

In [None]:
churn.head()

In [None]:

X = churn.iloc[:, :-1]
y = churn['Exited']
X.drop(['Surname', 'RowNumber', 'CustomerId'], axis = 1, inplace = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 11)

In [None]:

encoder = make_column_transformer((OneHotEncoder(drop = 'first'), ['Geography', 'Gender']),
                                  remainder = StandardScaler())

In [None]:

knn_pipe = Pipeline([('transform', encoder), ('model', KNeighborsClassifier())])
lgr_pipe = Pipeline([('transform', encoder), ('model', LogisticRegression())])

In [None]:

knn_pipe.fit(X_train, y_train)
lgr_pipe.fit(X_train, y_train)

In [None]:
#plot confusion matrices


### Predicting Positives

Return to the churn example and a Logistic Regression model on the data.



1. If you were to make predictions on a random 30% of the data, what percent of the true positives would you expect to capture?

2. Use the predict probability capabilities of the estimator to create a `DataFrame` with the following columns:

| probability of prediction = 1 | true label | 
| -----------  | -------------- |
| .8 | 1 |
| .7 | 1 |
| .4 | 0 |

3. Sort the probabilities from largest to smallest.  What percentage of the positives are in the first 3000 rows?

### `scikit-learn` visualizers

- `PrecisionRecallDisplay`
- `ROCurveDisplay`

from `skplot` [docs](https://scikit-plot.readthedocs.io/en/stable/metrics.html)

- `plot_cumulative_gain`

In [None]:
from sklearn.metrics import PrecisionRecallDisplay, RocCurveDisplay

In [None]:
import scikitplot as skplot