# Classification Metrics Review

Below we import a dataset containing information about customers who are currently paying for Health Insurance. The `Response` variable indicates whether or not the customer is interested in paying for Vehicle Insurance, where `1` means "interested" and `0` means "not interested.

For more about this dataset, you can review the documentation [here](https://www.kaggle.com/arashnic/imbalanced-data-practice?select=aug_train.csv)

**Below we import the insurance dataset.**

In [None]:
import pandas as pd
df = pd.read_csv('data/aug_train.csv')
df.head()

In [None]:
import pandas as pd
df = pd.read_csv('data/aug_train.csv')
df.head()

## Task 1 
### Seperate the features from the target.
* Assign the features to `X`.
* Assign the target to `y`.

In [None]:
# Your code here

In [None]:
X = df.drop('Response', axis = 1)
y = df.Response

## Task 2 
### Drop the `id` column from `X`.

In [None]:
# Your code here

In [None]:
X = X.drop('id', axis = 1)

## Task 3
### Why did we drop the `id` column before fitting a model?


The id column contains id numbers that are unique for every single row. Generally speaking, `id` columns are not very useful for most standard data science algorithms which assume that the outcomes for each row are *independent* of each other. 

## Task 4 
### Create a train test split
> Set the random state to `2021`.

In [None]:
# Your code here

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2021)

## Task 5 
### Select the numeric features

In [None]:
# Your code here

In [None]:
X_train_numeric = X_train.select_dtypes('number')
X_test_numeric = X_test[X_train_numeric.columns]

## Task 6 
### Import a scaler and scale the data

In [None]:
# Your code here

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train_numeric)
X_train_scaled = scaler.transform(X_train_numeric)
X_test_scaled = scaler.transform(X_test_numeric)

## Task 7 
### Initialize a logistic regression model
* Set the random state to `2021`

In [None]:
# Your code here

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=2021)

## Task 8 
### Fit the model to the scaled data

In [None]:
# Your code here

In [None]:
model.fit(X_train_scaled, y_train)

## Task 9 
### Plot a confusion matrix

In [None]:
# Your code here

In [None]:
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(model, X_train_scaled, y_train);

## Task 10 
### Please calculate the accuracy score for the above model.

In [None]:
from sklearn.metrics import accuracy_score
score = accuracy_score(y_train, model.predict(X_train_scaled))
print('{:.2}'.format(score))

In [None]:
# Run this cell unchanged
from src.questions import question_9
question_9.display()

## Task 11 
### Please calculate the precision score for the above model.

In [None]:
from sklearn.metrics import precision_score
score = precision_score(y_train, model.predict(X_train_scaled))
print('{:.2}'.format(score))

In [None]:
# Run this cell unchanged
from src.questions import question_10
question_10.display()

## Task 12 
### Please calculate the recall score for the above model.

In [None]:
from sklearn.metrics import recall_score
score = recall_score(y_train, model.predict(X_train_scaled))
print('{:.2}'.format(score))

In [None]:
# Run this cell unchanged
from src.questions import question_11
question_11.display()

## Task 13 
### Please calculate the f1 score for the above model.

In [None]:
from sklearn.metrics import f1_score
score = f1_score(y_train, model.predict(X_train_scaled))
print('{:.2}'.format(score))

In [None]:
# Run this cell unchanged
from src.questions import question_12
question_12.display()

## Task 14 
### Multiple Choice

We are working on a modeling project, and have determined that false positives are the most costly outcome. An ideal metric for this project is:

In [None]:
# Run this cell unchanged
from src.questions import question_13
question_13.display()

## Task 15 
### Multiple Choice

We are working on a modeling project, and have determined that **false negatives are the most costly outcome**. An ideal metric for this project is:

In [None]:
# Run this cell unchanged
from src.questions import question_14
question_14.display()

## Task 16 
### Multiple Choice
We are working on a modeling project with **imbalanced data**, and **we do not have a strong preference between false positives and false negatives**. An ideal metric for this project is:

In [None]:
# Run this cell unchanged
from src.questions import question_15
question_15.display()

## Task 17 
### Multiple Choice

We are working on a modeling project with **balanced data**, and **we do not have a strong preference between false positives and false negatives**. An ideal metric for this project is:

In [None]:
# Run this cell unchanged
from src.questions import question_16
question_16.display()