# Classification Metrics & Grid Search Review

Joseph Hopkins, ATL


## Imports

In [14]:
# Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler

In [15]:
# Data
df = pd.read_csv('train.csv', index_col='PassengerId')


## Explore & Clean Data

Look at the header

In [16]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Clean up column names (lowercase, no spaces, etc.)

In [17]:
df.columns = [col.lower()for col in df.columns]


Look at the description of all the columns.

In [18]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
survived,891,,,,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
pclass,891,,,,2.30864,0.836071,1.0,2.0,3.0,3.0,3.0
name,891,891.0,"Moor, Master. Meier",1.0,,,,,,,
sex,891,2.0,male,577.0,,,,,,,
age,714,,,,29.6991,14.5265,0.42,20.125,28.0,38.0,80.0
sibsp,891,,,,0.523008,1.10274,0.0,0.0,0.0,1.0,8.0
parch,891,,,,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
ticket,891,681.0,1601,7.0,,,,,,,
fare,891,,,,32.2042,49.6934,0.0,7.9104,14.4542,31.0,512.329
cabin,204,147.0,B96 B98,4.0,,,,,,,


Too many to handle. Turns out the first letter in the cabin name is the deck the cabin was on. Get value counts on that.

Set null cabins to `'U'`, and check your work.

Create a new column called `deck`, and populate it with the first letter of the cabin name.

Create a new column called `is_male`, and populate it with `1` for male, `0` for female.

In [9]:
df['is_male'] = (df['sex'] == 'male').astype(int)

Keep only the following columns in this order:
- `survived`
- `age`
- `fare`
- `is_male`
- `pclass`
- `deck`
- `embarked`

In [11]:
cols_to_keep = [
    'survived',
    'age',
    'fare',
    'is_male',
    'pclass',
    'deck',
    'embarked'
]

df = df[cols_to_keep]

KeyError: "['deck'] not in index"

Check for null values.

Drop any columns with fewer than 5 null rows.

Impute the mean of the column for nulls for any column with 5 or more null rows.

**NEVER DO THIS IN REAL LIFE.**

In [1]:
# Never do mean imputation in real life...


Check your work.

## Preprocess Data

Look at the DataFrame description.

In [10]:
df.describe(include - 'all').T

NameError: name 'include' is not defined

One-hot encode any categoricals

Set up `X` and `y`.

Train/test split. Do we need to stratify?

Scale.

## Fit a model

What type of question are we answering (regression or classification)?

In [19]:
# Instantiate
lr = LogisticRegressionCV(Cs = [1e9], cv=5,n_jobs=-1)

NameError: name 'LogisticRegressionCV' is not defined

In [20]:
# Fit
lr.fit(X_train, y_train)

NameError: name 'lr' is not defined

In [21]:
# Score train
lr.score(X_train, y_train)

NameError: name 'lr' is not defined

In [22]:
# Score test
lr.score(X_test, y_test)

NameError: name 'lr' is not defined

### Grid search

- Set up params dictionary.
- Instantiate a `GridSearchCV`
- Fit the model

In [23]:
lr_params = {'C: [100, 10, 1, 0.1, 0.01, 0.001]'}

In [24]:
gs=GridSearchCV(LogisticRegression(), lr_params, cv=5, n_jobs=-1, verbose=1)

AttributeError: 'str' object has no attribute 'items'

In [25]:
ga.fit(X_train, y_train)


NameError: name 'ga' is not defined

## Evaluate & Iterate Model
- Best parameters?
- Best score?
- Test score?

In [26]:
gs.best_params_

NameError: name 'gs' is not defined

Get predictions, and build a confusion matrix

Calculate and print the following metrics:
- Sensitivity (Recall)
- Specificity
- Positive Predictive Value (Precision)
- Negative Predictive Value
- Accuracy

See [the wikipedia entry for Confusion Matrix](https://en.wikipedia.org/wiki/Confusion_matrix) for reference.

In [None]:
gs.best_score_


---