In [None]:
# Don't run this cell

import sys
sys.path.append('../..')

from new_caller.student_caller_july.student_caller import CohortCaller
from new_caller.student_caller_july.student_list import avocoder_toasters

toaster_caller = CohortCaller(avocoder_toasters)

We will start this code along where we left off with yesterday's standdown activity. Below, the Titanic dataset has been loaded into memory.

In [None]:
import seaborn
import pandas as pd
import numpy as np

titanic = seaborn.load_dataset('titanic')
titanic.sample()

Yesterday afternoon, you split the target off from the independent variables, as the cell below does:

In [None]:
# Split the target 'survived' off from the predictors
X = titanic.drop('survived', axis=1)
y = titanic['survived']

You then converted the `sex` column to a binary.

In [None]:
def is_female(sex_record):
    '''
    Transform a column indicating sex by a string
    'male'/'female' to a binary where
    0 = Male
    1 = Female
    '''

    if sex_record == 'male':
        return 0
    else:
        return 1
    
X['sex'] = X['sex'].apply(is_female)


You then split the data into train and test. 

In [None]:
from sklearn.model_selection import train_test_split

# Implement a train-test split using all default arguments and random_state=42
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

Now we will use cross validation functions to select our best model.

For logistic regression, it has been stressed that we need to scale our data before passing our data into `cross_validate`.  In previous exercise, we have looked the other way at this, or implemented some gnarly KFold code.  KFold is good to know, in case you do need to implement some bespoke cross validation techniques.  However, pipelines will do a lot of the same work, with a lot less code.

Our first pipeline tools come from `sklearn.pipeline`

In [None]:
from sklearn.pipeline import Pipeline, make_pipeline

What is the difference between `Pipeline` and `make_pipeline`?

#Your answer here

# Take 5 minutes with a parterner to fill in the code below

To create a pipeline, we pass in different transformers and estimators. For our first pipeline, we will StandardScale our inputs, and then pass them into a LogisticRegression model.  

In [None]:
# pass StandardScaler and LogisticRegression objects (i.e. instances: Think parens) into make_pipeline
pipeline = None

Below you have been provided with a function that will print out train and test scores given a pipeline and a feature_list.  

In [None]:

from sklearn.model_selection import cross_validate

def print_cv_scores(pipe, feature_list):
    
    # we pass in pipe to cross validate along with a feature list.
    results = cross_validate(pipe, X_train[feature_list], 
                                   y_train, 
                                   return_train_score=True)
    
    print(results['train_score'])
    print(results['train_score'].mean())
    print('##############')
    print(results['test_score'])
    print(results['test_score'].mean())
    


In [None]:
# Pass in the pipeline above along with a list of two continuous variables: 'fare' and 'sex': 

The above code ensures that the Standard Scaler and Logistic Regression model are fit only on the training sets of each fold.  There is no leakage.

# End pair task here

Next, we want to add some complexity to the model by adding the feature `class`.  In order to do that.

In [None]:
X_train['class'].value_counts()

In order to do so, use a OneHotEncoder to transform the column. Like our other objects, our OneHotEncoder must be fit only on the training set.  The features which it encodes may then, by chance, not include values in the test set of each fold.  If we one hot encoded before the split, this could potentially break our model.  

Sklearn.compose has a handy class called ColumnTransformer. ColumnTransformer will allow us to pair a transformer with the columns that it should act on.

In [None]:
from sklearn.compose import ColumnTransformer


Create a ColumnTransformer object which takes as an argument `transformers` which equals a list of transformers.  For this instance, the list is composed of one transformer, the OneHotEncoder.  We specify the details of the transformer with a tuple that includes 
 - 1: a string we create that names the transformer 
 - 2: the transformer object, i.e. the OneHotEncoder object with arguments `categories="auto", handle_unknown="ignore"` 
 - 3: a list of the feature names it acts on, in this case just `class`. 
 
The second argument after the tupple, `remainder`, describes what to do with the features not acted on by the transformer.  We want the unused features to `passthrough` to the next transformer in the pipeline.

# Take 5 minutes with a parterner to fill in the code below

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Create the column transformer in this cell. 
col_transformer = None

Create a new pipeline with three arguments: col_transformer, StandardScaler(), and LogisticRegression(random_state=42)

In [None]:
# Create new pipeline here
new_pipe = None

In [None]:
print_cv_scores(new_pipe, ['sex', 'fare', 'class'])

# End pair here

Run the cell below, and calculate accuracy, precision, recall, and f1 score

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_hat_train = cross_val_predict(new_pipe, X_train[['sex', 'fare', 'class']], y_train)

confusion_matrix(y_train, y_hat_train)

In [None]:
accuracy = None
recall = None
precision = None
f1 = None

In [None]:
toaster_caller.call_n_students()

One great thing about pipelines is that they can be used with GridSearchCV.  That way, we can try out different combinations of hyperparameters for the different objects in the pipeline.  

We create a parameter grid which is a dictionary with keys designating the object and the hyperparameter, and values equal to a list of potential values.

The tough thing to remember is the key should have the form of `objectname` + `__` + `hyperparametername`

We can print out the new_pipe, like below, to see the names. 

In [None]:
new_pipe

In the cell below, create a parameter grid that tries out the values `.0001,.5,  1,50, 100` for the `C` hyperparameter.  Remember `C` equals inverse regularization strength.

# Take 3 minutes by yourself to try this

In [None]:
from sklearn.model_selection import GridSearchCV

parameter_grid = None

In [None]:
# Run this cell to fit the grid search.
gs = GridSearchCV(pipe, parameter_grid)
gs.fit(X_train[['sex', 'fare', 'class', 'age']], y_train)

After fitting, we can access the best_score_ as well as the best_parameters, as shown below

In [None]:
print(gs.best_score_)
print(gs.best_params_)

We also have the best_estimator_, which has been refit to the entire training set.  We can use that object to make predictions.

In [None]:
# Run cell without changes
from sklearn.metrics import plot_confusion_matrix

y_hat_train = gs.best_estimator_.predict(X_train[['sex', 'fare', 'class', 'age']])

plot_confusion_matrix(gs.best_estimator_, X_train[['sex', 'fare', 'class', 'age']], y_train)

In [None]:
# Calculate metrics
accuracy = None
precision = None
recall = None
f1_score = None

In [None]:
toaster_caller.call_n_students()

Although there is not a huge class imbalance, let's see what effect altering the balance may have.  We could use SMOTE (caution: SMOTE only works with imblearn.pipeline objects), but here let's try altering the `class_weight`.  Create a new param_grid with that tests out the same C values above, as well as the class_weight parameters `None` and `balanced`

# Take 5 minutes with a partner to fill in code below

In [None]:
new_parameter_grid = None

We can also change what metric we pass into the GridSearch via the `scoring` argument.  Pass in the string `f1` after the new_param_grid argument.

In [None]:
gs = GridSearchCV(pipe, new_parameter_grid, scoring=None)
gs.fit(X_train[['sex', 'fare', 'class', 'age']], y_train)

In [None]:
print(gs.best_score_)
print(gs.best_estimator_)

In [None]:
# Run to plot the confusion matrix from the training set.
plot_confusion_matrix(gs.best_estimator_, X_train[['sex', 'fare', 'class', 'age']], y_train)

# End pair

In [None]:
# Calculate metrics
accuracy = None
recall = None
precision = None
f1 = None

In [None]:
toaster_caller.call_n_students()

Describe why the confusion matrix looks the way it does.  Why are there now less of one error and more of another.

# Stretch Goal

Choose another classifier: KNN, NaiveBayes, DecisionTreeClassifier, and create a pipeline with it.  Use GridSearchCV to search out optimal hyperparameters for that classifier.

In [None]:
# Your code here