In [None]:
---
title: Using Pipelines
tags: [jupyter]
keywords: course, dataSchool, sklearn
summary: "Using pipeline to automate work and also perform crossvalidation across features and not just model building process."
sidebar: dataSchool_sidebar
permalink: __AutoGenThis__
notebookfilename:  __AutoGenThis__
---

# Encoding categorical features

[video #10](https://www.youtube.com/watch?v=irHhDMbw3xo&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=10)

Created by [Data School](https://www.dataschool.io). Watch all 10 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos).

**Note:** This notebook uses scikit-learn 0.20. Some of the code below will not work if you are using an earlier version of scikit-learn.

## Agenda

- Why should you use a **Pipeline**?
    - Chain steps together sequentially
    - properly cross-validate a process rather just a model
        - pre-processing is done outside the validation 
    -  you can do a gridsearch or a random search of pipeline rather than a model
        - this can give you tuning parameter of the model and preprocessing steps        
- How do you encode categorical features with **OneHotEncoder**?
- How do you apply OneHotEncoder to selected columns with **ColumnTransformer**?
- How do you build and cross-validate a Pipeline?
- How do you make predictions on new data using a Pipeline?
- Why should you use scikit-learn (rather than pandas) for preprocessing?

## Step 1: Load the dataset

**Recall:** In this dataset you're predicting based on the data you are if a given passenger survived or not.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('http://bit.ly/kaggletrain')

In [3]:
df.shape

(891, 12)

## Step 2: Select features

### Col Names

In [4]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

### Col Null

Identify the columns with null values

In [5]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### Feature Selections

We will only look at 4 features for this notebook

The feature matrix has 4 columns but one of the columns has 2 nulls in them (Embarked) so lets exclude the observations that are associated with them by passing ```df.Embarked.notna()```.

In [6]:
df = df.loc[df.Embarked.notna(), ['Survived', 'Pclass', 'Sex', 'Embarked']]

In [7]:
df.shape

(889, 4)

In [8]:
df.isna().sum()

Survived    0
Pclass      0
Sex         0
Embarked    0
dtype: int64

In [9]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked
0,0,3,male,S
1,1,1,female,C
2,1,3,female,S
3,1,1,female,S
4,0,3,male,S


Notice that Sex and Embarked are string.  But we can treat:

- sex a numeric variable 
- embarked as categorical variable

## Step 3: Cross-validate a model with one feature

### Feature Matrix

In [12]:
# creating the feature matrix
X = df.loc[:, ['Pclass']]
print(X.shape)

(889, 1)


Notice that the feature matrix should always be 2D matrix

### Response vector

In [13]:
# creating the response vector
y = df.Survived
print(y.shape)

(889,)


### Model Generation

#### Instantiate

In [14]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='lbfgs')

#### Cross-validate

In [15]:
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=5, scoring='accuracy').mean()

0.6783406335301212

What did we do?

- cross validated a **logistic regression** with:
    - **feature matrix** X
    - **response vector** y
- using **5-fold** cross validation technique
- and optimizing the **accuracy score**
- identifying the **average accuracy** across 5 folds


### Comparing to the null accuracy

Recall this is predicting based on the most frequency class.

In [16]:
y.value_counts(normalize=True)

0    0.617548
1    0.382452
Name: Survived, dtype: float64

Generally, you want to beat null accuracy.

## Step 4: Encode categorical features

For encoding categorical features if they are unordered usually the best approach is called dummy encoding which is also known as one hot encoding.


- skLearn calls it One-Hot-Encoding
- panas calls it dummy encoding 



In [17]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked
0,0,3,male,S
1,1,1,female,C
2,1,3,female,S
3,1,1,female,S
4,0,3,male,S


### Instiantiate

In [18]:
# dummy encoding of categorical features
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

### Fit and Transform

In [19]:
ohe.fit_transform(df[['Sex']])

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [0., 1.]])

In [20]:
ohe.categories_

[array(['female', 'male'], dtype=object)]

What is this saying??

- you categorize female and male 
    - if female it is one in that column
    - if male it is one in that column

In [21]:
ohe.fit_transform(df[['Embarked']])

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [22]:
ohe.categories_

[array(['C', 'Q', 'S'], dtype=object)]

In the above case there are 3 columns because there are 3 different values:

- C
- Q 
- S

for the Embarked feature

In [23]:
ohe.fit_transform(df[['Sex', 'Embarked']])

array([[0., 1., 0., 0., 1.],
       [1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       ...,
       [1., 0., 0., 0., 1.],
       [0., 1., 1., 0., 0.],
       [0., 1., 0., 1., 0.]])

In [25]:
ohe.categories_

[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]

## Step 5: Cross-validate a Pipeline with all features

Feature matrix should be features of interest

In [26]:
X = df.drop('Survived', axis='columns')

In [27]:
X.head()

Unnamed: 0,Pclass,Sex,Embarked
0,3,male,S
1,1,female,C
2,3,female,S
3,1,female,S
4,3,male,S


In [28]:
# use when different features need different preprocessing
from sklearn.compose import make_column_transformer

### Instantiate

In [31]:
column_trans = make_column_transformer(
    (OneHotEncoder(), ['Sex', 'Embarked']), # apply OHE on 'sex' and 'embarked' only
    remainder='passthrough') # for the remainder pass it through

You need to use ```column_trans``` anytime you have features in your DF that need different pre-processing.  In this case you need to apply OHE on sex and embarked but not on p-class which is a numeric

###  Fit column transformer

In [32]:
column_trans.fit_transform(X)

array([[0., 1., 0., 0., 1., 3.],
       [1., 0., 1., 0., 0., 1.],
       [1., 0., 0., 0., 1., 3.],
       ...,
       [1., 0., 0., 0., 1., 3.],
       [0., 1., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 3.]])

In [34]:
# chain sequential steps together
from sklearn.pipeline import make_pipeline

```pipeline``` and ```make_pipeline``` are functionally equivalent but ```make_pipeline``` is better.  Check the documentation for this.

So what have been done so far??

- we created an instance and fit of "column_trans" above to preprocess the feature
- we created an instance and fit of "LogisticRegression" to do the regression analysis

In [35]:
pipe = make_pipeline(column_trans, logreg)

### Applying cross-validation to pipeline

In [37]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.7727924839713071

By cross-validating the entire process, preprocessing occurs within each fold of cross-validation.

- it takes the feature and response variable and obtains the fold
- for that fold apply the pipeline
    - the pipeline OHE 'Sex', 'Embarked' and passes the remainder
- applies the regression
- move on to the next fold
- for each fold calculate the accuracy

## Step 6: Make predictions on "new" data

This is not good practice to take from the sample data for testing but for the purpose of showing how to fit the data to new samples lets just take 5 random samples from X.

In [38]:
X_new = X.sample(5, random_state=99)
X_new

Unnamed: 0,Pclass,Sex,Embarked
599,1,male,C
512,1,male,S
273,1,male,C
215,1,female,C
790,3,male,Q


### fitting to the pipeline

What does this mean?? It runs the pre-processing as well as the model itself.


pipeline now contains the optimized model as well as column transformer.  All you have to do is fit the entire dataset and then predict new unseen data.

In [39]:
pipe.fit(X, y)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True),
                                                  ['Sex', 'Embarked'])],
                                   verbose=False)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                         

### predicting

In [40]:
pipe.predict(X_new)

array([1, 0, 1, 1, 0], dtype=int64)

## Recap

Expansion on this check out the documentation [here](https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html)

### Imports

In [74]:
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

### Reading and feature selection

In [75]:
df = pd.read_csv('http://bit.ly/kaggletrain')
df = df.loc[df.Embarked.notna(), ['Survived', 'Pclass', 'Sex', 'Embarked']]
X = df.drop('Survived', axis='columns')
y = df.Survived

### Instantiate Column Transformer and Model

In [76]:
column_trans = make_column_transformer(
    (OneHotEncoder(), ['Sex', 'Embarked']),
    remainder='passthrough')
logreg = LogisticRegression(max_iter=10000, tol=0.1,solver='lbfgs') 

### Instantiate pipeline

In [77]:
pipe = make_pipeline(column_trans, logreg)

### Cross Validate Model OR Grid Search

#### Cross Validation only

In [59]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.7727924839713071

#### Grid Search

In [81]:
param_grid = {    
    'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

In [82]:
grid = GridSearchCV(pipe, param_grid, cv=10, scoring='accuracy', return_train_score=False)

In [83]:
grid.fit(X, y)

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='passthrough',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('onehotencoder',
                                                                         OneHotEncoder(categories='auto',
                                                                                       drop=None,
                                                                                       dtype=<class 'numpy.float64'>,
                                                                                       handle_unknown='error',
                

In [85]:
grid.cv_results_

{'mean_fit_time': array([0.01140068, 0.01020057, 0.01200066, 0.01320083, 0.01150069,
        0.01070063, 0.0177011 ]),
 'std_fit_time': array([0.00326199, 0.00060004, 0.00077469, 0.00060008, 0.00136028,
        0.00064033, 0.00954048]),
 'mean_score_time': array([0.00500028, 0.00360029, 0.00360022, 0.00400019, 0.00380018,
        0.00350022, 0.00530026]),
 'std_score_time': array([0.00402513, 0.00066325, 0.00048991, 0.00063249, 0.00060003,
        0.00049996, 0.001735  ]),
 'param_logisticregression__C': masked_array(data=[0.001, 0.01, 0.1, 1, 10, 100, 1000],
              mask=[False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'logisticregression__C': 0.001},
  {'logisticregression__C': 0.01},
  {'logisticregression__C': 0.1},
  {'logisticregression__C': 1},
  {'logisticregression__C': 10},
  {'logisticregression__C': 100},
  {'logisticregression__C': 1000}],
 'split0_test_score': array([0.61797753, 0.74157303, 0.7640449

In [84]:
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]

Unnamed: 0,mean_test_score,std_test_score,params
0,0.617543,0.001302,{'logisticregression__C': 0.001}
1,0.791918,0.038261,{'logisticregression__C': 0.01}
2,0.773889,0.022297,{'logisticregression__C': 0.1}
3,0.770518,0.019699,{'logisticregression__C': 1}
4,0.770518,0.019699,{'logisticregression__C': 10}
5,0.770518,0.019699,{'logisticregression__C': 100}
6,0.770518,0.019699,{'logisticregression__C': 1000}


In [88]:
grid.best_score_

0.791917773237998

### Apply and fit for out of sample data

In [None]:
X_new = X.sample(5, random_state=99)

In [None]:
pipe.fit(X, y)

In [None]:
pipe.predict(X_new)