# sklearn-pandas: don't be `pd.get_dummies()`

Today we're talking about [`sklearn-pandas`](https://github.com/scikit-learn-contrib/sklearn-pandas#sklearn-pandas)

1. Prevents data leakage
2. Works with new data!

Pair Programmed by Miles Erickson, Brian McGarry, and Cristian Nuno
Date: May 16, 2019

## Download necessary data

Today we're using a smaller version of the famous [`titanic` data set](https://gist.github.com/michhar/2dfd2de0d4f8727f873422c5d959fff5).

In [1]:
!wget https://gist.github.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv

--2019-05-16 17:10:29--  https://gist.github.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv
Resolving gist.github.com (gist.github.com)... 192.30.255.119
Connecting to gist.github.com (gist.github.com)|192.30.255.119|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv [following]
--2019-05-16 17:10:29--  https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 151.101.128.133, 151.101.192.133, 151.101.0.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|151.101.128.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10305 (10K) [text/plain]
Saving to: ‘titanic.csv.1’


## Install `sklearn-pandas`

In [2]:
!pip install sklearn-pandas



## Load necessary modules

In [3]:
from sklearn_pandas import DataFrameMapper, FunctionTransformer
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

## Load necessary data

In [4]:
!wget https://gist.github.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv

--2019-05-16 17:10:39--  https://gist.github.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv
Resolving gist.github.com (gist.github.com)... 192.30.255.119
Connecting to gist.github.com (gist.github.com)|192.30.255.119|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv [following]
--2019-05-16 17:10:39--  https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 151.101.128.133, 151.101.192.133, 151.101.0.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|151.101.128.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10305 (10K) [text/plain]
Saving to: ‘titanic.csv.2’


In [6]:
titanic = pd.read_csv("titanic.csv", delimiter="\t") # should be tsv for tab separated
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 12 columns):
PassengerId    156 non-null int64
Survived       156 non-null int64
Pclass         156 non-null int64
Name           156 non-null object
Sex            156 non-null object
Age            126 non-null float64
SibSp          156 non-null int64
Parch          156 non-null int64
Ticket         156 non-null object
Fare           156 non-null float64
Cabin          31 non-null object
Embarked       155 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 14.7+ KB


## Identify Data Cleaning Steps

* Right off the bat, we need to clean `Age` because we have missing values! Let's create a flag that identifies those records with missing `Age` values by using a custom function;
* [Impute missing values](https://scikit-learn.org/stable/modules/impute.html#impute) by calculating the median value of `Age` and replace missing values with this median value;
* Convert the `Sex` values from string to a value of 1 for female; 0 if else by using a custom function;
* Keep a list of columns and do nothing to them;
* [One hot encode](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) the `Pclass` column.

In [8]:
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')

from sklearn.preprocessing import OneHotEncoder
def is_female(x):
    """Assigns 1 if female; 0 if else"""
    if x == "female":
        return 1
    else:
        return 0
    
def is_missing(x):
    """Indicates if value is missing"""
    if pd.isna(x):
        return 1
    else:
        return 0

mapper = DataFrameMapper([
    (["Age"], FunctionTransformer(is_missing), {'alias': 'age_missing'}),
    (["Age"], imp_median),
    ("Sex", FunctionTransformer(is_female)),
    (["Fare", "SibSp"], None),
    (["Pclass"], OneHotEncoder(categories='auto')),
]
    , df_out=True)

## Train, Test, Split the `titanic` data set

Here, we setting aside 30% of our records into the testing set. We are also setting the `random_state` to ensure reproducibility of the split.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(titanic.drop("Survived", axis=1),
                                                    titanic["Survived"],
                                                    test_size=0.3,
                                                    random_state=2019)

In [10]:
type(mapper)

sklearn_pandas.dataframe_mapper.DataFrameMapper

Now let's fit `X_train` onto `mapper` and transform in two separate steps.

In [11]:
mapper.fit(X_train)

DataFrameMapper(default=False, df_out=True,
        features=[(['Age'], FunctionTransformer(func=None), {'alias': 'age_missing'}), (['Age'], SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0)), ('Sex', FunctionTransformer(func=None)), (['Fare', 'SibSp'], None), (['Pclass'], OneHotEncoder(categorical_features=None, categories='auto',
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True))],
        input_df=False, sparse=False)

In [12]:
train_output = mapper.transform(X_train)
train_output.head()

Unnamed: 0,age_missing,Age,Sex,Fare_SibSp_0,Fare_SibSp_1,Pclass_x0_1,Pclass_x0_2,Pclass_x0_3
34,0,28.0,0,82.1708,1.0,1.0,0.0,0.0
61,0,38.0,1,80.0,0.0,1.0,0.0,0.0
143,0,19.0,0,6.75,0.0,0.0,0.0,1.0
39,0,14.0,1,11.2417,1.0,0.0,0.0,1.0
13,0,39.0,0,31.275,1.0,0.0,0.0,1.0


In [13]:
test_output = mapper.transform(X_test)
test_output.head()

Unnamed: 0,age_missing,Age,Sex,Fare_SibSp_0,Fare_SibSp_1,Pclass_x0_1,Pclass_x0_2,Pclass_x0_3
38,0,18.0,1,18.0,2.0,0.0,0.0,1.0
132,0,47.0,1,14.5,1.0,0.0,0.0,1.0
107,1,26.0,0,7.775,0.0,0.0,0.0,1.0
66,0,29.0,1,10.5,0.0,0.0,1.0,0.0
18,0,31.0,1,18.0,1.0,0.0,0.0,1.0


## Create a Pipeline

To apply our data preprocessing steps in one [pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline). For this case, we are also choosing to fit our data onto a Decision Tree Classifier model.

In [14]:
model = DecisionTreeClassifier(max_depth=4)

In [15]:
pipe = Pipeline(steps=[
    #mapper can be a .py file or you can export as a pickle
    ("dataprep", mapper),
    # make sure you do cross validation 
    ("model", model)
])

Fit `X_train` and `y_train` onto our pipeline.

In [16]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('dataprep', DataFrameMapper(default=False, df_out=True,
        features=[(['Age'], FunctionTransformer(func=None), {'alias': 'age_missing'}), (['Age'], SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0)), ('Sex', FunctionTransformer(func=None)...      min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])

Store the probabilities of our predictions

In [17]:
y_pred = pipe.predict_proba(X_test)

In [18]:
y_pred[:5]

array([[1.        , 0.        ],
       [1.        , 0.        ],
       [0.66666667, 0.33333333],
       [0.11111111, 0.88888889],
       [0.        , 1.        ]])

Now let's calculate the [log loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html)

In [19]:
log_loss(y_test, y_pred)

4.592536845124868

## Now let's update our pipeline

This time let's use a Logitistic Regression model.

In [22]:
model = LogisticRegression(solver="lbfgs", max_iter=1000)
pipe = Pipeline(steps=[
    ("dataprep", mapper),
    ("model", model)
])

In [23]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('dataprep', DataFrameMapper(default=False, df_out=True,
        features=[(['Age'], FunctionTransformer(func=None), {'alias': 'age_missing'}), (['Age'], SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0)), ('Sex', FunctionTransformer(func=None)...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

In [24]:
y_pred = pipe.predict_proba(X_test)

In [25]:
log_loss(y_test, y_pred)

0.40080929251529374