# Data Preprocessing

See [Feature Engineering and Data Preprocessing](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.04-Feature-Engineering.ipynb) for more in-depth coverage. 

In this section we will take a look at basic preprocessing steps.

First let's make some sample data:

In [1]:
import pandas as pd

X = pd.DataFrame(dict(
    a = range(5),
    b = [-100, -50, 0, 200, 1000]
))

In [2]:
X

Unnamed: 0,a,b
0,0,-100
1,1,-50
2,2,0
3,3,200
4,4,1000


## Standardize

Algorithms, such as SVM, perform better when the data is *standardized*. Each column should have a mean value of 0 and standard deviation of 1. 

Sklearn provides a `.fit_transform` method that combines both `.fit` and `transform`:

In [3]:
from sklearn import preprocessing
std = preprocessing.StandardScaler()

In [4]:
std.fit_transform(X)

array([[-1.41421356, -0.75995002],
       [-0.70710678, -0.63737744],
       [ 0.        , -0.51480485],
       [ 0.70710678, -0.02451452],
       [ 1.41421356,  1.93664683]])

After fitting, there are various attributes we can inspect:

In [5]:
std.scale_

array([  1.41421356, 407.92156109])

In [6]:
std.mean_

array([  2., 210.])

In [7]:
std.var_

array([2.000e+00, 1.664e+05])

In [8]:
import numpy as np

In [10]:
np.mean(std.fit_transform(X))

0.0

## Scale to Range

Scaling to range is translating data so it is between 0 and 1, inclusive. Having the data bounded may be useful. However, it you have outliers, you probably want to be careful using this:

In [7]:
mms = preprocessing.MinMaxScaler()

In [8]:
mms.fit_transform(X)

array([[0.        , 0.        ],
       [0.25      , 0.04545455],
       [0.5       , 0.09090909],
       [0.75      , 0.27272727],
       [1.        , 1.        ]])

## Dummy Variables

We can use pandas to create dummy variables from categorical data. This is also referred to as *one-hot encoding* or *indicator encoding*. Dummy variables are especially useful if the data is nominal (unordered). The `get_dummies` function in pandas creates multiple columns for a categorical coliumn, each with a 1 or 0 if the original column had that value:

In [15]:
X_cat = pd.DataFrame(dict(
    name = ["George","Paul"],
    inst = ["Bass","Guitar"]
))

In [16]:
X_cat

Unnamed: 0,name,inst
0,George,Bass
1,Paul,Guitar


Please note the use of the `drop_first` option to eliminate a colum. *One of the dummy columns is a linear combination of the other columns.*

In [17]:
pd.get_dummies(X_cat, drop_first=True)

Unnamed: 0,name_Paul,inst_Guitar
0,0,0
1,1,1


pyjanitor has the ability to split columns with the `expand_column` function:

In [12]:
X_cat_2 = pd.DataFrame(dict(
    A = [1,None,3],
    names = ['Fred,George','George','John,Paul']
))

In [14]:
X_cat_2

Unnamed: 0,A,names
0,1.0,"Fred,George"
1,,George
2,3.0,"John,Paul"


In [None]:
#!pip install pyjanitor # Use this to install pyjanitor

In [20]:
import janitor as jn

In [21]:
jn.expand_column(X_cat_2, "names", sep=",").drop(columns="Fred")

Unnamed: 0,A,names,George,John,Paul
0,1.0,"Fred,George",1,0,0
1,,George,1,0,0
2,3.0,"John,Paul",0,1,1


## Label Encoder

If we have high cardinality nominal data, we can use *label encoding*. This will take categorical data and assign each value a number. It is useful for high cardinality data. This encoder imposes ordinality, which may be desired. It can take up less space than one-hot encoding, and some (tree) algorithms can deal with this encoding. 

The label encoder can only deal with one column at a time.

In [22]:
lab = preprocessing.LabelEncoder()

In [23]:
lab.fit_transform(X_cat["inst"])

array([0, 1])

If you have encoded values, apply the `.inverse_transform` method to decode them:

In [24]:
lab.inverse_transform([1,1,0])

array(['Guitar', 'Guitar', 'Bass'], dtype=object)

## Frequency Encoding

Another option for handling high cardinality categorical data is to `frequency encode` it. This means replacing the name of the category with the count it had in the training data. We will use pandas to do this.

First, use the `value_counts` method to make a mapping. With the mapping we can use the `.map` method to do the encoding:

In [27]:
mapping = X_cat["name"].value_counts()

In [28]:
mapping

Paul      1
George    1
Name: name, dtype: int64

In [30]:
X_cat['name'].map(mapping)

0    1
1    1
Name: name, dtype: int64

Make sure to store the training mapping so you can use it to encode future data with the same data. 

## Pulling Categories from Strings

One way to increase the accuracy of the Titanic model is to pull out titles from the names.

Here we use some basic regular expressions.

In [16]:

url = (
    "http://biostat.mc.vanderbilt.edu/"
    "wiki/pub/Main/DataSets/titanic3.xls"
)
df = pd.read_excel(url)

In [20]:
df['name'].str.extract("(\w+)\.",expand=False).value_counts()

Mr          757
Miss        260
Mrs         197
Master       61
Dr            8
Rev           8
Col           4
Ms            2
Major         2
Mlle          2
Mme           1
Don           1
Countess      1
Sir           1
Lady          1
Capt          1
Dona          1
Jonkheer      1
Name: name, dtype: int64

Use pandas to create dummy variables or combine columns with low counts into other categories (or drop them)

## Other Categorical Encoding

The `categorical_encoding` library is a set of scikit-learn transformers used tro convert categorical data into numberic data. 

One algorithm inplemented in the library is  a hash encoder. This is useful if you don't know how many categories yo uhave ahead of time or if you are using a bag of words to represent text. 
This will has the categorical columns into `n_components`. 

If you are using online learning (models that can be updated), this is very handy.

In [51]:
#!pip install category_encoders

In [52]:
import category_encoders as ce

In [53]:
he = ce.HashingEncoder(verbose=1)

In [54]:
he.fit_transform(X_cat)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7
0,0,0,0,1,0,1,0,0
1,0,2,0,0,0,0,0,0


The ordinal encoder can convert categorical columns that have order to a single column of numbers. Here we convert the size column to ordinal numbers. If a value is missing from the mapping dictionary, the default value of `-1` is used:

In [34]:
size_df = pd.DataFrame(dict(
    name = ["Eli","Xu","Markum"],
    size = ["small","med","xxl"]
))

In [57]:
size_map = dict(small = 1, med = 2, lg = 3)
ore = ce.OrdinalEncoder(
    mapping = [
        dict(col = "size",
            mapping = size_map)
    ]
)

In [58]:
ore.fit_transform(size_df)

Unnamed: 0,name,size
0,Eli,1.0
1,Xu,2.0
2,Markum,-1.0


If you have high cardinality data (a large number of unique values) consider using one of the Bayesian encoders that output a single column per categorical column. There are `TargetEncoder`, `LeaveOneOutEncoder`, `WOEEncoder`, `JamesSteinEncoder`, and `MEstimateEncoder`.

For example, to convert the Titanic survival column to a blend of posterior probability of the target and the prior probability given the title (categorical) information, use the following:

In [59]:
def get_title(df):
    return df['name'].str.extract("(\w+)\.",expand=False)

In [60]:
te = ce.TargetEncoder(cols="Title")

In [62]:
te.fit_transform(df.assign(Title=get_title), df['survived'])['Title'].head()

0    0.676923
1    0.508197
2    0.676923
3    0.162483
4    0.786802
Name: Title, dtype: float64

## Date Feature Engineering

The `fastai` library has an `add_datepart` function that will generate date attribute columns based on a datetime column. This si useful as most ML algorithms would bot be able to infer this type of signal from a numeric representation of a date:

In [64]:
#!pip install fastai

In [65]:
from fastai.tabular.transform import add_datepart

In [66]:
dates = pd.DataFrame(dict(
    A = pd.to_datetime(["9/17/2001","Jan 1, 2002"])
))

In [67]:
dates

Unnamed: 0,A
0,2001-09-17
1,2002-01-01


In [68]:
add_datepart(dates, "A")

Unnamed: 0,AYear,AMonth,AWeek,ADay,ADayofweek,ADayofyear,AIs_month_end,AIs_month_start,AIs_quarter_end,AIs_quarter_start,AIs_year_end,AIs_year_start,AElapsed
0,2001,9,38,17,0,260,False,False,False,False,False,False,1000684800
1,2002,1,1,1,1,1,False,True,False,True,False,True,1009843200


In [69]:
dates.T # .T == .transpose()

Unnamed: 0,0,1
AYear,2001,2002
AMonth,9,1
AWeek,38,1
ADay,17,1
ADayofweek,0,1
ADayofyear,260,1
AIs_month_end,False,False
AIs_month_start,False,True
AIs_quarter_end,False,False
AIs_quarter_start,False,True


## Manual Feature Engineering

We can use pandas to generate new features.

For the Titnanic dataset, we can add aggregate cabin data. To get aggregate data per cabin and merg it back in, we use the `.groupby` method to create the data. Then align it with the original data using the `.merge` method.

In [70]:
agg = df.groupby("cabin").agg("min,max,mean,sum".split(",")).reset_index()

In [71]:
agg.columns = ["_".join(c).strip("_") for c in agg.columns.values]

In [72]:
agg_df = df.merge(agg, on="cabin")

In [74]:
agg_df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,...,parch_mean,parch_sum,fare_min,fare_max,fare_mean,fare_sum,body_min,body_max,body_mean,body_sum
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,...,0.5,1,211.3375,211.3375,211.3375,422.675,,,,0.0
1,1,1,"Madill, Miss. Georgette Alexandra",female,15.0,0,1,24160,211.3375,B5,...,0.5,1,211.3375,211.3375,211.3375,422.675,,,,0.0
2,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,...,2.0,8,151.55,151.55,151.55,606.2,135.0,135.0,135.0,135.0
3,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,...,2.0,8,151.55,151.55,151.55,606.2,135.0,135.0,135.0,135.0
4,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,...,2.0,8,151.55,151.55,151.55,606.2,135.0,135.0,135.0,135.0


If you wanted to sum up "good" and "bad" columns, you could create a new column that is the sum of the aggreagted columns. This is somewhat of an art and also requires deep understanding of the data. 