```
From: https://github.com/ksatola
Version: 0.0.1

TODOs
1. 

```

# Preprocess Data

In [1]:
# Connect with underlying Python code
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0, '../src')

In [2]:
from datasets import (
    get_dataset
)

In [3]:
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
df = pd.DataFrame(
    {
        "a": range(5),
        "b": [-100, -50, 0, 200, 1000],
    }
)
df.head()

Unnamed: 0,a,b
0,0,-100
1,1,-50
2,2,0
3,3,200
4,4,1000


## Standardize
`Standardization` means standardizing the features around the center and 0 with a standard deviation of 1 is important when we compare measurements that have different units. Variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bias.

Some algorithms, such as `SVM`, perform better when the data is standardized. Each column should have a mean value of 0 and standard deviation of 1. Sklearn provides a `.fit_transform` method that combines both `.fit` and `.transform`.

Here is a pandas version. Remember that you will need to track the original mean and standard deviation if you use this for preprocessing. Any sample that you will use to predict later will need to be standardized with those same values

In [10]:
from sklearn import preprocessing

std = preprocessing.StandardScaler()
std.fit_transform(df)

array([[-1.41421356, -0.75995002],
       [-0.70710678, -0.63737744],
       [ 0.        , -0.51480485],
       [ 0.70710678, -0.02451452],
       [ 1.41421356,  1.93664683]])

In [11]:
std.scale_

array([1., 1.])

In [12]:
std.mean_

array([0.0000000e+00, 4.4408921e-17])

In [14]:
std.var_

array([1., 1.])

In [16]:
# Pandas version
std2 = (df - df.mean()) / df.std()
std2

array([[-1.41421356, -0.75995002],
       [-0.70710678, -0.63737744],
       [ 0.        , -0.51480485],
       [ 0.70710678, -0.02451452],
       [ 1.41421356,  1.93664683]])

In [17]:
std2.mean()

0.0

In [18]:
std2.std()

1.0

## Normalization (Scale to Range)
Scaling to range is translating data so it is between 0 and 1, inclusive. Having the data bounded may be useful. However, if you have `outliers`, you probably want to be careful using this.

In [25]:
from sklearn import preprocessing

mms = preprocessing.MinMaxScaler()
mms.fit(df)

MinMaxScaler(copy=True, feature_range=(0, 1))

In [26]:
mms.transform(df)

array([[0.        , 0.        ],
       [0.25      , 0.04545455],
       [0.5       , 0.09090909],
       [0.75      , 0.27272727],
       [1.        , 1.        ]])

In [28]:
# Pandas version
norm = (df - df.min()) / (df.max() - df.min())
norm

array([[0.        , 0.1952524 ],
       [0.21102245, 0.23183184],
       [0.42204491, 0.26841127],
       [0.63306736, 0.41472902],
       [0.84408981, 1.        ]])

## Dummy Variables (One-hot Encoding)
We can use pandas to create dummy variables from categorical data. This is also referred to as one-hot encoding, or indicator encoding. Dummy variables are especially useful if the data is nominal (unordered). The `get_dummies` function in pandas creates multiple columns for a categorical column, each with a 1 or 0 if the original column had that value. The `drop_first` option can be used to eliminate a column (one of the dummy columns is a linear combination of the other columns).

In [40]:
df_cat = pd.DataFrame(
    {
        "name": ["George", "Paul"],
        "inst": ["Bass", "Guitar"],
    }
)
df_cat.head()

Unnamed: 0,name,inst
0,George,Bass
1,Paul,Guitar


In [41]:
pd.get_dummies(df_cat)

Unnamed: 0,name_George,name_Paul,inst_Bass,inst_Guitar
0,1,0,1,0
1,0,1,0,1


In [42]:
pd.get_dummies(df_cat, drop_first=True)

Unnamed: 0,name_Paul,inst_Guitar
0,0,0
1,1,1


In [43]:
df_cat2 = pd.DataFrame(
    {
        "A": [1, None, 3],
        "names": [
            "Fred,George",
            "George",
            "John,Paul",
        ],
    }
)
df_cat2.head()

Unnamed: 0,A,names
0,1.0,"Fred,George"
1,,George
2,3.0,"John,Paul"


In [44]:
pd.get_dummies(df_cat2)

Unnamed: 0,A,"names_Fred,George",names_George,"names_John,Paul"
0,1.0,1,0,0
1,,0,1,0
2,3.0,0,0,1


In [45]:
pd.get_dummies(df_cat2, drop_first=True)

Unnamed: 0,A,names_George,"names_John,Paul"
0,1.0,0,0
1,,1,0
2,3.0,0,1


## Label Encoder
If we have high cardinality nominal data, we can use label encoding. This will take categorical data and assign each value a number. It is useful for high cardinality data. This encoder imposes ordinality, which may or may not be desired. It can take up less space than one-hot encoding, and some (tree) algorithms can deal with this encoding. The label encoder can only deal with one column at a time.

In [48]:
from sklearn import preprocessing

lab = preprocessing.LabelEncoder()
lab.fit_transform(df_cat.name)

array([0, 1])

If you have encoded values, applying the `.inverse_transform` method decodes them.

In [49]:
lab.inverse_transform([1, 1, 0])

array(['Paul', 'Paul', 'George'], dtype=object)

You can also use pandas to label encode. First, you convert the column to a categorical column type, and then pull out the numeric code from it. This code will create a new series of numeric data from a pandas series. We use the `.as_ordered` method to ensure that the category is ordered.

In [50]:
df_cat.name.astype("category").cat.as_ordered().cat.codes + 1

0    1
1    2
dtype: int8

## Frequency Encoding
Another option for handling high cardinality categorical data is to frequency encode it. This means replacing the name of the category with the count it had in the training data. We will use pandas to do this. First, we will use the pandas `.value_counts` method to make a mapping (a pandas series that maps strings to counts). With the mapping we can use the `.map` method to do the encoding. Make sure you store the training mapping so you can encode future data with the same data.

In [53]:
df_cat3 = pd.DataFrame(
    {
        "name": ["George", "Paul", "George"],
        "inst": ["Bass", "Guitar", "Bass"],
    }
)
df_cat3.head()

Unnamed: 0,name,inst
0,George,Bass
1,Paul,Guitar
2,George,Bass


In [56]:
mapping = df_cat3.name.value_counts()
mapping

George    2
Paul      1
Name: name, dtype: int64

In [57]:
df_cat3.name.map(mapping)

0    2
1    1
2    2
Name: name, dtype: int64