### Feature engineering methods
* imputation: fill missing values
* one-hot encoding: covert categories to binary columns
* binning: convert scalars to categories
* interactions terms
* normalization (group of techniques: scaling)

In [113]:
import pandas as pd

In [114]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Cabin          687
Embarked         2
Fare             0
Ticket           0
Parch            0
SibSp            0
Age              0
Sex              0
Name             0
Pclass           0
Survived         0
PassengerId      0
dtype: int64

**impute missing values in the age column:**

options:
* insert a predefined value: `df.fillna(1.0, inplace=True)`
* insert the mean: `df.fillna(df['col'].mean(), inplace=True)`
* interpolation
    * `s = pd.Series([1.0, 2.0, np.nan, np.nan, 5.0, np.nan, 7.0])`
    * `s.interpolate('linear')`
* insert the median
* insert the last know value in a time series (forward fill)
* insert the next known value in a time series (backward fill)
* options in `sklearn.imputer module`
* check advanced strategies in the **FanceImputer** package
* insert from the most similar data point with **k-nearest neighbors**
* reconstruct from latent features with Matrix factorization (NMF)

In [112]:
# check where missing values are
df.isnull().sum().sort_values(ascending = False)
# embarked - where they got on the ship 

Cabin          687
Age            177
Embarked         2
Fare             0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
Survived         0
PassengerId      0
dtype: int64

In [80]:
df['Age'].isna().sum()

177

In [81]:
df['Age'].isnull().sort_values(ascending = False).head(15)

643    True
517    True
502    True
507    True
140    True
718    True
306    True
511    True
304    True
303    True
711    True
709    True
301    True
300    True
732    True
Name: Age, dtype: bool

In [82]:
# easy impute: mean or median of age column
df['Age'].fillna(df['Age'].mean(), inplace=True)
# check
df.isnull().sum().sort_values(ascending = False)



Cabin          687
Embarked         2
Fare             0
Ticket           0
Parch            0
SibSp            0
Age              0
Sex              0
Name             0
Pclass           0
Survived         0
PassengerId      0
dtype: int64

In [128]:
# 1) fill age nan with random int between 20 and 40 if survived else randint below 20 or above 40
import numpy as np
import random
df['Age'] = np.where(((df['Survived'] == 1) & (df['Age'].isna())), df['Age'], random.randint(20,40))
df['Age'] = np.where(((df['Survived'] == 0) & (df['Age'].isna())), df['Age'], random.randint(0,20)|random.randint(40,80))

# check
df.isnull().sum().sort_values(ascending = False)

Cabin          687
Embarked         2
Fare             0
Ticket           0
Parch            0
SibSp            0
Age              0
Sex              0
Name             0
Pclass           0
Survived         0
PassengerId      0
dtype: int64

In [75]:
# more sophisticated method to fill NaNs:
# fitting a sample distribution with GMMs (an unsupervised model for the distribution of your data)
# generalized method of moments (GMM)

import pandas as pd
from sklearn.mixture import GaussianMixture
df = pd.DataFrame({
     'fruit': list('AAAABBBB'),
     'price': [2.0, 2.1, 2.2, 2.1, 1.0, 1.3, 1.1, 1.7]})
X = pd.get_dummies(df)
gmm = GaussianMixture(n_components=5)
gmm.fit(X)
gmm.sample(8)[0].round(1)

array([[ 2.1,  1. , -0. ],
       [ 2.1,  1. ,  0. ],
       [ 2.2,  1. ,  0. ],
       [ 2.2,  1. , -0. ],
       [ 1. ,  0. ,  1. ],
       [ 1.1, -0. ,  1. ],
       [ 1.1, -0. ,  1. ],
       [ 2. ,  1. ,  0. ]])

In [55]:
# passengers between 20 and 40 are more likely to survive
# for rows where passenger survived, fill age with random integer between 20 and 40
# for rows where passenger died, fill age with random integer <20 or >40


### Normalizing
Normalizing is an overarching term for many different transformations of the data. Usually, by normalizing you want to adjust the underlying distribution of the data.

* Calculating logarithms
    * `import numpy as np`
    * ` normalized = np.log(df['col'])`


* **Scaling**: adjust the range of the data or the the mean and standard definition to a defined range. Most of the time this happens, because a model expects a certain scale
* **Min-Max Scaler**: scales data to values in the range between 0 and 1
    * `from sklearn.preprocessing import MinMaxScaler`
    * `scaler = MinMaxScaler()`
    * `scaler.fit(X)` -> fit parametrizes the scaler
    * `Xtrans = scaler.transform(X)` -> transform does the actual scaling
    * `print("data after:\n", Xtrans[:5]`
* **Standard Scaler**: scales to a normal distribution with mean 0 and standard deviation 1
    * `from sklearn.preprocessing import StandardScaler`
    * `m = StandardScaler()`
    * `Xt = m.fit_transform()`
    
other normalization methods:
* dividing by a total sum
* **z-scores**
* propensities - deviatins from a reference value
* Box-Cox transformation - transforming a time series to a rougly linear shape

### Transform features

Encoding categories
* many ML algorithms require numerical feature values
* **converting a categorical column into numbers**
    * `from sklearn.preprocessing import LabelEncoder`
    * `m = LabelEncoder()`
    * `m.fit_transform(['A', 'B', 'A', 'C'])`

* One-Hot encoding (dummy encoding)
    * transfroms a category into binary values
    * `pd.get_dummies (['A', 'B', 'A', 'C'])`
     

In [66]:
# apply one-hot encoding to Passenger class column
binary_pclass = pd.get_dummies(df['Pclass'], prefix='Pclass')
#print(type(binary_pclass))
binary_pclass.head()

Unnamed: 0,Pclass_1,Pclass_2,Pclass_3
0,0,0,1
1,1,0,0
2,0,0,1
3,1,0,0
4,0,0,1


In [57]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [67]:
# join column to dataframe
df = df.join(binary_pclass, how='left') # , :-1 , lsuffix='_left', rsuffix='_right'
df.head()
# x = x.join(binary.iloc[:, :-1])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,1,2,3,Pclass_1,Pclass_2,Pclass_3
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,0,1,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,0,0,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0,1,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,0,0,1,0,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,0,1,0,0,1


### Binning

In [74]:
df['Age'] = pd.cut(df['Age'], bins=5, labels=[1,2,3,4,5])
df.head()

# workflow strategy: first binning of a variable and then one hot encoding 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,1,2,3,Pclass_1,Pclass_2,Pclass_3
0,1,0,3,"Braund, Mr. Owen Harris",male,2,1,0,A/5 21171,7.25,,S,0,0,1,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,3,1,0,PC 17599,71.2833,C85,C,1,0,0,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,2,0,0,STON/O2. 3101282,7.925,,S,0,0,1,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,3,1,0,113803,53.1,C123,S,1,0,0,1,0,0
4,5,0,3,"Allen, Mr. William Henry",male,3,0,0,373450,8.05,,S,0,0,1,0,0,1
