Lecture: AI I - Basics 

Previous:
[**Chapter 3.6: Additional Libraries and Tools**](../03_data/06_additionals.ipynb)

---

# Chapter 4.1: Data Preparation with scikit-learn

- [Imputation](#imputation)
- [Scaling](#scaling)
- [Dimensionality Reduction](#dimensionality-reduction)
- [Pipelines](#pipelines)
- [Feature Union](#feature-union)
- [Column Transformations](#column-transformations)

__Scikit-learn__ (also known as __sklearn__) is an open-source software library for machine learning in Python. It is very popular and actively maintained. The library offers various classification, regression, and clustering algorithms. In addition, sklearn also includes algorithms for model selection, dimensionality reduction, and data preprocessing.  

In this notebook, we (again) focus on data preprocessing (Data Preparation) and cover the following topics:
- Imputation  
- Scaling  
- Dimensionality Reduction  
- Pipelines  
- Feature Union  
- Column Transformations  

The documentation for scikit-learn can be found [here](https://scikit-learn.org/stable/index.html).  


In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns

## Imputation

Imputation refers to the completion of missing values (NaNs). As in pandas, there are various methods in sklearn to replace missing values. More information can be found [here](https://scikit-learn.org/stable/modules/impute.html).  


In [3]:
nan_data = np.array([[1, 2], [np.nan, 3], [7, 6]])

### One-dimensional Imputation

In one-dimensional imputation, the values are replaced column by column. The class [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) provides basic strategies for this purpose. Missing values can be replaced with a given constant value or with statistical values (mean, median, or most frequent value) of each column containing the missing values. This class also allows for different encodings of missing values.  


In [4]:
from sklearn.impute import SimpleImputer

# define different imputer
mean_imputer = SimpleImputer()
zero_imputer = SimpleImputer(strategy='constant', fill_value=0)

In [5]:
# First the imputer has to be fitted to the data, so either call first fit and then transform 
# or call fit_transform to do it in one step
mean_imputer.fit(nan_data)
mean_imputer.transform(nan_data)

array([[1., 2.],
       [4., 3.],
       [7., 6.]])

In [6]:
zero_imputer.fit_transform(nan_data)

array([[1., 2.],
       [0., 3.],
       [7., 6.]])

In [7]:
different_nan_data = np.array([[np.nan, 5], [8, 2], [6, 6]])
mean_imputer.transform(different_nan_data)

array([[4., 5.],
       [8., 2.],
       [6., 6.]])

__Brainstorming:__  
<details>
<summary>Why was np.nan replaced with 4?</summary>
Because the mean_imputer was previously "trained" on the other data.  
</details>

<details>
<summary>When can this behavior be an advantage?</summary>
An advantage is that the imputation strategies or exact values used during training can also be applied at test time or in live operation.  
</details>


In [8]:
zero_imputer.transform(different_nan_data)

array([[0., 5.],
       [8., 2.],
       [6., 6.]])

It is also possible to replace values other than `np.nan`.  


In [9]:
fischers_fritz = [
    ['Fischers', '', 'fischt', 'frische', 'Fische'],
    ['Frische', 'Fische', 'fischt', 'Fischers', '']
]

string_imputer = SimpleImputer(missing_values='', strategy='constant', fill_value='Fritz')
string_imputer.fit_transform(fischers_fritz)          

array([['Fischers', 'Fritz', 'fischt', 'frische', 'Fische'],
       ['Frische', 'Fische', 'fischt', 'Fischers', 'Fritz']], dtype=object)

### Multidimensional Variant
The multidimensional variant is significantly more complex. Roughly summarized, each missing value is modeled as a function of other features, and this estimate is then used for imputation. This process is repeated several times before the final replacements are made. This behavior is implemented by the [IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer).  

> __Note:__ This class is still experimental.  


In [10]:
from sklearn.experimental import enable_iterative_imputer  

from sklearn.impute import IterativeImputer

First, we create a small dummy dataset where the values of the individual columns have a clear relationship to each other.  


In [11]:
x = np.arange(1, 11, dtype="float")
y = x * x 

data = np.array([x, y])
data[(0, 1)] = np.nan
data[(0, 6)] = np.nan
data[(1, 3)] = np.nan
data[(1, 9)] = np.nan
data = data.T
data

array([[ 1.,  1.],
       [nan,  4.],
       [ 3.,  9.],
       [ 4., nan],
       [ 5., 25.],
       [ 6., 36.],
       [nan, 49.],
       [ 8., 64.],
       [ 9., 81.],
       [10., nan]])

Afterwards, we can again use `fit_transform` to replace the data.  


In [12]:
IterativeImputer().fit_transform(data)

array([[ 1.        ,  1.        ],
       [ 2.3175117 ,  4.        ],
       [ 3.        ,  9.        ],
       [ 4.        , 22.35578423],
       [ 5.        , 25.        ],
       [ 6.        , 36.        ],
       [ 6.58808359, 49.        ],
       [ 8.        , 64.        ],
       [ 9.        , 81.        ],
       [10.        , 83.09539114]])

The replacements of the `SimpleImputer`, on the other hand, look as follows.  


In [13]:
SimpleImputer().fit_transform(data)

array([[ 1.   ,  1.   ],
       [ 5.75 ,  4.   ],
       [ 3.   ,  9.   ],
       [ 4.   , 33.625],
       [ 5.   , 25.   ],
       [ 6.   , 36.   ],
       [ 5.75 , 49.   ],
       [ 8.   , 64.   ],
       [ 9.   , 81.   ],
       [10.   , 33.625]])

__Brainstorming:__  
<details>
    <summary>What are the advantages of the multidimensional variant?</summary>
    A major advantage is that the missing data is replaced depending on other data. This is often better, as there may be dependencies between the data. Example: height and weight of individuals.
</details>


In the following three lines of code, the iterative working method of the algorithm becomes visible.  


In [14]:
IterativeImputer(max_iter=1).fit_transform(data)



array([[ 1.        ,  1.        ],
       [ 3.10315829,  4.        ],
       [ 3.        ,  9.        ],
       [ 4.        , 20.73213454],
       [ 5.        , 25.        ],
       [ 6.        , 36.        ],
       [ 6.8956479 , 49.        ],
       [ 8.        , 64.        ],
       [ 9.        , 81.        ],
       [10.        , 82.62527756]])

In [15]:
IterativeImputer(max_iter=2).fit_transform(data)



array([[ 1.        ,  1.        ],
       [ 2.34645245,  4.        ],
       [ 3.        ,  9.        ],
       [ 4.        , 22.28201783],
       [ 5.        , 25.        ],
       [ 6.        , 36.        ],
       [ 6.61040064, 49.        ],
       [ 8.        , 64.        ],
       [ 9.        , 81.        ],
       [10.        , 83.06934333]])

In [16]:
IterativeImputer(max_iter=3).fit_transform(data)

array([[ 1.        ,  1.        ],
       [ 2.3175117 ,  4.        ],
       [ 3.        ,  9.        ],
       [ 4.        , 22.35578423],
       [ 5.        , 25.        ],
       [ 6.        , 36.        ],
       [ 6.58808359, 49.        ],
       [ 8.        , 64.        ],
       [ 9.        , 81.        ],
       [10.        , 83.09539114]])

---

Lecture: AI I - Basics 

Exercise: [**Exercise 4.1: Data Preparation**](../04_ml/exercises/01_data_preparation.ipynb)

Next: [**Chapter 4.2: Machine Learning with scikit-learn**](../04_ml/02_machine_learning.ipynb)