# Summary 

This is the summary of the crash course took at [BCAM](http://www.bcamath.org/en/) on 2019 40th week


**Index**  
* [Introduction](#Introduction)  
* [Preprocessing](#Preprocessing)
    * [Outliers](#Section-A%3A-Outliers)
    * [NaN values](#Section-B%3A-NaN-values)

## Introduction

There are three steps when it comes to ML:
 * Preprocess: is about ~80% of the time, we prepare and clean the data we are about to work with. In this stage we also select the features we'll work with.
 * Process: Where the learning takes place.
 * Evaluating: measuring the quality of our predictions.
 
We can come across with two kind of data: the one that shows both input variables and the output for them (categorical or numerical), and the one that only shows the inputs and we must infer the outputs. The learning that deals with first one is called **supervised learning** [wiki](https://en.wikipedia.org/wiki/Supervised_learning), whereas the one that deals with the latter is **non-supervised learning.** [mathworks](https://www.mathworks.com/discovery/unsupervised-learning.html)

Supervised learning works mainly in classification (the outputs are categories) & regression (the outputs are numbers).

Unsupervised learning deals with clustering, aka, group similar elements together.

We were also playing with the iris dataset in scikit learn.
Quite interesting the pairplot method of seaborn

## Getting familiar with the data
The following cells are a sample to get familiar with the data we'll work with. Also remember that expert knowledge is quite relevant, so always try to stay in contact with the data providers.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import pyplot
import seaborn as sns
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, KBinsDiscretizer

In [None]:
# Load the dataset (a set of numpy arrays)
iris = datasets.load_iris()

# Convert to a DF
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],columns= iris['feature_names'] + ['target'])
iris['target'] = iris['target'].astype(int)
iris.head()

In [None]:
# Have a quick summarizing view about the data
iris.describe()

In [None]:
# Create pair plots for each column in the dataset to find out relationships
sns.pairplot(iris, vars=iris.columns[:-1], hue='target')

We notice several things here:
* the target 0 (Setosa) is fairly different from the others. Their petal dimesions are significantly smaller than the other two.

In [None]:
# The violin plot in seaborn is mix between a box plot and a kernel density function
fig, ax = pyplot.subplots(figsize =(9, 7)) 
sns.violinplot( ax = ax, y = iris["petal width (cm)"], x = iris.target)

## Preprocessing

**Preprocessing** is the stage in a machine learning pipeline where we clean and make sense of the data we have. It takes the **80%** of the time spent in a ML project. THe bigger the data the more important its preprocessing process.

**Expert knowledge** is always relevant. It provides the most important features, the ranges for variables, helps identifying redundancies and set bounds to the modeling choices.

### Section A: Outliers
Outliers are often bad data points [1](https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm). 
They can be detected collectivelly (using Mahalanobis distance) or individually, that is, feature by feature.

In [None]:
# Read the dataset
df=pd.read_csv('iris_data.csv')

# Get a copy to work with
df_A = df.copy()

# let's look for outliers into the whole data (uncomment to see)
ax = sns.boxplot(data=df_A[df_A.columns[:-1]], orient="h", palette="Set2", linewidth=2.5)
plt.show()

In [None]:
# Since we see that petal_width has some of them, let's plot it
ax2 = sns.boxplot(data=df_A['sepal_width'], orient="h", color=sns.color_palette("Set2")[1], linewidth=2.5)
plt.show()

In [None]:
# Now we're going to get rid of them
# First, define interquartile range
df_A_stats=df_A.describe()
IQR = df_A_stats["sepal_width"]["75%"] - df_A_stats["sepal_width"]["25%"]

# And the whiskers
whiskers = [df_A_stats["sepal_width"]["25%"]-(1.5*IQR),
            df_A_stats["sepal_width"]["75%"]+(1.5*IQR)]

# Get the outliers
outliers=df_A[
    (df_A['sepal_width'] > whiskers[1]) |
    (df_A['sepal_width'] < whiskers[0])
]

# Now, drop'em all
data1_outliers=df_A.drop(index=outliers.index)
assert(data1_outliers.shape[0] == 146)  # Four indices were dropped as expected

# Finally plot the graph to ensure everything was fine
ax2 = sns.boxplot(y="sepal_width", data=data1_outliers, orient="h", color=sns.color_palette("Set2")[1], linewidth=2.5)
plt.show()

## Section B: NaN values

When coming across with NaNs, two approaches can be taken:
* If they are few of the total dataset and the dataset is large, we can remove them.
* Otherwise they can be guessed by especific algorithms.

### Getting rid of NaNs

In [None]:
# First get a copy of the original dataframe
df_nan = df.copy()

# Create some nans
df_nan.iloc[1, 0] = np.nan
df_nan.iloc[5:7, 1] = np.nan
df_nan.iloc[7, 3] = np.nan

# Get rid of nan values and reindex the df
df_not_nan = df_nan.dropna().reset_index()

# There should be no NaNs
assert(((df_not_nan != np.isnan).any()).any())

### FIlling in the NaNs with the mean value 

In [None]:
# Get an  array of the values
X_nan = df_nan.iloc[:, :-1].values

# Now instantiate the imputer from sklearn
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

# Calculate for X_nan and transform the original values
X_imputed = imp.fit(X_nan).transform(X_nan)

# Check that the first value imputed actually has the mean value in the previous dataset
assert(df_nan.iloc[:,0].mean() == X_imputed[1][0])

## Section C: Standarization & normalization
*From the slides*
* When  computing  distances  between  pairs  of  samples,  the  scales  of  the different features is very relevant.

* Moreover, we cannot obviate the curse of dimensionality effect, i.e. in high-dimensional spaces all data is sparse. In simple words, all distances become huge.

* Therefore, if  the  algorithm  we  plan  to  apply  after  preprocessing  implies distances and/or we are in a high-dimensional problem, we should transform all the features to a similar scale.

***
**Standarization:** Standardization (or Z-score normalization) is the process of rescaling the features so that they’ll have the properties of a Gaussian distribution with μ=0 and σ=1 where μ is the mean and σ is the standard deviation from the mean

**Normalization:** the process of scaling individual samples to have unit norm, that is, they will be in the range {0, 1} or {-1, 1}. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples. [s](https://scikit-learn.org/stable/modules/preprocessing.html#normalization)

### Standarization

In [None]:
# Create a new copy of the data
data2=df.copy()
 
# Split into input space and output space
X2 = data2[['sepal_length','sepal_width', 'petal_length', 'petal_width']]
y2 = data2['species']

# Instantiate the standard scaler (standarization)
sc = StandardScaler()
X2_standarized = sc.fit_transform(X2)

# Ensure that the values are --almost- 0 & 1 respectivelly
assert((X2_standarized.mean(axis=0) < 1e-10).all())
assert((X2_standarized.std(axis=0) < 1 + 1e-10).all())

### Normalization

In [None]:
# Instantiate the MinMaxScaler & apply
sc = MinMaxScaler()
X2_normalized = sc.fit_transform(X2)

# Ensure max=1 & min=0
assert((X2_normalized.min(axis=0) == 0).all())
assert((X2_normalized.max(axis=0) == 1).all())

## Section D: discretization

*From the slides*
* The goal of discretization is reducing the number of values of a continuous attribute by grouping them into intervals (bins). The new values are the bins.

* Some methods require discrete attributes (some naïve Bayes and Bayesian networks methods, etc.).

* Sometimes,  the results  are better  after discretization. Some methods do it implicitly (e.g. decision trees).

* In  general,  the  computational  cost  of  algorithms  with  discrete  attributes  is lower than their continuous versions.

In [None]:
# Create a new copy of the data to play with
data3=df.copy()

# Select the features we'll work with
X3 = data3[['sepal_length','sepal_width', 'petal_length', 'petal_width']]

# Instantiate the discretizers: equal width & equal frequency
enc1 = KBinsDiscretizer(n_bins=[3, 4, 2, 2], encode='ordinal', strategy='uniform')
enc2 = KBinsDiscretizer(n_bins=[3, 4, 2, 2], encode='ordinal', strategy='quantile')

# Calcualte the bins and transform them
X_binned_EW = enc1.fit(X3).transform(X3)
X_binned_EF = enc2.fit(X3).transform(X3)

# Now plot them to see the differences
fig =plt.figure(figsize=(20, 20))

plt.subplot(421)
plt.hist(X_binned_EW[:,0])

plt.title('Equal-Width -> sepal_length',fontsize=20)

plt.subplot(422)
plt.hist(X_binned_EF[:,0])
plt.title('Equal-Frequency -> sepal_length',fontsize=20)

plt.subplot(423)
plt.hist(X_binned_EW[:,1])
plt.title('Equal-Width -> sepal_width',fontsize=20)

plt.subplot(424)
plt.hist(X_binned_EF[:,1])
plt.title('Equal-Frequency -> sepal_width',fontsize=20)

plt.subplot(425)
plt.hist(X_binned_EW[:,2])
plt.title('Equal-Width -> petal_length',fontsize=20)

plt.subplot(426)
plt.hist(X_binned_EF[:,2])
plt.title('Equal-Frequency -> petal_length',fontsize=20)

plt.subplot(427)
plt.hist(X_binned_EW[:,3])
plt.title('Equal-Width -> petal_width',fontsize=20)

plt.subplot(428)
plt.hist(X_binned_EF[:,3])
plt.title('Equal-Frequency -> petal_width',fontsize=20)

plt.show()