# ML Workflow Intro

![Image](./img/scikit_learn.png)

## Installing scikit-learn

```$ conda create -n sklearn-env -c conda-forge scikit-learn```

```$ conda activate sklearn-env```

```$ conda install ipykernel```

```$ conda install pandas```

## [API Reference](https://scikit-learn.org/stable/modules/classes.html#)

- Datasets
- Impute
- Preprocessing and Normalization
- Model Selection
- Metrics
- Linear Models
- Ensemble Methods
- Clustering

---

# ML Data Preparation

- Missing values

- Encoding

- Scaling

- Data imbalance

In [None]:
# imports

import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

---

## Missing values

__scikit-learn__ estimators assume that all values in an array are numerical, and that all have and hold meaning!

In [None]:
# loading a classic!!!

titanic = pd.read_csv('./datasets/titanic.csv')
titanic

__Dataset Features:__

- PassengerId - Numerical PK ([1:891])

- Survived - Survival (0 = No; 1 = Yes)

- Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

- Name - Name of the passanger

- Sex - Genre of the passanger

- Age - Age of the passanger

- SibSp - Number of Siblings/Spouses Aboard

- Parch - Number of Parents/Children Aboard

- Ticket - Ticket Number

- Fare - Passenger Fare

- Cabin - Cabin Number

- Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [None]:
titanic.info()

In [None]:
# numeric features

titanic.describe()

In [None]:
# categorical features

cols = ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

cat_list = []
for col in cols:
    cat = titanic[col].unique()
    cat_num = len(cat)
    cat_dict = {"categorical_variable":col,
                "number_of_possible_values":cat_num,
                "values":cat}
    cat_list.append(cat_dict)
    
categories = pd.DataFrame(cat_list).sort_values(by="number_of_possible_values",
                                                ascending=False).reset_index(drop=True)
categories

In [None]:
# missing values

titanic.isnull().sum()

In [None]:
# missing values percentage function

def missing_percentage(df):
    percent_missing = df.isnull().sum() * 100 / len(df)
    missing_values_df = pd.DataFrame({'column_name': df.columns,'percent_missing': percent_missing})
    return missing_values_df

In [None]:
# missing values percentage

missing_percentage(titanic)

---

### Delete missing values

In [None]:
# drop columns
no_nan_col = ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Fare']
titanic_no_nan_col = titanic[no_nan_col]
titanic_no_nan_col

In [None]:
missing_percentage(titanic_no_nan_col)

In [None]:
# drop rows

titanic_no_nan_rows = titanic.dropna()
titanic_no_nan_rows

In [None]:
missing_percentage(titanic_no_nan_rows)

---

### Imputation of missing values

In [None]:
# we make a copy

titanic_input = titanic.copy()
titanic_input

In [None]:
# Using pandas -> Numeric continuous values

titanic_input['Age'] = titanic_input['Age'].fillna(titanic_input['Age'].mean())
#titanic_input['Age'] = titanic_input['Age'].replace(np.nan, titanic_input['Age'].mean())
missing_percentage(titanic_input)

In [None]:
# Using pandas -> Categorical values

titanic_input['Embarked'] = titanic_input['Embarked'].fillna(titanic_input['Embarked'].value_counts().index[0])
missing_percentage(titanic_input)

In [None]:
# we make another copy

titanic_input = titanic.copy()
titanic_input

In [None]:
# Using sklearn univariate feature imputation -> Numeric continuous values

imputer = SimpleImputer(strategy='mean', missing_values=np.nan)
type(imputer)

#### [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

![Image](./img/imputer_methods.JPG)

In [None]:
imputer = imputer.fit(titanic_input[['Age']])
imputer.get_params(deep=True)

In [None]:
titanic_input['Age'] = imputer.transform(titanic_input[['Age']])
missing_percentage(titanic_input)

In [None]:
# Using sklearn univariate feature imputation -> Categorical values

imputer = SimpleImputer(strategy='most_frequent', missing_values=np.nan)
imputer = imputer.fit(titanic_input[['Embarked']])
titanic_input['Embarked'] = imputer.transform(titanic_input[['Embarked']])
missing_percentage(titanic_input)

#### Other options:

- Last observation carried forward method: `.fillna(method='ffill')`)
- Iterpolation of the variable before and after a timestamp: `.interpolate(method='linear', limit_direction='forward', axis=0)`
- Using Algorithms that support missing values (no available for Sklearn algorithms)
- Missing values prediction (Machine Learning for Machine Learning)

---

## Encoding Categorical Data

again, __scikit-learn__ estimators assume that all values in an array are numerical, and that all have and hold meaning!

__Very Important:__ Ordinal Data vs. Nominal Data


In [None]:
# first we get the categorical data
cat_cols = ['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
titanic_enconded = titanic[cat_cols]
titanic_enconded

In [None]:
cat_list = []
for col in cat_cols:
    cat = titanic[col].unique()
    cat_num = len(cat)
    cat_dict = {"categorical_variable":col,
                "number_of_possible_values":cat_num,
                "values":cat}
    cat_list.append(cat_dict)
    
cat_df = pd.DataFrame(cat_list).sort_values(by="number_of_possible_values",
                                                ascending=False).reset_index(drop=True)
cat_df

---

### Label encoding

In [None]:
encoding = {'S':1, 'C':2, 'Q':3}
def ordinal_encoding(x):
    for key in encoding:
        if x == key:
            return encoding[key]

In [None]:
titanic_enconded['Embarked_num'] = titanic_enconded['Embarked'].apply(ordinal_encoding)
titanic_enconded

---

### One-hot encoding

In [None]:
# One-hot encoding https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

cat_cols = ['Name', 'Pclass', 'Sex', 'Embarked']
titanic_one_hot_encoding = pd.get_dummies(titanic[cat_cols], 
                                          columns=['Sex', 'Pclass'], 
                                          drop_first=True)
titanic_one_hot_encoding

__You can also use the [skalearn method](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for it__

---

## Feature Scaling

Scaling of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance.

__Why is it important?__ If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

![Image](./img/scaling.jpg)

[Here](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html) you may find a comparison between different approches.

In [None]:
# Sample data

sample_data = titanic[['Age', 'Pclass']]
sample_data.describe()
#sample_data = sample_data.to_numpy()


---

### [Standarization](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

The result of standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with μ=0  and σ=1 where μ is the mean (average) and σ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows:

![Image](./img/standarization.JPG)

In [None]:
# Using scikit-learn .StandardScaler()

scaler = StandardScaler()
scaled_data = scaler.fit_transform(sample_data)
scaled_data

In [None]:
scaled_df = pd.DataFrame(scaled_data, columns=['Age', 'Pclass'])
scaled_df.describe()

---

### [MinMax Scaling or Normalization](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

In this approach, the data is scaled to a fixed range - usually 0 to 1. The cost of having this bounded range - in contrast to standardization - is that we will end up with smaller standard deviations, which can suppress the effect of outliers. A Min-Max scaling is typically done via the following equation:

![Image](./img/normalization.JPG)

In [None]:
# Using scikit-learn .MinMaxScaler()

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(sample_data)
scaled_data

In [None]:
scaled_df = pd.DataFrame(scaled_data, columns=['Age', 'Pclass'])
scaled_df.describe()

---

### [Robust Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.

In [None]:
# Using scikit-learn .MinMaxScaler()

scaler = RobustScaler()
scaled_data = scaler.fit_transform(sample_data)
scaled_data

In [None]:
scaled_df = pd.DataFrame(scaled_data, columns=['Age', 'Pclass'])
scaled_df.describe()

[Here](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html) you may find more info about scaling.



---

## Data imbalance (mainly classification)

- Get more data

- Resampling: Under-sampling and Over-sampling

[Imbalanced-learn](https://imbalanced-learn.org/stable/references/index.html) is an open source, MIT-licensed library relying on scikit-learn that provides tools when dealing with classification with imbalanced classes.

`conda install -c conda-forge imbalanced-learn`

In [None]:
# Over-sampling example: original data

X, y = make_classification(n_classes=2, 
                           class_sep=2, 
                           weights=[0.1, 0.9], 
                           n_informative=3, 
                           n_redundant=1, 
                           flip_y=0, 
                           n_features=20, 
                           n_clusters_per_class=1, 
                           n_samples=1000, 
                           random_state=42)

print(X.shape, y.shape, Counter(y))

In [None]:
# Using SMOTE (Synthetic Minority Over-sampling Technique)

sm = SMOTE(random_state=42)

X_res, y_res = sm.fit_resample(X, y)

print(X_res.shape, y_res.shape, Counter(y_res))

---