## <span style="color:rebeccapurple">ML workflow steps:</span>

1. State the problem
2. Gather the data
3. Split train-test sets
4. Pre-process the data
5. Establish a baseline
6. Choose a model
7. Train the model
8. Optimize the model
9. Validate the model
10. Predict unknown data points using the model
11. Interpret and evaluate the model

# <span style="color:rebeccapurple">Setup</span>

**Scikit-learn (Sklearn)** is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.

In [None]:
# imports 
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn

In [None]:
# setting some figure display paramaters
sns.set_context('notebook')
sns.set_style('white', {'axes.linewidth': 0.5})
matplotlib.rcParams.update(matplotlib.rcParamsDefault)

plt.rcParams['figure.dpi'] = 150
plt.rcParams['xtick.major.size'] = 3
plt.rcParams['xtick.major.width'] = 1
plt.rcParams['xtick.bottom'] = True
plt.rcParams['ytick.left'] = True
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'Arial'
plt.rcParams['pdf.fonttype'] = 42
plt.rcParams['legend.edgecolor'] = 'w'

## <span style="color:rebeccapurple">1. State the problem</span></h1>

**Task:** Predict the body mass of penguins.

**Input:** Table of penguins features.

**Output:** A value for body mass in grams.

## <span style="color:rebeccapurple">2. Gather and inspect the data</span></h1>

**Data:** Size measurements for 344 adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica.

In [None]:
# Let's look at the data
df = pd.read_csv('data/penguins.csv')

<h4><span style="color:blue">Google Colab users only -- un-comment the code lines below and run them to download the dataset and read it</span></h4>

In [None]:
# !wget https://raw.githubusercontent.com/nuitrcs/scikit-learn-workshop/main/data/penguins.csv
# df = pd.read_csv('penguins.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
# Let's see what distribution of the output variable looks like
plt.figure(figsize=(4, 2))
sns.histplot(df['body_mass_g'], bins=20)
plt.title('Target distribution')
plt.show()

### <span style="color:green">-------- CONCEPT: Feature vs Target split --------</span>

Once the problem statement is defined, the data can be split into target labels and input features as below.

<span style="color:green"><font size="+1">What is the input data for prediction?</font></span>

In [None]:
# X is the input into the model- features that will be used to predict y
X = df.drop(columns="body_mass_g")
df.shape, X.shape

<span style="color:green"><font size="+1">What is the target to be predicted?</font></span>

In [None]:
# select the target labels
y = df.body_mass_g
y.shape

## <span style="color:rebeccapurple">3. Split the data into train and test</span></h1>
Ideally you want to separate the train and test sets very early on. I prefer to split them before pre-processing.

### <span style="color:green">-------- CONCEPT: Train-test split --------</span>

An important aspect of machine learning that sets itself apart from other fields is that in addition to the training error, we want to minimize the generalization error. In other words, we want to make sure that the trained model generalizes well to unobserved/future inputs. That is why we need to split our dataset into train and test set.

These 2 sets should be disjoint, i.e., one instance shouldn't be in both sets. (When dataset is too small, there are special measures but we will not cover here.)

The typical split is 80%-20% for train vs test set. In sklearn it is defaulted to 75-25 (or more accurately 0.75-0.25).

The train dataset is used to learn the model. The test dataset is used to estimate generalization error.

In [None]:
# import the function train_test_split from sklearn library
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [None]:
X_train.shape , X_test.shape , y_train.shape , y_test.shape

Learn more about the train-test split function - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

<span style="color:#DC537D"><font size="+1">DIY -- Write code to split the dataset 80-20 train vs test</font></span>

In [None]:
# your code here


# what is the shape of the new split sets?


## <span style="color:rebeccapurple">4. Pre-process the training data</span>

### <span style="color:green">-------- CONCEPT: Data Processing (Fit-Transform) --------</span>

Raw data can take on any range of values. By preprocessing data, we make it easier to interpret and use.
There are many ways to preproccess data for ML, depending on the modeling purpose and data characteristics.

We're going to discuss methods to deal with these three types of data:

* Numerical
* Categorical
* Missing

<span style="color:#DC537D"><font size="+1">DIY -- What kind of variables does our dataset have? (numeric, categorical, others?)</font></span>

In [None]:
df.columns

In [None]:
# list the numerical features
numeric_cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']

In [None]:
# list the categorical features
categorical_cols = ['species', 'island', 'sex', 'year']

### <span style="color:#1409FA">Scaling Numerical data</span>
Several machine learning algorithms rely on calculating distances. Therefore, it is important to have all the input features on the same scale - so that the distances computed for different features are comparable.

#### <span style="color:teal">Standardization</span>
This is the process of converting data into the standard format where each feature has zero mean and unit variance (i.e., std=1).
$$ x' = \frac{x - \mu}{\sigma}$$

In [1]:
from sklearn.preprocessing import StandardScaler

The StandardScaler object will find the fit parameters $\mu$ and $\sigma$ for each feature $x$ in the data

#### <span style="color:teal">Scaling features to a range</span>
Apart from standardization, we can scale features to lie between a given minimum and maximum value, often between zero and one. Range compression helps with robustness due to small standard deviations of features after scaling and at the same time preserves zero entries.
$$ x' = \frac{x - min}{max - min} $$

In [None]:
from sklearn.preprocessing import MinMaxScaler

The MinMaxScaler object will find the fit parameters $min$ and $max$ for each feature $x$ in the data

#### <span style="color:teal">Normalization</span>
This is the process of scaling individual samples to have unit norm. So far, we have scaled data by features i.e., calculations are applied on individual columns. Sometimes, we need to scale data across rows. For example, some clustering techniques require normalization to calculate cosine similarity scores.

In [None]:
from sklearn.preprocessing import Normalizer

### <span style="color:rebeccapurple">4.1 Fit-transform numerical features with Standardization</span>

Let's suppose we choose to do Standardization on the numerical features. This is how we would go about it.

### I. Find the fit parameters for standardization

In [None]:
# import the class "StandardScaler" from the scikit learn library
from sklearn.preprocessing import StandardScaler

In [None]:
# scaler is an object or instance of class StandardScaler
scaler = StandardScaler()
scaler

Hover on the "i" icon - what does it say?

In [None]:
# now feed your data for pre-processing into this object
scaler.fit(X_train)
scaler

What happened here? why?

In [None]:
# select only the numeric columns
scaler.fit(X_train[numeric_cols])
scaler

Hover on the "i" icon - what does it say now? <br>
Once the data is fit, it means that the scaler has calculated the parameters for scaling i.e. the mean and the standard deviation, for each of the numeric columns. However, we still need to calculate the scaled (also called transformed) values (x prime)

$$ x' = \frac{x - \mu}{\sigma}$$

### II. Transform the train AND test data with the fit paramaters you found

Important note: preprocessing should be fit to train dataset, instead of applying to the whole set, to avoid leakage of information. Data leakage essentially means that the training process has information about, and thus will create a bias toward, the test set, potentially leading to a deceptively good generalization result.

After the preprocessor learns to fit the train dataset, it is then used to transform the train and test set. More specifically, in the case of standardization,the preprocessor will obtain the mean and standard deviation of the train set and use those statistics to transform both the train and test set. As a result, after the transformation, the train set will definitely have 0 mean and unit standard deviation but the test set may not. Why do you think that is?

In [None]:
# get the transformed X_train and X_test data
numeric_X_train = scaler.transform(X_train[numeric_cols])
numeric_X_test  = scaler.transform(X_test[numeric_cols])

Let's see what the transformed data looks like

In [None]:
numeric_X_train_df = pd.DataFrame(data = numeric_X_train, columns= numeric_cols)
display(numeric_X_train_df.head(2))
display(X_train[numeric_cols].head(2))

In [None]:
numeric_X_train_df.describe().loc["mean"]

What do you notice about the mean of the transformed values?

In [None]:
numeric_X_train_df.describe().loc["std"]

What do you notice about the standard deviation of the transformed values?

### <span style="color:#1409FA">Encoding Categorical data</span>
Data sometimes come in non-numeric values in predictors and/or response.

In [None]:
df[categorical_cols].head()

#### <span style="color:teal">Ordinal encoding</span>
This is the process of assigning each unique category an integer value. Doing this, we impose a natural ordered relationship between each category.

For example, age is ordered in nature and we can map the different ranges to integer values. More specifically, 30-39 => 0, 40-49 =>1, 50-59 => 2, etc.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

#### <span style="color:teal">One-hot encoding</span>
When there is no natural ordinal relationship among different categories, OrdinalEncoder is not an appropriate approach.

In addition, when the response variable has no ordinal relationship, encoding its labels as ordered integer values can result in poor performance. For example, suppose we encode the response's labels as 0, 1, 2. The algorithm can return a prediction of 1.5.

One-hot encoding is the process of transforming each label of the orginal categorical variable into a new binary variable. This means the total number of features will increase after preprocessing.

In [None]:
from sklearn.preprocessing import OneHotEncoder

#### <span style="color:teal">Label encoding</span>
sklearn has a separate module to encode the target variable itself - this would be used if the target is categorical. For example, if we were to predict the sex of the penguin based on the other features then the l

In [None]:
from sklearn.preprocessing import LabelEncoder

We will learn how to encode the categorical features of our dataset in the next lesson when we perform a linear regression!