### **D2APR: Aprendizado de Máquina e Reconhecimento de Padrões** (IFSP, Campinas) <br/>
**Prof**: Samuel Martins (Samuka) <br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. <br/><br/>

### Custom CSS style

In [None]:
%%html
<style>
.dashed-box {
    border: 1px dashed black !important;
}
.dashed-box tr {
  background-color: white !important;  
}
.alt-tab {
    background-color: black;
    color: #ffc351;
    padding: 4px;
    font-size: 1em;
    font-weight: bold;
    font-family: monospace;
}
// add your CSS styling here
</style>

<span style='font-size: 2.5em'><b>California Housing 🏡</b></span><br/>
<span style='font-size: 1.5em'>Predict the median housing price in California districts</span>

<span style="background-color: #ffc351; padding: 4px; font-size: 1em;"><b>Sprint 4</b></span>

<img src="./imgs/california-flag.png" width=300/>

---



## Before starting this notebook
This jupyter notebook is designed for **experimental and teaching purposes**. <br/>
Although it is (relatively) well organized, it aims at solving the _target problem_ by evaluating (and documenting) _different solutions_ for somes steps of the **machine learning pipeline** — see the ***Machine Learning Project Checklist by xavecoding***. <br/>
We tried to make this notebook as literally a _notebook_. Thus, it contains notes, drafts, comments, etc.<br/>

For teaching purposes, some parts of the notebook may be _overcommented_. Moreover, to simulate a real development scenario, we will divide our solution and experiments into **"sprints"** in which each sprint has some goals (e.g., perform _feature selection_, train more ML models, ...). <br/>
The **sprint goal** will be stated at the beginning of the notebook.

A ***final notebook*** (or any other kind of presentation) that compiles and summarizes all sprints — the target problem, solutions, and findings — should be created later.

#### Conventions

<ul>
    <li>💡 indicates a tip. </li>
    <li> ⚠️ indicates a warning message. </li>
    <li><span class='alt-tab'>alt tab</span> indicates and an extra content (<i>e.g.</i>, slides) to explain a given concept.</li>
</ul>

---

## 🎯 Sprint Goals
- Refactor our codes by using the sklearn Pipelines
- Evaluate the models in the Test Set
- Compare the models with the baseline
---

### 0. Imports and default settings for plotting

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

## 🛠️ 5. Prepare the Data

#### **Preprocessing tasks**
- Fill in missing values (imputation)
- Add new features
- Feature Scaling
- One-Hot Encoding

<table align="left" class="dashed-box">
<tr>
    <td><span class='alt-tab'>alt tab</span></td>
    <td><b>Slides:</b> Scikit-Learn Design Principles - Hyperparameters vs Parameters<br/>
        <b>Slides:</b> Scikit-Learn Design Principles - Main APIs</td>
</tr>
</table><br/><br/>

### 5.1. Load the cleaned training set

Let's consider the training and testing sets already cleaned (sprint #2):
- Drop duplicated instances (no found)
- Drop instances with `housing_median_age` capped at 52
- Drop instances with `median_house_value` capped at 500001.0

In [None]:
# load the cleaned training set
housing_train = pd.read_csv('./datasets/housing_train_sprint-2.csv')

In [None]:
housing_train.head()

In [None]:
housing_train.shape

### 5.2. Separate the _features_ and the _target outcome_

In [None]:
housing_train.columns

In [None]:
# store the target outcome into a numpy array


In [None]:
y_train

In [None]:
y_train.shape

In [None]:
# overwrite the dataframe with only the features  


In [None]:
housing_train.head()

In [None]:
housing_train.shape

### 5.3. Separate the _numerical_ and _categorical_ features
Since we perform different preprocessing tasks (transformations) to _numerical_ features and _categorical_ ones, let's split them into two different dataframes.

In [None]:
housing_train.columns

In [None]:
# numerical atributes


In [None]:
# categorical attributes


In [None]:
# separating the features


In [None]:
housing_train_num.head()

In [None]:
housing_train_cat.head()

### 5.4. Filling in missing values

`sklearn.impute.SimpleImputer` <br/>
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [None]:
imputer.statistics_  # computed medians

In [None]:
housing_train_num.median()

<table align="left" class="dashed-box">
<tr>
    <td>💡</td>
    <td>The <code>SimpleImputer</code> finds out the <i>statistic for imputation</i> <b>for ALL features</b>.</td>
</tr>
<tr>
    <td></td>
    <td>We can save this <i>transformer</i> on the disk for future transfomations.</td>
</tr>
</table><br/><br/>

In [None]:
# filling in the missing values FOR ALL attributes
# it generates a numpy array


### 5.5. Adding new features
To _automate data preprocessing_ via sklearn, we will need _to create_ our **own transformer** to add the new features considered.

In [None]:
# template to create an own estimation
from sklearn.base import BaseEstimator, TransformerMixin


class NameOfYourTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X):
        return None  # return the transformed data instead of None


Since our custom transformer can be executed before other transformation, we will consider that the input is a **numpy 2D array**, not a _dataframe_. <br/>

This transformer will create 3 new features, based on the current ones:
- `total_rooms`
- `total_bedrooms`
- `population`
- `households`


Thus, we need to find their column indices first because our input will be a **numpy 2D array**.

In [None]:
# get the integer index of each attribute/column:


In [None]:
feat_engineer = HousingFeatEngineering()

housing_train_num_new_feats = feat_engineer.transform(housing_train_num.values)  # we need to convert it to numpy first
housing_train_num_new_feats

In [None]:
housing_train_num_new_feats.shape

In [None]:
# show the new feats
housing_train_num_new_feats[:, -3:]

### 5.6. Feature Scaling
Exactly as performed in the previous sprint: **RobustScaler**. <br/>
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
scaler.fit(housing_train_num)

In [None]:
housing_train_num_scaled = scaler.transform(housing_train_num)
housing_train_num_scaled

### 5.7. Categorical Variable Encoding
Instead of using the method `.get_dummies()` from _pandas_, let's use a method from _sklearn_.

`sklearn.preprocessing.OneHotEncoder` <br/>
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [None]:
housing_train_cat

In [None]:
housing_train_cat_1hot

<table align="left" class="dashed-box">
<tr>
    <td>💡</td>
    <td>Notice that the output is a <i>SciPy sparse matrix</i>, instead of a <i>NumPy array</i>. This is very useful when you have categorical attributes with <b>thousands of categories</b>.</td>
</tr>
<tr>
    <td></td>
    <td>After one-hot encoding, we get a matrix with thousands of columns, and the matrix is <i>full of 0s</i> except for <i>a single <b>1</b> per row</i>.</td>
</tr>
<tr>
    <td></td>
    <td>Using up tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the nonzero elements.</td>
</tr>
</table><br/><br/>


In [None]:
# converting to NumPy array
housing_train_cat_1hot.toarray()

In [None]:
# getting the list of categories
encoder.categories_

### 5.8. Creating Preprocessing `Pipelines`
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

<table align="left" class="dashed-box">
<tr>
    <td><span class='alt-tab'>alt tab</span></td>
    <td><b>Slides:</b> Scikit-Learn Design Principles - Pipelines<br/></td>
</tr>
</table><br/><br/>

Let's create a **Preprocessing `Pipeline`**.

In [None]:
from sklearn.pipeline import Pipeline

#### Pipeline for numerical data

In [None]:
housing_train_num_preprocessed

In [None]:
housing_train_num_preprocessed.shape

#### Pipeline for categorical data

In [None]:
housing_train_cat_preprocessed.toarray()

In [None]:
np.all(housing_train_cat_preprocessed.toarray() == housing_train_cat_1hot)

### 5.9. Putting it all by `ColumnTransformer`
https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

Applies _transformers_ to **columns** of an array or pandas DataFrame. <br/>
This **estimator** allows _different columns_ or _column subsets_ of the input to be **transformed *separately*** and the _features generated_ by each transformer will be _concatenated_ to form a **single feature space**. <br/>

This is useful for _heterogeneous or columnar data_, to combine several feature extraction mechanisms or transformations into a single transformer.

In [None]:
num_attributes

In [None]:
cat_attributes

In [None]:
housing_train.head()

<table align="left" class="dashed-box">
<tr>
    <td>⚠️</td>
    <td><b>BE CAREFUL</b>.</td>
</tr>
<tr>
    <td></td>
    <td>When performing the pipeline <i>"numerical"</i>, <code>ColumnTransformer</code> first <i>selects/filters</i> the columns passed by the list <code>num_attributes</code>. We then have a <i>new dataframe</i> with <b>new indices</b> that will be processed.</td>
</tr>
<tr>
    <td></td>
    <td>When generating new features, our custom transformer <code>HousingFeatEngineering()</code> assumes a given values for the indices of <code>total_rooms</code>, <code>total_bedrooms</code>, etc.</td>
</tr>
<tr>
    <td></td>
    <td>These considered indices <b>MUST MATCH EXACTLY</b> with the <i>corresponding columns</i> of the numpy array or dataframe passed as input. For our case, this matching is true.</td>
</tr>
<tr>
    <td></td>
    <td>But, <b>BE CAREFUL!!!</b></td>
</tr>
</table><br/><br/>

In [None]:
housing_train_pre_npy = preprocessed_pipeline.fit_transform(housing_train)

In [None]:
housing_train_pre_npy

In [None]:
preprocessed_pipeline.named_transformers_

In [None]:
preprocessed_pipeline.transformers_

### 5.10. Saving the Preprocessed Pipeline

In [None]:
import joblib

joblib.dump(preprocessed_pipeline, './models/preprocessed_pipeline.pkl')

In [None]:
# to load the pipeline
loaded_preprocessed_pipeline = joblib.load('./models/preprocessed_pipeline.pkl')

In [None]:
housing_train_pre_npy_2 = loaded_preprocessed_pipeline.fit_transform(housing_train)
housing_train_pre_npy_2

In [None]:
np.all(housing_train_pre_npy == housing_train_pre_npy_2)

### 5.11. Saving the Preprocessed Training Set

In [None]:
np.save('./datasets/housing_train_pre_numpy_sprint-4.npy', housing_train_pre_npy)

## 🏋️‍♀️ 6. Train ML Algorithms

### 6.1. Getting the independent (features) and dependent variables (outcome)

In [None]:
X_train = housing_train_pre_npy
# we already have y_train

In [None]:
X_train.shape

In [None]:
y_train.shape

### 6.2. Training the Models

<h3 style="color: #ff5757 !important"><b>Cross-validation</b></h3>

#### **→ Linear Regression**

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()  # default parameters
lin_scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)

lin_rmse_scores = np.sqrt(-lin_scores)

In [None]:
# printing function
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:
display_scores(lin_rmse_scores)

<br/>

We have exactly the results of Sprint #3.
- **Linear Regression:** \\$58,371.04 ± \$1,757.91

#### Training the final model
After cross-validation, we can train our models by using the **entire** _training set_.

## 🔬🧪 7. Evaluation on the Test Set

<table align="left" class="dashed-box">
<tr>
    <td>⚠️</td>
    <td>We should to evaluate <b>many other</b> <i>quick-and-dirty models</i> before any evaluation on the test set.</td>
</tr>
<tr>
    <td></td>
    <td>The strategy is to select the <i>most promising models</i> and <i>fine-tune them</i> (e.g., perform grid-search to find the best hyperparameters and/or try ensemble methods. The selected models could then be evaluated in the test set.</td>
</tr>
<tr>
    <td></td>
    <td>We opted for evaluating our single linear regression model just to complete the end-to-end pipeline in these early sprints. We will perform the above strategy in the next sprints.</td>
</tr>
</table><br/><br/>

### 7.1. Prepare the Data

In [None]:
### Load the testing set
housing_test = pd.read_csv('./datasets/housing_test_sprint-2.csv')

In [None]:
### Separate the _features_ and the _target outcome_


In [None]:
### Preprocess the Test Set


# preprocess the test set


### 7.2. Prediction

In [None]:
### loading the trained model


### evaluation


### 7.3. Prediction

#### RMSE

In [None]:
### computing the final score
from sklearn.metrics import mean_squared_error

lin_rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)
print(f'RMSE Lin. Reg. in the Test Set: {lin_rmse_test}')

By using our linear regression, the **RMSE** for the Test Set -- which has never been seen/used before -- is **\\$ 59,439.63**. <br/>
This error is _slightly higher_ than the _cross-validation error score_ **\\$58,371.04 ± \$1,757.91**, which tends to be common specially when fine-tunning the hyperparameters.

We need now to compare solution with the _current baseline_.

#### Confidence Interval for Squared Errors
In some cases, such a _point estimate_ of the **generalization error** will not be quite enough to convince you to launch: what if it is just _0.1%_ better than the model currently in production? <br/>
You might want to have an idea of how precise this estimate is.

For this, you can compute a ***95% confidence interval*** for the generalization error.

https://github.com/xavecoding/IFSP-CMP-D1AED-2021.1/blob/main/data_distributions/data_distributions.ipynb

<img src='./imgs/confidence_interval.png' />

In [None]:
## alternatively


In [None]:
from scipy.stats import norm

# alpha ==> confidence level
# loc ==> sample mean
# scale ==> standard error



In [None]:
# using the sqrt to keep the erros in the same units
np.sqrt(confidence_interval_squared_errors)

Therefore, we have 95% of confidence that the interval \[\\$56,281.32, \\$62,438.39]\] contains the population generalization error mean.

### 7.4. Comparing our model with the Baseline
Let's first recover the description of the **baseline** from Sprint #1.

#### **Baseline:**
Currently, the **district housing prices** are estimated ***manually by experts***: a team gathers up-to-date information about a district and finds out the _median housing price_. 
This is _costly_ and _time-consuming_, and their **estimates are not great**; they often realize that **their estimates were off by more than 20%**.

Note that this description is a bit vague. We only have an approximation: their estimates were off by more than 20%. <br/>
We do not have a concrete **error** for the baseline. <br/>

To overcome this, we will consider that the baseline estimates final housing prices between **20% and 25% more** than they actually are. 

In [None]:
np.random.seed(42)

In [None]:
y_test_pred_baseline = []

for true_housing_price in y_test:
    error_rate = 1 + np.random.randint(20, 26) / 100
    y_test_pred_baseline.append(true_housing_price * error_rate)

#### RMSE

In [None]:
baseline_rmse_test = mean_squared_error(y_test, y_test_pred_baseline, squared=False)
print(f'RMSE Baseline in the Test Set: {baseline_rmse_test}')

### Discussion

The final performance of our linear regression model (**\\$ 59,439.63**) is not better than the experts’ price estimates (**\\$47,911.00**), which were often off by about 20%. Therefore, it is not prepared to launch in production. We need to find a better model.

We may follow some strategies to find a better model than our current one:
- Evaluate many other different models/algorithms (_e.g.,_ Polynomial regression, KNN regression, SVM regression, ...)
- Apply some feature selection method;
- Perform fine-tunning to find the best hyperparaments
- Try ensemble methods

After all, a model with a score similar to the baseline might be enough. Even though it is not more accurate (or with a lower error) than the baseline, the fact that the model is automatic will frees up some time for the experts so they can work on more interesting and productive tasks.