### **D2APR: Aprendizado de Máquina e Reconhecimento de Padrões** (IFSP, Campinas) <br/>
**Prof**: Samuel Martins (Samuka) <br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. <br/><br/>

#### Custom CSS style

In [None]:
%%html
<style>
.dashed-box {
    border: 1px dashed black !important;
#    font-size: var(--jp-content-font-size1) !important;
}

.dashed-box table {

}

.dashed-box tr {
    background-color: white !important;
}
        
.alt-tab {
    background-color: black;
    color: #ffc351;
    padding: 4px;
    font-size: 1em;
    font-weight: bold;
    font-family: monospace;
}
// add your CSS styling here
</style>

<span style='font-size: 2.5em'><b>California Housing 🏡</b></span><br/>
<span style='font-size: 1.5em'>Predict the median housing price in California districts</span>

<span style="background-color: #ffc351; padding: 4px; font-size: 1em;"><b>Sprint #3</b></span>

<img src="./imgs/california-flag.png" width=300/>

---



## Before starting this notebook
This jupyter notebook is designed for **experimental and teaching purposes**. <br/>
Although it is (relatively) well organized, it aims at solving the _target problem_ by evaluating (and documenting) _different solutions_ for somes steps of the **machine learning pipeline** — see the ***Machine Learning Project Checklist by xavecoding***. <br/>
We tried to make this notebook as literally a _notebook_. Thus, it contains notes, drafts, comments, etc.<br/>

For teaching purposes, some parts of the notebook may be _overcommented_. Moreover, to simulate a real development scenario, we will divide our solution and experiments into **"sprints"** in which each sprint has some goals (e.g., perform _feature selection_, train more ML models, ...). <br/>
The **sprint goal** will be stated at the beginning of the notebook.

A ***final notebook*** (or any other kind of presentation) that compiles and summarizes all sprints — the target problem, solutions, and findings — should be created later.

#### Conventions

<ul>
    <li>💡 indicates a tip. </li>
    <li> ⚠️ indicates a warning message. </li>
    <li><span class='alt-tab'>alt tab</span> indicates and an extra content (<i>e.g.</i>, slides) to explain a given concept.</li>
</ul>

---

## 🎯 Sprint Goals
- Add new features
- Normalize the data
- Add a new model: Decision Tree Regression
---

### 0. Imports and default settings for plotting

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

## 💽 2. Get the Data
In the previous sprint, we have removed outliers from the entire dataset, split it into training and testing set, and preprocessed the training set (by fillin in missing values for `total_bedrooms`.) <br/>
Both preprocessed training set and (raw) testing set were _saved to disk_. Let's use them here.

### 2.2. Load the Data

In [None]:
import pandas as pd

housing_train_pre = pd.read_csv('./datasets/housing_train_pre_sprint-2.csv')  # preprocessed train set
housing_test = pd.read_csv('./datasets/housing_test_sprint-2.csv')

In [None]:
housing_train_pre.head()

In [None]:
housing_test.head()

## 🛠️ 5. Prepare the Data

### 5.1. Adding new features (_dependent variable_) and the target outcome (_dependent variable_)

The _total number of rooms_ in a district _is not very useful_ if you don’t know how many households there are. What you really want is **the number of rooms per household**. <br/>
Similarly, the _total number of bedrooms_ by itself _is not very useful_: you probably want to compare it to the number of rooms. </br>
And the **population per household** also seems like an interesting attribute combination to look at.

Let’s create these new attributes:

In [None]:
housing_train_pre.head()

In [None]:
housing_test.head()

<br/>

We could perform the EDA on the training set again but now considering these new features. <br/>
For now, let's just check the **correlation** between these _new features_ with the _target outcome_.

#### **Correlation**

Hey, not bad! The new `bedrooms_per_room attribute` is much more correlated with the `median house value` than the `total number of rooms or bedrooms`. <br/>
Apparently, houses with a <i>lower bedroom/room ratio</i> tend to be <b>more expensive</b>. <br/>
The `number of rooms per household` is also <b>more informative</b> than the `total number of rooms` in a district — obviously the larger the houses, the more expensive they are.

Another interesting point is the correlation between the <i>dummy variables</i> with the `median house value` -- not done in previous sprints. </br>
The dummy variable `ocean_proximity_INLAND` has <i>strong negative correlation</i> with the `median house value` whereas `ocean_proximity_<1H OCEAN` has a <i>strong postive one</i>.

<table align="left" class="dashed-box">
<tr>
    <td>💡</td>
    <td>This round of exploration does not have to be absolutely thorough; the point is to quickly gain insights that helps you ot improve our models.</i></td>
</tr>
<tr>
    <td></td>
    <td>But this is an <i>iterative process</i>: once you get a prototype up and running, you can analyze its output to gain more insights and come back to this exploration step.</td>
</tr>
</table><br/><br/>

### 5.2. Separating the independent variables (features) and the _dependent variable_ (target outcome)

In [None]:
housing_train_target = housing_train_pre['median_house_value'].copy()
housing_train_pre = housing_train_pre.drop(columns=['median_house_value'])

### 5.3. Feature Scaling

With few exceptions, ML algorithms **don’t perform well** when the _input numerical attributes_ have **very different scales**. </br>
For example, compare the scale of the attributes: `median_income` and `median_house_value`.

Although **feature scaling** _is not_ necessarily for Linear Regression, we intend to evaluate other regression methods soon that may need that. So, we will perform it. <br/><br/>

There are two common ways to get all attributes to have the same scale: _min-max scaling_ and _standardization_.

<img src='./imgs/normalization-vs-standardization.png' width=600/>

<table align="left" class="dashed-box">
<tr>
    <td>⚠️</td>
    <td>Note that scaling the <i>target outcome</i> is generally <b>not required</b>.</i></td>
</tr>
</table><br/><br/>

<table align="left" class="dashed-box">
<tr>
    <td>⚠️</td>
    <td>We <b>do not</b> need to scale the <i>binary dummy variables</i>.</i></td>
</tr>
</table><br/><br/>

<table align="left" class="dashed-box">
<tr>
    <td>⚠️</td>
    <td>As with all the transformations, it is important <i>to fit the scalers</i> to the <b>training data <i>only</i></b>, <b>not</b> to the <i>full dataset</i> (including the <i>test set</i>).</i></td>
</tr>
<tr>
    <td></td>
    <td>Only then can you use them to transform the training set and the test set (and new data)..</i></td>
</tr>
</table><br/><br/>

Let's use **Standardization**.

In [None]:
housing_train_pre.head()

In [None]:
housing_train_pre.columns

In [None]:
numeric_variables = ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'rooms_per_household', 'bedrooms_per_room', 'population_per_household']
dummy_variables = ['ocean_proximity_<1H OCEAN', 'ocean_proximity_INLAND', 'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY', 'ocean_proximity_NEAR OCEAN']

## 🏋️‍♀️ 6. Train ML Algorithms

### 6.1. Getting the independent (features) and dependent variables (outcome)

In [None]:
# we already have X_train
y_train = housing_train_target.values

### 6.2. Training the Models

#### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

linear_regressor = LinearRegression()  # default parameters
linear_regressor.fit(X_train, y_train)

#### Decision Tree Regression
This is a powerful model, capable of finding complex nonlinear relationships in the data.

### 6.3. Evaluating on the Training Set

#### **→ Linear Regression**

##### **Prediction**

In [None]:
y_train_pred_lin_reg = linear_regressor.predict(X_train)

##### **Evaluation**

In [None]:
from sklearn.metrics import r2_score

lin_reg_r2 = r2_score(y_train, y_train_pred_lin_reg)
print(f'R² linear regression = {lin_reg_r2}')

In [None]:
from sklearn.metrics import mean_squared_error

lin_reg_rmse = mean_squared_error(y_train, y_train_pred_lin_reg, squared=False)
print(f'RMSE = {lin_reg_rmse}')

The RMSE (\\$58,146) has slightly decreased compared to Sprint #2 (\\$58,689).

#### **Visual Analysis**

In [None]:
sns.scatterplot(x=y_train_pred_lin_reg, y=y_train)
plt.xlabel('Prediction')
plt.ylabel('Real')
plt.title('Median housing value - Prediction vs Real - Linear Regression')

In [None]:
residual_lin_reg = y_train - y_train_pred_lin_reg

sns.scatterplot(x=y_train_pred_lin_reg, y=residual_lin_reg)
plt.xlabel('Prediction')
plt.ylabel('Residual')
plt.title('Median housing value - Prediction vs Residual - Linear Regression')

<br/><br/>

#### **→ Decision Tree**

##### **Prediction**

In [None]:
y_train_pred_tree_reg = tree_regressor.predict(X_train)

##### **Evaluation**

In [None]:
from sklearn.metrics import r2_score

tree_reg_r2 = r2_score(y_train, y_train_pred_tree_reg)
print(f'R² decision tree regression = {tree_reg_r2}')

In [None]:
from sklearn.metrics import mean_squared_error

tree_reg_rmse = mean_squared_error(y_train, y_train_pred_tree_reg, squared=False)
print(f'RMSE = {tree_reg_rmse}')

Wait, what!? No error at all? Could this model really be absolutely perfect? <br/>
Of course, it is much more likely that the model has badly <b>overfit</b> the data.

We'd better evaluate it by using **Cross-Validation**.

#### **Visual Analysis**

In [None]:
sns.scatterplot(x=y_train_pred_tree_reg, y=y_train)
plt.xlabel('Prediction')
plt.ylabel('Real')
plt.title('Median housing value - Prediction vs Real - Decision Tree Regression')

In [None]:
residual_tree_reg = y_train - y_train_pred_tree_reg

sns.scatterplot(x=y_train_pred_lin_reg, y=residual_tree_reg)
plt.xlabel('Prediction')
plt.ylabel('Residual')
plt.title('Median housing value - Prediction vs Residual |')