### **D2APR: Aprendizado de Máquina e Reconhecimento de Padrões** (IFSP, Campinas) <br/>
**Prof**: Samuel Martins (Samuka) <br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. <br/><br/>

### Custom CSS style

In [1]:
%%html
<style>
.dashed-box {
    border: 1px dashed black !important;
}
.dashed-box tr {
  background-color: white !important;  
}
.alt-tab {
    background-color: black;
    color: #ffc351;
    padding: 4px;
    font-size: 1em;
    font-weight: bold;
    font-family: monospace;
}
// add your CSS styling here
</style>

<span style='font-size: 2.5em'><b>California Housing 🏡</b></span><br/>
<span style='font-size: 1.5em'>Predict the median housing price in California districts</span>

<span style="background-color: #ffc351; padding: 4px; font-size: 1em;"><b>Sprint Teste - pipelines</b></span>

<img src="./imgs/california-flag.png" width=300/>

---



## Before starting this notebook
This jupyter notebook is designed for **experimental and teaching purposes**. <br/>
Although it is (relatively) well organized, it aims at solving the _target problem_ by evaluating (and documenting) _different solutions_ for somes steps of the **machine learning pipeline** — see the ***Machine Learning Project Checklist by xavecoding***. <br/>
We tried to make this notebook as literally a _notebook_. Thus, it contains notes, drafts, comments, etc.<br/>

For teaching purposes, some parts of the notebook may be _overcommented_. Moreover, to simulate a real development scenario, we will divide our solution and experiments into **"sprints"** in which each sprint has some goals (e.g., perform _feature selection_, train more ML models, ...). <br/>
The **sprint goal** will be stated at the beginning of the notebook.

A ***final notebook*** (or any other kind of presentation) that compiles and summarizes all sprints — the target problem, solutions, and findings — should be created later.

#### Conventions

<ul>
    <li>💡 indicates a tip. </li>
    <li> ⚠️ indicates a warning message. </li>
    <li><span class='alt-tab'>alt tab</span> indicates and an extra content (<i>e.g.</i>, slides) to explain a given concept.</li>
</ul>

---

## 🎯 Sprint Goals
- Refactor Sprint 1 with sklearn Pipelines
---

## 🛠️ 5. Prepare the Data

In [3]:
import pandas as pd

In [6]:
# load training set
housing_train = pd.read_csv('./datasets/housing_train.csv')

In [7]:
housing_train.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')

In [8]:
num_attributes = ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value']
cat_attributes = ['ocean_proximity']

In [9]:
housing_train_num = housing_train[num_attributes]
housing_train_cat = housing_train[cat_attributes]

In [10]:
housing_train_num.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-121.89,37.29,38.0,1568.0,351.0,710.0,339.0,2.7042,286600.0
1,-121.93,37.05,14.0,679.0,108.0,306.0,113.0,6.4214,340600.0
2,-117.2,32.77,31.0,1952.0,471.0,936.0,462.0,2.8621,196900.0
3,-119.61,36.31,25.0,1847.0,371.0,1460.0,353.0,1.8839,46300.0
4,-118.59,34.23,17.0,6592.0,1525.0,4459.0,1463.0,3.0347,254500.0


In [24]:
housing_train_cat.head()

Unnamed: 0,ocean_proximity
0,<1H OCEAN
1,<1H OCEAN
2,NEAR OCEAN
3,INLAND
4,<1H OCEAN


In [28]:
X = housing_train_num.values
X.shape

(16512, 9)

In [71]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

# column index
rooms_col_idx, bedrooms_col_idx, population_col_idx, households_col_idx = 3, 4, 5, 6

class HousingFeatEngineering(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X):
        n_rows = X.shape[0]
        
        rooms_per_household = X[:, rooms_col_idx] / X[:, households_col_idx]
        rooms_per_household = rooms_per_household.reshape((n_rows, 1))
        
        population_per_household = X[:, population_col_idx] / X[:, households_col_idx]
        population_per_household = population_per_household.reshape((n_rows, 1))
        
        bedrooms_per_room = X[:, bedrooms_col_idx] / X[:, rooms_col_idx]
        bedrooms_per_room = bedrooms_per_room.reshape((n_rows, 1))
        
        return np.hstack((X, rooms_per_household, population_per_household, bedrooms_per_room))

In [81]:
# pipeline for numerical
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('feat_engineering', HousingFeatEngineering()),
    ('std_scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('one-hot-enconding', OneHotEncoder(handle_unknown='ignore'))
])

full_pipeline = ColumnTransformer([
    ("numerical", num_pipeline, num_attributes),
    ("categorical", cat_pipeline, cat_attributes),
])


In [86]:
housing_train_pre_npy = full_pipeline.fit_transform(housing_train)
housing_train_pre_npy

array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])

In [92]:
import joblib
joblib.dump(full_pipeline, 'full_pipeline.pkl')

['full_pipeline.pkl']

In [93]:
loaded_full_pipeline = joblib.load('full_pipeline.pkl')

In [94]:
housing_train_pre_npy_2 = loaded_full_pipeline.fit_transform(housing_train)
housing_train_pre_npy_2

array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])

In [96]:
np.all(housing_train_pre_npy == housing_train_pre_npy_2)

True

ColumnTransformer(transformers=[('numerical',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('feat_engineering',
                                                  HousingFeatEngineering()),
                                                 ('std_scaler',
                                                  StandardScaler())]),
                                 ['longitude', 'latitude', 'housing_median_age',
                                  'total_rooms', 'total_bedrooms', 'population',
                                  'households', 'median_income',
                                  'median_house_value']),
                                ('categorical',
                                 Pipeline(steps=[('one-hot-enconding',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                             

In [77]:
housing_train_num_pre = num_pipeline.fit_transform(housing_train_num)

In [78]:
housing_train_num_pre.shape

(16512, 9)