# Preprocessing Workflow


🎯 This exercise will guide you through the preprocessing workflow. Step by step, feature by feature, you will investigate the dataset and take preprocessing decisions accordingly.

🌤 We stored the `ML_Houses_dataset.csv` [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset.csv) in the cloud.

👇 Run the code down below to load the dataset and features you will be working with.

In [56]:
import pandas as pd

# Loading the dataset
url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset.csv"
data = pd.read_csv(url)

# Selecting some columns of interest
selected_features = ['GrLivArea',
                     'BedroomAbvGr',
                     'KitchenAbvGr', 
                     'OverallCond',
                     'RoofSurface',
                     'GarageFinish',
                     'CentralAir',
                     'ChimneyStyle',
                     'MoSold',
                     'SalePrice']

# Overwriting the "data" variable to keep only the columns of interest
# Notice the .copy() to copy the values 
data = data[selected_features].copy()

# Showing the first five rows
data.head() 

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,ChimneyStyle,MoSold,SalePrice
0,1710,3,1,5,1995.0,RFn,Y,bricks,2,208500
1,1262,3,1,8,874.0,RFn,Y,bricks,5,181500
2,1786,3,1,5,1593.0,RFn,Y,castiron,9,223500
3,1717,3,1,5,2566.0,Unf,Y,castiron,2,140000
4,2198,4,1,5,3130.0,RFn,Y,bricks,12,250000


📚 Take the time to do a ***preliminary investigation*** of the features by reading the ***dataset description*** available [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset_description.txt). Make sure to refer to it throughout the day.

## (1) Duplicates

ℹ️ ***Duplicates in datasets cause data leakage.*** 

👉 It is important to locate and remove duplicates.

❓ How many duplicated rows are there in the dataset ❓

<i>Save your answer under variable name `duplicate_count`.</i>

In [57]:
duplicate_count = data.duplicated().sum()
duplicate_count

300

❓ Remove the duplicates from the dataset. Overwite the dataframe `data`❓

In [58]:
data = data.drop_duplicates()

🧪 **Test your code**

In [59]:
from nbresult import ChallengeResult

result = ChallengeResult('duplicates',
                         duplicates = duplicate_count,
                         dataset = data
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/bingobango/.pyenv/versions/tom/bin/python3
cachedir: .pytest_cache
rootdir: /Users/bingobango/code/lewagon/data-preprocessing-workflow/tests
plugins: anyio-3.6.1, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 2 items

test_duplicates.py::TestDuplicates::test_dataset_length [32mPASSED[0m[32m           [ 50%][0m
test_duplicates.py::TestDuplicates::test_duplicate_count [32mPASSED[0m[32m          [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/duplicates.pickle

[32mgit[39m commit -m [33m'Completed duplicates step'[39m

[32mgit[39m push origin master



## (2) Missing data

❓ Print the percentage of missing values for every column of the dataframe. ❓

In [60]:
round(data.isnull().sum().sort_values(ascending=False) / len(data),2)

# len(data)

GarageFinish    0.06
RoofSurface     0.01
GrLivArea       0.00
BedroomAbvGr    0.00
KitchenAbvGr    0.00
OverallCond     0.00
CentralAir      0.00
ChimneyStyle    0.00
MoSold          0.00
SalePrice       0.00
dtype: float64

### `GarageFinish`

❓ **Questions** about `GarageFinish` ❓

Investigate the missing values in `GarageFinish`. Then, choose one of the following solutions:

1. Drop the column entirely
2. Impute the column median using `SimpleImputer` from Scikit-Learn
3. Preserve the NaNs and replace them with meaningful values

Make changes effective in the dataframe `data`.


<details>
    <summary>💡 Hint</summary>
    
ℹ️ According to the dataset description, the missing values in `GarageFinish` represent a house having no garage. They need to be encoded as such.
</details>

In [61]:
data['GarageFinish'].unique()

array(['RFn', 'Unf', 'Fin', nan], dtype=object)

In [62]:
import numpy as np

data.GarageFinish.replace(np.nan, 'NoGarage', inplace=True)

In [63]:
data.GarageFinish.value_counts()

Unf         605
RFn         422
Fin         352
NoGarage     81
Name: GarageFinish, dtype: int64

### `RoofSurface`

❓ **Questions** about `RoofSurface` ❓

Investigate the missing values in `RoofSurface`. Then, choose one of the following solutions:

1. Drop the column entirely
2. Impute the column median using sklearn's `SimpleImputer`
3. Preserve the NaNs and replace them with meaningful values

Make changes effective in the dataframe `data`.


<details>
    <summary>💡 Hint</summary>
    
ℹ️ `RoofSurface` has a few missing values that can be imputed by the median value.
</details>

In [64]:
data['RoofSurface'].isnull().sum()

9

In [65]:
from sklearn.impute import SimpleImputer

In [66]:
imputer = SimpleImputer(strategy="median")

In [67]:
imputer.fit(data[['RoofSurface']])

In [68]:
# Imputer object expects a 2D array
data['RoofSurface'] = imputer.transform(data[['RoofSurface']])

In [69]:
data['RoofSurface'].isnull().sum()

0

### `ChimneyStyle`

❓ **Questions** about `ChimneyStyle` ❓

Investigate the missing values in `ChimneyStyle`. Then, choose one of the following solutions:

1. Drop the column entirely
2. Impute the column median
3. Preserve the NaNs and replace them with meaningful values

Make changes effective in the dataframe `data`.


<details>
    <summary>💡 Hint</summary>
    
* ⚠️ Be careful: not all missing values are represented as `np.nans`, and Python's `isnull()` only detects `np.nans`...
    
* ℹ️ `ChimneyStyle` has a lot of missing values. The description does not touch on what they represent. As such, it is better not to make any assumptions and to drop the column entirely.
    

</details>

In [70]:
data['ChimneyStyle'].unique()

array(['bricks', 'castiron', '?'], dtype=object)

In [71]:
data.ChimneyStyle.value_counts(dropna=False)

?           1455
bricks         3
castiron       2
Name: ChimneyStyle, dtype: int64

In [72]:
data.ChimneyStyle.replace("?", np.nan, inplace=True)

In [73]:
data.ChimneyStyle.isnull().sum() / len(data)

0.9965753424657534

In [74]:
data.drop(columns='ChimneyStyle', inplace=True)

🧪 **Test your code**

In [75]:
from nbresult import ChallengeResult

result = ChallengeResult('missing_values',
                         dataset = data
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/bingobango/.pyenv/versions/tom/bin/python3
cachedir: .pytest_cache
rootdir: /Users/bingobango/code/lewagon/data-preprocessing-workflow/tests
plugins: anyio-3.6.1, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 2 items

test_missing_values.py::TestMissing_values::test_nans [32mPASSED[0m[32m             [ 50%][0m
test_missing_values.py::TestMissing_values::test_number_of_columns [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/missing_values.pickle

[32mgit[39m commit -m [33m'Completed missing_values step'[39m

[32mgit[39m push origin master



❓ When you are done with handling missing value, print out the percentage of missing values for the entire dataframe ❓

You should no longer have missing values !

In [76]:
data.isnull().sum()/len(data)

GrLivArea       0.0
BedroomAbvGr    0.0
KitchenAbvGr    0.0
OverallCond     0.0
RoofSurface     0.0
GarageFinish    0.0
CentralAir      0.0
MoSold          0.0
SalePrice       0.0
dtype: float64

## (3) Scaling

**But first...**

To understand the effects of scaling and encoding on model performance, let's get a **base score without any data transformation**.

❓ Cross-validate a linear regression model that predicts `SalePrice` using the other features

⚠️ Note that a linear regression model can only handle numeric features. [DataFrame.select_dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html) can help.

In [77]:
data.dtypes

GrLivArea         int64
BedroomAbvGr      int64
KitchenAbvGr      int64
OverallCond       int64
RoofSurface     float64
GarageFinish     object
CentralAir       object
MoSold            int64
SalePrice         int64
dtype: object

In [78]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

X = data.select_dtypes(exclude='object').drop('SalePrice', axis=1)
y = data.SalePrice

baseline_score = cross_val_score(LinearRegression(), X, y, cv=10).mean()
baseline_score

0.5776433985192075

Keep this score in mind! You will train a new model after data preprocessing in Challenge #2 - see if it improves your average score 😉

🚀 Now, back to **feature scaling**!

###  `RoofSurface` 

❓ **Question** about `RoofSurface` ❓

👇 Investigate `RoofSurface` for distribution and outliers. Then, choose the most appropriate scaling technique. Either:

1. Standard Scaler
2. Robust Scaler
3. MinMax Scaler

Replace the original columns with the transformed values.

In [79]:
data[['RoofSurface']].boxplot()
data[['RoofSurface']].hist(bins=20)

array([[<AxesSubplot:title={'center':'RoofSurface'}>]], dtype=object)

<details>
    <summary>💡 Hint</summary>
    
ℹ️ Since `RoofSurface` does not seem to have a normal distribution, it is better to MinMax scale.
</details>

In [80]:
from sklearn.preprocessing import MinMaxScaler

In [81]:
scaler = MinMaxScaler()

data['RoofSurface'] = scaler.fit_transform(data[['RoofSurface']])
data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice
0,1710,3,1,5,0.316729,RFn,Y,2,208500
1,1262,3,1,8,0.06965,RFn,Y,5,181500
2,1786,3,1,5,0.228124,RFn,Y,9,223500
3,1717,3,1,5,0.442583,Unf,Y,2,140000
4,2198,4,1,5,0.566894,RFn,Y,12,250000


### `GrLivArea`

❓ **Question** about `GrLivArea` ❓

👇 Investigate `GrLivArea` for distribution and outliers. Then, choose the most appropriate scaling technique. Either:

1. Standard Scale
2. Robust Scale
3. MinMax Scale

Replace the original columns with the transformed values.

In [82]:
data[['GrLivArea']].boxplot()
data[['GrLivArea']].hist(bins=30)

array([[<AxesSubplot:title={'center':'GrLivArea'}>]], dtype=object)

<details>
    <summary>💡 Hint</summary>
    
ℹ️ `GrLivArea` has a normal distribution, and some outliers. It needs to be Robust scaled.
</details>

In [83]:
from sklearn.preprocessing import RobustScaler

In [84]:
Rscaler = RobustScaler()

In [85]:
data['GrLivArea'] = Rscaler.fit_transform(data[['GrLivArea']])

In [86]:
data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice
0,0.38007,3,1,5,0.316729,RFn,Y,2,208500
1,-0.31209,3,1,8,0.06965,RFn,Y,5,181500
2,0.497489,3,1,5,0.228124,RFn,Y,9,223500
3,0.390885,3,1,5,0.442583,Unf,Y,2,140000
4,1.134029,4,1,5,0.566894,RFn,Y,12,250000


### `BedroomAbvGr` ,  `OverallCond` & `KitchenAbvGr`

❓ **Questions** about `BedroomAbvGr`, `OverallCond` & `KitchenAbvGr` ❓

👇 Investigate `BedroomAbvGr`, `OverallCond` & `KitchenAbvGr`. Then, chose one of the following scaling techniques:

1. MinMax Scaler
2. Standard Scaler
3. Robust Scaler

Replace the original columns with the transformed values.

<details>
    <summary>💡 Hint</summary>
    
ℹ️ `BedroomAbvGr` ,  `OverallCond` & `KitchenAbvGr` are ordinal features that should be MinMax scaled.
</details>

In [87]:
from sklearn.preprocessing import MinMaxScaler

In [88]:
scaler2 = MinMaxScaler()

In [89]:
data['BedroomAbvGr'],data['OverallCond'],data['KitchenAbvGr'] = scaler2.fit_transform(data[['BedroomAbvGr', 'OverallCond','KitchenAbvGr']]).T



In [36]:
data['BedroomAbvGr'] = scaler2.fit_transform(data[['BedroomAbvGr']])
data['OverallCond'] = scaler2.fit_transform(data[['OverallCond']])
data['KitchenAbvGr'] = scaler2.fit_transform(data[['KitchenAbvGr']])

In [90]:
data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice
0,0.38007,0.375,0.333333,0.5,0.316729,RFn,Y,2,208500
1,-0.31209,0.375,0.333333,0.875,0.06965,RFn,Y,5,181500
2,0.497489,0.375,0.333333,0.5,0.228124,RFn,Y,9,223500
3,0.390885,0.375,0.333333,0.5,0.442583,Unf,Y,2,140000
4,1.134029,0.5,0.333333,0.5,0.566894,RFn,Y,12,250000


🧪 **Test your code**

In [38]:
from nbresult import ChallengeResult

result = ChallengeResult('scaling',
                         dataset = data
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/bingobango/.pyenv/versions/tom/bin/python3
cachedir: .pytest_cache
rootdir: /Users/bingobango/code/lewagon/data-preprocessing-workflow/tests
plugins: anyio-3.6.1, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_scaling.py::TestScaling::test_bedroom_kitchen_condition [32mPASSED[0m[32m      [ 33%][0m
test_scaling.py::TestScaling::test_gr_liv_area [32mPASSED[0m[32m                    [ 66%][0m
test_scaling.py::TestScaling::test_roof_surface [32mPASSED[0m[32m                   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/scaling.pickle

[32mgit[39m commit -m [33m'Completed scaling step'[39m

[32mgit[39m push origin master



## (4) Feature Encoding

### `GarageFinish`

❓ **Question** about `GarageFinish`❓

👇 Investigate `GarageFinish` and choose one of the following encoding techniques accordingly:
- Ordinal encoding
- One-Hot encoding

Add the encoding to the dataframe as new colum(s), and remove the original column.


<details>
    <summary>💡 Hint</summary>
        
ℹ️ `GarageFinish` is a multicategorical feature that must be One hot encoded.
</details>

In [39]:
data.GarageFinish.unique()

array(['RFn', 'Unf', 'Fin', 'NoGarage'], dtype=object)

In [40]:
from sklearn.preprocessing import OneHotEncoder

In [41]:
# sparse = True creates sparse matrix instead of array. Better memory usage and processing power. 
# However most of the elements should be zero.
GarageFinish_ohe = OneHotEncoder(sparse=False)
GarageFinish_ohe.fit(data[['GarageFinish']])

In [42]:
GarageFinish_ohe.categories_

[array(['Fin', 'NoGarage', 'RFn', 'Unf'], dtype=object)]

In [92]:
data['Fin'], data['NoGarage'], data['RFn'], data['Unf'] = GarageFinish_ohe.fit_transform(data[['GarageFinish']]).T

In [44]:
data.drop(columns='GarageFinish', inplace=True)

In [45]:
data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,CentralAir,MoSold,SalePrice,Fin,NoGarage,RFn,Unf
0,0.38007,0.375,0.333333,0.5,0.316729,Y,2,208500,0.0,0.0,1.0,0.0
1,-0.31209,0.375,0.333333,0.875,0.06965,Y,5,181500,0.0,0.0,1.0,0.0
2,0.497489,0.375,0.333333,0.5,0.228124,Y,9,223500,0.0,0.0,1.0,0.0
3,0.390885,0.375,0.333333,0.5,0.442583,Y,2,140000,0.0,0.0,0.0,1.0
4,1.134029,0.5,0.333333,0.5,0.566894,Y,12,250000,0.0,0.0,1.0,0.0


### Encoding  `CentralAir`

❓ **Question** about `CentralAir`❓

Investigate `CentralAir` and choose one of the following encoding techniques accordingly:
- Ordinal encoding
- One-Hot encoding

Replace the original column with the newly generated encoded columns.


<details>
    <summary>💡 Hint</summary>
    
ℹ️ `CentralAir` is a binary categorical feature.
</details>

In [46]:
data.CentralAir.unique()

array(['Y', 'N'], dtype=object)

In [47]:
CentralAir_ohe = OneHotEncoder(sparse=False, drop='if_binary', categories=[['N', 'Y']])

In [48]:
data['CentralAir'] = CentralAir_ohe.fit_transform(data[['CentralAir']])

In [49]:
data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,CentralAir,MoSold,SalePrice,Fin,NoGarage,RFn,Unf
0,0.38007,0.375,0.333333,0.5,0.316729,1.0,2,208500,0.0,0.0,1.0,0.0
1,-0.31209,0.375,0.333333,0.875,0.06965,1.0,5,181500,0.0,0.0,1.0,0.0
2,0.497489,0.375,0.333333,0.5,0.228124,1.0,9,223500,0.0,0.0,1.0,0.0
3,0.390885,0.375,0.333333,0.5,0.442583,1.0,2,140000,0.0,0.0,0.0,1.0
4,1.134029,0.5,0.333333,0.5,0.566894,1.0,12,250000,0.0,0.0,1.0,0.0


In [50]:
data['CentralAir'].unique()

array([1., 0.])

## (5) Feature Engineering

### `MoSold` - Cyclical engineering 

👨🏻‍🏫 A feature can be numerical (continuous or discrete), categorical or ordinal. But a feature can also be temporal (e.g. quarters, months, days, minutes, ...). 

Cyclical features like time need some specific preprocessing. Indeed, if you want any Machine Learning algorithm to capture this cyclicity, your cyclical features must be preprocessed in a certain way.

👉 Consider the feature `MoSold`, the month on which the house was sold.

In [51]:
data["MoSold"].unique()

array([ 2,  5,  9, 12, 10,  8, 11,  4,  1,  7,  3,  6])

In [52]:
data['MoSold'].value_counts()

6     253
7     234
5     204
4     141
8     122
3     106
10     89
11     79
9      63
12     59
1      58
2      52
Name: MoSold, dtype: int64

* Many houses were sold in June (6), July (7) and May (5) (Spring/Summer)
* Only a few houses were sold in December (12), January (1) and February (2) (~ Fall/Winter)
    * But for any Machine Learning model, there is no reason why December (12) and January (1) would be "close"...

👩🏻‍🏫 ***How to deal with cyclical faetures?***

1.  Look at the following illustration and read the explanations to distinguish two different months.

<img src="https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/cyclical_features.png" alt="Cyclical features" width="1000" height="800">


2. Read this [article](https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/) for more details.




❓ **Question** about `MoSold` ❓ 
- Create two new features `sin_MoSold` and `cos_MoSold` which correspond respectively to the sine and cosine of MoSold.
- Drop the original column `MoSold`

<details>
    <summary>💡 Hint</summary>
    
ℹ️ The perimeter of a circle if $C = 2 \pi r = 2 \pi$ and assuming that $ r = 1$
</details>

In [55]:
test = (data.MoSold-1)
test.unique()

array([ 1,  4,  8, 11,  9,  7, 10,  3,  0,  6,  2,  5])

In [75]:
months_in_year = 12

data['sin_MoSold'] = np.sin(2*np.pi*(data.MoSold-1) / months_in_year)
data['cos_MoSold'] = np.cos(2*np.pi*(data.MoSold-1) / months_in_year)


In [76]:
data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,CentralAir,MoSold,SalePrice,Fin,NoGarage,RFn,Unf,sin_MoSold,cos_MoSold
0,0.38007,0.375,0.333333,0.5,0.316729,1.0,2,208500,0.0,0.0,1.0,0.0,0.5,0.866025
1,-0.31209,0.375,0.333333,0.875,0.06965,1.0,5,181500,0.0,0.0,1.0,0.0,0.866025,-0.5
2,0.497489,0.375,0.333333,0.5,0.228124,1.0,9,223500,0.0,0.0,1.0,0.0,-0.866025,-0.5
3,0.390885,0.375,0.333333,0.5,0.442583,1.0,2,140000,0.0,0.0,0.0,1.0,0.5,0.866025
4,1.134029,0.5,0.333333,0.5,0.566894,1.0,12,250000,0.0,0.0,1.0,0.0,-0.5,0.866025


In [79]:
data.sample(50).plot.scatter('sin_MoSold','cos_MoSold').set_aspect('equal')

In [80]:
data.drop(columns='MoSold', inplace=True)

🧪 **Test your code**

In [81]:
from nbresult import ChallengeResult

result = ChallengeResult('encoding', dataset = data, new_features = ['sin_MoSold', 'cos_MoSold'])

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/bingobango/.pyenv/versions/tom/bin/python3
cachedir: .pytest_cache
rootdir: /Users/bingobango/code/lewagon/data-preprocessing-workflow/tests
plugins: anyio-3.6.1, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 4 items

test_encoding.py::TestEncoding::test_central_air [32mPASSED[0m[32m                  [ 25%][0m
test_encoding.py::TestEncoding::test_columns [32mPASSED[0m[32m                      [ 50%][0m
test_encoding.py::TestEncoding::test_month_sold_features [32mPASSED[0m[32m          [ 75%][0m
test_encoding.py::TestEncoding::test_month_sold_features_number [32mPASSED[0m[32m   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/encoding.pickle

[32mgit[39m commit -m [33m'Completed encoding step'[39m

[32mgit[39m push origin master



## (6) Export the preprocessed dataset

👇 Now that the dataset has been preprocessed, execute the code below to export it. You will keep working on it in the next exercise.

In [55]:
data.to_csv("data/clean_dataset.csv", index=False)

🏁 Congratulations! Now, you know how to ***preprocess a dataset*** !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge!