# Preprocessing Workflow


🎯 This exercise will guide you through the preprocessing workflow. Step by step, feature by feature, you will investigate the dataset and take preprocessing decisions accordingly.

🌤 We stored the `ML_Houses_dataset.csv` [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset.csv) in the cloud.

👇 Run the code down below to load the dataset and features you will be working with.

In [10]:
import pandas as pd

# Loading the dataset
url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset.csv"
data = pd.read_csv(url)

# Selecting some columns of interest
selected_features = ['GrLivArea',
                     'BedroomAbvGr',
                     'KitchenAbvGr', 
                     'OverallCond',
                     'RoofSurface',
                     'GarageFinish',
                     'CentralAir',
                     'ChimneyStyle',
                     'MoSold',
                     'SalePrice']

# Overwriting the "data" variable to keep only the columns of interest
# Notice the .copy() to copy the values 
data = data[selected_features].copy()

# Showing the first five rows
data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,ChimneyStyle,MoSold,SalePrice
0,1710,3,1,5,1995.0,RFn,Y,bricks,2,208500
1,1262,3,1,8,874.0,RFn,Y,bricks,5,181500
2,1786,3,1,5,1593.0,RFn,Y,castiron,9,223500
3,1717,3,1,5,2566.0,Unf,Y,castiron,2,140000
4,2198,4,1,5,3130.0,RFn,Y,bricks,12,250000


📚 Take the time to do a ***preliminary investigation*** of the features by reading the ***dataset description*** available [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Houses_dataset_description.txt). Make sure to refer to it throughout the day.

## (1) Duplicates

ℹ️ ***Duplicates in datasets cause data leakage.*** 

👉 It is important to locate and remove duplicates.

❓ How many duplicated rows are there in the dataset ❓

<i>Save your answer under variable name `duplicate_count`.</i>

In [11]:
len(data) # Check number of rows before removing duplicates

1760

In [12]:
data.duplicated().head() # Check whether a row is aduplicated version of a previous row

0    False
1    False
2    False
3    False
4    False
dtype: bool

In [13]:
duplicate_count = data.duplicated().sum() # Compute the number of duplicated rows
duplicate_count

300

❓ Remove the duplicates from the dataset. Overwite the dataframe `data`❓

In [14]:
data = data.drop_duplicates() # Remove duplicates
len(data)# Check new number of rows

1460

🧪 **Test your code**

In [15]:
from nbresult import ChallengeResult

result = ChallengeResult('duplicates',
                         duplicates = duplicate_count,
                         dataset = data
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/gulecs/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/gulecs/code/gulecsec/data-preprocessing-workflow/tests
plugins: anyio-3.6.1, dash-2.7.0, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 2 items

test_duplicates.py::TestDuplicates::test_dataset_length [32mPASSED[0m[32m           [ 50%][0m
test_duplicates.py::TestDuplicates::test_duplicate_count [32mPASSED[0m[32m          [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/duplicates.pickle

[32mgit[39m commit -m [33m'Completed duplicates step'[39m

[32mgit[39m push origin master



## (2) Missing data

❓ Print the percentage of missing values for every column of the dataframe. ❓

In [16]:
data.isnull().sum().sort_values(ascending=False) #NaN count for each column

GarageFinish    81
RoofSurface      9
GrLivArea        0
BedroomAbvGr     0
KitchenAbvGr     0
OverallCond      0
CentralAir       0
ChimneyStyle     0
MoSold           0
SalePrice        0
dtype: int64

In [17]:
data.isnull().sum().sort_values(ascending=False)/len(data) #NaN percentage for each column

GarageFinish    0.055479
RoofSurface     0.006164
GrLivArea       0.000000
BedroomAbvGr    0.000000
KitchenAbvGr    0.000000
OverallCond     0.000000
CentralAir      0.000000
ChimneyStyle    0.000000
MoSold          0.000000
SalePrice       0.000000
dtype: float64

### `GarageFinish`

❓ **Questions** about `GarageFinish` ❓

Investigate the missing values in `GarageFinish`. Then, choose one of the following solutions:

1. Drop the column entirely
2. Impute the column median using `SimpleImputer` from Scikit-Learn
3. Preserve the NaNs and replace them with meaningful values

Make changes effective in the dataframe `data`.


<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ According to the dataset description, the missing values in `GarageFinish` represent a house having no garage. They need to be encoded as such.
</details>

In [18]:
(data.GarageFinish.isnull().sum()/len(data))*100 # Percentage of missing values in GarageFinish

5.5479452054794525

In [19]:
data["GarageFinish"].value_counts()

Unf    605
RFn    422
Fin    352
Name: GarageFinish, dtype: int64

In [20]:
data["GarageFinish"].isnull().sum()

81

In [21]:
data["GarageFinish"].isnull().value_counts()

False    1379
True       81
Name: GarageFinish, dtype: int64

In [22]:
import numpy as np

data.GarageFinish.replace(np.nan, "NG", inplace=True) #Replace NaN by "NoAlley"

data.GarageFinish.value_counts()#Check count of each category


Unf    605
RFn    422
Fin    352
NG      81
Name: GarageFinish, dtype: int64

In [23]:
data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,ChimneyStyle,MoSold,SalePrice
0,1710,3,1,5,1995.0,RFn,Y,bricks,2,208500
1,1262,3,1,8,874.0,RFn,Y,bricks,5,181500
2,1786,3,1,5,1593.0,RFn,Y,castiron,9,223500
3,1717,3,1,5,2566.0,Unf,Y,castiron,2,140000
4,2198,4,1,5,3130.0,RFn,Y,bricks,12,250000


### `RoofSurface`

❓ **Questions** about `RoofSurface` ❓

Investigate the missing values in `RoofSurface`. Then, choose one of the following solutions:

1. Drop the column entirely
2. Impute the column median using sklearn's `SimpleImputer`
3. Preserve the NaNs and replace them with meaningful values

Make changes effective in the dataframe `data`.


<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ `RoofSurface` has a few missing values that can be imputed by the median value.
</details>

In [24]:
(data.RoofSurface.isnull().sum()/len(data))*100 # Percentage of missing values in RoofSurface

0.6164383561643836

In [25]:
data["RoofSurface"].value_counts().head()

3817.0    5
2420.0    3
2814.0    3
3349.0    3
5016.0    3
Name: RoofSurface, dtype: int64

In [26]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median") # Instantiate a SimpleImputer object with your strategy of choice

imputer.fit(data[['RoofSurface']]) # Call the "fit" method on the object

data['RoofSurface'] = imputer.transform(data[['RoofSurface']]) # Call the "transform" method on the object

imputer.statistics_ # The mean is stored in the transformer's memory

array([2906.])

### `ChimneyStyle`

❓ **Questions** about `ChimneyStyle` ❓

Investigate the missing values in `ChimneyStyle`. Then, choose one of the following solutions:

1. Drop the column entirely
2. Impute the column median
3. Preserve the NaNs and replace them with meaningful values

Make changes effective in the dataframe `data`.


<details>
    <summary>💡 <i>Hint</i></summary>
    
* ⚠️ Be careful: not all missing values are represented as `np.nans`, and Python's `isnull()` only detects `np.nans`...
    
* ℹ️ `ChimneyStyle` has a lot of missing values. The description does not touch on what they represent. As such, it is better not to make any assumptions and to drop the column entirely.
    

</details>

In [27]:
(data.ChimneyStyle.isnull().sum()/len(data))*100 # Percentage of missing values in ChimneyStyle

0.0

In [28]:
data["ChimneyStyle"].value_counts()

?           1455
bricks         3
castiron       2
Name: ChimneyStyle, dtype: int64

In [29]:
data.drop(columns='ChimneyStyle', inplace=True) # Drop ChimneyStyle column 

data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice
0,1710,3,1,5,1995.0,RFn,Y,2,208500
1,1262,3,1,8,874.0,RFn,Y,5,181500
2,1786,3,1,5,1593.0,RFn,Y,9,223500
3,1717,3,1,5,2566.0,Unf,Y,2,140000
4,2198,4,1,5,3130.0,RFn,Y,12,250000


🧪 **Test your code**

In [30]:
from nbresult import ChallengeResult

result = ChallengeResult('missing_values',
                         dataset = data
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/gulecs/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/gulecs/code/gulecsec/data-preprocessing-workflow/tests
plugins: anyio-3.6.1, dash-2.7.0, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 2 items

test_missing_values.py::TestMissing_values::test_nans [32mPASSED[0m[32m             [ 50%][0m
test_missing_values.py::TestMissing_values::test_number_of_columns [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/missing_values.pickle

[32mgit[39m commit -m [33m'Completed missing_values step'[39m

[32mgit[39m push origin master



❓ When you are done with handling missing value, print out the percentage of missing values for the entire dataframe ❓

You should no longer have missing values !

In [31]:
(data.isnull().sum()/len(data))*100 # Percentage of missing values in entire dataframe

GrLivArea       0.0
BedroomAbvGr    0.0
KitchenAbvGr    0.0
OverallCond     0.0
RoofSurface     0.0
GarageFinish    0.0
CentralAir      0.0
MoSold          0.0
SalePrice       0.0
dtype: float64

## (3) Scaling

**First of all, before scaling...**

To understand the effects of scaling and encoding on model performance, let's get a **base score without any data transformation**.

❓ Cross-validate a linear regression model that predicts `SalePrice` using the other features ❓

⚠️ Note that a linear regression model can only handle numeric features. [DataFrame.select_dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html) can help.

In [32]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X = data[["GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "OverallCond", "RoofSurface","MoSold"]]
y = data["SalePrice"]

# Instantiate model
model = LinearRegression()

# 5-Fold Cross validate model
cv_results = cross_validate(model, X, y, cv=5)

# Scores
cv_results['test_score']

# Mean of scores
cv_results['test_score'].mean()

0.5726603017210621

Keep this score in mind! You will train a new model after data preprocessing in Challenge #2 - see if it improves your average score 😉

🚀 Now, back to **feature scaling**!

###  `RoofSurface` 

❓ **Question** about `RoofSurface` ❓

👇 Investigate `RoofSurface` for distribution and outliers. Then, choose the most appropriate scaling technique. Either:

1. Standard Scaler
2. Robust Scaler
3. MinMax Scaler

Replace the original columns with the transformed values.

In [33]:
data["RoofSurface"]

0       1995.0
1        874.0
2       1593.0
3       2566.0
4       3130.0
         ...  
1455    1698.0
1456    2645.0
1457     722.0
1458    3501.0
1459    3082.0
Name: RoofSurface, Length: 1460, dtype: float64

In [34]:
from sklearn.preprocessing import MinMaxScaler

# define min max scaler
scaler = MinMaxScaler()

model=scaler.fit(data[["RoofSurface"]])

# transform data
data["RoofSurface"]=model.transform(data[["RoofSurface"]])
data["RoofSurface"].head()

0    0.316729
1    0.069650
2    0.228124
3    0.442583
4    0.566894
Name: RoofSurface, dtype: float64

<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ Since `RoofSurface` has neither a Gaussian distribution, nor outliers $\rightarrow$ MinMaxScaler.
</details>

### `GrLivArea`

❓ **Question** about `GrLivArea` ❓

👇 Investigate `GrLivArea` for distribution and outliers. Then, choose the most appropriate scaling technique. Either:

1. Standard Scaler
2. Robust Scaler
3. MinMax Scaler

Replace the original columns with the transformed values.

In [35]:
from sklearn.preprocessing import RobustScaler

r_scaler = RobustScaler() # Instanciate Robust Scaler

r_scaler.fit(data[['GrLivArea']]) # Fit scaler to feature

data['GrLivArea'] = r_scaler.transform(data[['GrLivArea']]) #Scale

data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice
0,0.38007,3,1,5,0.316729,RFn,Y,2,208500
1,-0.31209,3,1,8,0.06965,RFn,Y,5,181500
2,0.497489,3,1,5,0.228124,RFn,Y,9,223500
3,0.390885,3,1,5,0.442583,Unf,Y,2,140000
4,1.134029,4,1,5,0.566894,RFn,Y,12,250000


<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ `GrLivArea` has many outliers $\rightarrow$ RobustScaler()
</details>

### `BedroomAbvGr` ,  `OverallCond` & `KitchenAbvGr`

❓ **Questions** about `BedroomAbvGr`, `OverallCond` & `KitchenAbvGr` ❓

👇 Investigate `BedroomAbvGr`, `OverallCond` & `KitchenAbvGr`. Then, chose one of the following scaling techniques:

1. MinMax Scaler
2. Standard Scaler
3. Robust Scaler

Replace the original columns with the transformed values.

<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ `BedroomAbvGr` ,  `OverallCond` & `KitchenAbvGr` are ordinal features. There are less than 0.1% of outliers so no need to use _RobustScaler()_. The distribution is not Gaussian, hence no _StandardScaler()_. By elimination, you can confidently choose _MinMaxScaler()_.
</details>

In [36]:
data.describe()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,MoSold,SalePrice
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,0.07841,2.866438,1.046575,5.575342,0.508148,6.321918,180921.19589
std,0.813952,0.815778,0.220338,1.112799,0.291583,2.703626,79442.502883
min,-2.263422,0.0,0.0,1.0,0.0,1.0,34900.0
25%,-0.516802,2.0,1.0,5.0,0.246143,5.0,129975.0
50%,0.0,3.0,1.0,5.0,0.517523,6.0,163000.0
75%,0.483198,3.0,1.0,6.0,0.761406,8.0,214000.0
max,6.455002,8.0,3.0,9.0,1.0,12.0,755000.0


In [37]:
from sklearn.preprocessing import MinMaxScaler

minmaxscaler_2 = MinMaxScaler()

data['BedroomAbvGr'], data['OverallCond'], data['KitchenAbvGr'] =  minmaxscaler_2.fit_transform(data[['BedroomAbvGr','OverallCond','KitchenAbvGr']]).T

data.head()


Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice
0,0.38007,0.375,0.333333,0.5,0.316729,RFn,Y,2,208500
1,-0.31209,0.375,0.333333,0.875,0.06965,RFn,Y,5,181500
2,0.497489,0.375,0.333333,0.5,0.228124,RFn,Y,9,223500
3,0.390885,0.375,0.333333,0.5,0.442583,Unf,Y,2,140000
4,1.134029,0.5,0.333333,0.5,0.566894,RFn,Y,12,250000


🧪 **Test your code**

In [38]:
from nbresult import ChallengeResult

result = ChallengeResult('scaling',
                         dataset = data
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/gulecs/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/gulecs/code/gulecsec/data-preprocessing-workflow/tests
plugins: anyio-3.6.1, dash-2.7.0, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_scaling.py::TestScaling::test_bedroom_kitchen_condition [32mPASSED[0m[32m      [ 33%][0m
test_scaling.py::TestScaling::test_gr_liv_area [32mPASSED[0m[32m                    [ 66%][0m
test_scaling.py::TestScaling::test_roof_surface [32mPASSED[0m[32m                   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/scaling.pickle

[32mgit[39m commit -m [33m'Completed scaling step'[39m

[32mgit[39m push origin master



## (4) Feature Encoding

### `GarageFinish`

❓ **Question** about `GarageFinish`❓

👇 Investigate `GarageFinish` and choose one of the following encoding techniques accordingly:
- Ordinal encoding
- One-Hot encoding

Add the encoding to the dataframe as new colum(s), and remove the original column.


<details>
    <summary>💡 <i>Hint</i></summary>
        
ℹ️ `GarageFinish` is a multicategorical feature that must be One-hot-encoded.
</details>

In [39]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

data.GarageFinish.unique()  # Check unique values for GarageFinish

ohe = OneHotEncoder(sparse = False) # Instantiate encoder

ohe.fit(data[['GarageFinish']]) # Fit encoder

garage_encoded = ohe.transform(data[['GarageFinish']]) # Encode GarageFinish

data[ohe.categories_[0]] = garage_encoded # categories_ stores the order of encoded column names

data.head()


Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice,Fin,NG,RFn,Unf
0,0.38007,0.375,0.333333,0.5,0.316729,RFn,Y,2,208500,0.0,0.0,1.0,0.0
1,-0.31209,0.375,0.333333,0.875,0.06965,RFn,Y,5,181500,0.0,0.0,1.0,0.0
2,0.497489,0.375,0.333333,0.5,0.228124,RFn,Y,9,223500,0.0,0.0,1.0,0.0
3,0.390885,0.375,0.333333,0.5,0.442583,Unf,Y,2,140000,0.0,0.0,0.0,1.0
4,1.134029,0.5,0.333333,0.5,0.566894,RFn,Y,12,250000,0.0,0.0,1.0,0.0


### Encoding  `CentralAir`

❓ **Question** about `CentralAir`❓

Investigate `CentralAir` and choose one of the following encoding techniques accordingly:
- Ordinal encoding
- One-Hot encoding

Replace the original column with the newly generated encoded columns.


<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ `CentralAir` is a binary categorical feature.
</details>

In [40]:
from sklearn.preprocessing import OneHotEncoder

data.CentralAir.unique() # Check unique values for CentralAir

ohe = OneHotEncoder(drop='if_binary', sparse = False) # Instantiate encoder for binary feature

ohe.fit(data[['CentralAir']]) # Fit encoder

data['CentralAir'] = ohe.transform(data[['CentralAir']]) # Encode CentralAir

data.head()


Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,GarageFinish,CentralAir,MoSold,SalePrice,Fin,NG,RFn,Unf
0,0.38007,0.375,0.333333,0.5,0.316729,RFn,1.0,2,208500,0.0,0.0,1.0,0.0
1,-0.31209,0.375,0.333333,0.875,0.06965,RFn,1.0,5,181500,0.0,0.0,1.0,0.0
2,0.497489,0.375,0.333333,0.5,0.228124,RFn,1.0,9,223500,0.0,0.0,1.0,0.0
3,0.390885,0.375,0.333333,0.5,0.442583,Unf,1.0,2,140000,0.0,0.0,0.0,1.0
4,1.134029,0.5,0.333333,0.5,0.566894,RFn,1.0,12,250000,0.0,0.0,1.0,0.0


## (5) Feature Engineering

### `MoSold` - Cyclical engineering 

👨🏻‍🏫 A feature can be numerical (continuous or discrete), categorical or ordinal. But a feature can also be temporal (e.g. quarters, months, days, minutes, ...). 

Cyclical features like time need some specific preprocessing. Indeed, if you want any Machine Learning algorithm to capture this cyclicity, your cyclical features must be preprocessed in a certain way.

👉 Consider the feature `MoSold`, the month on which the house was sold.

In [41]:
data["MoSold"].value_counts()

6     253
7     234
5     204
4     141
8     122
3     106
10     89
11     79
9      63
12     59
1      58
2      52
Name: MoSold, dtype: int64

* Many houses were sold in June (6), July (7) and May (5) (Spring/Summer)
* Only a few houses were sold in December (12), January (1) and February (2) (~ Fall/Winter)
    * But for any Machine Learning model, there is no reason why December (12) and January (1) would be "close"...

👩🏻‍🏫 ***How to deal with cyclical features?***

1.  Look at the following illustration and read the explanations to distinguish two different months.

<img src="https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/cyclical_features.png" alt="Cyclical features" width="1000" height="800">


2. Read this [article](https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/) for more details.




❓ **Question** about `MoSold` ❓ 
- Create two new features `sin_MoSold` and `cos_MoSold` which correspond respectively to the sine and cosine of MoSold.
- Drop the original column `MoSold`

<details>
    <summary>💡 <i>Hint</i></summary>
    
ℹ️ The perimeter of a circle if $C = 2 \pi r = 2 \pi$ (assuming that $ r = 1$ for the sake of simplicity).
</details>

In [48]:
data["sin_MoSold"] = np.sin(data["MoSold"])
data["sin_MoSold"] = data["sin_MoSold"].round()

In [49]:
data["cos_MoSold"] = np.cos(data["MoSold"])
data["cos_MoSold"] =data["cos_MoSold"].round()

In [54]:
data.drop(columns='GarageFinish', inplace=True) # Drop MoSold column 

In [55]:
data.head()

Unnamed: 0,GrLivArea,BedroomAbvGr,KitchenAbvGr,OverallCond,RoofSurface,CentralAir,SalePrice,Fin,NG,RFn,Unf,sin_MoSold,cos_MoSold
0,0.38007,0.375,0.333333,0.5,0.316729,1.0,208500,0.0,0.0,1.0,0.0,1.0,-0.0
1,-0.31209,0.375,0.333333,0.875,0.06965,1.0,181500,0.0,0.0,1.0,0.0,-1.0,0.0
2,0.497489,0.375,0.333333,0.5,0.228124,1.0,223500,0.0,0.0,1.0,0.0,0.0,-1.0
3,0.390885,0.375,0.333333,0.5,0.442583,1.0,140000,0.0,0.0,0.0,1.0,1.0,-0.0
4,1.134029,0.5,0.333333,0.5,0.566894,1.0,250000,0.0,0.0,1.0,0.0,-1.0,1.0


🧪 **Test your code**

In [56]:
from nbresult import ChallengeResult

result = ChallengeResult('encoding', dataset = data, new_features = ['sin_MoSold', 'cos_MoSold'])

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/gulecs/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/gulecs/code/gulecsec/data-preprocessing-workflow/tests
plugins: anyio-3.6.1, dash-2.7.0, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 4 items

test_encoding.py::TestEncoding::test_central_air [32mPASSED[0m[32m                  [ 25%][0m
test_encoding.py::TestEncoding::test_columns [32mPASSED[0m[32m                      [ 50%][0m
test_encoding.py::TestEncoding::test_month_sold_features [32mPASSED[0m[32m          [ 75%][0m
test_encoding.py::TestEncoding::test_month_sold_features_number [32mPASSED[0m[32m   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/encoding.pickle

[32mgit[39m commit -m [33m'Completed encoding step'[39m

[32mgit[39m push origin master



## (6) Export the preprocessed dataset

👇 Now that the dataset has been preprocessed, execute the code below to export it. You will keep working on it in the next exercise.

In [57]:
data.to_csv("data/clean_dataset.csv", index=False)

🏁 Congratulations! Now, you know how to ***preprocess a dataset*** !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge!