# Project

### Deadline: 22th of October

---

## Scope & Ground Rules


### Part 1 - 50% | Scope

It's composed for `5` small assignments with guidelines. The assignment scores **evenly distributed across all questions** (each question accounts for 10% of the final score). 

### Part 2 - 50% | Scope

Part 2 entails a project using the same dataset. The goal is to prove your data preprocessing skills. As output from this project, you should delivery the **notebook with the code you have done**.

Please apply, at least, **6 transformations** to the feature set you have in hands (or to the features which makes sense to apply the transformation). Each transformation should be accompanied by an explanation . Last but not least, compare the benefits of such transformation with the baseline score or the last best score. 

Regarding the variables you have to use throughout the Part 2, there are 6 in the total, 3 of them you are free to choose while the remaining 3 I have picked for you:
* YearBuilt
* LotFrontage
* MasVnrType

Make your baseline progressive, i.e. please consider the score from the previous transformation as the new baseline if it shows improvements. Example:

    Baseline - subset of transformations [None]  = 60% accuracy
    Iteration #1 - subset of transformations [A]     = 64% accuracy -> new baseline
    Iteration #2 - subset of transformations [A,B]   = 63% accuracy
    Iteration #3 - subset of transformations [A,B,C] = 68% accuracy -> new baseline
    ...
    Iteration #N - subset of transformations [A,B,C,..., N]
    (Being A, B, C a transformation that uses 1 or N features.)

Transformation example: encoding `color` & `country` with `One-Hot-Encoding`.

The `target` variable should be used to compute the accuracy (please use the `Target` you have created on the exercise 2.1, part1).


**Any question please contact to me via Slack or Email.**


---

**IMPORTANT NOTES to have in mind** 

a) Code Readability is taken into account for the evaluation, so please make it simple, readable and explain your operations when necessary.

b) Make sure that the evaluater can re-run the notebook from the begining, i.e. before you delivery the assignment please go to the bar on top of your notebook -> `Kernel` -> `Restart & Run all`. Validate that all outputs are as you expect.

----


## How can I delivery the Project?



**Email contact**

Please email me via `jpedronn@gmail.com` with the following subject:

`[MPPD Project] John Doe` - including the brackets !!!


**Deliverable**

1) Notebook with the code used for both parts, 1 and 2.

2) The notebook **NAME** should follow the notation:

```
 <MyFirstNameAndLastName>_MPPD_project.ipynb
```

E.g. `JoaoNadkarni_MPPD_project.ipynb`

---

---

## Setup

Feel free to add any Python package as you please

In [1]:
import os
import pandas as pd
import numpy as np

from pgds_mpp_utils import split_dataset, score_approach
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (
    MinMaxScaler,
    StandardScaler,
)

---

# Part 1

## 1- Load Data

1.1- Load **house_prices_final_project.csv** to a Pandas DataFrame. You can see in `data_description.txt` file the description of each column

In [2]:
houses_df = pd.read_csv('house_prices_final_project.csv')

In [3]:
houses_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


1.2- Print dataset total number of `observations` and `variables`

In [4]:
#Total observations and variables
houses_df.shape

(1460, 81)

---

### Please find below the subset of columns we are going to consider for the rest of the assignment

In [5]:
columns_list = ['FullBath',
                'TotRmsAbvGrd',
                'Fireplaces',
                'GarageYrBlt',
                'GarageCars',
                'GarageArea',
                'LotFrontage',
                'WoodDeckSF',
                'OpenPorchSF',
                'SaleType',
                'SaleCondition',
                'SalePrice']

1.3- Create a new dataframe which is a subset of the origin dataframe based on the columns listed above.

In [6]:
#New DataFrame with columns listed above
df = houses_df[columns_list]

In [7]:
df.head()

Unnamed: 0,FullBath,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,LotFrontage,WoodDeckSF,OpenPorchSF,SaleType,SaleCondition,SalePrice
0,2,8,0,2003.0,2,548,65.0,0,61,WD,Normal,208500
1,2,6,1,1976.0,2,460,80.0,298,0,WD,Normal,181500
2,2,6,1,2001.0,2,608,68.0,0,42,WD,Normal,223500
3,1,7,1,1998.0,3,642,60.0,0,35,WD,Abnorml,140000
4,2,9,1,2000.0,3,836,84.0,192,84,WD,Normal,250000


## 2- Creating Labels

2.1- Create the `target` column based on `SalePrice`. The split should be done using the median value to create 2 new buckets. `Min->Median` bucket should have assigned the value `0` while the other bucket (`Median->Max`) value should be `1`.

Note: you are free to decide the buckets boundaries

In [8]:
# Calculate the median of the SalePrice Variable
median_sale_price = df['SalePrice'].median()
median_sale_price

163000.0

In [9]:
#Create a new column 'target'
# 0 if 'SalePrice' <= median and 1 is 'SalePrice' > median
df.loc[df['SalePrice'] <= median_sale_price, 'target'] = 0
df.loc[df['SalePrice'] > median_sale_price, 'target'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['SalePrice'] <= median_sale_price, 'target'] = 0


In [10]:
df.head()

Unnamed: 0,FullBath,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,LotFrontage,WoodDeckSF,OpenPorchSF,SaleType,SaleCondition,SalePrice,target
0,2,8,0,2003.0,2,548,65.0,0,61,WD,Normal,208500,1.0
1,2,6,1,1976.0,2,460,80.0,298,0,WD,Normal,181500,1.0
2,2,6,1,2001.0,2,608,68.0,0,42,WD,Normal,223500,1.0
3,1,7,1,1998.0,3,642,60.0,0,35,WD,Abnorml,140000,0.0
4,2,9,1,2000.0,3,836,84.0,192,84,WD,Normal,250000,1.0


## 3- Handling Missing Values

3.1- List the amount of missing values per column

In [11]:
# Missing values
df.isnull().sum()

FullBath           0
TotRmsAbvGrd       0
Fireplaces         0
GarageYrBlt       81
GarageCars         0
GarageArea         0
LotFrontage      259
WoodDeckSF         0
OpenPorchSF        0
SaleType           0
SaleCondition      0
SalePrice          0
target             0
dtype: int64

In [12]:
# Percentage of missing values
df.isnull().mean() * 100

FullBath          0.000000
TotRmsAbvGrd      0.000000
Fireplaces        0.000000
GarageYrBlt       5.547945
GarageCars        0.000000
GarageArea        0.000000
LotFrontage      17.739726
WoodDeckSF        0.000000
OpenPorchSF       0.000000
SaleType          0.000000
SaleCondition     0.000000
SalePrice         0.000000
target            0.000000
dtype: float64

3.2- Take care of the missing values in the column `LotFrontage`

In [13]:
# Split the dataset into train and test
train_df, test_df = split_dataset(df, "target")

In [14]:
print(f"Train dataset shape: {train_df.shape}")
print(f"Test dataset shape: {test_df.shape}")

Train dataset shape: (978, 13)
Test dataset shape: (482, 13)


In [15]:
# Imputation with median using Scikit-learn
imputer_median = SimpleImputer(strategy ='median')

train_df_median = train_df.copy()
test_df_median = test_df.copy()

train_df_median["LotFrontage"] = imputer_median.fit_transform(train_df_median[["LotFrontage"]])
test_df_median["LotFrontage"] = imputer_median.transform(test_df_median[["LotFrontage"]])

In [16]:
train_df_median.head()

Unnamed: 0,FullBath,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,LotFrontage,WoodDeckSF,OpenPorchSF,SaleType,SaleCondition,SalePrice,target
163,1,4,0,,0,0,55.0,0,0,WD,Normal,103200,0.0
1414,1,8,1,1922.0,2,370,64.0,0,0,WD,Normal,207000,1.0
227,1,5,0,1987.0,1,280,21.0,0,0,WD,Normal,106000,0.0
694,1,5,0,1995.0,2,576,51.0,112,0,WD,Normal,141500,0.0
1264,2,5,0,1998.0,2,511,34.0,144,68,COD,Abnorml,181000,1.0


## 4- Handling Categorical Data

4.1- Split categorical feature into a `df_categorical` dataframe

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   FullBath       1460 non-null   int64  
 1   TotRmsAbvGrd   1460 non-null   int64  
 2   Fireplaces     1460 non-null   int64  
 3   GarageYrBlt    1379 non-null   float64
 4   GarageCars     1460 non-null   int64  
 5   GarageArea     1460 non-null   int64  
 6   LotFrontage    1201 non-null   float64
 7   WoodDeckSF     1460 non-null   int64  
 8   OpenPorchSF    1460 non-null   int64  
 9   SaleType       1460 non-null   object 
 10  SaleCondition  1460 non-null   object 
 11  SalePrice      1460 non-null   int64  
 12  target         1460 non-null   float64
dtypes: float64(3), int64(8), object(2)
memory usage: 148.4+ KB


In [18]:
categorical_columns = ['SaleType',
                       'SaleCondition']

In [19]:
df_categorical = df[categorical_columns]

In [20]:
df_categorical.head()

Unnamed: 0,SaleType,SaleCondition
0,WD,Normal
1,WD,Normal
2,WD,Normal
3,WD,Abnorml
4,WD,Normal


4.2- Apply OHE to `SaleType`

In [21]:
df_categorical['SaleType'].unique()

array(['WD', 'New', 'COD', 'ConLD', 'ConLI', 'CWD', 'ConLw', 'Con', 'Oth'],
      dtype=object)

In [22]:
# One-Hot encoding with Scikit-Learn
from category_encoders import OneHotEncoder as OHE
ohe_enc = OHE(use_cat_names = True)

# Split the dataset into train and test
train_df, test_df = split_dataset(df, "target")

# Create copies of the test and train dataframes for one-hot encoding
train_df_categorical = test_df.copy()
test_df_categorical = test_df.copy()

# Perform one-hot encoding on the training and test data
train_df_OHE_sale = ohe_enc.fit_transform(train_df_categorical)
test_df_OHE_sale = ohe_enc.transform(test_df_categorical)

In [23]:
train_df_OHE_sale.head()

Unnamed: 0,FullBath,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,LotFrontage,WoodDeckSF,OpenPorchSF,SaleType_WD,...,SaleType_ConLI,SaleType_Con,SaleCondition_Normal,SaleCondition_Family,SaleCondition_Partial,SaleCondition_Abnorml,SaleCondition_Alloca,SaleCondition_AdjLand,SalePrice,target
353,1,5,0,2005.0,2,484,60.0,106,0,1,...,0,0,1,0,0,0,0,0,105900,0.0
92,1,5,0,1921.0,2,432,80.0,0,0,1,...,0,0,1,0,0,0,0,0,163500,1.0
1010,2,7,1,1948.0,1,312,115.0,0,0,1,...,0,0,1,0,0,0,0,0,135000,0.0
96,2,6,0,1999.0,2,472,78.0,158,29,1,...,0,0,1,0,0,0,0,0,214000,1.0
1177,1,5,0,1926.0,1,210,,0,0,1,...,0,0,1,0,0,0,0,0,115000,0.0


## 5- Feature Scaling

5.1- Apply feature scaling to the variable `GarageArea`. Make sure that the new range fall between `-1/3` and `3`.

In [24]:
# Make a copy of the original dataframe
df_copy = df.copy()

In [25]:
# Split the dataset into training and testing subsets based on the 'target' column
train_df, test_df = split_dataset(df_copy, 'target')

In [26]:
# Normalize the 'GarageArea' feature using Min-Max scaling
mmscaler = MinMaxScaler()

In [27]:
# Fit and transform the training data
train_values_mmscaler  = mmscaler.fit_transform(train_df[['GarageArea']])
# Transform the test data 
test_values_mmscaler = mmscaler.transform(test_df[['GarageArea']])

In [28]:
# Create dataframes for the scaled 'GarageArea' values in both training and testing sets
train_df_mmscaler = pd.DataFrame(train_values_mmscaler, columns = ['GarageArea'])
test_df_mmscaler =pd.DataFrame(test_values_mmscaler, columns = ['GarageArea'])

In [29]:
train_df_mmscaler.head()

Unnamed: 0,GarageArea
0,0.0
1,0.260931
2,0.197461
3,0.406206
4,0.360367


In [30]:
# Calculate the minimum and maximum values of the scaled 'GarageArea' across both training and testing datasets
min_value = min(train_df_mmscaler['GarageArea'].min(), test_df_mmscaler['GarageArea'].min())
max_value = max(train_df_mmscaler['GarageArea'].max(), test_df_mmscaler['GarageArea'].max())

# Check if the scaled values fall within the interval (-1/3, 3)
if min_value >= -1/3 and max_value <= 3:
    print("Values are within the range (-1/3, 3).")
else:
    print("Values are outside the range (-1/3, 3).")

Values are within the range (-1/3, 3).


---

## End of Part 1 - Thank you very much!

---

---

# Part 2

Remark:
* you shall use 6 variable for the assessment
* 3 out of 6 features are designated in the section on the top of the notebook
* the 3 remaining variables are up to you to choose
* you can consider any variable from the original dataset during this assessment

Above all, take this opportunity to practice :)

**Good luck!**

In [31]:
### Chosen Variables
#'GarageCars', 'SaleType' and 'KitchenQual'

### Provided Variables 
#'YearBuilt', 'LotFrontage' and 'MasVnrType'

In [32]:
columns_list = ['GarageCars',
                'SaleType',
                'KitchenQual',
                'LotFrontage',
                'YearBuilt',
                'MasVnrType', 
                'SalePrice']

In [33]:
#Create a new DataFrame with 6 variables
houses = houses_df[columns_list]

In [34]:
#Create the 'target' variable
median_sale_price = houses['SalePrice'].median()
houses.loc[houses['SalePrice'] <= median_sale_price, 'target'] = 0
houses.loc[houses['SalePrice'] > median_sale_price, 'target'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  houses.loc[houses['SalePrice'] <= median_sale_price, 'target'] = 0


In [35]:
houses.head()

Unnamed: 0,GarageCars,SaleType,KitchenQual,LotFrontage,YearBuilt,MasVnrType,SalePrice,target
0,2,WD,Gd,65.0,2003,BrkFace,208500,1.0
1,2,WD,TA,80.0,1976,,181500,1.0
2,2,WD,Gd,68.0,2001,BrkFace,223500,1.0
3,3,WD,Gd,60.0,1915,,140000,0.0
4,3,WD,Gd,84.0,2000,BrkFace,250000,1.0


In [36]:
# Remove the 'SalePrice' variable since we have created the target variable based on it
houses = houses.drop(columns = 'SalePrice')

In [37]:
houses.head()

Unnamed: 0,GarageCars,SaleType,KitchenQual,LotFrontage,YearBuilt,MasVnrType,target
0,2,WD,Gd,65.0,2003,BrkFace,1.0
1,2,WD,TA,80.0,1976,,1.0
2,2,WD,Gd,68.0,2001,BrkFace,1.0
3,3,WD,Gd,60.0,1915,,0.0
4,3,WD,Gd,84.0,2000,BrkFace,1.0


In [38]:
# Display information about the 'houses' dataframe
houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   GarageCars   1460 non-null   int64  
 1   SaleType     1460 non-null   object 
 2   KitchenQual  1460 non-null   object 
 3   LotFrontage  1201 non-null   float64
 4   YearBuilt    1460 non-null   int64  
 5   MasVnrType   1452 non-null   object 
 6   target       1460 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 80.0+ KB


In [39]:
# Split into training and test
train_houses, test_houses = split_dataset(houses, "target")

In [40]:
# Check the size of each dataset
print(f"Train dataset shape: {train_houses.shape}")
print(f"Test dataset shape: {test_houses.shape}")

Train dataset shape: (978, 7)
Test dataset shape: (482, 7)


In [41]:
train_houses.head()

Unnamed: 0,GarageCars,SaleType,KitchenQual,LotFrontage,YearBuilt,MasVnrType,target
163,0,WD,TA,55.0,1956,,0.0
1414,2,WD,Gd,64.0,1923,,1.0
227,1,WD,TA,21.0,1970,BrkFace,0.0
694,2,WD,TA,51.0,1936,,0.0
1264,2,COD,Gd,34.0,1998,,1.0


In [42]:
# Create a new DataFrame only with numerical variables to apply some transformations
# Select numerical columns from the previous DataFrame
numerical = houses.select_dtypes(include = 'number')
numerical.head()

Unnamed: 0,GarageCars,LotFrontage,YearBuilt,target
0,2,65.0,2003,1.0
1,2,80.0,1976,1.0
2,2,68.0,2001,1.0
3,3,60.0,1915,0.0
4,3,84.0,2000,1.0


In [43]:
# Check if we have missing values in the numerical dataset
numerical.isnull().mean() * 100

GarageCars      0.000000
LotFrontage    17.739726
YearBuilt       0.000000
target          0.000000
dtype: float64

In [44]:
#First, let's calculate the score_approach with variables that have no missing values; we will consider the following variables: 'GarageCars', 'YearBuilt' and 'target', since this variables do not present missing values
filtered_data = numerical[['GarageCars', 'YearBuilt', 'target']]
filtered_data.head()

Unnamed: 0,GarageCars,YearBuilt,target
0,2,2003,1.0
1,2,1976,1.0
2,2,2001,1.0
3,3,1915,0.0
4,3,2000,1.0


In [45]:
# Split into training and test
train_filtered_data, test_filtered_data = split_dataset(filtered_data, "target")

In [46]:
#Calculate the score_approach that will serve as your baseline
score_approach(train_filtered_data, test_filtered_data, 'target')

0.7406639004149378

#### Now, let's apply some transformations to the 'LotFrontage' variable to determine the best score_approach.

In [47]:
# Split the dataset with all numerical variables into training and testing data
train_numerical, test_numerical = split_dataset(numerical, "target")

In [48]:
# Check the size of each dataset
print(f"Train dataset shape: {train_numerical.shape}")
print(f"Test dataset shape: {test_numerical.shape}")

Train dataset shape: (978, 4)
Test dataset shape: (482, 4)


In [49]:
#Transformation A: Imputation of 'LotFrontage' with mean using Scikit-learn
mean_simple_imputer = SimpleImputer(strategy="mean")

In [50]:
# Impute missing values using the mean and transform the training and testing data
train_values = mean_simple_imputer.fit_transform(train_numerical)
test_values = mean_simple_imputer.transform(test_numerical)

In [51]:
train_values

array([[0.000e+00, 5.500e+01, 1.956e+03, 0.000e+00],
       [2.000e+00, 6.400e+01, 1.923e+03, 1.000e+00],
       [1.000e+00, 2.100e+01, 1.970e+03, 0.000e+00],
       ...,
       [3.000e+00, 8.200e+01, 1.995e+03, 1.000e+00],
       [0.000e+00, 2.100e+01, 1.972e+03, 0.000e+00],
       [2.000e+00, 5.300e+01, 1.910e+03, 0.000e+00]])

In [52]:
# Transform it back into a DataFrame since the transformation resulted in an array
train_numerical_lot_mean = pd.DataFrame(train_values, columns = train_numerical.columns)
test_numerical_lot_mean = pd.DataFrame(test_values, columns = test_numerical.columns)

In [53]:
# Verify if there are still missing values
train_numerical_lot_mean.isnull().mean() * 100

GarageCars     0.0
LotFrontage    0.0
YearBuilt      0.0
target         0.0
dtype: float64

In [54]:
#Calculate the score_approach with imputation of the 'mean' for the variable 'LotFrontage'
score_approach(train_numerical_lot_mean, test_numerical_lot_mean, 'target')

0.7406639004149378

In [55]:
#Transformation B: Imputation of 'LotFrontage' with median using Scikit-learn
median_simple_imputer =  SimpleImputer(strategy="median")
train_values_median = median_simple_imputer.fit_transform(train_numerical)
test_values_median = median_simple_imputer.transform(test_numerical)

In [56]:
# Transform it back into a DataFrame since the transformation resulted in an array
train_numerical_lot_median = pd.DataFrame(train_values_median, columns = train_numerical.columns)
test_numerical_lot_median = pd.DataFrame(test_values_median, columns = test_numerical.columns)

In [57]:
#Calculate the score_approach with imputation of the 'median' for the variable 'LotFrontage'
score_approach(train_numerical_lot_median, test_numerical_lot_median, 'target')

0.7406639004149378

In [58]:
# #Transformation C: Encoding missing values with a constant value
train_numerical_encoding_lot = train_numerical.copy()
test_numerical_encoding_lot = test_numerical.copy()

value_to_encode = -1
train_numerical_encoding_lot['LotFrontage'] = train_numerical_encoding_lot['LotFrontage'].fillna(value_to_encode)
test_numerical_encoding_lot['LotFrontage'] = test_numerical_encoding_lot['LotFrontage'].fillna(value_to_encode)

In [59]:
#Calculate the score_approach by encoding the missing values for the variable 'LotFrontage'
score_approach(train_numerical_encoding_lot, test_numerical_encoding_lot, 'target')

0.7406639004149378

#### No differences were observed between the score approach values obtained for each of the transformations (imputation of the mean/median and encoding missing values with a constant value).
#### Let's try imputation using K-nearest neighbors

In [60]:
# #Transformation D: Imputation for completing missing values using K-nearest neighbors
from sklearn.impute import KNNImputer
#knn_imputer with 2 n_neighbors
knn_imputer_2 = KNNImputer(n_neighbors = 2)

In [61]:
knn_imputer_2

In [62]:
train_values_knn = knn_imputer_2.fit_transform(train_numerical)
test_values_knn = knn_imputer_2.transform(test_numerical)

In [63]:
# Transform it back into a DataFrame since the transformation resulted in an array
train_numerical_knn_imputer_2 = pd.DataFrame(train_values_knn, columns = train_numerical.columns)
test_numerical_knn_imputer_2 = pd.DataFrame(test_values_knn, columns = test_numerical.columns)

In [64]:
# Verify if there are still missing values
train_numerical_knn_imputer_2.isnull().mean() * 100

GarageCars     0.0
LotFrontage    0.0
YearBuilt      0.0
target         0.0
dtype: float64

In [65]:
#Calculate the score_approach with imputation for completing missing values using K-nearest neighbors (n=2)
score_approach(train_numerical_knn_imputer_2, test_numerical_knn_imputer_2, 'target')

0.7510373443983402

In [66]:
#Let's try knn_imputer with 3 n_neighbors
knn_imputer_3 = KNNImputer(n_neighbors = 3)

In [67]:
knn_imputer_3

In [68]:
train_values_knn = knn_imputer_3.fit_transform(train_numerical)
test_values_knn = knn_imputer_3.transform(test_numerical)

In [69]:
# Transform it back into a DataFrame since the transformation resulted in an array
train_numerical_knn_imputer_3 = pd.DataFrame(train_values_knn, columns = train_numerical.columns)
test_numerical_knn_imputer_3 = pd.DataFrame(test_values_knn, columns = test_numerical.columns)

In [70]:
#Calculate the score_approach with imputation for completing missing values using K-nearest neighbors (n=3)
score_approach(train_numerical_knn_imputer_3, test_numerical_knn_imputer_3, 'target')

0.7531120331950207

#### Numerical Variables Notes:

**1. Baseline** - No transformations --> 0.7406639004149378

**2. Transformation A** (imputation with mean) of variable 'LotFrontage' --> 0.7406639004149378

**3. Transformation B** (imputation with median) of variable 'LotFrontage' --> 0.7406639004149378

**4. Transformation C** (constant value) of variable 'LotFrontage' --> 0.7406639004149378

**5. Transformation D** (KNN with n_neighbors = 2) of variable 'LotFrontage' --> 0.7510373443983402 (new baseline)

**6. Transformation E** (KNN with n_neighbors = 3) of variable 'LotFrontage' --> 0.7531120331950207 (new baseline)

The best score_approach was achieved with the imputation of missing values using the k-nearest neighbors technique.

The best score was obtained with n_neighbors set to 3.

**Next step: create a new dataset based on the last one, but including the categorical variable 'SaleType'**

In [71]:
# Add 'GarageCars', 'YearBuilt', 'target' and 'SaleType' to perform some tranformations
# Create a new dataset 'df' to simplify the dataset names
columns_list = ['GarageCars',
                'SaleType',
                'YearBuilt', 
                'target']
df = houses[columns_list]
df.head()

Unnamed: 0,GarageCars,SaleType,YearBuilt,target
0,2,WD,2003,1.0
1,2,WD,1976,1.0
2,2,WD,2001,1.0
3,3,WD,1915,0.0
4,3,WD,2000,1.0


In [72]:
# Split the dataset into training and test
train_df, test_df = split_dataset(df, "target")

In [73]:
print(f"Train dataset shape: {train_df.shape}")
print(f"Test dataset shape: {test_df.shape}")

Train dataset shape: (978, 4)
Test dataset shape: (482, 4)


In [74]:
#Reset the index of the training and test datasets to add the 'LotFrontage' column already filled with missing values imputed using KNN - best baseline 
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)
train_numerical_knn_imputer = train_numerical_knn_imputer_3.reset_index(drop=True)
test_numerical_knn_imputer = test_numerical_knn_imputer_3.reset_index(drop=True)
train_df['LotFrontage'] = train_numerical_knn_imputer['LotFrontage']
test_df['LotFrontage'] = test_numerical_knn_imputer['LotFrontage']
train_df.head()

Unnamed: 0,GarageCars,SaleType,YearBuilt,target,LotFrontage
0,0,WD,1956,0.0,55.0
1,2,WD,1923,1.0,64.0
2,1,WD,1970,0.0,21.0
3,2,WD,1936,0.0,51.0
4,2,COD,1998,1.0,34.0


In [75]:
# Check for the presence of missing values
train_df.isnull().mean()

GarageCars     0.0
SaleType       0.0
YearBuilt      0.0
target         0.0
LotFrontage    0.0
dtype: float64

In [76]:
# Let's see how many categories the 'SaleType' variable has
df['SaleType'].unique()

array(['WD', 'New', 'COD', 'ConLD', 'ConLI', 'CWD', 'ConLw', 'Con', 'Oth'],
      dtype=object)

In [77]:
# Applying Ordinal Encoding to the 'SaleType' variable 
from sklearn.preprocessing import OrdinalEncoder

ord_encoder = OrdinalEncoder(categories= [["COD", "Con", "ConLD", "ConLI", "ConLw", "CWD", "WD", "New", "Oth"]])

In [78]:
# Perform a copy of training and test dataset 
train_df_ord_enc_sale = train_df.copy()
test_df_ord_enc_sale = test_df.copy()

In [79]:
# Fit and transform the training and test data
train_df_ord_enc_sale['SaleType'] = ord_encoder.fit_transform(train_df_ord_enc_sale[['SaleType']])
test_df_ord_enc_sale['SaleType'] = ord_encoder.transform(test_df_ord_enc_sale[['SaleType']])

In [80]:
ord_encoder.categories_

[array(['COD', 'Con', 'ConLD', 'ConLI', 'ConLw', 'CWD', 'WD', 'New', 'Oth'],
       dtype=object)]

In [81]:
train_df_ord_enc_sale.head()

Unnamed: 0,GarageCars,SaleType,YearBuilt,target,LotFrontage
0,0,6.0,1956,0.0,55.0
1,2,6.0,1923,1.0,64.0
2,1,6.0,1970,0.0,21.0
3,2,6.0,1936,0.0,51.0
4,2,0.0,1998,1.0,34.0


In [82]:
#Calculate the score_approach using ordinal encoding in 'SaleType' variable
score_approach(train_df_ord_enc_sale, test_df_ord_enc_sale, 'target')

0.7655601659751037

In [83]:
# Applying One-Hot Encoding with Scikit-Learn - 'SaleType' Variable
train_df_sale = train_df.copy()
test_df_sale = test_df.copy()

In [84]:
train_df_sale.head()

Unnamed: 0,GarageCars,SaleType,YearBuilt,target,LotFrontage
0,0,WD,1956,0.0,55.0
1,2,WD,1923,1.0,64.0
2,1,WD,1970,0.0,21.0
3,2,WD,1936,0.0,51.0
4,2,COD,1998,1.0,34.0


In [85]:
# Initialize a One-Hot Encoder for feature transformation
ohe_enc = OHE(use_cat_names = True)

In [86]:
# Fit and transform the training and test data
train_df_OHE_sale = ohe_enc.fit_transform(train_df_sale)
test_df_OHE_sale = ohe_enc.transform(test_df_sale)

In [87]:
train_df_OHE_sale.head()

Unnamed: 0,GarageCars,SaleType_WD,SaleType_COD,SaleType_New,SaleType_ConLD,SaleType_ConLw,SaleType_Con,SaleType_Oth,SaleType_CWD,SaleType_ConLI,YearBuilt,target,LotFrontage
0,0,1,0,0,0,0,0,0,0,0,1956,0.0,55.0
1,2,1,0,0,0,0,0,0,0,0,1923,1.0,64.0
2,1,1,0,0,0,0,0,0,0,0,1970,0.0,21.0
3,2,1,0,0,0,0,0,0,0,0,1936,0.0,51.0
4,2,0,1,0,0,0,0,0,0,0,1998,1.0,34.0


In [88]:
#Calculate the score_approach using One Hot encoding in 'SaleType' variable
score_approach(train_df_OHE_sale, test_df_OHE_sale, 'target')

0.7572614107883817

In [89]:
# Applying Binary Encoding - 'SaleType' Variable
from category_encoders import BinaryEncoder

In [90]:
# Initialize a Binary Encoder for feature encoding
binary_encoder = BinaryEncoder()

In [91]:
# Fit and transform the training and test data
binary_train_df_sale = binary_encoder.fit_transform(train_df_sale)
binary_test_df_sale = binary_encoder.transform(test_df_sale)

In [92]:
binary_train_df_sale.head()

Unnamed: 0,GarageCars,SaleType_0,SaleType_1,SaleType_2,SaleType_3,YearBuilt,target,LotFrontage
0,0,0,0,0,1,1956,0.0,55.0
1,2,0,0,0,1,1923,1.0,64.0
2,1,0,0,0,1,1970,0.0,21.0
3,2,0,0,0,1,1936,0.0,51.0
4,2,0,0,1,0,1998,1.0,34.0


In [93]:
#Calculate the score_approach using binary encoder in 'SaleType' variable
score_approach(binary_train_df_sale, binary_test_df_sale, 'target')

0.7468879668049793

#### 'SaleType' variable transformations

**7. Transformation F** (KNN with n_neighbors = 3) of variable 'LotFrontage' and Ordinal Encoding of variable 'SaleType' --> 0.7655601659751037 (new baseline)

**8. Transformation G** (KNN with n_neighbors = 3) of variable 'LotFrontage' and One Hot Encoding of variable 'SaleType' --> 0.7572614107883817

**9. Transformation H** (KNN with n_neighbors = 3) of variable 'LotFrontage' and Binary Encoding of variable 'SaleType' --> 0.7468879668049793

The best approach until now was achieved by imputing missing values in the 'LotFrontage' variable using the k-nearest neighbors technique and by applying Ordinal Encoding to the 'SaleType' variable with a specific order.


**Next step: create a new dataset based on the last one, but including the categorical variable 'MasVnrType'**

In [94]:
#Create a copy of training and test dataset with the best transformations in order to simplify
train_df_2 = train_df_ord_enc_sale.copy()
test_df_2 = test_df_ord_enc_sale.copy()

In [95]:
## Now let's add the column 'MasVnrType'
#Reset the index of the training and test datasets to add the 'MassVnrType' column
train_df_2 = train_df_2.reset_index(drop=True)
test_df_2 = test_df_2.reset_index(drop=True)
train_houses = train_houses.reset_index(drop=True)
test_houses = test_houses.reset_index(drop=True)
train_df_2['MasVnrType'] = train_houses['MasVnrType']
test_df_2['MasVnrType'] = test_houses['MasVnrType']
train_df_2.head()


Unnamed: 0,GarageCars,SaleType,YearBuilt,target,LotFrontage,MasVnrType
0,0,6.0,1956,0.0,55.0,
1,2,6.0,1923,1.0,64.0,
2,1,6.0,1970,0.0,21.0,BrkFace
3,2,6.0,1936,0.0,51.0,
4,2,0.0,1998,1.0,34.0,


In [96]:
# Check the presence of missing values
train_df_2.isnull().mean()

GarageCars     0.000000
SaleType       0.000000
YearBuilt      0.000000
target         0.000000
LotFrontage    0.000000
MasVnrType     0.005112
dtype: float64

In [97]:
# We have missing values in the 'MasVnrType' column. 
# Since this variable is categorical, let's fill these missing values by replacing them with the mode.
train_df_2_mode = train_df_2.copy()
test_df_2_mode = test_df_2.copy()

train_df_2_mode['MasVnrType'].fillna(train_df_2_mode['MasVnrType'].mode()[0], inplace=True)
test_df_2_mode['MasVnrType'].fillna(test_df_2_mode['MasVnrType'].mode()[0], inplace=True)

In [98]:
# Check if we still have missing values 
train_df_2_mode.isnull().mean()

GarageCars     0.0
SaleType       0.0
YearBuilt      0.0
target         0.0
LotFrontage    0.0
MasVnrType     0.0
dtype: float64

In [99]:
# Now let's apply ordinal encoding to 'MasVnrType' variable - mode was used to fill the missing values
train_df_2_mode['MasVnrType'].unique()

array(['None', 'BrkFace', 'Stone', 'BrkCmn'], dtype=object)

In [100]:
# Initialize an Ordinal Encoder with specific category order for feature encoding
ord_encoder = OrdinalEncoder(categories= [["BrkCmn", "BrkFace", "Stone", "None"]])

In [101]:
# Perform a copy of training and test dataset and apply the Ordinal Encoder
train_df_2_ord_enc_mas = train_df_2_mode.copy()
test_df_2_ord_enc_mas = test_df_2_mode.copy()
train_df_2_ord_enc_mas['MasVnrType'] = ord_encoder.fit_transform(train_df_2_ord_enc_mas[['MasVnrType']])
test_df_2_ord_enc_mas['MasVnrType'] = ord_encoder.transform(test_df_2_ord_enc_mas[['MasVnrType']])

In [102]:
#Calculate the score_approach using ordinal encoding in 'MasVnrType' variable - mode was used to fill the missing values
score_approach(train_df_2_ord_enc_mas, test_df_2_ord_enc_mas, 'target')

0.7468879668049793

In [103]:
# Applying One-Hot Encoding with Scikit-Learn - 'MasVnrType' Variable
# Create copies of training and test datasets for one-hot encoding
train_df_2_OHE_mas = train_df_2_mode.copy()
test_df_2_OHE_mas = test_df_2_mode.copy()

# Initialize a One-Hot Encoder 
ohe_encoder = OHE(use_cat_names = True)

# Perform one-hot encoding on the training and test data for the 'MasVnrType' variable
train_df_2_OHE_mas = ohe_encoder.fit_transform(train_df_2_OHE_mas)
test_df_2_OHE_mas = ohe_encoder.transform(test_df_2_OHE_mas)

In [104]:
#Calculate the score_approach using One Hot encoding in 'MasVnrType' variable - mode was used to fill the missing values
score_approach(train_df_2_OHE_mas, test_df_2_OHE_mas, 'target')

0.7468879668049793

In [105]:
# Applying Binary Encoding
# Create copies of the training and test datasets for binary encoding
train_df_2_binary_mas = train_df_2_mode.copy()
test_df_2_binary_mas = test_df_2_mode.copy()

# Initialize a Binary Encoder
binary_encoder = BinaryEncoder()

# Perform binary encoding on the training and test data
train_df_2_binary_mas = binary_encoder.fit_transform(train_df_2_binary_mas)
test_df_2_binary_mas = binary_encoder.transform(test_df_2_binary_mas)

In [106]:
#Calculate the score_approach using Binary encoding in 'MasVnrType' variable - mode was used to fill the missing values
score_approach(train_df_2_binary_mas, test_df_2_binary_mas, 'target')

0.7427385892116183

**10. Transformation I**: (KNN with n_neighbors = 3) of variable 'LotFrontage' + Ordinal Encoding of variable 'SaleType' + filling missing values with mode and perform ordinal encoding in variable 'MasVnrType' --> 0.7468879668049793 

**11. Transformation J**: (KNN with n_neighbors = 3) of variable 'LotFrontage' + Ordinal Encoding of variable 'SaleType' + filling missing values with mode and perform One Hot encoding in variable 'MasVnrType' --> 0.7468879668049793 

**12. Transformation K** (KNN with n_neighbors = 3) of variable 'LotFrontage' + Ordinal Encoding of variable 'SaleType' + filling missing values with mode and perform binary encoding in variable 'MasVnrType' --> 0.7427385892116183 

With the variable 'MasVnrType,' the score did not improve. Therefore, up to this point the best transformation remains Transformation F (KNN with n_neighbors = 3) of variable 'LotFrontage' and Ordinal Encoding of variable 'SaleType' --> 0.7655601659751037 (best baseline)

**Next step: create a new dataset based on the last one, but including the categorical variable 'KitchenQual'**

In [107]:
#Create a copy of training and test dataset with the best transformations in order to simplify dataset names
train_df_3 = train_df_ord_enc_sale.copy()
test_df_3 = test_df_ord_enc_sale.copy()

In [108]:
## Now let's add the column 'KitchenQual'
#Reset the index of the training and test datasets to add the 'KitchenQual' column
train_df_3 = train_df_3.reset_index(drop=True)
test_df_3 = test_df_3.reset_index(drop=True)

# Reset the index for the original training and test DataFrames to maintain consistency
train_houses = train_houses.reset_index(drop=True)
test_houses = test_houses.reset_index(drop=True)

# Copy the 'KitchenQual' column from the original training and test data to the best-score DataFrame
train_df_3['KitchenQual'] = train_houses['KitchenQual']
test_df_3['KitchenQual'] = test_houses['KitchenQual']
train_df_3.head()

Unnamed: 0,GarageCars,SaleType,YearBuilt,target,LotFrontage,KitchenQual
0,0,6.0,1956,0.0,55.0,TA
1,2,6.0,1923,1.0,64.0,Gd
2,1,6.0,1970,0.0,21.0,TA
3,2,6.0,1936,0.0,51.0,TA
4,2,0.0,1998,1.0,34.0,Gd


In [109]:
# Check the presence of missing values
train_df_3.isnull().mean()

GarageCars     0.0
SaleType       0.0
YearBuilt      0.0
target         0.0
LotFrontage    0.0
KitchenQual    0.0
dtype: float64

In [110]:
train_df_3['KitchenQual'].unique()

array(['TA', 'Gd', 'Fa', 'Ex'], dtype=object)

In [111]:
# Let's apply ordinal encoding to the 'KitchenQual' variable
ord_encoder = OrdinalEncoder(categories= [["Ex", "Fa", "Gd", "TA"]])

# Perform a copy of training and test dataset and apply the ordinal encoder
train_df_3_ord_enc = train_df_3.copy()
test_df_3_ord_enc = test_df_3.copy()
train_df_3_ord_enc['KitchenQual'] = ord_encoder.fit_transform(train_df_3_ord_enc[['KitchenQual']])
test_df_3_ord_enc['KitchenQual'] = ord_encoder.transform(test_df_3_ord_enc[['KitchenQual']])

In [112]:
#Calculate the score_approach using Ordinal encoding in 'KitchenQual' variable
score_approach(train_df_3_ord_enc, test_df_3_ord_enc, 'target')

0.7883817427385892

In [113]:
# Applying One-Hot Encoding with Scikit-Learn - 'KitchenQual' Variable
# Perform a copy of training and test dataset and apply the OHE
train_df_3_OHE = train_df_3.copy()
test_df_3_OHE = test_df_3.copy()

ohe_encoder = OHE(use_cat_names = True)

train_df_3_OHE = ohe_encoder.fit_transform(train_df_3_OHE)
test_df_3_OHE = ohe_encoder.transform(test_df_3_OHE)

In [114]:
#Calculate the score_approach using OHE in 'KitchenQual' variable
score_approach(train_df_3_OHE, test_df_3_OHE, 'target')

0.7883817427385892

In [115]:
# Applying Binary Encoding
train_df_3_binary = train_df_3.copy()
test_df_3_binary = test_df_3.copy()

binary_encoder = BinaryEncoder()

train_df_3_binary = ohe_encoder.fit_transform(train_df_3_binary)
test_df_3_binary = ohe_encoder.transform(test_df_3_binary)

In [116]:
#Calculate the score_approach using Binary encoding in 'KitchenQual' variable
score_approach(train_df_3_binary, test_df_3_binary, 'target')

0.7883817427385892

**13. Transformation L**: (KNN with n_neighbors = 3) of variable 'LotFrontage' + Ordinal Encoding of variable 'SaleType' + Ordinal Encoding of variable 'KitchenQual' --> 0.7883817427385892 (new_baseline)

**14. Transformation M**: (KNN with n_neighbors = 3) of variable 'LotFrontage' + Ordinal Encoding of variable 'SaleType' + One Hot Encoding of variable 'KitchenQual' --> 0.7883817427385892  

**15. Transformation N** (KNN with n_neighbors = 3) of variable 'LotFrontage' + Ordinal Encoding of variable 'SaleType' + Binary Encoding of variable 'KitchenQual' --> 0.7883817427385892 

The new best score was obtained with Transformation L: 0.7883817427385892

In [117]:
# Let's perform Standardization and Normalization in Numerical Columns, namely: 'YearBuilt', 'LotFrontage' and 'GarageCars'

#Let's simplify the name of training and test dataset 
train_df_4 = train_df_3_ord_enc.copy()
test_df_4 = test_df_3_ord_enc.copy()

# Columns to be standardized or normalized
selected_columns = ['YearBuilt', 'LotFrontage', 'GarageCars']

# Select the columns to be standardized or normalized
input_columns = [col for col in train_df_4.columns if col in selected_columns]
input_columns

# Justification for standardization and normalization:
# Different Scales: 'YearBuilt' has multiple decades, while 'LotFrontage' and 'GarageCars' have smaller scales.

['GarageCars', 'YearBuilt', 'LotFrontage']

In [118]:
# Initialize the StandardScaler object for standardization
zscore = StandardScaler()

In [119]:
train_df_4.head()

Unnamed: 0,GarageCars,SaleType,YearBuilt,target,LotFrontage,KitchenQual
0,0,6.0,1956,0.0,55.0,3.0
1,2,6.0,1923,1.0,64.0,2.0
2,1,6.0,1970,0.0,21.0,3.0
3,2,6.0,1936,0.0,51.0,3.0
4,2,0.0,1998,1.0,34.0,2.0


In [120]:
# Fit the scaler to the training data and apply standardization
train_values_zscore = zscore.fit_transform(train_df_4[input_columns])

# Apply the same standardization to the test data
test_values_zscore = zscore.fit_transform(test_df_4[input_columns])

In [121]:
# Create DataFrames for Z-score scaled training and test data
train_df_4_zscore = pd.DataFrame(train_values_zscore, columns = input_columns)
test_df_4_zscore =pd.DataFrame(test_values_zscore, columns = input_columns)

In [122]:
# Assign the 'target' column from the original training data to the Z-score scaled training data
train_df_4_zscore['target'] =train_df_4['target']
# Assign the 'target' column from the original test data to the Z-score scaled test data
test_df_4_zscore['target'] =test_df_4['target']

In [123]:
#Calculate the score_approach using standardization
score_approach(train_df_4_zscore, test_df_4_zscore, 'target')

0.7987551867219918

In [124]:
train_df_4_zscore.head()

Unnamed: 0,GarageCars,YearBuilt,LotFrontage,target
0,-2.351419,-0.514374,-0.661029,0.0
1,0.288865,-1.606179,-0.258791,1.0
2,-1.031277,-0.051184,-2.180596,0.0
3,0.288865,-1.176074,-0.839802,0.0
4,0.288865,0.875197,-1.599585,1.0


In [125]:
# Initialize MinMaxScaler for feature scaling
mmscaler = MinMaxScaler()

In [126]:
# Perform Min-Max scaling on the selected input columns for both training and test data
train_values_mmscaler = mmscaler.fit_transform(train_df_4[input_columns])
test_values_mmscaler = mmscaler.fit_transform(test_df_4[input_columns])

In [127]:
# Create DataFrames for the Min-Max scaled training and test data
train_df_4_mmscaler = pd.DataFrame(train_values_mmscaler, columns = input_columns)
test_df_4_mmscaler =pd.DataFrame(test_values_mmscaler, columns = input_columns)

In [128]:
# Assign the 'target' column from the original training and test datasets to the scaled DataFrames
train_df_4_mmscaler['target'] =train_df_4['target']
test_df_4_mmscaler['target'] =test_df_4['target']

In [129]:
#Calculate the score_approach using normalization
score_approach(train_df_4_mmscaler, test_df_4_mmscaler, 'target')

0.8029045643153527

In [130]:
train_df_4_mmscaler.head()

Unnamed: 0,GarageCars,YearBuilt,LotFrontage,target
0,0.0,0.608696,0.116438,0.0
1,0.5,0.369565,0.14726,1.0
2,0.25,0.710145,0.0,0.0
3,0.5,0.463768,0.10274,0.0
4,0.5,0.913043,0.044521,1.0


**16. Transformation O**: Standardization of numerical features using 'StandardScaler'--> 0.7987551867219918 (new_baseline)

**17. Transformation P**: Normalization of numerical features using 'MinMaxScaler' --> 0.8029045643153527 (new_best_baseline)

The best transformation so far involves applying normalization to the numerical variables (the new baseline is 0.8029045643153527).

**_In the next step, we will add the categorical variables 'Lot Frontage' and 'SaleType' after applying transformations that yielded the best score values to see if they improve the score_approach values._**


In [131]:
# Reset the index for the best-score DataFrames to ensure consistency and clear index labels
# Resetting the index ensures that the best-score DataFrames have a clear index label, improving data handling.
train_df_best_score = train_df_4_mmscaler.reset_index(drop=True)
test_df_best_score = test_df_4_mmscaler.reset_index(drop=True)
# Reset the index for the original training and test DataFrames to maintain consistency
train_df_4 = train_df_4.reset_index(drop=True)
test_df_4 = test_df_4.reset_index(drop=True)
# Copy the 'KitchenQual' and 'SaleType' columns from the original training and test data to the best-score DataFrames
train_df_best_score[['KitchenQual', 'SaleType']] = train_df_4[['KitchenQual', 'SaleType']]
test_df_best_score[['KitchenQual', 'SaleType']] = test_df_4[['KitchenQual', 'SaleType']]
train_df_best_score.head()

Unnamed: 0,GarageCars,YearBuilt,LotFrontage,target,KitchenQual,SaleType
0,0.0,0.608696,0.116438,0.0,3.0,6.0
1,0.5,0.369565,0.14726,1.0,2.0,6.0
2,0.25,0.710145,0.0,0.0,3.0,6.0
3,0.5,0.463768,0.10274,0.0,3.0,6.0
4,0.5,0.913043,0.044521,1.0,2.0,0.0


In [132]:
# Calculate the score_approach while including 'KitchenQual' and 'SaleType' variables
score_approach(train_df_best_score, test_df_best_score, 'target')

0.8112033195020747

Finally, the best score_approach (0.8112033195020747) was achieved by applying the following transformations to the variables:
1. 'Lot Frontage' - applying KNN to fill in missing values, with the best score obtained at n = 3.
2. 'SaleType' - applying ordinal encoding.
3. 'MasVnrType' - dropping the column.
4. 'KitchenQual' - applying ordinal encoding.
5. Applying normalization to the numeric variables - 'YearBuilt', 'LotFrontage' and 'GarageCars'.