## Encoding 

In [None]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler

In [None]:
# Here we have data without missing values
train = pd.read_csv("data/filled/train.csv")
test = pd.read_csv("data/filled/test.csv")

In [None]:
print(train.shape)
print(test.shape)

## Features' dtypes

First of all we need to understand if features' dtypes really corresponds to real features' meaning and if there is no mistake in dtype for a particular feature.

To ensure that all is OK we need to check data_description.txt file.

Now let's separate object features from numerical ones

In [None]:
object_columns = train.select_dtypes(include='object')
continuous_columns = train.select_dtypes(exclude="object").drop(["SalePrice", "Id"], axis=1)

In [None]:
object_columns.columns

With all categorical features there are no problem. They are really categorical. The only interesting thing is with features which have some ordering meaning like PoolQC or GarageQual. These features can be encoded with LabelEncoder and not with One Hot Encoding. But as the first version of encoding we will encode almost all categorical features with one hot except the features which have only two categories

In [None]:
continuous_columns.columns

With continuous features there are several features which are categorical but were encoded in the data like numbers and thus pandas think they are continuous. 

* **MsSubClass** &ndash; 100% categorical feature
* **OverallQual** &ndash; Also a categorical feature BUT can be interpretet as already encoded feature using label encoding and thus we will not encode it
* **OverallCond** &ndash; The same situation as with OverallQual

So we need to add MsSubClass feature to our object columns and remove it from the continuous ones

In [None]:
object_columns["MSSubClass"] = continuous_columns["MSSubClass"]
continuous_columns = continuous_columns.drop(["MSSubClass"], axis=1)

## Category features encoding

Let's check how many category features we have 

In [None]:
object_columns.columns

It can be that the train and test sets have not the same set of categories in the same column (for example the test set can have some categories which the train set doesn't have). On this purpose we will stack train and test sets vertically together, then encode them and then split them back.

In [None]:
data = pd.concat([train.drop("SalePrice", axis=1), test], axis=0, ignore_index=True)
data = data.reset_index(drop=True)

category_counts = [data[column].value_counts().shape[0] for column in object_columns.columns]
print(sum(category_counts))
print(len(category_counts))

We see that if we use, for example, one hot encoding, we will have additional 281 columns in our data (minus 44 because we will delete previous columns of taken feature).

In [None]:
encoded = data.copy()
encoder = OneHotEncoder(sparse_output=False)
one_hot_encoded = encoder.fit_transform(encoded[object_columns.columns])      
one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(object_columns.columns))
encoded = pd.concat([encoded, one_hot_df], axis=1)
encoded = encoded.drop(object_columns.columns, axis=1)

encoded.shape

In [None]:
encoded_train = encoded.iloc[:1459].copy()
encoded_train["SalePrice"] = train["SalePrice"]

encoded_test = encoded.iloc[1459:].copy()

In [None]:
print(encoded_train.shape)
print(encoded_test.shape)

## Casting continuous features to float

To work with all features we need to convert them to float. Because of the fact that we have encoded all categorical features now we have all our features numerical. But still some of them can be integer rather than float.

In [None]:
encoded_train = encoded_train.astype("float64")
encoded_test = encoded_test.astype("float64")

encoded_train = encoded_train.drop(["Id"], axis=1)
encoded_test = encoded_test.drop(["Id"], axis=1)

In [None]:
encoded_train.to_csv("data/encoded/train.csv", index=False)
encoded_test.to_csv("data/encoded/test.csv", index=False)

## Scaling

In addition in this notebook we will create files with already scaled continuous features. Just not to do that in future notebooks.

We scale all of the continuous columns as well as we do logarithmic transformation to the target feature.

In [None]:
scaled_train = encoded_train.copy() 
scaled_test = encoded_test.copy()

scaler = StandardScaler()
columns2scale = continuous_columns.columns

scaled_train[columns2scale] = scaler.fit_transform(scaled_train[columns2scale])
scaled_test[columns2scale] = scaler.transform(scaled_test[columns2scale])
scaled_train["SalePrice"] = np.log(scaled_train["SalePrice"])

scaled_train.to_csv("data/scaled_train.csv", index=False)
scaled_test.to_csv("data/scaled_test.csv", index=False)

When we train some models in future we will need to use just **encoded_train.csv** file and not **scaled_train.csv** because during the training we need to split our WHOLE train data into train/test, then scale firstly train and then scale test using scale parameters of train to avoid data leakage.

Scaled train and test from this notebook are only for the final submission to Kaggle.