## Encoding 

In [1]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler

In [2]:
train_without_nan = pd.read_csv("data/train_without_nans.csv")
test_without_nan = pd.read_csv("data/test_without_nans.csv")

In [3]:
print(train_without_nan.shape)
print(test_without_nan.shape)

(1459, 81)
(1459, 80)


## Features' dtypes

First of all we need to understand if features' dtypes really corresponds to real features' meaning and if there is no mistake in dtype for a particular feature.

To ensure that all is OK we need to check data_description.txt file and ensure that dtypes really correspond to the features

Now let's separate object features from numerical ones

In [4]:
object_columns = train_without_nan.select_dtypes(include='object')
continuous_columns = train_without_nan.select_dtypes(exclude="object").drop(["SalePrice", "Id"], axis=1)

In [5]:
print(object_columns.columns)
print(object_columns.shape)

Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')
(1459, 43)


With all categorical features there are no problem. They are really categorical. The only interesting thing is with features which have some ordering meaning like PoolQC or GarageQual. These features can be encoded with LabelEncoder and not with One Hot Encoding. But as the first version of encoding we will encode almost all categorical features with one hot except the features which have obly two categories

In [6]:
print(continuous_columns.columns)

Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold'],
      dtype='object')


With continuous features there are several features which are categorical but were encoded in the data like numbers and thus pandas think they are continuous. 

* **MsSubClass** &ndash; 100% categorical feature
* **OverallQual** &ndash; Also a categorical feature BUT can be interpretet as already encoded feature using label encoding and thus we will not encode it
* **OverallCond** &ndash; The same situation as with OverallQual

So we need to add MsSubClass feature to our object columns and remove it from the continuous ones

In [7]:
object_columns["MSSubClass"] = train_without_nan["MSSubClass"]
continuous_columns = continuous_columns.drop(["MSSubClass"], axis=1)

## Category features encoding

Let's check how many category features we have 

In [8]:
object_columns.columns

Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition', 'MSSubClass'],
      dtype='object')

It can be that the train and test sets have not the same set of categories in the same column (for example the test set can have some categories which the train set doesn't have). On this purpose we will stack train and test sets vertically together, then encode them and then split them back.

In [9]:
whole_without_nan = pd.concat([train_without_nan.drop("SalePrice", axis=1), test_without_nan], axis=0, ignore_index=True)
whole_without_nan = whole_without_nan.reset_index(drop=True)

In [10]:
category_counts = [whole_without_nan[column].value_counts().shape[0] for column in object_columns.columns]
print(sum(category_counts))
print(category_counts)

281
[5, 2, 3, 4, 4, 2, 5, 3, 25, 9, 8, 5, 8, 6, 8, 15, 16, 4, 4, 5, 6, 5, 5, 4, 7, 7, 6, 5, 2, 5, 4, 7, 6, 7, 4, 6, 6, 3, 4, 5, 5, 9, 6, 16]


We see that if we use, for example, one hot encoding, we will have additional 281 columns in our data (minus 44 because we will delete previous columns of taken feature).

Anyway we have no better choice than encode features which have many categories with one hot encoding. To make our dataset at least a little bit smaller we will use label encoding for features which have only 2 categories. 

There are one additional problem which can appear. If we look more precisely at our category count we can see that our last category feature has more classes in test set than in train set. That means that during the tranformation of this feature in the test set we will have an error. To handle it we will set *handle_unknown* parameter in OneHotEncoder as *ignore*

In [11]:
encoded = whole_without_nan.copy()

for column in object_columns.columns:
    category_cnt = encoded[column].value_counts().shape[0]
    if category_cnt == 2:
        encoder = LabelEncoder()
    
        encoded[column] = encoder.fit_transform(encoded[column])
    else:
        encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
        
        one_hot_encoded = encoder.fit_transform(encoded[[column]])
        
        
        one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out([column]))
        
        encoded = pd.concat([encoded, one_hot_df], axis=1)
        encoded = encoded.drop([column], axis=1)

In [12]:
encoded.shape

(2918, 314)

In [13]:
encoded_train = encoded.iloc[:1459].copy()
encoded_train["SalePrice"] = train_without_nan["SalePrice"]

encoded_test = encoded.iloc[1459:].copy()

In [14]:
print(encoded_train.shape)
print(encoded_test.shape)

(1459, 315)
(1459, 314)


## Casting continuous features

To work with all features we need to convert them to float. Because of the fact that we have encoded all categorical features now we have all our features numerical

In [15]:
encoded_train = encoded_train.astype("float64")
encoded_test = encoded_test.astype("float64")

encoded_train = encoded_train.drop(["Id"], axis=1)
encoded_test = encoded_test.drop(["Id"], axis=1)

In [16]:
encoded_train.to_csv("data/encoded_train.csv", index=False)
encoded_test.to_csv("data/encoded_test.csv", index=False)

## Scaling

In addition in this notebook we will create files with already scaled continuous features. Just not to do that in future notebooks.

We scale all of the continuous columns as well as we do logarithmic transformation to the target feature.

In [17]:
scaled_train = encoded_train.copy() 
scaled_test = encoded_test.copy()

columns2scale = continuous_columns.columns

for column in columns2scale:
    scaler = StandardScaler()

    scaled_train[column] = scaler.fit_transform(scaled_train[[column]])
    scaled_test[column] = scaler.transform(scaled_test[[column]])

scaled_train["SalePrice"] = np.log(scaled_train["SalePrice"])

scaled_train.to_csv("data/scaled_train.csv", index=False)
scaled_test.to_csv("data/scaled_test.csv", index=False)

When we train some models in future we will need to use just **encoded_train.csv** file and not **scaled_train.csv** because during the training we need to split our WHOLE train data into train/test, then scale firstly train and then scale test using scale parameters of train to avoid data leakage