## Encoding 

In [1]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [2]:
train_without_nan = pd.read_csv("data/train_without_nans.csv")
test_without_nan = pd.read_csv("data/test_without_nans.csv")

In [3]:
print(train_without_nan.shape)
train_without_nan.info()

(1459, 81)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1459 non-null   object 
 3   LotFrontage    1459 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          1459 non-null   object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1459 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 1

## Features' dtypes

First of all we need to understand if features' dtypes really corresponds to real features' meaning and if there is no mistake in dtype for a particular feature.

To ensure that all is OK we need to check data_description.txt file and ensure that dtypes really correspond to the features

Now let's separate object features from numerical ones

In [4]:
object_columns = train_without_nan.select_dtypes(include='object')
numerical_columns = train_without_nan.select_dtypes(exclude="object").drop(["SalePrice"], axis=1)

## Category features encoding

Let's check how many category features we have 

In [5]:
object_columns.shape

(1459, 43)

And and what is the sum of the counts of categories in each feature

In [6]:
category_counts_train = [train_without_nan[column].value_counts().shape[0] for column in object_columns.columns]
print(sum(category_counts_train))
print(category_counts_train)

265
[5, 2, 3, 4, 4, 2, 5, 3, 25, 9, 8, 5, 8, 6, 8, 15, 16, 4, 4, 5, 6, 5, 5, 4, 7, 7, 6, 5, 2, 5, 4, 7, 6, 7, 4, 6, 6, 3, 4, 5, 5, 9, 6]


In [7]:
category_counts_test = [test_without_nan[column].value_counts().shape[0] for column in object_columns.columns]
print(sum(category_counts_test))
print(category_counts_test)

247
[5, 2, 3, 4, 4, 1, 5, 3, 25, 9, 5, 5, 7, 6, 4, 13, 15, 4, 4, 5, 6, 5, 5, 4, 7, 7, 4, 5, 2, 4, 4, 7, 6, 7, 4, 5, 6, 3, 3, 5, 4, 9, 6]


We see that if we use, for example, one hot encoding, we will have additional 265 columns in our data (minus 43 because we will delete previous columns of taken feature).

Anyway we have no better choice than encode features which have many categories with one hot encoding. To make our dataset at least a little bit smaller we will use label encoding for features which have only 2 categories. 

In [8]:
encoded_train = train_without_nan.copy()
encoded_test = test_without_nan.copy()

for column in object_columns.columns:
    category_cnt = encoded_train[column].value_counts().shape[0]
    
    if category_cnt == 2:
        encoder = LabelEncoder()
        encoded_train[column] = encoder.fit_transform(encoded_train[column])
        encoded_test[column] = encoder.transform(encoded_test[column])
    else:
        encoder = OneHotEncoder(sparse_output=False)
        
        one_hot_encoded_train = encoder.fit_transform(encoded_train[[column]])
        one_hot_encoded_test = encoder.transform(encoded_test[[column]])
        
        one_hot_df_train = pd.DataFrame(one_hot_encoded_train,
                                        columns=encoder.get_feature_names_out([column]))
        one_hot_df_test = pd.DataFrame(one_hot_encoded_test,
                                        columns=encoder.get_feature_names_out([column]))
        
        encoded_train = pd.concat([encoded_train, one_hot_df_train], axis=1)
        encoded_train = encoded_train.drop([column], axis=1)

        encoded_test = pd.concat([encoded_test, one_hot_df_test], axis=1)
        encoded_test = encoded_test.drop([column], axis=1)

In [9]:
encoded_train.shape

(1459, 300)

In [10]:
encoded_test.shape

(1459, 299)

## Casting continuous features

To work with all features we need to convert them to float. Because of the fact that we have encoded all categorical features now we have all our features numerical

In [11]:
encoded_train = encoded_train.astype("float64")
encoded_test = encoded_test.astype("float64")

In [12]:
encoded_train.to_csv("data/encoded_train.csv", index=False)
encoded_test.to_csv("data/encoded_test.csv", index=False)