# Predicting Apartment Renting Prices in Santiago MR - Feature Engineering

In [1]:
import pandas as pd
import numpy as np
import re
import math
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib 
import seaborn as sns
import joblib

from sklearn import compose, preprocessing, pipeline
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold, StratifiedKFold
import sklearn.metrics as metrics
import sklearn.utils

matplotlib.rcParams["figure.figsize"] = (8,5)

## Data Load: load cleaned data into a dataframe

In [2]:
df = pd.read_csv("RENT_APARTMENT_MR_eda.csv")
print(df.shape)
df

(4762, 5)


Unnamed: 0,Surface,Bedrooms,Bathrooms,Location,Price
0,21.0,1.0,1.0,Las Condes,190000.0
1,40.0,1.0,1.0,Las Condes,507938.0
2,220.0,4.0,4.0,Las Condes,1500000.0
3,140.0,4.0,4.0,Las Condes,1500000.0
4,140.0,4.0,3.0,Las Condes,1500000.0
...,...,...,...,...,...
4757,140.0,3.0,3.0,Vitacura,1550000.0
4758,144.0,4.0,4.0,Vitacura,1550000.0
4759,140.0,4.0,4.0,Vitacura,1550000.0
4760,185.0,3.0,3.0,Vitacura,2000000.0


## Training, Test data split

Before conducting any transformation on the features which we will use as predictors, we should split the data into training and test sets. <br>
Next, implemented feature transformations will be done to the training set and then applied on the test set. This way we avoid data leakage from the test to the training set.

In [3]:
train_ratio = 0.85
test_ratio = 0.15

# train_data = {}

# X = df[features[i]]
X = df.drop(['Price'], axis=1)
y = df['Price']

# split the data into train (train + validation) and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=test_ratio,
    random_state=22
)

frac_train = len(X_train) / (len(X_train)+len(X_test))
frac_test = len(X_test) / (len(X_train)+len(X_test))

print(f"Number of samples for training ({frac_train}): {len(X_train)}")
print(f"Number of samples for test ({frac_test}): {len(X_test)}")
# print(X_train, X_val, X_test)
 

Number of samples for training (0.8498530029399411): 4047
Number of samples for test (0.1501469970600588): 715


In [4]:
X_train
# X_test

Unnamed: 0,Surface,Bedrooms,Bathrooms,Location
4088,80.0,3.0,2.0,Santiago
2519,70.0,3.0,2.0,Maipu
3561,30.0,1.0,1.0,Santiago
862,500.0,3.0,4.0,Las Condes
932,304.0,4.0,4.0,Las Condes
...,...,...,...,...
2527,90.0,3.0,2.0,Macul
2952,50.0,1.0,1.0,Providencia
4587,113.0,2.0,2.0,Vitacura
356,68.0,2.0,2.0,Las Condes


In [5]:
X_train.dtypes
# X_test.dtypes

Surface      float64
Bedrooms     float64
Bathrooms    float64
Location      object
dtype: object

## Feature transformations

Next, we will apply transformations to some of the features in the dataset. For this, we will treat each feature as the type of variable described below.

- Surface: continuous variable. It will be normalized.
- Bedrooms, Bathrooms: discrete variables. They can be treated as ordinal (only certain values are allowed, with certain ordering) and hence will be encoded as such.
- Location: categorical variable. It will be one-hot-encoded.

## Dimensionality Reduction on Rooms count

Number of rooms (feature columns *Bedrooms* and *Bathrooms*) can be considered as ordinal variables, since they take only certain discrete values which have an specific ordering (with equal interval between different value levels). We will implement a custom transformation in order to reduce the dimensionality of these features, which groups these feature values in categories corresponding to the integer of the feature values (below a threshold of 4) and to the threshold itself (for values above the threshold).

To preserve the ordering of these ordinal features, we will not apply one-hot-encoding on them.

In [6]:
class CustomOrdinalEncoder(TransformerMixin, BaseEstimator):
    """
    Class defining a custom transformer that applies ordinal encoding.
    For any given feature in the input dataset, this transformation assigns 
    the value of the defined threshold to values which are higher the threshold.
    While feature values which lower or equal than the threshold are kept 
    unchanged.

    Parameters
    ----------
    TransformerMixin: class
        Implements method fit_transform().
    BaseEstimator: class
        Implements methods set_params() and get_params().

    Returns
    -------
        pandas.dataframe
        Dataframe containing transformed data.                
    """    
    def __init__(self, threshold=None):
        self.threshold = threshold
        
    def fit(self,X,y=None):
        return self
    
    def transform(self,X,y=None):
        """Apply ordinal encoding transformation to input dataset."""
#         print(type(X))
        if isinstance(X, pd.DataFrame):
            # convert dataframe to series
            X = X.squeeze('columns').copy()
#         print(X.columns)
        else:
            X = pd.Series(X).copy()
#         print(type(X))
#         print(X)        
        X = X.apply(
            lambda x: int(x) if x < self.threshold else int(self.threshold)
        )
#         print(X.shape)
#         print(type(X))
        return X.to_frame()
    

Let's see the effect of this transformation when we apply it to our train set.

In [7]:
X_train_transformed = X_train.copy()

ordinal_encoder = CustomOrdinalEncoder(threshold=4.)
ordinal_encoder.fit_transform(X_train_transformed['Bedrooms'])
X_train_transformed['Bedrooms_encoded'] = ordinal_encoder.transform(X_train_transformed['Bedrooms'])

ordinal_encoder = CustomOrdinalEncoder(threshold=4.)
ordinal_encoder.fit_transform(X_train_transformed['Bathrooms'])
X_train_transformed['Bathrooms_encoded'] = ordinal_encoder.transform(X_train_transformed['Bathrooms'])

X_train_transformed

Unnamed: 0,Surface,Bedrooms,Bathrooms,Location,Bedrooms_encoded,Bathrooms_encoded
4088,80.0,3.0,2.0,Santiago,3,2
2519,70.0,3.0,2.0,Maipu,3,2
3561,30.0,1.0,1.0,Santiago,1,1
862,500.0,3.0,4.0,Las Condes,3,4
932,304.0,4.0,4.0,Las Condes,4,4
...,...,...,...,...,...,...
2527,90.0,3.0,2.0,Macul,3,2
2952,50.0,1.0,1.0,Providencia,1,1
4587,113.0,2.0,2.0,Vitacura,2,2
356,68.0,2.0,2.0,Las Condes,2,2


In [8]:
X_train_transformed.Bedrooms_encoded.value_counts(ascending=False)

2    1353
1    1177
3    1025
4     492
Name: Bedrooms_encoded, dtype: int64

In [9]:
X_train_transformed.Bathrooms_encoded.value_counts(ascending=False)

1    1573
2    1405
3     651
4     418
Name: Bathrooms_encoded, dtype: int64

## Dimensionality Reduction on Locations

We will treat feature column *Location* as a categorical variable, since it takes only certain discrete nominal values without specific ordering. 

In order to reduce its demensionality, we will define a custom transformation that assigns any *Location* having less than a threshold of 50 data points as *Other* (with value *AAA*). This way number of categories of this variable will be significantly reduced. Consequently, we will ended up with fewer dummy columns later, once we apply one-hot encoding.

In [10]:
class DimensionReducer(BaseEstimator,TransformerMixin):
    """
    Class defining a custom transformer that applies dimensionality reduction.
    For any given feature in the input dataset, this transformation assigns 
    the value 'value_lower' to values with count lower than the defined 'threshold'.
    While feature values with count higher or equal than the 'threshold' are 
    kept unchanged. 

    Parameters
    ----------
    TransformerMixin: class
        Implements method fit_transform().
    BaseEstimator: class
        Implements methods set_params() and get_params().
    threshold: int
        Cutoff value of feature count.
    value_lower: int or str
        Value used to replace transformed feature values.

    Returns
    -------
        pandas.dataframe
        Dataframe containing transformed data.                
    """      
    def __init__(self, threshold=None, value_lower=None):
        self.threshold = threshold
        self.value_lower = value_lower        
        
    def feature_selector(self,X,y=None):
        """Define dimensionality reduction transformation."""                
        if isinstance(X, pd.DataFrame):
            # convert dataframe to series
            X = X.squeeze('columns').copy()         
        else:
            X = pd.Series(X).copy()
#         print(type(X))        
#         print(X)
        series_feature = X.value_counts(ascending=False)
#         print(series_feature)
        self.series_feature_above = series_feature[series_feature >= self.threshold]
        self.series_feature_below = series_feature[series_feature < self.threshold]
        print(f"Total categories in feature: {len(series_feature)}")
        print(f"Total categories in feature (above threshold = {self.threshold}): {len(self.series_feature_above)}")
        print(f"Total categories in feature (below threshold = {self.threshold}): {len(self.series_feature_below)}")          

    def fit(self,X,y=None):
        """Fit dimensionality reduction transformation to the input dataset."""                        
        self.feature_selector(X)
        return self
    
    def transform(self,X,y=None):
        """Apply dimensionality reduction transformation to input dataset."""        
        if isinstance(X, pd.DataFrame):
            # convert dataframe to series
            X = X.squeeze('columns').copy()        
        else:
            X = pd.Series(X).copy()        
#         print(type(X))                
#         print(X)
        X = X.apply(
            lambda x: self.value_lower if x in self.series_feature_below else x
        )        
        return X.to_frame()

Let's see the effect of this transformation when we apply it to our train set.

In [11]:
X_train_transformed = X_train.copy()

dim_red = DimensionReducer(threshold=50, value_lower='AAA')
dim_red.fit(X_train_transformed['Location'])

Total categories in feature: 32
Total categories in feature (above threshold = 50): 12
Total categories in feature (below threshold = 50): 20


DimensionReducer(threshold=50, value_lower='AAA')

In [12]:
dim_red.series_feature_below

Recoleta         45
Quinta Normal    38
La Reina         28
Maipu            21
San Joaquin      19
Huechuraba       17
Penalolen        12
Puente Alto       9
Quilicura         8
La Granja         6
Pudahuel          4
Cerrillos         3
Conchali          3
San Bernardo      2
Lo Prado          2
La Pintana        2
El Bosque         2
Colina            1
Lampa             1
Cerro Navia       1
Name: Location, dtype: int64

In [13]:
dim_red.series_feature_above

Las Condes          995
Santiago            852
Vitacura            395
Providencia         304
Lo Barnechea        243
Nunoa               241
San Miguel          225
Estacion Central    196
La Florida          128
La Cisterna          90
Macul                80
Independencia        74
Name: Location, dtype: int64

In [14]:
X_train_transformed['Location_encoded'] = dim_red.transform(X_train_transformed['Location'])
X_train_transformed.head(20)

Unnamed: 0,Surface,Bedrooms,Bathrooms,Location,Location_encoded
4088,80.0,3.0,2.0,Santiago,Santiago
2519,70.0,3.0,2.0,Maipu,AAA
3561,30.0,1.0,1.0,Santiago,Santiago
862,500.0,3.0,4.0,Las Condes,Las Condes
932,304.0,4.0,4.0,Las Condes,Las Condes
2003,65.0,2.0,2.0,La Florida,La Florida
232,71.0,2.0,2.0,Las Condes,Las Condes
2410,40.0,2.0,1.0,Estacion Central,Estacion Central
2638,36.0,1.0,1.0,Estacion Central,Estacion Central
1985,77.0,2.0,2.0,Macul,Macul


In [15]:
list_location_encoded = list(set(X_train_transformed['Location_encoded'].tolist()))

joblib.dump(list_location_encoded, "list_location_encoded.pkl")

['list_location_encoded.pkl']

In [16]:
sorted(list_location_encoded)

['AAA',
 'Estacion Central',
 'Independencia',
 'La Cisterna',
 'La Florida',
 'Las Condes',
 'Lo Barnechea',
 'Macul',
 'Nunoa',
 'Providencia',
 'San Miguel',
 'Santiago',
 'Vitacura']

In [17]:
X_train_transformed['Location_encoded'].value_counts(ascending=False)

Las Condes          995
Santiago            852
Vitacura            395
Providencia         304
Lo Barnechea        243
Nunoa               241
San Miguel          225
AAA                 224
Estacion Central    196
La Florida          128
La Cisterna          90
Macul                80
Independencia        74
Name: Location_encoded, dtype: int64

## Define pipelines for feature transformations

We will include the custom transformations implemented above as well as the standard transformations that we will use into pipelines that we can apply directly on both the training and test sets.

In [18]:
# pipeline for numerical feature 'Surface'
pipe_num = pipeline.Pipeline(steps=[
#     ("imputer", impute.SimpleImputer(strategy="mean")),
    ("scaler", preprocessing.MinMaxScaler()),
])

# pipeline for categorical feature 'Bedrooms'
pipe_cat_1 = pipeline.Pipeline(steps=[
    ("custom_ordinal", CustomOrdinalEncoder(threshold=4)),
])

# pipeline for categorical feature 'Bathrooms'
pipe_cat_2 = pipeline.Pipeline(steps=[
    ("custom_ordinal", CustomOrdinalEncoder(threshold=4)),
])

# pipeline for categorical feature 'Location'
# - By using drop="first" in OneHotEncoder we remove the category 'AAA' created by DimensionReducer.
#   This way we avoid the dummy variable trap, which occurs when 2 or more dummy variables created by one-hot encoding are highly correlated (multi-collinearity).
# - With handle_unknown="ignore", unknown categories in the test set will be encoded as all zeros.
#   In practice, this means that missing categories will be treated as if they were 'AAA' category, which will be 
#   most likely the case of new data from a location not included originally in this analysis.
pipe_cat_3 = pipeline.Pipeline(steps=[
    ("dim_reducer", DimensionReducer(threshold=50, value_lower='AAA')),
    ("onehot", preprocessing.OneHotEncoder(handle_unknown="ignore", drop="first")),    
])

# onehot_preprocessor = preprocessing.OneHotEncoder(handle_unknown="ignore")

pipe_all = compose.ColumnTransformer(
    transformers=[
        ("numerical", pipe_num, ['Surface']),
        ("categorical_1", pipe_cat_1, ['Bedrooms']),        
        ("categorical_2", pipe_cat_2, ['Bathrooms']),
        ("categorical_3", pipe_cat_3, ['Location']),                
#         ("passthrough", "passthrough", ["Price_total"])
    ]
)

In [19]:
pipe_all

ColumnTransformer(transformers=[('numerical',
                                 Pipeline(steps=[('scaler', MinMaxScaler())]),
                                 ['Surface']),
                                ('categorical_1',
                                 Pipeline(steps=[('custom_ordinal',
                                                  CustomOrdinalEncoder(threshold=4))]),
                                 ['Bedrooms']),
                                ('categorical_2',
                                 Pipeline(steps=[('custom_ordinal',
                                                  CustomOrdinalEncoder(threshold=4))]),
                                 ['Bathrooms']),
                                ('categorical_3',
                                 Pipeline(steps=[('dim_reducer',
                                                  DimensionReducer(threshold=50,
                                                                   value_lower='AAA')),
                                   

### First, we apply the defined preprocessing pipeline to the training set

In [20]:
X_train_transformed = pipe_all.fit_transform(X_train).toarray()
print(X_train_transformed.shape)
print(X_train_transformed[:2])

# save fitted pipeline
joblib.dump(pipe_all, "pipe_all.pkl")

Total categories in feature: 32
Total categories in feature (above threshold = 50): 12
Total categories in feature (below threshold = 50): 20
(4047, 15)
[[0.13188648 3.         2.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         1.         0.        ]
 [0.11519199 3.         2.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.        ]]


['pipe_all.pkl']

In [21]:
df_train = pd.DataFrame(X_train_transformed, index=X_train.index)
df_train['Price'] = y_train
df_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,Price
4088,0.131886,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,620000.0
2519,0.115192,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,730000.0
3561,0.048414,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,300000.0
862,0.833055,3.0,4.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3900000.0
932,0.505843,4.0,4.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2950000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2527,0.148581,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,720000.0
2952,0.081803,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,571431.0
4587,0.186978,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1000000.0
356,0.111853,2.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1000000.0


In [22]:
df_train.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,Price
count,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0
mean,0.150191,2.205584,1.97875,0.048431,0.018285,0.022239,0.031628,0.245861,0.060044,0.019768,0.05955,0.075117,0.055597,0.210526,0.097603,1010103.0
std,0.123256,0.994169,0.981058,0.214702,0.133997,0.147477,0.17503,0.43065,0.237599,0.139218,0.236681,0.263613,0.22917,0.407733,0.296814,914317.5
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,190000.0
25%,0.067613,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,400000.0
50%,0.105175,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,650000.0
75%,0.198664,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1350000.0
max,1.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,10095270.0


### Next, we use the already defined pipeline, fitted on the training set, to transform the test set

In [23]:
X_test_transformed = pipe_all.transform(X_test).toarray()
print(X_test_transformed.shape)
print(X_test_transformed[:2])

(715, 15)
[[0.07011686 1.         1.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         1.        ]
 [0.15525876 3.         2.         0.         0.         0.
  0.         1.         0.         0.         0.         0.
  0.         0.         0.        ]]




In [24]:
df_test = pd.DataFrame(X_test_transformed, index=X_test.index)
df_test['Price'] = y_test
df_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,Price
4548,0.070117,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,698415.0
5,0.155259,3.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,870000.0
4341,0.213689,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2900000.0
1975,0.105175,2.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,530000.0
86,0.075125,1.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,890162.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4665,0.282137,2.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1800000.0
1463,0.118531,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,650000.0
922,0.627713,4.0,4.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4285729.0
2967,0.081803,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,530000.0


In [25]:
df_test.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,Price
count,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0
mean,0.158995,2.236364,1.986014,0.05035,0.023776,0.023776,0.025175,0.234965,0.06993,0.011189,0.048951,0.068531,0.06993,0.197203,0.11049,1059030.0
std,0.138163,1.033781,1.017948,0.218818,0.152458,0.152458,0.156766,0.424274,0.255208,0.105257,0.215917,0.252833,0.255208,0.398165,0.313718,1008999.0
min,0.003339,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,190000.0
25%,0.065109,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,390000.0
50%,0.098497,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,630000.0
75%,0.217028,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1400000.0
max,0.958264,4.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9523842.0


In the resulting training and test sets, the ordering of the transformed columns follows the order of the transformations applied in the defined pipeline. In this case, the table below outlines the relationship between the original features and the transformed columns, for each of the transformation carried out.

Next, we will use this information to train several models defined by different algorithms and different sets of predictor features. Based on their predictive power, we will select the models that perform best to further analyse their predictions on the training and test sets. 

| Feature Column | Transformation | Transformed columns | 
| --- | --- | --- |
| Surface | MinMaxScaler | 0 | 
| Bedrooms | CustomOrdinalEncoder | 1 | 
| Bathrooms | CustomOrdinalEncoder | 2 | 
| Location | DimensionReducer + OneHotEncoder | 3 - 14 | 

### Save transformed training and test sets

We save the transformed training and test sets, which will be using in the next stage of model training.

In [26]:
df_train.to_csv("RENT_APARTMENT_MR_training_set.csv", index=False)
df_test.to_csv("RENT_APARTMENT_MR_test_set.csv", index=False)