# Canadian Rental Prices Feature Engineering and Data Preprocessing

## 1. Introduction

This notebook prepares the data for a regression model that will predict the price of a rental based on multiple factors.

We're going to work with the data that's been explored and cleaned in [this notebook](EDA.ipynb). The dataset contains rental prices data associated with multiple factors, like the location, type of rental, number of rooms, etc. 

The model selection to predict the price will be made in [this notebook](Regression.ipynb).

In [237]:
import numpy as np
import pandas as pd
import pickle
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [238]:
# Load the data
df = pd.read_pickle('../Data/cleaned_data.pkl')
df.shape

(19045, 18)

In [239]:
df.sample(5)

Unnamed: 0,rentfaster_id,city,province,address,latitude,longitude,lease_term,type,price,beds,baths,sq_feet,link,furnishing,availability_date,smoking,cats,dogs
13195,566190,Ottawa,Ontario,"1350 Hemlock Road, Ottawa, Ontario K1K 5C2",45.45282,-75.63291,Long Term,Apartment,1990.0,1,1.0,730.0,/on/ottawa/rentals/apartment/1-bedroom/carson-...,Unfurnished,Immediate,Non-Smoking,True,True
656,477774,Calgary,Alberta,912 6th Avenue SW,51.048157,-114.081637,Long Term,Apartment,1715.0,1,1.0,530.0,/ab/calgary/rentals/apartment/1-bedroom/downto...,Unfurnished,Immediate,Non-Smoking,True,True
9872,568528,Winnipeg,Manitoba,384 Arnold Avenue,49.864198,-97.139262,Long Term,Duplex,2200.0,3,2.5,2200.0,/mb/winnipeg/rentals/duplex/3-bedrooms/lord-ro...,Unfurnished,Immediate,Non-Smoking,False,False
8605,143797,Calgary,Alberta,901 10 Avenue SW,51.04356,-114.081911,Long Term,Apartment,1850.0,1,1.0,570.0,/ab/calgary/rentals/apartment/1-bedroom/beltli...,Unfurnished,Immediate,Non-Smoking,False,False
16700,527987,Montréal,Quebec,360 President Kennedy,45.507216,-73.569636,Long Term,Apartment,1900.0,0,1.0,457.0,/qc/montreal/rentals/apartment/1-bedroom/pet-f...,Unfurnished,Immediate,Non-Smoking,True,True


## 2. Simplify model by removing some columns

**rentfaster_id**: An ID column is not correlated in any way with the price.

**province, address, latitude & longitude**: Location is an important factor in the price. However, having a model with all provinces and exact locations would overly complexify the model. Having the city should be enough to account location in our model.

**link**: The link has no correlation with the price.

**availability_date**: I don't think the availability date has anything to do with the price.

In [242]:
df.drop(columns=['rentfaster_id', 'province', 'address', 'latitude', 'longitude', 'link', 'availability_date'], axis=1, inplace=True)
df.sample(5)

Unnamed: 0,city,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs
2062,Calgary,Long Term,Main Floor,2285.0,3,2.5,1485.0,Unfurnished,Non-Smoking,False,False
8471,Nanaimo,Long Term,Apartment,2965.0,0,1.0,1348.0,Unfurnished,Non-Smoking,False,False
5866,Edmonton,Long Term,House,2595.0,3,2.5,1752.0,Unfurnished,Non-Smoking,False,False
5124,Edmonton,Long Term,Apartment,1434.0,2,1.0,880.0,Unfurnished,Unknown,True,True
5421,Edmonton,Long Term,Condo Unit,2250.0,2,2.0,795.0,Furnished,Non-Smoking,False,False


## 3. Encode categorical columns

### 3.1 Boolean

In [245]:
# Create dummy variables
df = df.assign(
    dogs=np.where(df['dogs'] == True, 1, 0),
    cats=np.where(df['cats'] == True, 1, 0),
)
df.sample(5)

Unnamed: 0,city,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs
5042,Edmonton,Long Term,Apartment,1420.0,0,1.0,0.0,Unfurnished,Non-Smoking,1,1
17896,Montréal,Long Term,Apartment,1395.0,1,1.0,1371.060606,Unfurnished,Non-Smoking,0,0
3616,Calgary,Long Term,Apartment,2000.0,2,1.0,800.0,Unfurnished,Non-Smoking,1,1
15255,Toronto,Long Term,Apartment,2600.0,2,1.0,1371.060606,Unfurnished,Unknown,0,0
18400,Lloydminster,Long Term,Apartment,1568.0,3,1.0,971.0,Unfurnished,Non-Smoking,1,1


### 3.2 Non-ordinal

In [247]:
# Check the number of unique values
df[['city', 'lease_term', 'type', 'furnishing', 'smoking']].nunique()

city          269
lease_term      7
type           15
furnishing      3
smoking         5
dtype: int64

We will use Onehot Encoding for all the mentionned features, except city. It would increase the cardinality of the input features way too much. We will use Frequency Encoding, to encode that the cities with the most ads are big cities, which is most likely related to higher prices.

In [249]:
# Map the city with the frequency
city_frequency_map = df['city'].value_counts()
city_frequency_map

city
Calgary             4721
Edmonton            2570
Toronto             2395
Montréal            1516
Ottawa              1084
                    ... 
Christopher Lake       1
Priddis                1
Innisfail              1
Langdon                1
Crowsnest Pass         1
Name: count, Length: 269, dtype: int64

In [250]:
# Apply frequency encoding
df['city'] = df['city'].map(city_frequency_map)
df.head()

Unnamed: 0,city,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs
0,128,Long Term,Townhouse,2495.0,2,2.5,1403.0,Unfurnished,Non-Smoking,1,1
1,128,Long Term,Townhouse,2695.0,3,2.5,1496.0,Unfurnished,Non-Smoking,1,1
2,128,Long Term,Townhouse,2295.0,2,2.5,1180.0,Unfurnished,Non-Smoking,1,1
3,128,Long Term,Townhouse,2095.0,2,2.5,1403.0,Unfurnished,Non-Smoking,1,1
4,128,Long Term,Townhouse,2495.0,2,2.5,1351.0,Unfurnished,Non-Smoking,1,1


In [251]:
# Apply OneHotEncoder for the other features
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_cols = encoder.fit_transform(df[['lease_term', 'type', 'furnishing', 'smoking']])
new_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out(['lease_term', 'type', 'furnishing', 'smoking']))
df = df.drop(columns=['lease_term', 'type', 'furnishing', 'smoking'], axis=1).join(new_df)
df.sample(5)

Unnamed: 0,city,price,beds,baths,sq_feet,cats,dogs,lease_term_6 months,lease_term_Long Term,lease_term_Negotiable,...,type_Room For Rent,type_Storage,type_Townhouse,type_Vacation Home,furnishing_Negotiable,furnishing_Unfurnished,smoking_Non-Smoking,smoking_Smoke Free Building,smoking_Smoking Allowed,smoking_Unknown
13855,2395,2725.0,1,1.0,494.0,1,1,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
9014,57,2350.0,2,1.0,950.0,1,1,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
10554,118,2995.0,3,2.5,1800.0,1,1,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
44,128,1737.0,2,1.0,792.0,1,1,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
17378,1516,1790.0,1,1.0,1371.060606,1,1,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


In [252]:
# Make sure there are no more string values
s = df.dtypes
s = s[s != 'object']
assert len(s) == len(df.columns)

## 5. Feature Selection

In [254]:
# Train-test split
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 42)
X.shape

(19045, 32)

In [266]:
# Select features using Mutual Info Regresssion, since there might be non-linear relationships between the features.
selector = SelectKBest(score_func=mutual_info_regression, k=10)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
selected_features = selector.get_feature_names_out(X.columns)
selected_columns = list(selected_features)
selected_columns.append('price')
df = df[selected_columns]
df.sample(5)

Unnamed: 0,city,beds,baths,sq_feet,cats,dogs,type_Apartment,type_Basement,type_House,type_Room For Rent,price
8292,97,1,1.0,720.0,1,1,1.0,0.0,0.0,0.0,2205.0
12833,1084,1,1.0,360.0,0,0,1.0,0.0,0.0,0.0,2200.0
4308,4721,2,2.0,902.0,1,1,1.0,0.0,0.0,0.0,2100.0
12996,1084,4,1.0,1449.0,1,1,1.0,0.0,0.0,0.0,2725.0
1600,4721,3,3.5,1895.0,0,0,0.0,0.0,0.0,0.0,3350.0


In [268]:
# Dump preprocessed data into a pickle file
data = {
    'city_frequency_map': city_frequency_map,
    'df': df,
}

with open('../Data/preprocessed_data.pkl', 'wb') as handle:
    pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)

## End