# Canadian Rental Prices Feature Engineering and Data Preprocessing

## Introduction

This notebook prepares the data for a regression model that will predict the price of a rental based on multiple factors.

We're going to work with the data that's been explored and cleaned in [this notebook](DataAnalysis.ipynb). The dataset contains rental prices data associated with multiple factors, like the location, type of rental, number of rooms, etc. 

The model selection to predict the price will be made in [this notebook](DataModeling.ipynb).

In [4]:
import numpy as np
import pandas as pd
import pickle
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [5]:
# Load the data
df = pd.read_csv('Data/cleaned_data.csv')
df.shape

(18832, 19)

In [6]:
df.sample(5)

Unnamed: 0,rentfaster_id,city,province,address,latitude,longitude,lease_term,type,price,beds,baths,sq_feet,link,furnishing,availability_date,smoking,cats,dogs,postal_code
17668,555499,Montréal,Quebec,1350 du Fort,45.491281,-73.580864,Long Term,Apartment,1525.0,1,1.0,1371.060606,/qc/montreal/rentals/apartment/1-bedroom/pet-f...,Unfurnished,Immediate,Non-Smoking,True,False,H3H
13409,562018,Tillsonburg,Ontario,93 Keba Crescent,42.880897,-80.736283,Long Term,House,2395.0,3,3.5,1900.0,/on/tillsonburg/rentals/house/3-bedrooms/pet-f...,Unfurnished,Immediate,Non-Smoking,True,True,N4G
5868,504939,Edmonton,Alberta,10823 Jasper Ave,53.540628,-113.507537,Short Term,Apartment,3270.0,1,1.0,620.0,/ab/edmonton/rentals/apartment/1-bedroom/downt...,Furnished,Immediate,Non-Smoking,True,True,T5J
1703,570759,Calgary,Alberta,307 Ascot Cir Sw,51.047735,-114.219721,Long Term,Townhouse,2900.0,3,2.5,1434.0,/ab/calgary/rentals/townhouse/3-bedrooms/aspen...,Unfurnished,Immediate,Non-Smoking,False,False,T3H
2550,571986,Calgary,Alberta,681 Savanna Boulevard Northeast,51.134773,-113.949688,Long Term,Condo Unit,2200.0,2,2.0,950.0,/ab/calgary/rentals/condo/2-bedrooms/savanna/p...,Unfurnished,June 18,Non-Smoking,True,True,T3J


## 1. Simplify model by removing some columns

**rentfaster_id**: An ID column is not correlated in any way with the price.

**province, city, address, latitude & longitude**: The postal code includes the province, city and neighbourhood.

**link**: The link has no correlation with the price.

**availability_date**: I don't think the availability date has anything to do with the price.

In [9]:
df = df.drop(columns=['rentfaster_id', 'province', 'city', 'address', 'latitude', 'longitude', 'link', 'availability_date'], axis=1)
df.sample(5)

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
3361,Long Term,Main Floor,1695.0,2,1.0,1000.0,Unfurnished,Non-Smoking,False,False,T2G
9273,Long Term,Apartment,2500.0,2,1.0,706.0,Unfurnished,Non-Smoking,True,True,V9B
16079,Long Term,Apartment,1690.0,1,1.0,749.0,Unfurnished,Non-Smoking,False,False,J4X
2785,Long Term,House,2850.0,4,2.5,2000.0,Negotiable,Non-Smoking,False,True,T3A
2589,Short Term,House,5000.0,5,3.5,3500.0,Furnished,Non-Smoking,False,False,T3K


## 2. Encode categorical columns

### 2.1 Boolean

In [12]:
# Create dummy variables
df = df.assign(
    dogs=np.where(df['dogs'] == True, 1, 0),
    cats=np.where(df['cats'] == True, 1, 0),
)
df.sample(5)

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
1983,Long Term,Duplex,2400.0,3,2.5,1371.060606,Unfurnished,Non-Smoking,0,0,T2Z
7673,Long Term,Main Floor,2650.0,3,2.0,1300.0,Unfurnished,Non-Smoking,0,0,T4R
13226,Long Term,Apartment,2140.0,2,1.0,861.0,Unfurnished,Non-Smoking,0,0,N7T
2298,Long Term,Basement,1800.0,2,1.0,1000.0,Unfurnished,Non-Smoking,0,0,T3P
5345,Long Term,Townhouse,1100.0,1,1.0,533.0,Unfurnished,Non-Smoking,1,1,T6X


### 2.2 Non-ordinal

In [14]:
# Check the number of unique values
df[['postal_code', 'lease_term', 'type', 'furnishing', 'smoking']].nunique()

postal_code    698
lease_term       7
type            15
furnishing       3
smoking          5
dtype: int64

We will use Onehot Encoding for all the mentionned features, except postal code. It would increase the cardinality of the input features way too much. We will use Frequency Encoding instead.

In [16]:
# Map the city with the frequency
pc_frequency_map = df['postal_code'].value_counts()
pc_frequency_map

postal_code
K1N    369
T2R    329
T5K    290
T3M    289
T2T    277
      ... 
L1T      2
K7N      2
L9G      2
L2H      2
P1A      2
Name: count, Length: 698, dtype: int64

In [17]:
# Apply frequency encoding
df['postal_code'] = df['postal_code'].map(pc_frequency_map)
df.head()

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
0,Long Term,Townhouse,2495.0,2,2.5,1403.0,Unfurnished,Non-Smoking,1,1,156
1,Long Term,Townhouse,2695.0,3,2.5,1496.0,Unfurnished,Non-Smoking,1,1,156
2,Long Term,Townhouse,2295.0,2,2.5,1180.0,Unfurnished,Non-Smoking,1,1,156
3,Long Term,Townhouse,2095.0,2,2.5,1403.0,Unfurnished,Non-Smoking,1,1,156
4,Long Term,Townhouse,2495.0,2,2.5,1351.0,Unfurnished,Non-Smoking,1,1,156


In [18]:
# Apply OneHotEncoder for the other features
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_cols = encoder.fit_transform(df[['lease_term', 'type', 'furnishing', 'smoking']])
new_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out(['lease_term', 'type', 'furnishing', 'smoking']))
df = df.drop(columns=['lease_term', 'type', 'furnishing', 'smoking'], axis=1).join(new_df)
df.sample(5)

Unnamed: 0,price,beds,baths,sq_feet,cats,dogs,postal_code,lease_term_6 months,lease_term_Long Term,lease_term_Negotiable,...,type_Room For Rent,type_Storage,type_Townhouse,type_Vacation Home,furnishing_Negotiable,furnishing_Unfurnished,smoking_Non-Smoking,smoking_Smoke Free Building,smoking_Smoking Allowed,smoking_Unknown
8979,2115.0,1,1.0,600.0,0,0,2,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
14447,3876.0,2,2.0,863.0,1,1,74,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
6411,1425.0,1,1.5,1371.060606,1,0,68,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
2045,750.0,0,0.0,1371.060606,1,1,329,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
7083,1324.0,2,1.0,930.0,1,1,146,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0


In [19]:
# Make sure there are no more string values
s = df.dtypes
s = s[s != 'object']
assert len(s) == len(df.columns)

## 3. Feature Selection

In [21]:
# Train-test split
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X.shape

(18832, 32)

In [22]:
# Select features using Mutual Info Regresssion, since there might be non-linear relationships between the features.
selector = SelectKBest(score_func=mutual_info_regression, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
selected_features = selector.get_feature_names_out(X.columns)
selected_columns = list(selected_features)
selected_columns.append('price')
df = df[selected_columns]
df.sample(5)

Unnamed: 0,beds,baths,sq_feet,cats,dogs,postal_code,type_Apartment,type_Basement,type_Room For Rent,furnishing_Unfurnished,price
409,2,2.0,1024.0,1,1,123,1.0,0.0,0.0,1.0,2760.0
15571,1,1.0,462.0,0,0,57,1.0,0.0,0.0,1.0,2100.0
11671,1,1.0,714.0,1,1,8,1.0,0.0,0.0,1.0,2275.0
18501,2,1.0,740.0,1,1,49,1.0,0.0,0.0,1.0,1724.0
7059,3,2.0,1371.060606,1,1,225,0.0,0.0,0.0,1.0,2200.0


In [23]:
# Dump preprocessed data into a pickle file
data = {
    'pc_frequency_map': pc_frequency_map,
    'df': df,
}

with open('Data/preprocessed_data.pkl', 'wb') as handle:
    pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)

## End