# Canadian Rental Prices Feature Engineering and Data Preprocessing

## Introduction

This notebook prepares the data for a regression model that will predict the price of a rental based on multiple factors.

We're going to work with the data that's been explored and cleaned in [this notebook](EDA.ipynb). The dataset contains rental prices data associated with multiple factors, like the location, type of rental, number of rooms, etc. 

The model selection to predict the price will be made in [this notebook](Regression.ipynb).

In [3]:
import numpy as np
import pandas as pd
import pickle
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [5]:
# Load the data
df = pd.read_pickle('../Data/cleaned_data.pkl')
df.shape

(18947, 18)

In [7]:
df.sample(5)

Unnamed: 0,rentfaster_id,city,province,address,latitude,longitude,lease_term,type,price,beds,baths,sq_feet,link,furnishing,availability_date,smoking,cats,dogs
11323,566474,Kitchener,Ontario,595 Strasburg Road,43.417214,-80.480752,Long Term,Apartment,2300.0,2,1.0,540.0,/on/kitchener/rentals/apartment/2-bedrooms/pet...,Unfurnished,Immediate,Non-Smoking,True,True
14115,568787,Toronto,Ontario,25 Dalhousie Street and 30 Mutual Street,43.654262,-79.37489,Long Term,Apartment,3090.0,1,1.0,540.0,/on/toronto/rentals/apartment/1-bedroom/pet-fr...,Unfurnished,August 01,Unknown,True,True
12452,559539,Ottawa,Ontario,265 Rideau Street,45.428846,-75.686676,Long Term,Apartment,2025.0,1,1.0,552.0,/on/ottawa/rentals/apartment/1-bedroom/byward-...,Unfurnished,July 01,Non-Smoking,True,True
8862,531030,Calgary,Alberta,15 Aspenmont Heights SW,51.039445,-114.214688,Long Term,Apartment,2200.0,1,1.0,592.0,/ab/calgary/rentals/apartment/1-bedroom/aspen-...,Unfurnished,Call for Availability,Non-Smoking,True,True
15717,514535,Toronto,Ontario,39 Niagara St.,43.641654,-79.400819,Long Term,Apartment,4550.0,2,1.0,854.0,/on/toronto/rentals/apartment/1-bedroom/non-sm...,Unfurnished,Immediate,Non-Smoking,False,False


## 1. Simplify model by removing some columns

**rentfaster_id**: An ID column is not correlated in any way with the price.

**province, address, latitude & longitude**: Location is an important factor in the price. However, having a model with all provinces and exact locations would overly complexify the model. Having the city should be enough to account location in our model.

**link**: The link has no correlation with the price.

**availability_date**: I don't think the availability date has anything to do with the price.

In [11]:
df.drop(columns=['rentfaster_id', 'province', 'address', 'latitude', 'longitude', 'link', 'availability_date'], axis=1, inplace=True)
df.sample(5)

Unnamed: 0,city,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs
12279,Ottawa,Long Term,Apartment,3450.0,2,1.0,774.0,Unfurnished,Non-Smoking,True,True
2842,Calgary,Long Term,Apartment,1580.0,1,1.0,1371.060606,Unfurnished,Non-Smoking,False,False
7997,Abbotsford,Long Term,Apartment,1225.0,0,1.0,450.0,Unfurnished,Non-Smoking,True,True
15482,Toronto,Long Term,Apartment,6500.0,3,3.0,1543.0,Unfurnished,Non-Smoking,True,True
18880,Swift Current,Long Term,Apartment,1430.0,3,1.0,829.0,Unfurnished,Non-Smoking,True,True


## 2. Encode categorical columns

### 2.1 Boolean

In [14]:
# Create dummy variables
df = df.assign(
    dogs=np.where(df['dogs'] == True, 1, 0),
    cats=np.where(df['cats'] == True, 1, 0),
)
df.sample(5)

Unnamed: 0,city,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs
8701,Calgary,Long Term,Townhouse,2700.0,3,2.5,1492.0,Unfurnished,Non-Smoking,1,0
13895,Toronto,Long Term,Apartment,2595.0,1,1.0,456.0,Unfurnished,Unknown,0,0
5844,Edmonton,Long Term,Room For Rent,690.0,6,4.5,2400.0,Furnished,Non-Smoking,0,0
6566,Edmonton,Long Term,Basement,1250.0,2,1.0,1371.060606,Unfurnished,Non-Smoking,0,1
296,Calgary,Long Term,Apartment,1880.0,1,1.0,578.0,Unfurnished,Non-Smoking,1,1


### 2.2 Non-ordinal

In [16]:
# Check the number of unique values
df[['city', 'lease_term', 'type', 'furnishing', 'smoking']].nunique()

city          269
lease_term      7
type           15
furnishing      3
smoking         5
dtype: int64

We will use Onehot Encoding for all the mentionned features, except city. It would increase the cardinality of the input features way too much. We will use Frequency Encoding, to encode that the cities with the most ads are big cities, which is most likely related to higher prices.

In [18]:
# Map the city with the frequency
city_frequency_map = df['city'].value_counts()
city_frequency_map

city
Calgary             4717
Edmonton            2569
Toronto             2379
Montréal            1514
Ottawa              1082
                    ... 
Christopher Lake       1
Priddis                1
Innisfail              1
Langdon                1
Crowsnest Pass         1
Name: count, Length: 269, dtype: int64

In [20]:
# Apply frequency encoding
df['city'] = df['city'].map(city_frequency_map)
df.head()

Unnamed: 0,city,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs
0,128,Long Term,Townhouse,2495.0,2,2.5,1403.0,Unfurnished,Non-Smoking,1,1
1,128,Long Term,Townhouse,2695.0,3,2.5,1496.0,Unfurnished,Non-Smoking,1,1
2,128,Long Term,Townhouse,2295.0,2,2.5,1180.0,Unfurnished,Non-Smoking,1,1
3,128,Long Term,Townhouse,2095.0,2,2.5,1403.0,Unfurnished,Non-Smoking,1,1
4,128,Long Term,Townhouse,2495.0,2,2.5,1351.0,Unfurnished,Non-Smoking,1,1


In [22]:
# Apply OneHotEncoder for the other features
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_cols = encoder.fit_transform(df[['lease_term', 'type', 'furnishing', 'smoking']])
new_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out(['lease_term', 'type', 'furnishing', 'smoking']))
df = df.drop(columns=['lease_term', 'type', 'furnishing', 'smoking'], axis=1).join(new_df)
df.sample(5)

Unnamed: 0,city,price,beds,baths,sq_feet,cats,dogs,lease_term_6 months,lease_term_Long Term,lease_term_Negotiable,...,type_Room For Rent,type_Storage,type_Townhouse,type_Vacation Home,furnishing_Negotiable,furnishing_Unfurnished,smoking_Non-Smoking,smoking_Smoke Free Building,smoking_Smoking Allowed,smoking_Unknown
16040,84,1795.0,2,1.0,1100.0,0,0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1448,4717,1600.0,2,1.0,1371.060606,0,0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
10216,204,639.0,1,1.0,1371.060606,0,0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
14017,2379,2889.0,1,1.0,647.0,1,1,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
9475,6,975.0,1,1.0,1371.060606,0,0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


In [24]:
# Make sure there are no more string values
s = df.dtypes
s = s[s != 'object']
assert len(s) == len(df.columns)

## 3. Feature Selection

In [33]:
# Train-test split
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X.shape

(18947, 5)

In [35]:
# Select features using Mutual Info Regresssion, since there might be non-linear relationships between the features.
selector = SelectKBest(score_func=mutual_info_regression, k=5)
X_train_selected = selector.fit_transform(X_train, y_train)
selected_features = selector.get_feature_names_out(X.columns)
selected_columns = list(selected_features)
selected_columns.append('price')
df = df[selected_columns]
df.sample(5)

Unnamed: 0,city,beds,baths,sq_feet,type_Apartment,price
11792,195,1,1.0,614.0,1.0,2489.0
16901,1514,0,1.0,1371.060606,1.0,1900.0
5439,2569,3,2.5,1600.0,0.0,2300.0
1907,4717,2,2.0,1371.060606,0.0,725.0
13472,7,2,1.0,835.0,1.0,2035.0


In [37]:
# Dump preprocessed data into a pickle file
data = {
    'city_frequency_map': city_frequency_map,
    'df': df,
}

with open('../Data/preprocessed_data.pkl', 'wb') as handle:
    pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)

## End