# Canadian Rental Prices Feature Engineering and Data Preprocessing

## Introduction

This notebook prepares the data for a regression model that will predict the price of a rental based on multiple factors.

We're going to work with the data that's been explored and cleaned in [this notebook](DataAnalysis.ipynb). The dataset contains rental prices data associated with multiple factors, like the location, type of rental, number of rooms, etc. 

The model selection to predict the price will be made in [this notebook](DataModeling.ipynb).

In [4]:
import numpy as np
import pandas as pd
import pickle
from sklearn.feature_selection import mutual_info_regression, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [5]:
# Load the data
df = pd.read_csv('Data/cleaned_data.csv')
df.shape

(18832, 19)

In [6]:
df.sample(5)

Unnamed: 0,rentfaster_id,city,province,address,latitude,longitude,lease_term,type,price,beds,baths,sq_feet,link,furnishing,availability_date,smoking,cats,dogs,postal_code
3487,72819,Calgary,Alberta,"241081 range road 33, rockyview No. 44",51.030629,-114.350466,Long Term,Acreage,4400.0,3,3.0,2300.0,/ab/calgary/rentals/acreage/3-bedrooms/springb...,Unfurnished,July 01,Non-Smoking,True,True,T3Z
6328,397953,Edmonton,Alberta,11015 83 St NW,53.558613,-113.468462,Long Term,Apartment,900.0,0,1.0,425.0,/ab/edmonton/rentals/apartment/1-bedroom/cromd...,Unfurnished,Call for Availability,Non-Smoking,True,False,T5H
11646,560185,Lucan,Ontario,280 Main St,43.191809,-81.41174,Long Term,Apartment,2200.0,2,1.0,701.0,/on/lucan/rentals/apartment/1-bedroom/pet-frie...,Unfurnished,Immediate,Non-Smoking,True,True,N0M
14890,394643,Toronto,Ontario,3967 Lawrence Avenue East,43.763957,-79.203175,Long Term,Apartment,2645.14,2,1.0,930.0,/on/toronto/rentals/apartment/1-bedroom/non-sm...,Unfurnished,Immediate,Non-Smoking,False,False,M1E
3947,458369,Calgary,Alberta,41 Yorkville Avenue Southwest,50.874666,-114.072466,Long Term,House,3500.0,4,3.5,2350.0,/ab/calgary/rentals/house/4-bedrooms/yorkville...,Unfurnished,August 01,Non-Smoking,True,True,T2X


---

## 1. Simplify model by removing some columns

**rentfaster_id**: An ID column is not correlated in any way with the price.

**province, city, address, latitude & longitude**: The postal code includes the province, city and neighbourhood.

**link**: The link has no correlation with the price.

**availability_date**: I don't think the availability date has anything to do with the price.

In [10]:
df = df.drop(columns=['rentfaster_id', 'province', 'city', 'address', 'latitude', 'longitude', 'link', 'availability_date'], axis=1)
df.sample(5)

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
15559,Long Term,Condo Unit,1900.0,0,1.0,1371.060606,Unfurnished,Non-Smoking,False,False,M5A
15066,Long Term,Apartment,2375.0,2,1.0,844.0,Unfurnished,Non-Smoking,True,True,M2N
6612,Long Term,House,3800.0,4,3.5,1760.0,Furnished,Non-Smoking,False,False,T6X
17552,Long Term,Apartment,2150.0,2,1.0,1371.060606,Unfurnished,Non-Smoking,False,False,H3H
13906,Long Term,Apartment,5097.0,2,2.0,816.0,Unfurnished,Unknown,True,True,M5V


---

## 2. Encode categorical columns

### 2.1 Boolean

In [14]:
# Create dummy variables
df = df.assign(
    dogs=np.where(df['dogs'] == True, 1, 0),
    cats=np.where(df['cats'] == True, 1, 0),
)
df.sample(5)

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
2063,Long Term,Main Floor,2495.0,4,2.5,2200.0,Unfurnished,Non-Smoking,1,1,T3P
1223,Long Term,Condo Unit,2200.0,2,2.0,725.0,Furnished,Non-Smoking,0,0,T3R
14582,Long Term,Apartment,3950.0,2,2.0,963.0,Unfurnished,Non-Smoking,1,1,M4S
15016,Long Term,Apartment,1729.0,1,1.0,476.0,Unfurnished,Non-Smoking,0,0,M4A
1230,Long Term,Apartment,1900.0,2,1.0,811.0,Unfurnished,Non-Smoking,1,1,T2E


### 2.2 Non-ordinal

In [16]:
# Check the number of unique values
df[['postal_code', 'lease_term', 'type', 'furnishing', 'smoking']].nunique()

postal_code    698
lease_term       7
type            15
furnishing       3
smoking          5
dtype: int64

We will use Onehot Encoding for all the mentionned features, except postal code. It would increase the cardinality of the input features way too much. We will use Frequency Encoding instead.

In [18]:
# Map the city with the frequency
pc_frequency_map = df['postal_code'].value_counts()
pc_frequency_map

postal_code
K1N    369
T2R    329
T5K    290
T3M    289
T2T    277
      ... 
L1T      2
K7N      2
L9G      2
L2H      2
P1A      2
Name: count, Length: 698, dtype: int64

In [19]:
# Apply frequency encoding
df['postal_code'] = df['postal_code'].map(pc_frequency_map)
df.head()

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
0,Long Term,Townhouse,2495.0,2,2.5,1403.0,Unfurnished,Non-Smoking,1,1,156
1,Long Term,Townhouse,2695.0,3,2.5,1496.0,Unfurnished,Non-Smoking,1,1,156
2,Long Term,Townhouse,2295.0,2,2.5,1180.0,Unfurnished,Non-Smoking,1,1,156
3,Long Term,Townhouse,2095.0,2,2.5,1403.0,Unfurnished,Non-Smoking,1,1,156
4,Long Term,Townhouse,2495.0,2,2.5,1351.0,Unfurnished,Non-Smoking,1,1,156


In [20]:
# Apply OneHotEncoder for the other features
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_cols = encoder.fit_transform(df[['lease_term', 'type', 'furnishing', 'smoking']])
new_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out(['lease_term', 'type', 'furnishing', 'smoking']))
df = df.drop(columns=['lease_term', 'type', 'furnishing', 'smoking'], axis=1).join(new_df)
df.sample(5)

Unnamed: 0,price,beds,baths,sq_feet,cats,dogs,postal_code,lease_term_6 months,lease_term_Long Term,lease_term_Negotiable,...,type_Room For Rent,type_Storage,type_Townhouse,type_Vacation Home,furnishing_Negotiable,furnishing_Unfurnished,smoking_Non-Smoking,smoking_Smoke Free Building,smoking_Smoking Allowed,smoking_Unknown
9788,1350.0,2,1.0,1371.060606,0,0,28,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
15597,2100.0,1,1.0,723.0,1,1,46,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
7574,1489.0,2,1.0,840.0,1,1,48,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
15369,1990.0,5,3.0,1371.060606,0,0,30,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
15714,2965.59,2,1.0,958.0,1,1,98,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


In [21]:
# Make sure there are no more string values
s = df.dtypes
s = s[s != 'object']
assert len(s) == len(df.columns)

---

## 3. Feature Selection

To simplify the model, we are only going to choose 10 features, so it's easier to interpret.

In [25]:
# Train-test split
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=54)
X.shape

(18832, 32)

In [26]:
# Select features using Mutual Info Regresssion, since there might be non-linear relationships between the features.
selector = SelectKBest(score_func=mutual_info_regression, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
selected_features = selector.get_feature_names_out(X.columns)
selected_columns = list(selected_features)
selected_columns.append('price')
df = df[selected_columns]
df.sample(5)

Unnamed: 0,beds,baths,sq_feet,cats,dogs,postal_code,type_Apartment,type_Basement,type_Room For Rent,furnishing_Unfurnished,price
10576,1,1.0,1371.060606,1,1,3,1.0,0.0,0.0,1.0,1785.0
3570,2,2.0,884.0,0,0,162,0.0,0.0,0.0,1.0,2050.0
12890,1,1.0,773.0,0,0,369,1.0,0.0,0.0,1.0,1920.94
6650,3,2.5,1371.060606,1,1,215,0.0,0.0,0.0,1.0,1745.0
17666,0,1.0,400.0,0,0,112,1.0,0.0,0.0,1.0,1580.0


In [27]:
# Dump data frame and frequency map
df.to_csv('Data/preprocessed_data.csv', index=False)

with open('Data/postcode_frequencies.pkl', 'wb') as handle:
    pickle.dump(pc_frequency_map, handle, protocol=pickle.HIGHEST_PROTOCOL)

## End