# Canadian Rental Prices Feature Engineering and Data Preprocessing

## Introduction

This notebook prepares the data for a regression model that will predict the price of a rental based on multiple factors.

We're going to work with the data that's been explored and cleaned in [this notebook](DataAnalysis.ipynb). The dataset contains rental prices data associated with multiple factors, like the location, type of rental, number of rooms, etc. 

The model selection to predict the price will be made in [this notebook](DataModeling.ipynb).

In [6]:
import numpy as np
import pandas as pd
import pickle
from sklearn.feature_selection import mutual_info_regression, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [7]:
# Load the data
df = pd.read_csv('Data/cleaned_data.csv')
df.shape

(18832, 19)

In [8]:
df.sample(5)

Unnamed: 0,rentfaster_id,city,province,address,latitude,longitude,lease_term,type,price,beds,baths,sq_feet,link,furnishing,availability_date,smoking,cats,dogs,postal_code
2905,124935,Calgary,Alberta,58 Cranleigh Terrace,50.886247,-113.992509,Negotiable,Basement,1800.0,1,1.0,1300.0,/ab/calgary/rentals/basement/1-bedroom/cransto...,Unfurnished,July 01,Non-Smoking,True,False,T3M
6712,569124,Edmonton,Alberta,8918 88 Avenue Northwest,53.523639,-113.464965,Long Term,Main Floor,2600.0,3,1.0,1300.0,/ab/edmonton/rentals/main-floor/3-bedrooms/bon...,Unfurnished,July 01,Non-Smoking,False,False,T6C
13773,523076,Toronto,Ontario,2376 Dundas Street West,43.657961,-79.451842,Long Term,Apartment,3110.0,2,1.0,783.0,/on/toronto/rentals/apartment/1-bedroom/pet-fr...,Unfurnished,Immediate,Smoke Free Building,True,True,M6P
1127,547253,Calgary,Alberta,Arbour Lake,51.133625,-114.19332,Long Term,House,3500.0,4,3.0,1322.0,/ab/calgary/rentals/house/4-bedrooms/arbour-la...,Unfurnished,Immediate,Non-Smoking,True,True,T3G
365,361795,Calgary,Alberta,620 10 Avenue SW,51.044004,-114.075438,Long Term,Apartment,2319.0,1,1.0,612.0,/ab/calgary/rentals/apartment/1-bedroom/beltli...,Unfurnished,August 07,Non-Smoking,True,True,T2R


---

## 1. Simplify model by removing some columns

**rentfaster_id**: An ID column is not correlated in any way with the price.

**province, city, address, latitude & longitude**: The postal code includes the province, city and neighbourhood.

**link**: The link has no correlation with the price.

**availability_date**: I don't think the availability date has anything to do with the price.

In [12]:
df = df.drop(columns=['rentfaster_id', 'province', 'city', 'address', 'latitude', 'longitude', 'link', 'availability_date'], axis=1)
df.sample(5)

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
13785,Long Term,Apartment,3120.0,2,1.0,783.0,Unfurnished,Smoke Free Building,True,True,M6P
13219,Long Term,Apartment,1585.0,1,1.0,574.0,Unfurnished,Non-Smoking,True,True,N5R
16638,Long Term,Apartment,2875.0,2,1.0,585.0,Unfurnished,Non-Smoking,True,True,H3A
3510,Long Term,Apartment,2085.0,2,2.0,900.0,Unfurnished,Non-Smoking,True,True,T2Y
14690,Long Term,Apartment,2001.0,0,1.0,0.0,Unfurnished,Non-Smoking,True,True,M6P


---

## 2. Encode categorical columns

### 2.1 Boolean

In [16]:
# Create dummy variables
df = df.assign(
    dogs=np.where(df['dogs'] == True, 1, 0),
    cats=np.where(df['cats'] == True, 1, 0),
)
df.sample(5)

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
15581,Long Term,Apartment,2150.0,1,1.0,500.0,Unfurnished,Non-Smoking,1,1,M9B
16390,Long Term,Apartment,1770.0,1,1.0,829.0,Unfurnished,Non-Smoking,1,1,H7P
14145,Long Term,Apartment,3934.0,3,2.0,943.0,Unfurnished,Non-Smoking,1,1,M6G
9350,Long Term,Apartment,1805.0,2,1.0,845.0,Unfurnished,Non-Smoking,1,1,V4T
8217,Long Term,Townhouse,3368.0,3,2.5,1585.0,Unfurnished,Unknown,1,1,V1W


### 2.2 Non-ordinal

In [18]:
# Check the number of unique values
df[['postal_code', 'lease_term', 'type', 'furnishing', 'smoking']].nunique()

postal_code    698
lease_term       7
type            15
furnishing       3
smoking          5
dtype: int64

We will use Onehot Encoding for all the mentionned features, except postal code. It would increase the cardinality of the input features way too much. We will use Frequency Encoding instead.

In [20]:
# Map the city with the frequency
pc_frequency_map = df['postal_code'].value_counts()
pc_frequency_map

postal_code
K1N    369
T2R    329
T5K    290
T3M    289
T2T    277
      ... 
L1T      2
K7N      2
L9G      2
L2H      2
P1A      2
Name: count, Length: 698, dtype: int64

In [21]:
# Apply frequency encoding
df['postal_code'] = df['postal_code'].map(pc_frequency_map)
df.head()

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
0,Long Term,Townhouse,2495.0,2,2.5,1403.0,Unfurnished,Non-Smoking,1,1,156
1,Long Term,Townhouse,2695.0,3,2.5,1496.0,Unfurnished,Non-Smoking,1,1,156
2,Long Term,Townhouse,2295.0,2,2.5,1180.0,Unfurnished,Non-Smoking,1,1,156
3,Long Term,Townhouse,2095.0,2,2.5,1403.0,Unfurnished,Non-Smoking,1,1,156
4,Long Term,Townhouse,2495.0,2,2.5,1351.0,Unfurnished,Non-Smoking,1,1,156


In [22]:
# Apply OneHotEncoder for the other features
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_cols = encoder.fit_transform(df[['lease_term', 'type', 'furnishing', 'smoking']])
new_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out(['lease_term', 'type', 'furnishing', 'smoking']))
df = df.drop(columns=['lease_term', 'type', 'furnishing', 'smoking'], axis=1).join(new_df)
df.sample(5)

Unnamed: 0,price,beds,baths,sq_feet,cats,dogs,postal_code,lease_term_6 months,lease_term_Long Term,lease_term_Negotiable,...,type_Room For Rent,type_Storage,type_Townhouse,type_Vacation Home,furnishing_Negotiable,furnishing_Unfurnished,smoking_Non-Smoking,smoking_Smoke Free Building,smoking_Smoking Allowed,smoking_Unknown
10994,1560.0,0,1.0,406.0,1,1,145,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
18454,1559.0,2,1.0,877.0,1,1,36,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
11794,2443.11,1,1.0,660.0,0,0,60,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
2544,2300.0,2,1.0,870.0,0,0,329,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
11374,1810.0,2,1.0,807.0,1,1,16,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


In [23]:
# Make sure there are no more string values
s = df.dtypes
s = s[s != 'object']
assert len(s) == len(df.columns)

---

## 3. Feature Selection

In [26]:
# Train-test split
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X.shape

(18832, 32)

In [27]:
# Select features using Mutual Info Regresssion, since there might be non-linear relationships between the features.
selector = SelectKBest(score_func=mutual_info_regression, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
selected_features = selector.get_feature_names_out(X.columns)
selected_columns = list(selected_features)
selected_columns.append('price')
df = df[selected_columns]
df.sample(5)

Unnamed: 0,beds,baths,sq_feet,cats,dogs,postal_code,type_Apartment,type_Basement,type_House,type_Room For Rent,price
15561,1,1.0,1371.060606,1,1,18,1.0,0.0,0.0,0.0,2000.0
2526,2,1.0,1371.060606,0,0,152,0.0,0.0,0.0,0.0,1775.0
16224,1,1.0,0.0,1,1,4,1.0,0.0,0.0,0.0,1254.0
10877,1,1.0,550.0,1,1,5,1.0,0.0,0.0,0.0,1840.0
9661,3,1.5,0.0,0,0,69,0.0,0.0,0.0,0.0,1700.0


In [28]:
# Dump preprocessed data into a pickle file
data = {
    'pc_frequency_map': pc_frequency_map,
    'df': df,
}

with open('Data/preprocessed_data.pkl', 'wb') as handle:
    pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)

## End