# Canadian Rental Prices Feature Engineering and Data Preprocessing

## Introduction

This notebook prepares the data for a regression model that will predict the price of a rental based on multiple factors. The data has already been explored and cleaned in [this notebook](DataAnalysis.ipynb). The dataset contains rental prices data associated with multiple variables, like the location, type of rental, number of rooms, furnishing, etc.

In [58]:
import numpy as np
import pandas as pd
import pickle
from sklearn.feature_selection import mutual_info_regression, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [59]:
# Load the data
df = pd.read_csv('Data/cleaned_data.csv')
df.shape

(18832, 19)

In [60]:
df.sample(5)

Unnamed: 0,rentfaster_id,city,province,address,latitude,longitude,lease_term,type,price,beds,baths,sq_feet,link,furnishing,availability_date,smoking,cats,dogs,postal_code
13878,560434,Toronto,Ontario,55 Smooth Rose Court,43.7744,-79.333442,Long Term,Apartment,2116.0,1,1.0,496.0,/on/toronto/rentals/apartment/1-bedroom/pet-fr...,Unfurnished,October 10,Unknown,True,True,M2J
7759,476414,Sherwood Park,Alberta,5001 Eton Boulevard,53.563429,-113.276715,Long Term,Condo Unit,2200.0,2,2.0,1055.0,/ab/sherwood-park/rentals/condo/2-bedrooms/pet...,Unfurnished,Immediate,Non-Smoking,False,True,T8H
1093,290642,Calgary,Alberta,Panorama Hills,51.16265,-114.100404,Long Term,Basement,1500.0,1,1.0,850.0,/ab/calgary/rentals/basement/1-bedroom/panoram...,Furnished,Immediate,Non-Smoking,False,False,T3K
7595,334281,Medicine Hat,Alberta,2398 Southview Dr SE,50.0163,-110.643755,Long Term,Apartment,1440.0,2,1.0,0.0,/ab/medicine-hat/rentals/apartment/2-bedrooms/...,Unfurnished,Immediate,Non-Smoking,True,True,T1B
2437,569918,Calgary,Alberta,Coventry Hills,51.171903,-114.038052,Long Term,Basement,1450.0,1,1.0,1000.0,/ab/calgary/rentals/basement/1-bedroom/coventr...,Unfurnished,June 15,Non-Smoking,True,False,T3K


---

## 1. Simplify dataset by removing columns

**rentfaster_id**: An ID column is not correlated in any way with the price.

**province, city, address, latitude & longitude**: The postal code includes the province, city and neighbourhood.

**link**: The link has no correlation with the price.

**availability_date**: It has anything to do with the price.

In [64]:
df = df.drop(columns=['rentfaster_id', 'province', 'city', 'address', 'latitude', 'longitude', 'link', 'availability_date'], axis=1)
df.sample(5)

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
17376,Long Term,Apartment,2200.0,2,1.0,900.0,Unfurnished,Non-Smoking,True,True,H2X
5757,Negotiable,Apartment,1150.0,0,1.0,600.0,Unfurnished,Non-Smoking,True,True,T5N
13852,Long Term,Apartment,3750.0,2,2.0,767.0,Unfurnished,Non-Smoking,True,True,M9A
13792,Long Term,Apartment,2231.0,1,1.0,1371.060606,Unfurnished,Non-Smoking,True,True,M6P
2541,Long Term,Main Floor,2200.0,3,1.5,1235.7,Unfurnished,Non-Smoking,False,False,T2K


---

## 2. Encode categorical columns

In [67]:
df.nunique()

lease_term        7
type             15
price          2745
beds             10
baths            16
sq_feet        1659
furnishing        3
smoking           5
cats              2
dogs              2
postal_code     698
dtype: int64

In [68]:
# Create dummy variables for cats and dogs
df = df.assign(
    dogs=np.where(df['dogs'] == True, 1, 0),
    cats=np.where(df['cats'] == True, 1, 0),
)
df.sample(5)

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
4497,Long Term,Apartment,1950.0,1,1.0,498.0,Unfurnished,Non-Smoking,1,1,T2R
14345,Long Term,Apartment,3800.0,4,2.0,1077.0,Unfurnished,Non-Smoking,1,1,M4C
1792,Long Term,Apartment,2038.0,2,1.0,923.0,Unfurnished,Non-Smoking,1,1,T2R
7056,Long Term,Apartment,1125.0,1,1.0,1371.060606,Unfurnished,Non-Smoking,1,0,T5K
18567,Long Term,Apartment,1415.0,1,1.0,447.0,Unfurnished,Non-Smoking,1,1,S7H


In [69]:
# Get values of lease term
df['lease_term'].value_counts(normalize=True)

lease_term
Long Term     0.935960
Negotiable    0.047525
Short Term    0.011576
12 months     0.003239
Unknown       0.001540
6 months      0.000106
months        0.000053
Name: proportion, dtype: float64

In [70]:
# To reduce cardinality, convert lease_tern column into a dummy variable
df['long_term'] = np.where(df['lease_term'] == 'Long Term', 1, 0)
df = df.drop(columns='lease_term', axis=1)
df.sample(5)

Unnamed: 0,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code,long_term
8802,Apartment,1475.0,1,1.0,820.0,Unfurnished,Non-Smoking,1,0,T2L,1
128,Acreage,3050.0,3,2.0,1520.0,Unfurnished,Non-Smoking,0,1,T1S,1
15726,Apartment,3750.0,2,2.0,1070.0,Unfurnished,Unknown,0,0,M6S,1
3903,Basement,1650.0,2,1.0,1100.0,Unfurnished,Non-Smoking,0,0,T2L,1
17936,Apartment,2450.0,2,2.0,940.0,Unfurnished,Non-Smoking,1,1,H9R,1


In [71]:
# Get values of furnishing
df['furnishing'].value_counts(normalize=True)

furnishing
Unfurnished    0.914667
Furnished      0.072377
Negotiable     0.012957
Name: proportion, dtype: float64

In [72]:
# Group Negotiable and Unfurnished and create dummy variable
df['furnished'] = np.where(df['furnishing'] == 'Furnished', 1, 0)
df = df.drop(columns='furnishing', axis=1)
df.sample(5)

Unnamed: 0,type,price,beds,baths,sq_feet,smoking,cats,dogs,postal_code,long_term,furnished
9407,Apartment,1648.0,3,2.0,1030.0,Non-Smoking,1,1,R1A,1,0
10989,Apartment,2235.0,2,2.0,1333.0,Non-Smoking,1,1,K7K,1,0
10677,Apartment,1760.0,1,1.0,616.0,Non-Smoking,1,1,N1M,1,0
12533,Apartment,1975.0,1,1.0,622.0,Non-Smoking,1,1,K1P,1,0
18723,Apartment,1749.0,2,1.0,1128.0,Non-Smoking,1,0,S7K,1,0


In [73]:
# Get values of smoking
df['smoking'].value_counts(normalize=True)

smoking
Non-Smoking            0.852220
Unknown                0.122610
Smoke Free Building    0.014762
Smoking Allowed        0.007540
Negotiable             0.002867
Name: proportion, dtype: float64

In [74]:
# Reduce cardinality of smoking and create dummy variable
df['smoking_allowed'] = np.where(df['smoking'] == 'Smoking Allowed', 1, 0)
df = df.drop(columns='smoking', axis=1)
df.sample(5)

Unnamed: 0,type,price,beds,baths,sq_feet,cats,dogs,postal_code,long_term,furnished,smoking_allowed
18739,Apartment,2085.0,3,1.5,609.0,1,1,S7V,1,0,0
16124,Apartment,3295.0,2,2.0,1102.0,0,0,H3X,1,0,0
7417,Apartment,1600.0,2,1.0,780.0,0,0,T1V,1,0,0
14873,Apartment,2378.48,1,1.0,732.0,1,1,M1K,1,0,0
8153,Condo Unit,3000.0,2,2.0,935.0,0,0,V3B,1,0,0


In [75]:
# Get values of type
df['type'].value_counts(normalize=True)

type
Apartment        0.699713
Condo Unit       0.069138
Townhouse        0.055013
House            0.053473
Basement         0.053260
Main Floor       0.024798
Room For Rent    0.021665
Duplex           0.016408
Office Space     0.003027
Storage          0.001115
Parking Spot     0.001115
Loft             0.000797
Acreage          0.000319
Vacation Home    0.000106
Mobile           0.000053
Name: proportion, dtype: float64

In [76]:
# Keep 5 types and group the rest
df['type'] = np.where(df['type'].isin(['Apartment', 'Condo Unit', 'Townhouse', 'House', 'Basement']), df['type'], 'Other')
df.sample(5)

Unnamed: 0,type,price,beds,baths,sq_feet,cats,dogs,postal_code,long_term,furnished,smoking_allowed
605,Apartment,2302.0,2,2.0,872.0,1,1,T2P,1,0,0
16086,Apartment,2495.0,1,1.0,992.0,1,1,H3X,1,1,0
4732,Apartment,2009.0,2,1.0,949.0,1,1,T5J,1,0,0
10858,Other,1995.0,1,2.0,1371.060606,1,1,L8L,1,0,0
14256,Apartment,2771.0,1,1.0,722.0,1,1,M2J,1,0,0


One Hot Encoding will be used for the remaining categorical column type. Frequency Encoding will be used for postal code. Using One Hot Encoding would increase the cardinality of the input features way too much. 

In [78]:
# Apply OneHotEncoder for type
encoder = OneHotEncoder(drop=['Other'], sparse_output=False)
encoded_cols = encoder.fit_transform(df[['type']])
new_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out(['type']))
df = df.drop(columns=['type'], axis=1).join(new_df)
df.sample(5)

Unnamed: 0,price,beds,baths,sq_feet,cats,dogs,postal_code,long_term,furnished,smoking_allowed,type_Apartment,type_Basement,type_Condo Unit,type_House,type_Townhouse
5014,1300.0,1,1.0,700.0,1,0,T6H,1,0,0,1.0,0.0,0.0,0.0,0.0
15409,2430.0,1,1.0,478.0,1,1,M6K,1,0,0,1.0,0.0,0.0,0.0,0.0
11764,2670.31,1,1.0,732.0,0,0,L5B,1,0,0,1.0,0.0,0.0,0.0,0.0
12874,1930.81,1,1.0,773.0,0,0,K1N,1,0,0,1.0,0.0,0.0,0.0,0.0
8527,3337.0,2,2.0,932.0,1,1,T3M,1,0,0,1.0,0.0,0.0,0.0,0.0


In [79]:
# Map the city with the frequency
pc_frequency_map = df['postal_code'].value_counts()
pc_frequency_map

postal_code
K1N    369
T2R    329
T5K    290
T3M    289
T2T    277
      ... 
L1T      2
K7N      2
L9G      2
L2H      2
P1A      2
Name: count, Length: 698, dtype: int64

In [80]:
# Apply frequency encoding
df['postal_code'] = df['postal_code'].map(pc_frequency_map)
df.sample(5)

Unnamed: 0,price,beds,baths,sq_feet,cats,dogs,postal_code,long_term,furnished,smoking_allowed,type_Apartment,type_Basement,type_Condo Unit,type_House,type_Townhouse
2621,850.0,2,1.5,1850.0,0,0,144,0,1,0,0.0,0.0,0.0,0.0,0.0
18677,850.0,1,1.0,400.0,1,1,95,1,0,0,1.0,0.0,0.0,0.0,0.0
6616,3200.0,2,2.0,1500.0,1,1,168,1,0,0,0.0,0.0,1.0,0.0,0.0
11398,2650.0,2,2.0,1691.0,1,1,33,1,0,0,1.0,0.0,0.0,0.0,0.0
11105,2168.47,2,1.0,699.0,0,0,145,1,0,0,1.0,0.0,0.0,0.0,0.0


In [81]:
# Make sure there are no more string values
s = df.dtypes
s = s[s != 'object']
assert len(s) == len(df.columns)

---

## 3. Feature Selection

In [84]:
# Train-test split
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=54)
X.shape

(18832, 14)

In [85]:
# Select features using Mutual Info Regresssion, since there might be non-linear relationships between the features.
selector = SelectKBest(score_func=mutual_info_regression, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
selected_features = selector.get_feature_names_out(X.columns)
selected_columns = list(selected_features)
selected_columns.append('price')
df = df[selected_columns]
df.sample(5)

Unnamed: 0,beds,baths,sq_feet,cats,dogs,postal_code,furnished,type_Apartment,type_Basement,type_House,price
8884,1,1.0,721.0,1,1,10,0,1.0,0.0,0.0,1750.0
8392,1,1.0,767.0,1,1,33,0,1.0,0.0,0.0,1967.0
6230,1,1.0,700.0,0,0,290,1,0.0,0.0,0.0,1400.0
4948,1,1.0,653.0,1,1,215,0,1.0,0.0,0.0,1515.0
14411,2,1.0,1371.060606,1,1,11,0,1.0,0.0,0.0,2824.0


In [86]:
# Dump data frame and frequency map
df.to_csv('Data/preprocessed_data.csv', index=False)

with open('Data/postcode_frequencies.pkl', 'wb') as handle:
    pickle.dump(pc_frequency_map, handle, protocol=pickle.HIGHEST_PROTOCOL)

## End