# Canadian Rental Prices Feature Engineering and Data Preprocessing

## Introduction

This notebook prepares the data for a regression model that will predict the price of a rental based on multiple factors. The data has already been explored and cleaned in [this notebook](DataAnalysis.ipynb). The dataset contains rental prices data associated with multiple variables, like the location, type of rental, number of rooms, furnishing, etc.

In [4]:
import numpy as np
import pandas as pd
import pickle
from sklearn.feature_selection import mutual_info_regression, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [5]:
# Load the data
df = pd.read_csv('Data/cleaned_data.csv')
df.shape

(18832, 19)

In [6]:
df.sample(5)

Unnamed: 0,rentfaster_id,city,province,address,latitude,longitude,lease_term,type,price,beds,baths,sq_feet,link,furnishing,availability_date,smoking,cats,dogs,postal_code
12694,564823,Ottawa,Ontario,245 Laurier E,45.426332,-75.68105,Long Term,Apartment,1595.0,0,1.0,1371.060606,/on/ottawa/rentals/apartment/studio/centretown...,Furnished,Immediate,Non-Smoking,True,True,K1N
7513,567675,Lethbridge,Alberta,Indian Battle Heights,49.676164,-112.898252,Long Term,Room For Rent,885.0,1,1.0,1371.060606,/ab/lethbridge/rentals/shared/1-bedroom/indian...,Unfurnished,July 01,Non-Smoking,False,False,T1K
8283,571834,Kelowna,British Columbia,,49.824022,-119.490133,Long Term,Condo Unit,2800.0,2,2.0,1250.0,/bc/kelowna/rentals/condo/2-bedrooms/furnished...,Furnished,July 01,Non-Smoking,False,False,V1W
6908,566851,Edmonton,Alberta,Windermere,53.427274,-113.628369,Negotiable,House,4100.0,5,3.5,2670.0,/ab/edmonton/rentals/house/5-bedrooms/winderme...,Furnished,July 01,Non-Smoking,False,False,T6W
14003,568787,Toronto,Ontario,25 Dalhousie Street and 30 Mutual Street,43.654262,-79.37489,Long Term,Apartment,4050.0,2,2.0,892.0,/on/toronto/rentals/apartment/1-bedroom/pet-fr...,Unfurnished,August 01,Unknown,True,True,M5B


---

## 1. Simplify dataset by removing columns

**rentfaster_id**: An ID column is not correlated in any way with the price.

**province, city, address, latitude & longitude**: The postal code includes the province, city and neighbourhood.

**link**: The link has no correlation with the price.

**availability_date**: It has anything to do with the price.

In [10]:
df = df.drop(columns=['rentfaster_id', 'province', 'city', 'address', 'latitude', 'longitude', 'link', 'availability_date'], axis=1)
df.sample(5)

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
6384,Long Term,Apartment,1691.0,2,2.0,1371.060606,Unfurnished,Non-Smoking,False,True,T6W
5671,Long Term,Basement,1500.0,2,1.0,925.0,Unfurnished,Non-Smoking,True,True,T5P
2183,Long Term,Loft,1150.0,0,1.0,1371.060606,Unfurnished,Non-Smoking,False,False,T2Z
8028,Long Term,Apartment,3220.0,2,2.0,833.0,Unfurnished,Non-Smoking,True,True,V3J
3728,Negotiable,Room For Rent,1250.0,1,1.0,1500.0,Unfurnished,Non-Smoking,False,False,T3M


---

## 2. Encode categorical columns

In [13]:
df.nunique()

lease_term        7
type             15
price          2745
beds             10
baths            16
sq_feet        1659
furnishing        3
smoking           5
cats              2
dogs              2
postal_code     698
dtype: int64

In [14]:
# Create dummy variables for cats and dogs
df = df.assign(
    dogs=np.where(df['dogs'] == True, 1, 0),
    cats=np.where(df['cats'] == True, 1, 0),
)
df.sample(5)

Unnamed: 0,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code
5296,Negotiable,Basement,1499.0,2,1.0,1890.0,Unfurnished,Non-Smoking,0,0,T6M
1737,Negotiable,Room For Rent,700.0,1,1.0,1372.0,Furnished,Non-Smoking,0,0,T3L
11032,Long Term,Apartment,1425.0,1,1.0,320.0,Unfurnished,Non-Smoking,0,0,K7L
18347,Long Term,Apartment,1150.0,1,1.0,0.0,Unfurnished,Non-Smoking,1,1,S4T
2560,Long Term,Apartment,1900.0,1,1.0,704.0,Unfurnished,Non-Smoking,1,1,T2S


In [15]:
# Get values of lease term
df['lease_term'].value_counts(normalize=True)

lease_term
Long Term     0.935960
Negotiable    0.047525
Short Term    0.011576
12 months     0.003239
Unknown       0.001540
6 months      0.000106
months        0.000053
Name: proportion, dtype: float64

In [16]:
# To reduce cardinality, convert lease_tern column into a dummy variable
df['long_term'] = np.where(df['lease_term'] == 'Long Term', 1, 0)
df = df.drop(columns='lease_term', axis=1)
df.sample(5)

Unnamed: 0,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs,postal_code,long_term
5529,House,2300.0,3,2.5,1350.0,Unfurnished,Non-Smoking,0,0,T6M,1
8934,Apartment,2700.0,3,1.0,1039.0,Unfurnished,Non-Smoking,1,1,V3R,1
3213,Duplex,3000.0,3,2.5,1450.0,Furnished,Non-Smoking,0,0,T3J,0
8805,Basement,1600.0,1,1.0,805.0,Unfurnished,Non-Smoking,1,1,T3L,1
7516,Apartment,1550.0,2,1.0,808.0,Unfurnished,Non-Smoking,1,1,T1K,1


In [17]:
# Get values of furnishing
df['furnishing'].value_counts(normalize=True)

furnishing
Unfurnished    0.914667
Furnished      0.072377
Negotiable     0.012957
Name: proportion, dtype: float64

In [18]:
# Group Negotiable and Unfurnished and create dummy variable
df['furnished'] = np.where(df['furnishing'] == 'Furnished', 1, 0)
df = df.drop(columns='furnishing', axis=1)
df.sample(5)

Unnamed: 0,type,price,beds,baths,sq_feet,smoking,cats,dogs,postal_code,long_term,furnished
18313,Apartment,1345.0,1,1.0,640.0,Non-Smoking,1,1,S4P,1,0
7390,Apartment,1085.0,1,1.0,612.0,Non-Smoking,1,1,T8V,1,0
1740,Basement,1600.0,1,1.0,490.0,Non-Smoking,1,1,T2N,1,0
7298,Apartment,1185.0,1,1.0,0.0,Non-Smoking,1,1,T9H,1,0
7367,Condo Unit,1950.0,2,1.0,1150.0,Non-Smoking,1,1,T8L,0,1


In [19]:
# Get values of smoking
df['smoking'].value_counts(normalize=True)

smoking
Non-Smoking            0.852220
Unknown                0.122610
Smoke Free Building    0.014762
Smoking Allowed        0.007540
Negotiable             0.002867
Name: proportion, dtype: float64

In [20]:
# Reduce cardinality of smoking and create dummy variable
df['smoking_allowed'] = np.where(df['smoking'] == 'Smoking Allowed', 1, 0)
df = df.drop(columns='smoking', axis=1)
df.sample(5)

Unnamed: 0,type,price,beds,baths,sq_feet,cats,dogs,postal_code,long_term,furnished,smoking_allowed
7478,Apartment,1545.0,2,1.0,861.0,1,1,T1J,1,0,0
3703,Condo Unit,2750.0,2,2.0,950.0,0,0,T2Z,0,1,0
5006,Apartment,1538.0,2,2.0,937.0,1,1,T5K,1,0,0
9739,Apartment,1246.0,1,1.0,563.0,1,1,R2M,1,0,1
9089,Apartment,3401.0,1,1.0,619.0,1,1,V6G,1,0,0


In [21]:
# Get values of type
df['type'].value_counts(normalize=True)

type
Apartment        0.699713
Condo Unit       0.069138
Townhouse        0.055013
House            0.053473
Basement         0.053260
Main Floor       0.024798
Room For Rent    0.021665
Duplex           0.016408
Office Space     0.003027
Storage          0.001115
Parking Spot     0.001115
Loft             0.000797
Acreage          0.000319
Vacation Home    0.000106
Mobile           0.000053
Name: proportion, dtype: float64

In [22]:
# Keep 5 types and group the rest
df['type'] = np.where(df['type'].isin(['Apartment', 'Condo Unit', 'Townhouse', 'House', 'Basement']), df['type'], 'Other')
df.sample(5)

Unnamed: 0,type,price,beds,baths,sq_feet,cats,dogs,postal_code,long_term,furnished,smoking_allowed
17804,Apartment,1595.0,1,1.0,700.0,1,1,H3X,1,0,0
16590,Apartment,1960.0,1,1.0,610.0,1,0,H3G,1,0,0
15345,Apartment,4800.0,3,2.0,1200.0,1,1,M5N,1,0,0
16220,Apartment,1324.0,1,1.0,0.0,1,1,J8Y,1,0,0
14613,Apartment,2106.75,0,1.0,338.0,0,0,M5R,1,0,0


One Hot Encoding will be used for the remaining categorical column type. Frequency Encoding will be used for postal code. Using One Hot Encoding would increase the cardinality of the input features way too much. 

In [24]:
# Apply OneHotEncoder for type
encoder = OneHotEncoder(drop=['Other'], sparse_output=False)
encoded_cols = encoder.fit_transform(df[['type']])
new_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out(['type']))
df = df.drop(columns=['type'], axis=1).join(new_df)
df.sample(5)

Unnamed: 0,price,beds,baths,sq_feet,cats,dogs,postal_code,long_term,furnished,smoking_allowed,type_Apartment,type_Basement,type_Condo Unit,type_House,type_Townhouse
4570,850.0,0,1.0,350.0,1,1,T2R,1,0,0,1.0,0.0,0.0,0.0,0.0
9158,2525.0,1,1.0,639.0,0,0,V6E,1,0,0,1.0,0.0,0.0,0.0,0.0
2492,1450.0,2,1.0,1371.060606,1,0,T2A,1,0,0,0.0,1.0,0.0,0.0,0.0
16777,1295.0,0,1.0,305.0,0,0,H2X,1,0,0,1.0,0.0,0.0,0.0,0.0
4262,1100.0,1,1.0,900.0,0,1,T3E,1,0,0,0.0,0.0,0.0,0.0,0.0


In [25]:
# Map the city with the frequency
pc_frequency_map = df['postal_code'].value_counts()
pc_frequency_map

postal_code
K1N    369
T2R    329
T5K    290
T3M    289
T2T    277
      ... 
L1T      2
K7N      2
L9G      2
L2H      2
P1A      2
Name: count, Length: 698, dtype: int64

In [26]:
# Apply frequency encoding
df['postal_code'] = df['postal_code'].map(pc_frequency_map)
df.sample(5)

Unnamed: 0,price,beds,baths,sq_feet,cats,dogs,postal_code,long_term,furnished,smoking_allowed,type_Apartment,type_Basement,type_Condo Unit,type_House,type_Townhouse
11398,2650.0,2,2.0,1691.0,1,1,33,1,0,0,1.0,0.0,0.0,0.0,0.0
11847,2751.0,2,1.0,965.0,1,1,14,1,0,0,1.0,0.0,0.0,0.0,0.0
4247,2000.0,3,1.0,1371.060606,0,0,247,1,0,0,1.0,0.0,0.0,0.0,0.0
16541,2625.0,1,1.0,721.0,1,1,166,1,0,0,1.0,0.0,0.0,0.0,0.0
15754,4600.0,3,2.5,1655.0,1,1,18,1,0,0,0.0,0.0,0.0,0.0,1.0


In [27]:
# Make sure there are no more string values
s = df.dtypes
s = s[s != 'object']
assert len(s) == len(df.columns)

---

## 3. Feature Selection

In [30]:
# Train-test split
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=54)
X.shape

(18832, 14)

In [31]:
# Select features using Mutual Info Regresssion, since there might be non-linear relationships between the features.
selector = SelectKBest(score_func=mutual_info_regression, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
selected_features = selector.get_feature_names_out(X.columns)
selected_columns = list(selected_features)
selected_columns.append('price')
df = df[selected_columns]
df.sample(5)

Unnamed: 0,beds,baths,sq_feet,cats,dogs,postal_code,type_Apartment,type_Basement,type_Condo Unit,type_House,price
15853,0,1.0,556.0,0,0,20,1.0,0.0,0.0,0.0,2000.0
17635,0,1.0,450.0,1,1,22,1.0,0.0,0.0,0.0,1750.0
2288,2,2.0,1082.0,0,0,144,0.0,0.0,0.0,0.0,2095.0
8721,2,1.0,1371.060606,1,1,145,1.0,0.0,0.0,0.0,2325.0
2261,2,1.0,762.0,0,0,98,0.0,0.0,0.0,0.0,2100.0


In [32]:
# Dump data frame and frequency map
df.to_csv('Data/preprocessed_data.csv', index=False)

with open('Data/postcode_frequencies.pkl', 'wb') as handle:
    pickle.dump(pc_frequency_map, handle, protocol=pickle.HIGHEST_PROTOCOL)

## End