# Canadian Rental Prices Data Preprocessing

## 1. Introduction

This notebook prepares the data for a regression model that will predict the price of a rental based on multiple factors. Our target variable is the price.

**About the dataset**

We will work with the data cleaned in [this notebook](EDA.ipynb).

The regression model will be made is [this notebook](Regression.ipynb).

In [6]:
import numpy as np
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler

In [7]:
# Load the data
df = pd.read_pickle('../Data/cleaned_data.pkl')
df.shape

(19045, 18)

In [8]:
df.sample(5)

Unnamed: 0,rentfaster_id,city,province,address,latitude,longitude,lease_term,type,price,beds,baths,sq_feet,link,furnishing,availability_date,smoking,cats,dogs
18405,334153,Moose Jaw,Saskatchewan,602-660 Laurier Street West,50.406825,-105.552401,Long Term,Townhouse,1745.0,3 Beds,1.0,930.0,/sk/moose-jaw/rentals/townhouse/3-bedrooms/lyn...,Unfurnished,Immediate,Non-Smoking,True,True
19028,334289,Yorkton,Saskatchewan,320 Gladstone Ave S,51.197961,-102.474704,Long Term,Apartment,1060.0,1 Bed,1.0,618.0,/sk/yorkton/rentals/apartment/1-bedroom/pet-fr...,Unfurnished,Immediate,Non-Smoking,True,True
8189,474801,Courtenay,British Columbia,1248 9th Street,49.682029,-125.009075,Long Term,Apartment,1595.0,1 Bed,1.0,795.0,/bc/courtenay/rentals/apartment/1-bedroom/pet-...,Unfurnished,August 01,Non-Smoking,True,False
14498,555639,Toronto,Ontario,120 Raglan Avenue,43.686654,-79.421291,Long Term,Apartment,1944.0,Studio,1.0,398.0,/on/toronto/rentals/apartment/1-bedroom/555639,Unfurnished,Immediate,Unknown,False,False
655,535808,Calgary,Alberta,2232 33 Avenue Southwest,51.0242,-114.113721,Long Term,Apartment,2000.0,1 Bed,1.5,700.0,/ab/calgary/rentals/apartment/1-bedroom/south-...,Unfurnished,Immediate,Non-Smoking,True,True


## 2. Remove unecessary columns

**rentfaster_id**: An ID column is not correlated in any way with the price

**address**: It's redundant since we have the latitude and longitude

**availability_date**: It can't be categorized as there are too many unique values

In [11]:
df.drop(columns=['rentfaster_id', 'address', 'link', 'availability_date'], axis=1, inplace=True)

## 3. Encode categorical columns

### 3.1 Boolean

In [14]:
# Create dummy variables
df = df.assign(
    dogs=np.where(df['dogs'] == True, 1.0, 0.0),
    cats=np.where(df['cats'] == True, 1.0, 0.0),
)
df.sample(5)

Unnamed: 0,city,province,latitude,longitude,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs
4673,Cochrane,Alberta,51.198296,-114.511496,Long Term,Basement,1650.0,1 Bed,1.0,700.0,Unfurnished,Non-Smoking,0.0,0.0
15721,Toronto,Ontario,43.757855,-79.295431,Long Term,Duplex,2900.0,3 Beds,1.0,1371.060606,Unfurnished,Non-Smoking,0.0,0.0
3649,Calgary,Alberta,51.03953,-114.088654,Long Term,Condo Unit,2450.0,2 Beds,1.5,894.0,Furnished,Non-Smoking,0.0,0.0
16968,Montréal,Quebec,45.491856,-73.583386,Long Term,Apartment,1850.0,1 Bed,1.0,625.0,Unfurnished,Non-Smoking,1.0,0.0
14082,Toronto,Ontario,43.643465,-79.3961,Long Term,Apartment,4135.0,2 Beds,2.0,1091.0,Unfurnished,Unknown,1.0,1.0


### 3.2 Non-ordinal

In [16]:
# Check the number of unique values
df[['city', 'province', 'lease_term', 'type', 'furnishing', 'smoking']].nunique()

city          269
province       10
lease_term      7
type           15
furnishing      4
smoking         5
dtype: int64

We will use Onehot Encoding for all the mentionned features, except city. It would increase the cardinality of the input features way too much. We will use Frequency Encoding, to encode that the cities with the most ads are big cities, which is most likely related to higher prices.

In [18]:
# Map the city with the frequency
city_frequency_map = df['city'].value_counts(normalize=True)
city_frequency_map

city
Calgary             0.247887
Edmonton            0.134944
Toronto             0.125755
Montréal            0.079601
Ottawa              0.056918
                      ...   
Christopher Lake    0.000053
Priddis             0.000053
Innisfail           0.000053
Langdon             0.000053
Crowsnest Pass      0.000053
Name: proportion, Length: 269, dtype: float64

In [19]:
# Apply frequency encoding
df['city'] = df['city'].map(city_frequency_map)
df.head()

Unnamed: 0,city,province,latitude,longitude,lease_term,type,price,beds,baths,sq_feet,furnishing,smoking,cats,dogs
0,0.006721,Alberta,51.305962,-114.012515,Long Term,Townhouse,2495.0,2 Beds,2.5,1403.0,Unfurnished,Non-Smoking,1.0,1.0
1,0.006721,Alberta,51.305962,-114.012515,Long Term,Townhouse,2695.0,3 Beds,2.5,1496.0,Unfurnished,Non-Smoking,1.0,1.0
2,0.006721,Alberta,51.305962,-114.012515,Long Term,Townhouse,2295.0,2 Beds,2.5,1180.0,Unfurnished,Non-Smoking,1.0,1.0
3,0.006721,Alberta,51.305962,-114.012515,Long Term,Townhouse,2095.0,2 Beds,2.5,1403.0,Unfurnished,Non-Smoking,1.0,1.0
4,0.006721,Alberta,51.305962,-114.012515,Long Term,Townhouse,2495.0,2 Beds,2.5,1351.0,Unfurnished,Non-Smoking,1.0,1.0


In [20]:
# Apply OneHotEncoder for the other features
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_cols = encoder.fit_transform(df[['province', 'lease_term', 'type', 'furnishing', 'smoking']])
new_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out(['province', 'lease_term', 'type', 'furnishing', 'smoking']))
df = df.drop(columns=['province', 'lease_term', 'type', 'furnishing', 'smoking'], axis=1).join(new_df)
df.head()

Unnamed: 0,city,latitude,longitude,price,beds,baths,sq_feet,cats,dogs,province_British Columbia,...,type_Storage,type_Townhouse,type_Vacation Home,furnishing_Negotiable,furnishing_Unfurnished,"furnishing_Unfurnished, Negotiable",smoking_Non-Smoking,smoking_Smoke Free Building,smoking_Smoking Allowed,smoking_Unknown
0,0.006721,51.305962,-114.012515,2495.0,2 Beds,2.5,1403.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,0.006721,51.305962,-114.012515,2695.0,3 Beds,2.5,1496.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,0.006721,51.305962,-114.012515,2295.0,2 Beds,2.5,1180.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
3,0.006721,51.305962,-114.012515,2095.0,2 Beds,2.5,1403.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
4,0.006721,51.305962,-114.012515,2495.0,2 Beds,2.5,1351.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0


### 3.3 Ordinal

In [22]:
df[['beds', 'baths']].nunique()

beds     12
baths    17
dtype: int64

In [23]:
# Use OrdinalEncoder for the 2 features
encoder = OrdinalEncoder()
encoded_cols = encoder.fit_transform(df[['beds', 'baths']])
new_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out(['beds', 'baths']))
df = df.drop(columns=['beds', 'baths'], axis=1).join(new_df)
df.head()

Unnamed: 0,city,latitude,longitude,price,sq_feet,cats,dogs,province_British Columbia,province_Manitoba,province_New Brunswick,...,type_Vacation Home,furnishing_Negotiable,furnishing_Unfurnished,"furnishing_Unfurnished, Negotiable",smoking_Non-Smoking,smoking_Smoke Free Building,smoking_Smoking Allowed,smoking_Unknown,beds,baths
0,0.006721,51.305962,-114.012515,2495.0,1403.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,4.0
1,0.006721,51.305962,-114.012515,2695.0,1496.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,2.0,4.0
2,0.006721,51.305962,-114.012515,2295.0,1180.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,4.0
3,0.006721,51.305962,-114.012515,2095.0,1403.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,4.0
4,0.006721,51.305962,-114.012515,2495.0,1351.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,4.0


In [24]:
# Make sure there are no more string values
s = df.dtypes
s = s[s == 'float64']
assert len(s) == len(df.columns)

## 4. Train-test split

Since our dataset is large, we will keep 15% of the data for testing.

In [27]:
# Separate features and target variable
X = df.drop('price', axis=1)
y = df['price']

In [28]:
# Split data into train-test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 42)

## 5. Feature scaling

Since our features are not in the same range, we should apply feature scaling.

In [31]:
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
print('Train data', scaled_X_train[:3])
print('Test data', scaled_X_test[:3])

Train data [[-1.1062918   0.41405998 -0.71184004 -0.76529458  0.68471325  0.73033565
  -0.2485693  -0.17797249 -0.02723674 -0.02223595 -0.03145416 -0.11574442
  -0.67189883 -0.35568746 -0.19086642 -0.01111592  0.26296386 -0.22118124
  -0.1129724  -0.04786314 -0.0078599   0.65538554 -0.23628572 -0.27027367
  -0.13072766 -0.23947471 -0.02942086 -0.158354   -0.01111592 -0.05510103
  -0.03427956 -0.14603055 -0.03336424 -0.2441969  -0.01111592 -0.11184535
   0.30660323 -0.01361458  0.41547397 -0.12163216 -0.08785854 -0.37233969
  -0.62837688 -0.53719371]
 [-1.06317863 -1.35400481  0.94097211  0.86958043 -1.46046539 -1.36923344
  -0.2485693  -0.17797249 -0.02723674 -0.02223595 -0.03145416 -0.11574442
   1.4883193  -0.35568746 -0.19086642 -0.01111592  0.26296386 -0.22118124
  -0.1129724  -0.04786314 -0.0078599  -1.52581945 -0.23628572 -0.27027367
  -0.13072766 -0.23947471 -0.02942086 -0.158354   -0.01111592 -0.05510103
  -0.03427956  6.84788224 -0.03336424 -0.2441969  -0.01111592 -0.11184535


In [32]:
# Dump preprocessed data into a pickle file
data = {
    'city_frequency_map': city_frequency_map,
    'scaled_X_train': scaled_X_train,
    'scaled_X_test': scaled_X_test,
    'y_train': y_train,
    'y_test': y_test
}

with open('../Data/preprocessed_data.pkl', 'wb') as handle:
    pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)

## End