### Goal
We are going to build a model for predicting a price for apartments and condominiums in Singapore.

##### Task for the current workbook:
- Build some models for predicting the price, based on the data prepared in the workbook 1 and the insights into the data in the workbook 2

#### Strategy
- Based on the analysis in the Workbook 2, we concluded that linear models are not suitable for this dataset because of significant non-linearities in the data.
- Therefore, we will be using models which are more suitable for non-linear data, such as Decision Tree regression, Random forest regression, and artificial neural networks.
- Using some variables such as project_name, street and postal_code will lead to high-dimensional binary vector spaces. We can use two strategies to deal with this:
    - Try models without these variables. Latitude and longitude may partly [replace them]
    - Use dimensionality reduction, e.g. PCA, and run the regressions on that data
- Our data includes several years of observations. From the analysis we have seen that the prices have changed over the years. Therefore, we may test how robust the model is by choosing y1 as a train set and y2 as a test set

#### Plan:
- Use data cleaned from outliers
- Numeric and geographical data only:
    - Decision tree regression
    - Random forest regression
- All data:
    - PCA:
        - Generate principal components. Try different number of PC, make sure that explained_variance_ratio_ cummulative is sufficient (?)
        - Decision tree regression
        - Random forest regression
    - ANN:
        - Run ANN on all data

### Importing data

In [4]:
import pandas as pd
import numpy as np

data_path = "data/"
# file_leases_no_outliers = "2_lease_no_outliers.csv" 
file_freehold_no_outliers = "2_freehold_no_outliers.csv" 
TRAIN_SIZE = 0.7

In [6]:
df = pd.read_csv(data_path+file_freehold_no_outliers, parse_dates=['sale_date'])

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28302 entries, 0 to 28301
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   project_name     28302 non-null  object        
 1   street           28302 non-null  object        
 2   postal_district  28302 non-null  int64         
 3   sale_type        28302 non-null  object        
 4   area_type        28302 non-null  object        
 5   property_type    28302 non-null  object        
 6   tenure_type      28302 non-null  object        
 7   mkt_segment      28302 non-null  object        
 8   sale_month       28302 non-null  int64         
 9   sale_year        28302 non-null  int64         
 10  sale_date        28302 non-null  datetime64[ns]
 11  floor_level      28302 non-null  float64       
 12  max_floor        28302 non-null  float64       
 13  area_sqft        28302 non-null  float64       
 14  lat              28302 non-null  float

##### Regression on variables excluding project_name and street

In [68]:
df_clean = df.drop(columns=['project_name', 'street', 'area_type', 'tenure_type', 
                            'sale_date', 'lat', 'lon', 'price_total' ])
dataset_1 = df_clean

In [69]:
dataset_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28302 entries, 0 to 28301
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   postal_district  28302 non-null  int64  
 1   sale_type        28302 non-null  object 
 2   property_type    28302 non-null  object 
 3   mkt_segment      28302 non-null  object 
 4   sale_month       28302 non-null  int64  
 5   sale_year        28302 non-null  int64  
 6   floor_level      28302 non-null  float64
 7   max_floor        28302 non-null  float64
 8   area_sqft        28302 non-null  float64
 9   lat_adj          28302 non-null  float64
 10  lon_adj          28302 non-null  float64
 11  price_sqft       28302 non-null  float64
dtypes: float64(6), int64(3), object(3)
memory usage: 2.6+ MB


In [70]:
dataset_1 = dataset_1.values

In [71]:
X = dataset_1[:, :-1]
y = dataset_1[:, -1]

##### Categorical columns

In [58]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [72]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0, 1, 2, 3 , 4, 5])], 
                       remainder='passthrough')

In [73]:
X = ct.fit_transform(X).toarray()

### Entire dataset approach

##### Split training and test set

In [74]:
from sklearn.model_selection import train_test_split

In [75]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = TRAIN_SIZE)

##### Feature scaling

In [76]:
from sklearn.preprocessing import StandardScaler

In [77]:
sc = StandardScaler()

In [78]:
X_train[:, -5:] = sc.fit_transform(X_train[:, -5:])

In [79]:
X_test[:, -5:] = sc.transform(X_test[:, -5:])

In [80]:
#save backup
X_train_copy = X_train.copy()
X_test_copy = X_test.copy()
y_train_copy = y_train.copy()
y_test_copy = y_test.copy()

#### Regression models

In [81]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

##### Decision tree

In [82]:
tree_1 = DecisionTreeRegressor()
tree_1.fit(X_train, y_train)
y_pred_tree_1 = tree_1.predict(X_test)

In [83]:
r2_score(y_test, y_pred_tree_1)

0.9351191281777256

##### Random forest

In [84]:
forest_1 = RandomForestRegressor()
forest_1.fit(X_train, y_train)
y_pred_forest_1 = forest_1.predict(X_test)

In [85]:
r2_score(y_test, y_pred_forest_1)

0.9614501180802233

### Time-split approach

In [86]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28302 entries, 0 to 28301
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   postal_district  28302 non-null  int64  
 1   sale_type        28302 non-null  object 
 2   property_type    28302 non-null  object 
 3   mkt_segment      28302 non-null  object 
 4   sale_month       28302 non-null  int64  
 5   sale_year        28302 non-null  int64  
 6   floor_level      28302 non-null  float64
 7   max_floor        28302 non-null  float64
 8   area_sqft        28302 non-null  float64
 9   lat_adj          28302 non-null  float64
 10  lon_adj          28302 non-null  float64
 11  price_sqft       28302 non-null  float64
dtypes: float64(6), int64(3), object(3)
memory usage: 2.6+ MB


In [87]:
dataset_2_train = df_clean[df_clean.sale_year == 2022].drop(columns=['sale_year']).values
dataset_2_test = df_clean[df_clean.sale_year == 2023].drop(columns=['sale_year']).values
X_train_2 = dataset_2_train[:, :-1]
y_train_2 = dataset_2_train[:, -1]
X_test_2 = dataset_2_test[:, :-1]
y_test_2 = dataset_2_test[:, -1]

#### Feature scaling and column transforming

In [88]:
# one column less, as sale_year column is removed
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0, 1, 2, 3 , 4, 5])], 
                       remainder='passthrough')
X_train_2 = ct.fit_transform(X_train_2).toarray()
X_test_2 = ct.transform(X_test_2).toarray()

In [89]:
sc_2 = StandardScaler()
X_train_2[:, -5:] = sc_2.fit_transform(X_train_2[:, -5:])
X_test_2[:, -5:] = sc_2.transform(X_test_2[:, -5:])

#### Regression models

In [90]:
tree_2 = DecisionTreeRegressor()
tree_2.fit(X_train_2, y_train_2)
y_pred_2 = tree_2.predict(X_test_2)

In [91]:
r2_score(y_test_2, y_pred_2)

0.7224667595875356

In [92]:
forest_2 = RandomForestRegressor()
forest_2.fit(X_train_2, y_train_2)
y_pred_forest_2 = forest_2.predict(X_test_2)

In [93]:
r2_score(y_test_2, y_pred_forest_2)

0.8144938980431243

### Conclusion
- As expected, using year_1 data for training and year_2 data for testing is less efficient, because we are effectively loosing the information about the year. 
- However, the predictions are still quite high, meaning that overall the model is robust and we can use the entire dataset model

### PCA approach

In [102]:
from sklearn.decomposition import PCA

In [103]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28302 entries, 0 to 28301
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   project_name     28302 non-null  object        
 1   street           28302 non-null  object        
 2   postal_district  28302 non-null  int64         
 3   sale_type        28302 non-null  object        
 4   area_type        28302 non-null  object        
 5   property_type    28302 non-null  object        
 6   tenure_type      28302 non-null  object        
 7   mkt_segment      28302 non-null  object        
 8   sale_month       28302 non-null  int64         
 9   sale_year        28302 non-null  int64         
 10  sale_date        28302 non-null  datetime64[ns]
 11  floor_level      28302 non-null  float64       
 12  max_floor        28302 non-null  float64       
 13  area_sqft        28302 non-null  float64       
 14  lat              28302 non-null  float

In [104]:
df_pca = df.drop(columns=['sale_date', 'area_type', 'tenure_type', 'sale_date', 'lat', 'lon', 'price_total'])

In [105]:
df_pca.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28302 entries, 0 to 28301
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   project_name     28302 non-null  object 
 1   street           28302 non-null  object 
 2   postal_district  28302 non-null  int64  
 3   sale_type        28302 non-null  object 
 4   property_type    28302 non-null  object 
 5   mkt_segment      28302 non-null  object 
 6   sale_month       28302 non-null  int64  
 7   sale_year        28302 non-null  int64  
 8   floor_level      28302 non-null  float64
 9   max_floor        28302 non-null  float64
 10  area_sqft        28302 non-null  float64
 11  lat_adj          28302 non-null  float64
 12  lon_adj          28302 non-null  float64
 13  price_sqft       28302 non-null  float64
dtypes: float64(6), int64(3), object(5)
memory usage: 3.0+ MB


#### Column encoding

In [106]:
dataset_pca = df_pca.values
X_pca = dataset_pca[:, :-1]
y_pca = dataset_pca[:, -1]

In [107]:
columns_to_transform = [0, 1, 2, 3 , 4, 5, 6, 7]
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns_to_transform)], 
                       remainder='passthrough')
X_pca_ct = ct.fit_transform(X_pca).toarray()

#### Train_test split

In [108]:
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca_ct, y_pca, train_size=TRAIN_SIZE)

#### Feature scaling

In [109]:
#train
sc_pca = StandardScaler()
X_train_pca[:, -5:] = sc_pca.fit_transform(X_train_pca[:, -5:])

#test
X_test_pca[:, -5:] = sc_pca.transform(X_test_pca[:, -5:])

#### Running PCA

In [133]:
#train
pca = PCA(n_components=40)
X_train_pca_transformed = pca.fit_transform(X_train_pca)

In [134]:
#increasing the number of components until explained variance is 85%. After that the increase becomes really low
pca.explained_variance_ratio_.cumsum()

array([0.19167108, 0.30173745, 0.39729581, 0.47525749, 0.5168095 ,
       0.55256919, 0.58656329, 0.61026421, 0.6331939 , 0.65069086,
       0.66559063, 0.67914876, 0.69104127, 0.70210756, 0.7120521 ,
       0.72056726, 0.72893937, 0.73713217, 0.74507588, 0.75285695,
       0.76058627, 0.76810406, 0.77544301, 0.78258261, 0.78909475,
       0.79554238, 0.80163935, 0.8075566 , 0.81283842, 0.81711436,
       0.8211643 , 0.82489139, 0.82822321, 0.83127071, 0.83394199,
       0.83652181, 0.83889989, 0.84104544, 0.84307908, 0.84506564])

In [135]:
#test
X_test_pca_transformed = pca.transform(X_test_pca)

#### Regression models on PCA

##### Linear regression

In [136]:
from sklearn.linear_model import LinearRegression

In [137]:
lr_pca = LinearRegression()

In [138]:
lr_pca.fit(X_train_pca_transformed, y_train_pca)

LinearRegression()

In [139]:
y_pred_lr_pca = lr_pca.predict(X_test_pca_transformed)

In [140]:
r2_score(y_test_pca, y_pred_lr_pca)

0.8007135227111963

Comment:
- Although the original data contained significant non-linearity, Linear model can be used with PCA

##### Decision tree

In [141]:
tree_pca = DecisionTreeRegressor()
tree_pca.fit(X_train_pca_transformed, y_train_pca)
y_pred_tree_pca = tree_pca.predict(X_test_pca_transformed)
r2_score(y_test_pca, y_pred_tree_pca)

0.8101643688979822

##### Random forest

In [142]:
forest_pca = RandomForestRegressor()
forest_pca.fit(X_train_pca_transformed, y_train_pca)
y_pred_forest_pca = forest_pca.predict(X_test_pca_transformed)
r2_score(y_test_pca, y_pred_forest_pca)

0.9083579771420061

### Artificial Neural Network

##### Preparing the dataset

In [143]:
df_ann = df_pca.copy()

In [144]:
df_ann.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28302 entries, 0 to 28301
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   project_name     28302 non-null  object 
 1   street           28302 non-null  object 
 2   postal_district  28302 non-null  int64  
 3   sale_type        28302 non-null  object 
 4   property_type    28302 non-null  object 
 5   mkt_segment      28302 non-null  object 
 6   sale_month       28302 non-null  int64  
 7   sale_year        28302 non-null  int64  
 8   floor_level      28302 non-null  float64
 9   max_floor        28302 non-null  float64
 10  area_sqft        28302 non-null  float64
 11  lat_adj          28302 non-null  float64
 12  lon_adj          28302 non-null  float64
 13  price_sqft       28302 non-null  float64
dtypes: float64(6), int64(3), object(5)
memory usage: 3.0+ MB


In [145]:
dataset_ann = df_ann.values
X_ann = dataset_ann[:, :-1]
y_ann = dataset_ann[:, -1].astype('float32')

In [146]:
y_ann.dtype

dtype('float32')

##### Encoding categorical columns

In [147]:
columns_to_transform = [0, 1, 2, 3 , 4, 5, 6, 7]
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns_to_transform)], 
                       remainder='passthrough')
X_ann = ct.fit_transform(X_ann).toarray().astype('float32')

In [148]:
X_train_ann, X_test_ann, y_train_ann, y_test_ann = train_test_split(X_ann, y_ann, train_size=TRAIN_SIZE)

In [149]:
sc_ann = StandardScaler()
X_train_ann = sc_ann.fit_transform(X_train_ann)
X_test_ann = sc_ann.transform(X_test_ann)

In [150]:
y_train_ann.dtype

dtype('float32')

##### ANN

In [151]:
import tensorflow as tf
from tensorflow.keras.metrics import MeanSquaredError

In [152]:
ann = tf.keras.models.Sequential()

In [153]:
#Adding layers
ann.add(tf.keras.layers.Input(shape=(X_train_ann.shape[1])))
ann.add(tf.keras.layers.Dense(512, activation='relu'))  
ann.add(tf.keras.layers.Dense(256, activation='relu'))
ann.add(tf.keras.layers.Dense(128, activation='relu'))
ann.add(tf.keras.layers.Dense(1))

In [154]:
ann.compile(optimizer = 'adam', loss = 'mean_squared_error', metrics = [MeanSquaredError()])

In [155]:
#Training
ann.fit(X_train_ann, y_train_ann, batch_size = 32, epochs = 100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100


Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x1697f5828>

In [156]:
#Predicting

In [157]:
y_pred_ann = ann.predict(X_test_ann)

In [158]:
r2_score(y_test_ann, y_pred_ann)

0.9431106839699473