CS4001/4042 Assignment 1, Part B, Q1
---

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [1]:
!pip install pytorch_tabular[extra]

Collecting pytorch_tabular[extra]
  Downloading pytorch_tabular-1.0.2-py2.py3-none-any.whl (122 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.5/122.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting category-encoders<2.7.0,>=2.6.0 (from pytorch_tabular[extra])
  Downloading category_encoders-2.6.2-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.8/81.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting pytorch-lightning<2.0.0,>=1.8.0 (from pytorch_tabular[extra])
  Downloading pytorch_lightning-1.9.5-py3-none-any.whl (829 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m829.5/829.5 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting omegaconf>=2.1.0 (from pytorch_tabular[extra])
  Downloading omegaconf-2.3.0-py3-none-any.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.5/79.5 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch

In [17]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

> Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [18]:
df = pd.read_csv('hdb_price_prediction.csv')

# TODO: Enter your code here
# year 2019 and before as training data
df_train = df[df['year'] <= 2019]
# year 2020 as validation data
df_validation = df[df['year'] == 2020]
# year 2021 as test data
df_test = df[df['year'] == 2021]

> Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [19]:
# TODO: Enter your code here

# target: List[str]: A list of strings with the names of the target column(s)
# The aim is to predict public housing prices in Singapore from related features, hence I use resale_price
target = ['resale_price']

# continuous_columns: List[str]: Column names of the numeric fields. Defaults to []
# Numeric / Continuous features given: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
continuous_columns = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']

# categorical_columns: List[str]: Column names of the categorical fields to treat differently
# Categorical features given: month, town, flat_model_type, storey_range
categorical_columns = ['month', 'town', 'flat_model_type', 'storey_range']

# define the target variable, names of the continuous and categorical variables
data_config = DataConfig(
    target=target,
    continuous_cols=continuous_columns,
    categorical_cols=categorical_columns,
)

trainer_config = TrainerConfig(
    # batch_size to be 1024
    batch_size=1024,
    # max_epoch as 50.
    max_epochs=50,
    # automatically tune the learning rate.
    auto_lr_find=True
)

# create a feedforward neural network with 1 hidden layer containing 50 neurons.
# use default values of CategoryEmbeddingModelConfig, like activation function defaults to ReLU
model_config = CategoryEmbeddingModelConfig(
    # task is set to regression
    task="regression",
    # Hyphen-separated number of layers and units in the classification head. eg. 32-64-32.
    # since only 1 hidden layer with 50 neurons is needed, value set to "50"
    layers="50"
)

# use default optimizer - Adam
optimizer_config = OptimizerConfig()

# plug in all the values above together: data_config, model_config, optimizer_config, trainer_config
tabular_model = TabularModel(
    data_config=data_config,
    trainer_config=trainer_config,
    model_config=model_config,
    optimizer_config=optimizer_config
)

# Usage Example: https://pytorch-tabular.readthedocs.io/en/latest/
tabular_model.fit(train=df_train, validation=df_validation)

result = tabular_model.evaluate(df_test)
pred_df = tabular_model.predict(df_test)

tabular_model.save_model("nailah_models/b1_model")

2023-10-12 16:21:53,734 - {pytorch_tabular.tabular_model:105} - INFO - Experiment Tracking is turned off
INFO:pytorch_tabular.tabular_model:Experiment Tracking is turned off
INFO:lightning_fabric.utilities.seed:Global seed set to 42
2023-10-12 16:21:53,778 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
INFO:pytorch_tabular.tabular_model:Preparing the DataLoaders
2023-10-12 16:21:53,786 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
INFO:pytorch_tabular.tabular_datamodule:Setting up the datamodule for regression task
2023-10-12 16:21:54,108 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
INFO:pytorch_tabular.tabular_model:Preparing the Model: CategoryEmbeddingModel
2023-10-12 16:21:54,177 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
INFO:pytorch_tabular.tabular_model:Preparing the Trainer
  rank_zero_deprecation(
INFO:pytorch_lightning.utilities

Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

  rank_zero_warn(
  rank_zero_warn(
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=100` reached.
INFO:pytorch_lightning.tuner.lr_finder:Learning rate set to 0.5754399373371567
INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/.lr_find_582a66c6-bcc7-478b-bb4d-034d9f08d009.ckpt
INFO:pytorch_lightning.utilities.rank_zero:Restored all states from the checkpoint file at /content/.lr_find_582a66c6-bcc7-478b-bb4d-034d9f08d009.ckpt
2023-10-12 16:22:01,111 - {pytorch_tabular.tabular_model:575} - INFO - Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
INFO:pytorch_tabular.tabular_model:Suggested LR: 0.5754399373371567. For plot and detailed analysis, use `find_learning_rate` method.
2023-10-12 16:22:01,115 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
INFO:pytorch_tabular.tabular_model:Training Started


Output()

2023-10-12 16:22:44,734 - {pytorch_tabular.tabular_model:584} - INFO - Training the model completed
INFO:pytorch_tabular.tabular_model:Training the model completed
2023-10-12 16:22:44,741 - {pytorch_tabular.tabular_model:1258} - INFO - Loading the best model
INFO:pytorch_tabular.tabular_model:Loading the best model


Output()

  rank_zero_deprecation(


Output()



> Report the test RMSE error and the test R2 value that you obtained.



\# TODO: \<Enter your answer here\>

In [20]:
import math
import sklearn.metrics

target = pred_df['resale_price']
predict = pred_df['resale_price_prediction']

print("Values Obtained for RMSE error and R2")
print("-----------------------------------------------")
# https://www.javatpoint.com/rsme-root-mean-square-error-in-python
print(f"Test RMSE: {math.sqrt(sklearn.metrics.mean_squared_error(target, predict))}")
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
print(f"R2: {sklearn.metrics.r2_score(target, predict)}")

Values Obtained for RMSE error and R2
-----------------------------------------------
Test RMSE: 76696.91659962252
R2: 0.7776188068029297


> Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [21]:
# TODO: Enter your code here
# use MSE to find test samples with largest errors
pred_df['error'] = (pred_df['resale_price'] - pred_df['resale_price_prediction'])**2
# sort values to get the largest errors on top and show the top 25
poor_predictions = pred_df.sort_values(by="error", ascending=False).head(25)
print(poor_predictions)

        month  year          town                full_address    nearest_stn  \
92405      11  2021   BUKIT MERAH            46 SENG POH ROAD    Tiong Bahru   
90957       6  2021   BUKIT BATOK  288A BUKIT BATOK STREET 25    Bukit Batok   
112128     12  2021      TAMPINES      156 TAMPINES STREET 12       Tampines   
90608      12  2021        BISHAN       273B BISHAN STREET 24         Bishan   
106192     12  2021    QUEENSTOWN              89 DAWSON ROAD     Queenstown   
91871       6  2021   BUKIT MERAH         17 TIONG BAHRU ROAD    Tiong Bahru   
93825       8  2021  CENTRAL AREA       4 TANJONG PAGAR PLAZA  Tanjong Pagar   
92504      12  2021   BUKIT MERAH            49 KIM PONG ROAD    Tiong Bahru   
105695      6  2021    QUEENSTOWN              91 DAWSON ROAD     Queenstown   
90432       8  2021        BISHAN       275A BISHAN STREET 24         Bishan   
92299      10  2021   BUKIT MERAH         36 MOH GUAN TERRACE    Tiong Bahru   
92442      11  2021   BUKIT MERAH       

In [22]:
print("Using categorical columns to see the trend (in poor_predictions)")
print("----------------------------------------------------------------\n")
for column in categorical_columns:
    print(f"Column: {column}")
    counts = poor_predictions[column].value_counts().reset_index()
    counts.columns = [column, 'Count']
    print(counts)
    print("\n")

Using categorical columns to see the trend (in poor_predictions)
----------------------------------------------------------------

Column: month
   month  Count
0     12      6
1      6      4
2      8      4
3     10      4
4     11      3
5      9      2
6      4      1
7      3      1


Column: town
           town  Count
0   BUKIT MERAH     10
1    QUEENSTOWN      6
2        BISHAN      3
3  CENTRAL AREA      2
4   BUKIT BATOK      1
5      TAMPINES      1
6    ANG MO KIO      1
7       HOUGANG      1


Column: flat_model_type
                  flat_model_type  Count
0                3 ROOM, Standard      6
1  4 ROOM, Premium Apartment Loft      6
2                5 ROOM, Improved      6
3            EXECUTIVE, Apartment      2
4                    5 ROOM, DBSS      2
5           5 ROOM, Adjoined flat      2
6           EXECUTIVE, Maisonette      1


Column: storey_range
   storey_range  Count
0      01 TO 03      7
1      07 TO 09      3
2      28 TO 30      3
3      10 TO 12     

In [23]:
print("Using categorical columns to see the trend (full list)")
print("--------------------------------------------------------\n")
for column in categorical_columns:
    print(f"Column: {column}")
    counts = df[column].value_counts().reset_index()
    counts.columns = [column, 'Count']
    print(counts)
    print("\n")

Using categorical columns to see the trend (full list)
--------------------------------------------------------

Column: month
    month  Count
0       7  15932
1       8  14798
2       3  14334
3       6  14318
4       1  13189
5       9  13044
6      10  12840
7       4  12764
8      11  12755
9       5  12486
10     12  11975
11      2  11118


Column: town
               town  Count
0          SENGKANG  13391
1           PUNGGOL  11838
2         WOODLANDS  11242
3            YISHUN  10969
4          TAMPINES  10671
5       JURONG WEST  10479
6             BEDOK   8566
7           HOUGANG   7854
8     CHOA CHU KANG   7289
9        ANG MO KIO   6611
10      BUKIT MERAH   6115
11    BUKIT PANJANG   6015
12      BUKIT BATOK   5624
13        TOA PAYOH   5079
14        PASIR RIS   4866
15  KALLANG/WHAMPOA   4799
16       QUEENSTOWN   4479
17        SEMBAWANG   4129
18          GEYLANG   3956
19         CLEMENTI   3632
20      JURONG EAST   3284
21           BISHAN   2988
22        SERANG

A trend I identified is that the model consistently overestimation of resale prices especially for those underepresented values in the dataset. For example, `flat_model_type = 4 ROOM, Premium Apartment Loft` has one of the highest occurence (6) in the poor predictions dataframe and a likely contributing factor to this is that there is very little representation (76) of it in the full dataset (as compared to `flat_model_type = 4 ROOM, Model A` which has a count of 40309)

Hence, one way to reduce the error in resale price predictions is to use a more diverse (and larger if possible) dataset which will help the model to better learn the underlying data patterns and ultimately make more accurate predictions.