# Question B4 (10 marks)

Model degradation is a common issue faced when deploying machine learning models (including neural networks) in the real world. New data points could exhibit a different pattern from older data points due to factors such as changes in government policy or market sentiments. For instance, housing prices in Singapore have been increasing and the Singapore government has introduced 3 rounds of cooling measures over the past years (16 December 2021, 30 September 2022, 27 April 2023).

In such situations, the distribution of the new data points could differ from the original data distribution which the models were trained on. Recall that machine learning models often work with the assumption that the test distribution should be similar to train distribution. When this assumption is violated, model performance will be adversely impacted.  In the last part of this assignment, we will investigate to what extent model degradation has occurred.




---



---



Your co-investigators used a linear regression model to rapidly test out several combinations of train/test splits and shared with you their findings in a brief report attached in Appendix A below. You wish to investigate whether your deep learning model corroborates with their findings.

In [6]:
!pip install alibi-detect --user




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip





In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from alibi_detect.cd import TabularDrift

1.Evaluate your model from B1 on data from year 2022 and report the test R2.

In [7]:
# Load the dataset
df = pd.read_csv('hdb_price_prediction.csv')

# Import required libraries
from pytorch_tabular import TabularModel
from sklearn.metrics import r2_score, mean_squared_error
import math

# Load the pre-trained model (B1)
B1_model = TabularModel.load_model("saved_models/B1_Model")

# Filter dataset for the year 2022
df_test = df[df["year"] == 2022]  # Using only data from year 2022 for evaluation

# Predict resale prices using the loaded model
pred = B1_model.predict(df_test)

# Print evaluation metrics
print("\n")
print("2022 Resale Price Evaluation:")
print("Coefficient of Determination (R2):", r2_score(df_test['resale_price'], pred['resale_price_prediction']))
print("Root Mean Square Error (RMSE):", math.sqrt(mean_squared_error(df_test['resale_price'], pred['resale_price_prediction'])))


Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try



2022 Resale Price Evaluation:
Coefficient of Determination (R2): 0.4178136829653838
Root Mean Square Error (RMSE): 129910.63113894255


2.Evaluate your model from B1 on data from year 2023 and report the test R2.

In [18]:
# TODO: Enter your code here

df_test = df[df["year"] == 2023] # Just year 2023
pred = B1_model.predict(df_test)


print("\n")
print("Evaluation of 2023 data:  R2: ", r2_score(df_test['resale_price'], pred['resale_price_prediction']))
print("RMSE: ", math.sqrt(mean_squared_error(df_test['resale_price'], pred['resale_price_prediction'])))

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are settin



Evaluation of 2023 data:  R2:  0.13473511711169517
RMSE:  159714.97263754322


3.Did model degradation occur for the deep learning model?


### Model Performance Comparison Between 2022 and 2023

#### R2 Score Comparison
- **Definition**: The R2 score, or coefficient of determination, assesses the goodness of fit of the model to the data by measuring the proportion of variance in the dependent variable predictable from the independent variables.
- **Observation**: 
  - In 2022, the R2 score was 0.418, which suggests a relatively higher proportion of explained variance.
  - In 2023, the R2 score decreased to 0.135, indicating that the model's predictive accuracy declined and it explains less of the variance in the dependent variable.
  
#### RMSE Comparison
- **Definition**: Root Mean Square Error (RMSE) quantifies the average deviation of predicted values from actual values, with higher RMSE values indicating larger prediction errors.
- **Observation**: 
  - The RMSE increased from 129,910.63 in 2022 to 159,714.97 in 2023, highlighting that the model's predictions in 2023 are less accurate compared to 2022.

#### Model Performance Degradation
- The decrease in R2 and increase in RMSE indicate a decline in model performance from 2022 to 2023. This degradation suggests that the model is less effective in predicting outcomes accurately in 2023.
- **Possible Causes**:
  - Changes in data patterns or relationships.
  - Environmental shifts affecting underlying data trends.
  - Alterations in the model's structure or configuration.

#### Other Factors Influencing Model Degradation
1. **Prediction Errors**:
   - Check for patterns such as heteroscedasticity or systematic biases in prediction errors.
2. **Cross-Validation Scores**:
   - Monitor cross-validation scores for changes in performance on unseen data.
3. **Feature Importance**:
   - Observe shifts in feature importance rankings to understand changes in model behavior.
4. **Model Complexity**:
   - Ensure that model complexity aligns with the data's complexity to avoid overfitting or underfitting.
5. **Generalization Performance**:
   - Evaluate the model's ability to generalize to new data beyond the training set.
6. **Monitoring Metrics**:
   - Continuously track performance metrics for any deviations from expected behavior.
7. **External Validation**:
   - Compare predictions against external benchmarks or domain knowledge for accuracy.
8. **Model Drift Detection**:
   - Implement drift detection techniques to identify changes in data distribution affecting model performance.


4.Model degradation could be caused by [various data distribution shifts](https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html#data-shift-types): covariate shift (features), label shift and/or concept drift (altered relationship between features and labels).
There are various conflicting terminologies in the [literature](https://www.sciencedirect.com/science/article/pii/S0950705122002854#tbl1). Let’s stick to this reference for this assignment.

> Using the **Alibi Detect** library, apply the **TabularDrift** function with the training data (year 2019 and before) used as the reference and **detect which features have drifted** in the 2023 test dataset. Before running the statistical tests, ensure you **sample 1000 data points** each from the train and test data. Do not use the whole train/test data. (Hint: use this example as a guide https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_chi2ks_adult.html)


In [20]:
# YOUR CODE HERE

target = ['resale_price']
categorical_cols = ['month', 'town', 'flat_model_type', 'storey_range']
continuous_cols = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']

n_ref = 1000
n_test = 1000

# Extract unique categories for each categorical column
category_map = {}
for i, col in enumerate(categorical_cols):
    category_map[i] = df[col].unique().tolist()


# Splitting the dataset into train (reference) and test based on the 'year' column
X_train = df[df['year'] <= 2019]
X_test = df[df['year'] == 2023]

# Limiting the number of rows for both reference and test datasets
X_train = X_train[:n_ref]
X_test = X_test[:n_test]

X_ref = X_train[categorical_cols + continuous_cols].values
X_test = X_test[categorical_cols + continuous_cols].values

y_ref = X_train[target].values
y_test = df[df['year'] == 2023][target].values[:n_test]

categories_per_feature = {f: None for f in list(category_map.keys())}
cd = TabularDrift(X_ref, p_val=.05, categories_per_feature=categories_per_feature)

predictions = cd.predict(X_test)
labels = ['No','Yes']
print('Drift -> {}'.format(labels[predictions['data']['is_drift']]))

feature_preds = cd.predict(X_test, drift_type='feature')
feature_names = categorical_cols + continuous_cols

data = []
for f in range(cd.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    feature_name = feature_names[f]
    is_drift = feature_preds['data']['is_drift'][f]
    stat_val, p_val = feature_preds['data']['distance'][f], feature_preds['data']['p_val'][f]
    data.append([feature_name, labels[is_drift], stat, stat_val, p_val])

headers = ["Feature Name", "Drift", "Statistic", "Statistic Value", "p-value"]
view = pd.DataFrame(data, columns=headers)
display(view)


Drift -> Yes


Unnamed: 0,Feature Name,Drift,Statistic,Statistic Value,p-value
0,month,No,Chi2,0.0,1.0
1,town,Yes,Chi2,667.473877,0.0
2,flat_model_type,Yes,Chi2,77.586258,1.877453e-05
3,storey_range,Yes,Chi2,38.800251,0.0006863867
4,dist_to_nearest_stn,No,K-S,0.055,0.09354284
5,dist_to_dhoby,Yes,K-S,0.218,2.417212e-21
6,degree_centrality,No,K-S,0.029,0.7830867
7,eigenvector_centrality,Yes,K-S,0.195,3.9347000000000005e-17
8,remaining_lease_years,Yes,K-S,0.271,6.288003e-33
9,floor_area_sqm,Yes,K-S,0.134,2.728432e-08


5.Assuming that the flurry of housing measures have made an impact on the relationship between all the features and resale_price (i.e. P(Y|X) changes), which type of data distribution shift possibly led to model degradation?


In [21]:
# YOUR ANSWER HERE



**Answer:** This is known as **Concept Drift**.

- Concept Drift occurs when the relationship between features (X) and the target (Y) changes over time.
- In this case, external factors like housing market changes may alter how features impact resale prices, leading to potential model degradation.


6.From your analysis via TabularDrift, which features contribute to this shift?


In [22]:
# YOUR ANSWER HERE

**Answer:** The following features contribute to the Concept Drift:

- **Town**
- **Flat Model Type**
- **Storey Range**
- **Distance to Dhoby Ghaut**
- **Eigenvector Centrality**
- **Remaining Lease Years**
- **Floor Area (sqm)**

These features have shown significant shifts, affecting their relationship with resale prices and potentially impacting model accuracy.


7.Suggest 1 way to address model degradation and implement it, showing improved test R2 for year 2023.


**Solution to Address Model Degradation:**

The B1 model was initially trained on older data, specifically from 2017 to 2019. To improve the model's performance and address degradation due to changes in the housing market, one effective approach is to retrain the model on more recent data.

**Updated Approach:**
- **Training Data:** Include all available data up to and including 2022 to capture recent trends and shifts.
  - Updated Training Data: [2017, 2018, 2019, 2020, 2021, 2022]
- **Testing Data:** Reserve 2023 data for testing to evaluate performance improvements.

By retraining the model on this expanded dataset, the model can better generalize to recent patterns, potentially improving its R2 score for predictions in 2023.


In [23]:
# YOUR CODE HERE

df_train = df[df["year"] <= 2022] # year 2022 and before 
df_test = df[df["year"] == 2023] # Just year 2023

print("Training data : ",df_train["year"].unique())
print("Testing data : ",df_test["year"].unique())

B1_model.fit(train=df_train)#, validation=df_val)

Seed set to 42


Training data :  [2017 2018 2019 2020 2021 2022]
Testing data :  [2023]


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are settin

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


C:\Users\65976\AppData\Roaming\Python\Python312\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:639: Checkpoint directory C:\Users\65976\Desktop\SC4001 Assignment\saved_models exists and is not empty.
C:\Users\65976\AppData\Roaming\Python\Python312\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.
C:\Users\65976\AppData\Roaming\Python\Python312\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at c:\Users\65976\Desktop\SC4001 Assignment\.lr_find_ec7efeb1-9cfe-4e71-b425-81172619f8aa.ckpt
Restored all states from the checkpoint at c:\Users\65976\Desktop\SC4001 Assignment\.lr_find_ec7efeb1-9cfe-4e71-b425-81172619f8aa.ckpt


Output()

<pytorch_lightning.trainer.trainer.Trainer at 0x22c8507fda0>

In [24]:
pred = B1_model.predict(df_test)

print("\n")
print("Evaluation of 2023 data after including 2022 :  \nR2: ", r2_score(df_test['resale_price'], pred['resale_price_prediction']))
print("RMSE: ", math.sqrt(mean_squared_error(df_test['resale_price'], pred['resale_price_prediction'])))

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are settin



Evaluation of 2023 data after including 2022 :  
R2:  0.6217505638285561
RMSE:  105599.10491991042


### Model Evaluation Summary

**Before Retraining:**
- Training Data: Years [2017, 2018, 2019]
- Testing Data:
  - 2022: R² = 0.418, RMSE = 129,910.63
  - 2023: R² = 0.135, RMSE = 159,714.98

**After Retraining with Updated Data:**
- Training Data: Years [2017, 2018, 2019, 2020, 2021, 2022]
- Testing Data:
  - 2023: R² = 0.622, RMSE = 105,599.10

The retrained model shows better fit and reduced error for 2023.


In [25]:
pre = {
    "Training set": ['Year <= 2019', ''],
    "Test set": [2022, 2023],
    "Test R2": [0.418, 0.135],
    "Test RMSE": [129910.63, 159714.98]
}

post = {
    "Training set": ['Year <= 2022'],
    "Test set": [2023],
    "Test R2": [0.62],
    "Test RMSE": [105870.94]
}

pre_df = pd.DataFrame(pre)
post_df = pd.DataFrame(post)

print("Training data :  [2017 2018 2019] Testing data :  [2022, 2023]")
display(pre_df)
print("Training data :  [2017 2018 2019 2020 2021 2022] Testing data :  [2023]")
display(post_df)

Training data :  [2017 2018 2019] Testing data :  [2022, 2023]


Unnamed: 0,Training set,Test set,Test R2,Test RMSE
0,Year <= 2019,2022,0.418,129910.63
1,,2023,0.135,159714.98


Training data :  [2017 2018 2019 2020 2021 2022] Testing data :  [2023]


Unnamed: 0,Training set,Test set,Test R2,Test RMSE
0,Year <= 2022,2023,0.62,105870.94


### Appendix A



Here are our results from a linear regression model. We used StandardScaler for continuous variables and OneHotEncoder for categorical variables.

While 2021 data can be predicted well, test R2 dropped rapidly for 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| Year <= 2020 | 2021     | 0.76    |
| Year <= 2020 | **2022**     | 0.41    |
| Year <= 2020 | **2023**     | **0.10**   |



Similarly, a model trained on 2017 data can predict 2018-2021 well (with slight degradation in performance for 2021), but drops drastically in 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2017         | 2018     | 0.90    |
|              | 2019     | 0.89    |
|              | 2020     | 0.87    |
|              | 2021     | 0.72    |
|              | **2022**     | **0.37**    |
|              | **2023**     | **0.09**    |

With the test set fixed at year 2021, training on data from 2017-2020 still works well on the test data, with minimal degradation. Training sets closer to year 2021 generally do better.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2020         | 2021     | 0.81    |
| 2019         | 2021     | 0.75    |
| 2018         | 2021     | 0.73    |
| 2017         | 2021     | 0.72    |