# Question B4 (10 marks)

Model degradation is a common issue faced when deploying machine learning models (including neural networks) in the real world. New data points could exhibit a different pattern from older data points due to factors such as changes in government policy or market sentiments. For instance, housing prices in Singapore have been increasing and the Singapore government has introduced 3 rounds of cooling measures over the past years (16 December 2021, 30 September 2022, 27 April 2023).

In such situations, the distribution of the new data points could differ from the original data distribution which the models were trained on. Recall that machine learning models often work with the assumption that the test distribution should be similar to train distribution. When this assumption is violated, model performance will be adversely impacted.  In the last part of this assignment, we will investigate to what extent model degradation has occurred.




---



---



Your co-investigators used a linear regression model to rapidly test out several combinations of train/test splits and shared with you their findings in a brief report attached in Appendix A below. You wish to investigate whether your deep learning model corroborates with their findings.

In [1]:
!pip install alibi-detect



You should consider upgrading via the 'c:\users\joann\coding projects\ipynb dump\ipynb-dump\scripts\python.exe -m pip install --upgrade pip' command.


In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from alibi_detect.cd import TabularDrift

1.Evaluate your model from B1 on data from year 2022 and report the test R2.

In [3]:
from sklearn.metrics import r2_score
from pytorch_tabular import TabularModel

df = pd.read_csv('hdb_price_prediction.csv')

model = TabularModel.load_model('saved_model/model_B1')

test_df_22 = df[df['year'] == 2022]
test_predictions_22 = model.predict(test_df_22)

y_true = test_df_22['resale_price']
y_pred = test_predictions_22

r2_22 = r2_score(y_true, y_pred)
print(f'Test R2 for year 2022: {r2_22}')

c:\Users\joann\Coding Projects\ipynb dump\ipynb-dump\lib\site-packages\lightning_fabric\utilities\cloud_io.py:56: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.


Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try

Test R2 for year 2022: 0.4178136950647571


2.Evaluate your model from B1 on data from year 2023 and report the test R2.

In [4]:
model = TabularModel.load_model('saved_model/model_B1')


test_df_23 = df[df['year'] == 2023]
test_predictions_23 = model.predict(test_df_23)

y_true = test_df_23['resale_price']
y_pred = test_predictions_23

r2_23 = r2_score(y_true, y_pred)
print(f'Test R2 for year 2023: {r2_23}')

c:\Users\joann\Coding Projects\ipynb dump\ipynb-dump\lib\site-packages\lightning_fabric\utilities\cloud_io.py:56: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.


Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try

Test R2 for year 2023: 0.13473514277102083


3.Did model degradation occur for the deep learning model?


In [5]:
# YOUR ANSWER HERE
"""
# Model Degradation in Machine Learning

**Model degradation** refers to the decline in a model's performance when it encounters new data that does not align with the distribution of the original training data. This decline becomes more evident when the new data significantly diverges from the patterns observed during training.

### Drop in R² Score

- 2022 R² score: **0.418**
- 2023 R² score: **0.135**

The **R² score**, which reflects the proportion of variance in the dependent variable that can be explained by the independent variables, has **decreased notably**, indicating:

- A **significant reduction** in the model's predictive accuracy
- A **reduction in the ability to generalize** to new data over time
"""

"\n# Model Degradation in Machine Learning\n\n**Model degradation** refers to the decline in a model's performance when it encounters new data that does not align with the distribution of the original training data. This decline becomes more evident when the new data significantly diverges from the patterns observed during training.\n\n### Drop in R² Score\n\n- 2022 R² score: **0.418**\n- 2023 R² score: **0.135**\n\nThe **R² score**, which reflects the proportion of variance in the dependent variable that can be explained by the independent variables, has **decreased notably**, indicating:\n\n- A **significant reduction** in the model's predictive accuracy\n- A **reduction in the ability to generalize** to new data over time\n"

# Model Degradation in Machine Learning

**Model degradation** refers to the decline in a model's performance when it encounters new data that does not align with the distribution of the original training data. This decline becomes more evident when the new data significantly diverges from the patterns observed during training.

### Drop in R² Score

- 2022 R² score: **0.418**
- 2023 R² score: **0.135**

The **R² score**, which reflects the proportion of variance in the dependent variable that can be explained by the independent variables, has **decreased notably**, indicating:

- A **significant reduction** in the model's predictive accuracy
- A **reduction in the ability to generalize** to new data over time




---



---



4.Model degradation could be caused by [various data distribution shifts](https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html#data-shift-types): covariate shift (features), label shift and/or concept drift (altered relationship between features and labels).
There are various conflicting terminologies in the [literature](https://www.sciencedirect.com/science/article/pii/S0950705122002854#tbl1). Let’s stick to this reference for this assignment.

> Using the **Alibi Detect** library, apply the **TabularDrift** function with the training data (year 2019 and before) used as the reference and **detect which features have drifted** in the 2023 test dataset. Before running the statistical tests, ensure you **sample 1000 data points** each from the train and test data. Do not use the whole train/test data. (Hint: use this example as a guide https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_chi2ks_adult.html)


In [6]:
#TODO: Check res sma orng lain

categorical_cols = ['month', 'town', 'flat_model_type', 'storey_range']  
continuous_cols = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm'] 

# Get a random sample of the train and test datasets (1000 data points), if not randomly sampled, the p-value 
df_train = df[df['year'] <= 2019]
df_test  = df[df['year'] == 2023]

X_train_sample = df_train.sample(1000, random_state=SEED)
X_test_sample = df_test.sample(1000, random_state=SEED)

# Define the reference + test variables
X_ref = X_train_sample[categorical_cols + continuous_cols].values
y_ref = X_train_sample['resale_price'].values

X_test = X_test_sample[categorical_cols + continuous_cols].values
y_test = X_test_sample['resale_price'].values

# Create category_map
X = df[categorical_cols + continuous_cols]

cat_map = {}
for i in range(len(X.columns)):
    if X.columns[i] in categorical_cols:
        cat_map[i] = df[X.columns[i]].unique().tolist()
categories_per_feature = {f: None for f in list(cat_map.keys())}

# Initialize the TabularDrift detector with a p-value threshold
cd = TabularDrift(X_ref, p_val=.05, categories_per_feature=categories_per_feature)

# Predict drift on the test dataset
pred = cd.predict(X_test)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[pred['data']['is_drift']]))
print("Threshold:", pred['data']['threshold'])

# Detect and print drifted features
feature_predict = cd.predict(X_test, drift_type='feature')

for feature in range(cd.n_features):
    stat = 'Chi2' if feature in list(categories_per_feature.keys()) else 'K-S'
    fname = X.columns.values[feature]
    is_drift = feature_predict['data']['is_drift'][feature]
    stat_val, p_val = feature_predict['data']['distance'][feature], feature_predict['data']['p_val'][feature]
    print(f'{fname:<20} \t Drift? {labels[is_drift]} \t {stat:<8} {stat_val:.3f} \t p-value {p_val:.3f}')
#

Drift? Yes!
Threshold: 0.005
month                	 Drift? Yes! 	 Chi2     430.336 	 p-value 0.000
town                 	 Drift? No! 	 Chi2     33.178 	 p-value 0.127
flat_model_type      	 Drift? Yes! 	 Chi2     62.122 	 p-value 0.001
storey_range         	 Drift? Yes! 	 Chi2     27.842 	 p-value 0.010
dist_to_nearest_stn  	 Drift? No! 	 K-S      0.035 	 p-value 0.561
dist_to_dhoby        	 Drift? No! 	 K-S      0.059 	 p-value 0.059
degree_centrality    	 Drift? No! 	 K-S      0.038 	 p-value 0.455
eigenvector_centrality 	 Drift? No! 	 K-S      0.056 	 p-value 0.084
remaining_lease_years 	 Drift? Yes! 	 K-S      0.163 	 p-value 0.000
floor_area_sqm       	 Drift? Yes! 	 K-S      0.062 	 p-value 0.041


5.Assuming that the flurry of housing measures have made an impact on the relationship between all the features and resale_price (i.e. P(Y|X) changes), which type of data distribution shift possibly led to model degradation?


In [1]:
# YOUR ANSWER HERE

"""

**Concept drift** occurs when the relationship between the features, **P(Y|X)** (e.g., location and flat type). This happens even though the **feature distribution (P(X))** remains relatively unchanged.

- The **input features** (e.g., location, flat type) have not changed.
- We are likely unable to predict resale prices accurately due to **external factors**, such as:
  - New housing policies
  - Changes in market dynamics
  - Other unobserved influences

### Distinguishing Concept Drift from Data Drift:
- **Data drift** involves changes in the feature distribution (**P(X)**).
- In contrast, **concept drift** refers to shifts in the underlying relationship between the features and the target variable (**P(Y|X)**).

In this instance, **concept drift** is the probable cause of the **model degradation** we're observing, as external factors have likely altered the dynamics between input features and resale prices.

"""

"\n\n**Concept drift** occurs when the relationship between the features, **P(Y|X)** (e.g., location and flat type). This happens even though the **feature distribution (P(X))** remains relatively unchanged.\n\n- The **input features** (e.g., location, flat type) have not changed.\n- We are likely unable to predict resale prices accurately due to **external factors**, such as:\n  - New housing policies\n  - Changes in market dynamics\n  - Other unobserved influences\n\n### Distinguishing Concept Drift from Data Drift:\n- **Data drift** involves changes in the feature distribution (**P(X)**).\n- In contrast, **concept drift** refers to shifts in the underlying relationship between the features and the target variable (**P(Y|X)**).\n\nIn this instance, **concept drift** is the probable cause of the **model degradation** we're observing, as external factors have likely altered the dynamics between input features and resale prices.\n\n"


**Concept drift** occurs when the relationship between the features, **P(Y|X)** (e.g., location and flat type). This happens even though the **feature distribution (P(X))** remains relatively unchanged.

- The **input features** (e.g., location, flat type) have not changed.
- We are likely unable to predict resale prices accurately due to **external factors**, such as:
  - New housing policies
  - Changes in market dynamics
  - Other unobserved influences

### Distinguishing Concept Drift from Data Drift:
- **Data drift** involves changes in the feature distribution (**P(X)**).
- In contrast, **concept drift** refers to shifts in the underlying relationship between the features and the target variable (**P(Y|X)**).

In this instance, **concept drift** is the probable cause of the **model degradation** we're observing, as external factors have likely altered the dynamics between input features and resale prices.


6.From your analysis via TabularDrift, which features contribute to this shift?


In [8]:
# YOUR ANSWER HERE
"""
Based on our results, the following features demonstrate **significant drift** (p-value < 0.05):

- Month
- Flat Model Type
- Storey Range
- Remaining Lease Years
- Floor Area (sqm)
"""

'\nBased on our results, the following features demonstrate **significant drift** (p-value < 0.05):\n\n- Month\n- Flat Model Type\n- Storey Range\n- Remaining Lease Years\n- Floor Area (sqm)\n'

Based on our results, the following features demonstrate **significant drift** (p-value < 0.05):

- Month
- Flat Model Type
- Storey Range
- Remaining Lease Years
- Floor Area (sqm)

7.Suggest 1 way to address model degradation and implement it, showing improved test R2 for year 2023.


In [9]:
# YOUR CODE HERE

In [10]:
# Train the model on the new training data

import warnings
from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

# Define the DataConfig
data_config = DataConfig(
    target=['resale_price'], 
    continuous_cols=['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm'],
    categorical_cols=['month', 'town', 'flat_model_type', 'storey_range']
)

# Define the TrainerConfig
trainer_config = TrainerConfig(
    auto_lr_find=True,  
    batch_size=1024,  
    max_epochs=50  
)

# Define the CategoryEmbeddingModelConfig
model_config = CategoryEmbeddingModelConfig(
    task="regression", 
    layers="50",  
)

# Define the OptimizerConfig
optimizer_config = OptimizerConfig(
    optimizer="Adam"  
)

# Define the TabularModel
model_new = TabularModel(
    data_config=data_config,  
    model_config=model_config,  
    optimizer_config=optimizer_config, 
    trainer_config=trainer_config
)


warnings.filterwarnings("ignore")

df_2019 = df[df['year'] <= 2019]
df_2022 = df[df['year'] == 2022]
df_test  = df[df['year'] == 2023]

# Split the 2022 data in half for train and validation
df_2022_first_half = df_2022.iloc[:len(df_2022)//2]
df_2022_second_half = df_2022.iloc[len(df_2022)//2:]

df_train = pd.concat([df_2019, df_2022_first_half])
df_val = df_2022_second_half

model_new.fit(train=df_train, validation=df_val, seed=SEED)
pred_new = model_new.predict(df_test)
res_new = model_new.evaluate(df_test)



Seed set to 42


GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.5754399373371567
Restoring states from the checkpoint path at c:\Users\joann\Coding Projects\ipynb dump\.lr_find_7a0905fe-ac28-4104-bd79-c5ec13892a9c.ckpt
Restored all states from the checkpoint at c:\Users\joann\Coding Projects\ipynb dump\.lr_find_7a0905fe-ac28-4104-bd79-c5ec13892a9c.ckpt


Output()

Output()

In [11]:
#TODO: Check res orng lain gmna

from sklearn.metrics import mean_squared_error, r2_score
import math

y_true = df_test['resale_price']  
y_pred = pred_new['resale_price_prediction']  

rmse = mean_squared_error(y_true, y_pred, squared=False) 
r2 = r2_score(y_true, y_pred)

print(f'Test RMSE: {rmse}')
print(f'Test R2: {r2}')

Test RMSE: 130797.2002051138
Test R2: 0.41969714032253713


### Addressing Concept Drift in Model Degradation

As we have identified **concept drift** as the main cause of model degradation, we can address it by retraining the model on new data (the year 2022). This will help the model learn the new relationships between the features and the target variable.

By doing do, we have improved the model's R2 score from 0.135 to 0.420



### Appendix A



Here are our results from a linear regression model. We used StandardScaler for continuous variables and OneHotEncoder for categorical variables.

While 2021 data can be predicted well, test R2 dropped rapidly for 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| Year <= 2020 | 2021     | 0.76    |
| Year <= 2020 | **2022**     | 0.41    |
| Year <= 2020 | **2023**     | **0.10**   |



Similarly, a model trained on 2017 data can predict 2018-2021 well (with slight degradation in performance for 2021), but drops drastically in 2022 and 2023.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2017         | 2018     | 0.90    |
|              | 2019     | 0.89    |
|              | 2020     | 0.87    |
|              | 2021     | 0.72    |
|              | **2022**     | **0.37**    |
|              | **2023**     | **0.09**    |

With the test set fixed at year 2021, training on data from 2017-2020 still works well on the test data, with minimal degradation. Training sets closer to year 2021 generally do better.

| Training set | Test set | Test R2 |
|--------------|----------|---------|
| 2020         | 2021     | 0.81    |
| 2019         | 2021     | 0.75    |
| 2018         | 2021     | 0.73    |
| 2017         | 2021     | 0.72    |