### Predicting HDB Prices with Ridge & Lasso Regression

In this notebook, we will be using Ridge regression to improve our base linear regression model. If you have not been through the base linear regression model notebook, you will find it here <to fill at a later stage .com>. Much of the original thinking is shared there and will be improved upon here to create a better predictive model. It is recommended to have a look through that notebook before exploring this notebook. 

To briefly reiterate, having collected the data, wrangled it together, done exploratory data analysis, we will now be modelling the data to predict HDB prices. 

#### Brief Refresher on Differences Between Ridge & Lasso Regression
Ridge and Lasso regression are both methods used in linear regression to prevent overfitting.

| **Ridge Regression** | **Lasso Regression** |
| -------- | ------- |
|Minimizes the sum of squared residuals (like ordinary least squares) plus a penalty proportional to the sum of the squared magnitudes of the coefficients. | Also minimizes the sum of squared residuals, but with a penalty proportional to the sum of the absolute values of the coefficients. |
|This is known as L2 regularization. | This is known as L1 regularization. |
|Adds “squared magnitude” of coefficient as penalty term to the loss function.| Adds “absolute value of magnitude” of coefficient as penalty term to the loss function |
|Shrinks coefficients but rarely makes them exactly zero. Thus, it does not inherently perform feature selection. | Can shrink coefficients to zero, effectively performing feature selection and removing some variables entirely.|
|Not suited for variable selection in cases where we have a large number of features. | More useful when you want to reduce the number of features. |
|Introduces bias into the estimates but reduces the variance, particularly useful when multicollinearity is present. | Also introduces bias but can eliminate the variance of some coefficients entirely (by setting them to zero). |
|Better suited when the data includes many large parameters of about equal importance. | More efficient when the solution is believed to have only a few significant parameters and the goal is to identify them. |
|Generally has unique solutions and is computationally efficient. | Can be computationally more challenging due to the absolute value in the penalty term, especially when the number of variables is very large. |

To summarize, Ridge regression is good for reducing model complexity and preventing over-fitting but cannot zero out coefficients. Lasso regression is capable of reducing the number of features in your model by setting some coefficient values to zero, which can be particularly useful in model selection and interpretation.

Code from the linear regression notebook will now be copied over up to the point of modelling to allow repeatability.

#### Load Libraries

In [1]:
# Data Manipulation
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Train Test Split
from sklearn.model_selection import train_test_split, cross_validate

# Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

# Modelling
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import uniform, loguniform

# Model Evaluation
from sklearn.metrics import mean_absolute_error, r2_score
import scipy.stats as stats

#### Load Data into DataFrame, Prepare Data for Pipeline, and Train Test Split

In [4]:
# Make file path variable so that all we need is to change this if we move notebook location
file_path = '../data/processed/final_HDB_for_model.parquet.gzip'

# Read data into csv
df = pd.read_parquet(file_path)

# Rename columns for easier readability
df.columns = df.columns.str.replace('Lime, Cement, & Fabricated Construction Materials Excl Glass & Clay Materials', 'key construction materials')
df.columns = df.columns.str.replace('Clay Construction Materials & Refractory Construction Materials', 'other construction materials')

# Put all columns to be deleted into a list
drop_cols = ['block', 'street_name','address','sold_year_month'] + \
            ['GDP per capita','Personal Income m','GDPm (Current Prices)',
             'Median Household Inc','Core inflation','ResidentPopulation',
             'other construction materials']

# Drop columns
df = df.drop(columns=drop_cols)


In [3]:
# Create lists of the categorical and numerical columns allowing them to be treated differently
cat_cols = df.select_dtypes(include=['object']).columns
num_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Create new list of numeric columns, removing resale_price from columns to scale
num_cols_scale = list(num_cols)
num_cols_scale.remove('resale_price')


In [4]:
# Select target column
target_col = 'resale_price'

# Ready X and y
X = df.loc[:, ~df.columns.isin([target_col])]
y = df[target_col]

# Split the data, 80-20 split with a random state included for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 54)


#### Create Preprocessing Pipeline

In [5]:
# Create instances of OneHotEncoder
cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

# Create pipeline of two scalers for numeric data
num_transformer = make_pipeline(RobustScaler(), MinMaxScaler())

# Create a final to apply transformations to subsets of columns
prepoc = make_column_transformer(
    (cat_transformer, cat_cols),
    (num_transformer, num_cols_scale),
    remainder = 'passthrough'
)

# View Pipeline
prepoc

In [6]:
# Process X & y with pipeline
X_train_processed = prepoc.fit_transform(X_train)
X_test_processed = prepoc.transform(X_test)

# Check to see if it worked
print("Number of columns originally:", X.shape[1])
print("Number of columns after preprocessing:",X_train_processed.shape[1])

Number of columns originally: 15
Number of columns after preprocessing: 183


### Creating a Ridge Regression Model & Evaluation 

In [7]:
# Instantiate the model
base_ridge_model = Ridge()

# Define multiple scoring metrics
scoring = ['r2', 'neg_mean_absolute_error']

# Get the cross validation scores
scores = cross_validate(base_ridge_model, X_train_processed, y_train, cv=5, scoring=scoring, return_train_score=False)

# View scores dictionary
scores

{'fit_time': array([2.56644011, 2.74671292, 2.63409185, 2.71825194, 2.64139485]),
 'score_time': array([0.0036509 , 0.00365281, 0.00300694, 0.00271106, 0.00328016]),
 'test_r2': array([0.8738275 , 0.87276868, 0.87345216, 0.87276727, 0.87286833]),
 'test_neg_mean_absolute_error': array([-44789.76482433, -45089.68931652, -44867.45881518, -44962.34895703,
        -44976.10572063])}

In [8]:
# Get rounded scores stored in variables
train_base_r2_mean = round(scores['test_r2'].mean(), 2)
train_base_mae_mean = round(-(scores['test_neg_mean_absolute_error'].mean()),2)

# Print scores to assess
print("Training r2 score =", train_base_r2_mean)
print("Training Mean Absolute Error =", train_base_mae_mean)

Training r2 score = 0.87
Training Mean Absolute Error = 44937.07


**Training R-squared (R²) Score = 0.87:**
 - R-squared is a statistical measure that represents the proportion of the variance for the dependent variable that's explained by the independent variables in a regression model.
 - In general, a higher R² indicates a better fit of the model to the data.
 - An R² score of 0.87 suggests that 87% of the variability in the HDB price can be explained by the model. 
 - An R² score of 0.87 is typically considered high, indicating that the model explains a large proportion of the variance in the outcome variable.
 - However, it's important to note that a high R² does not necessarily mean the model is good. It doesn't indicate whether the model is appropriate, nor does it imply that the predictions are accurate.

**Training Mean Absolute Error (MAE) = 44937.07:**

 - Mean Absolute Error (MAE) gives an average of the absolute errors between the predicted values and the actual values without considering the direction (i.e., over or under-predicting).
 - It's a common measure of forecast error in regression analysis.
 - A MAE of $44937.07 means that, on average, the predictions of the model are off by $44937.07.
 - The magnitude of the MAE needs to be considered in the context of the scale of the dependent variable - HDB prices. For some datasets, an MAE of 44937.07 might be very small, while for oth0ers, it might be considered large. Since HDBs cost on average $500K, having a MAE of ~$50K(rounded up) means the model's output for HDB prices is off by 10% on average. 

In [46]:
# Fitting model on data.
base_ridge_model.fit(X_train_processed, y_train)


In [10]:
# Predict y with fitted model
y_pred = base_ridge_model.predict(X_test_processed)

# results
test_base_mae_mean = round(mean_absolute_error(y_test, y_pred),2)
test_base_r2_mean = round(r2_score(y_test, y_pred),2)

print("Testing r2 score =", test_base_r2_mean)
print("Testing Mean Absolute Error =", test_base_mae_mean)

Testing r2 score = 0.87
Testing Mean Absolute Error = 44806.75


**Testing R-squared (R²) Score = 0.87:**
- An R² of 0.87 for the testing set is also high, suggesting that the model has generalized well to new data. 
- It's particularly notable when the testing R² is close to the training R², as it indicates consistency in performance. 

**Testing Mean Absolute Error (MAE) = 44806.75:**
 - The testing MAE being slightly lower than the training MAE (44806.75 vs. 44937.07) is a positive sign. 
 - It suggests that the model is not overfitting and is performing slightly better or at least as well on unseen data compared to the training data.

**Comments on Scores:**

 Overall, these metrics indicate a model that performs consistently on both training and testing data, with a high R² suggesting good explanatory power and a MAE providing insight into the average prediction error. The similarity between training and testing scores is a good sign, indicating that the model has generalized well and is not just fitting to the peculiarities of the training data.

#### Taking a look at the Feature Coefficients

In [11]:
# To Get importance of features in a DF:
# Get feature names
feature_names = prepoc.get_feature_names_out()

# Get coefficients
coefficients = base_ridge_model.coef_

# Create empty Dictionary
feature_coefficients = {}

# For loop to print coefficient and put them into dict
for feature, coef in zip(feature_names, coefficients):
    #print(f"{feature}: {coef}")
    feature_coefficients[feature] = coef

# Include intercept in the dict
#print(f"intercept: {base_ridge_model.intercept_}")
feature_coefficients["intercept"] = base_ridge_model.intercept_

# Converting to DataFrame
feature_coefficients_df = pd.DataFrame(list(feature_coefficients.items()), columns=['Feature', 'Coefficient'])

# Sorting the DataFrame by the absolute values of the 'Coefficient' column
feature_coefficients_df = feature_coefficients_df.sort_values(by='Coefficient', key=abs, ascending=False)

# Show top 15
feature_coefficients_df.head(15)

Unnamed: 0,Feature,Coefficient
173,pipeline__floor_area_sqm,584270.151926
96,onehotencoder__most_closest_mrt_CHANGI AIRPORT,362780.291468
111,onehotencoder__most_closest_mrt_HAVELOCK,321914.519565
75,onehotencoder__most_closest_mrt_BEAUTY WORLD,305992.464216
175,pipeline__sold_year,305281.652975
86,onehotencoder__most_closest_mrt_BRAS BASAH,281789.592973
181,pipeline__walking_time_mrt,-264765.170382
71,onehotencoder__flat_model_Type S2,248601.241096
123,onehotencoder__most_closest_mrt_LABRADOR PARK,239887.71528
69,onehotencoder__flat_model_Terrace,234835.342512


**Comments**

Model seems to make sense, as floor_area_sqm is the single strongest predictor of price, followed by the closest MRT station to the HDB, the year it was sold, and walking times to the closest MRT. 

#### Improving Model by Including Model Hyperparameter Tuning

In [12]:
# Define the model
base_ridge_model = Ridge()

# Define a distribution to sample the alpha parameter
params = {
    'alpha': uniform(0.1, 1), # This samples alpha uniformly between 0.1 and 1000
    'solver': ['lsqr']
}

# Setup the random search with 5-fold cross-validation
ridge_random_search = RandomizedSearchCV(estimator=base_ridge_model,
                                         param_distributions=params,
                                         n_iter=100, # Number of parameter settings sampled
                                         scoring='r2', # or another relevant scoring method
                                         cv=5, # Number of folds in cross-validation
                                         random_state=108, # Seed for reproducibility
                                         n_jobs=-1) # Use all available cores

# Search and fit model
ridge_random_search.fit(X_train_processed, y_train)

In [13]:
best_parameters = ridge_random_search.best_params_
print(best_parameters)
best_random_cv_model = ridge_random_search.best_estimator_
best_random_cv_model

{'alpha': 0.20660549017777122, 'solver': 'lsqr'}


In [14]:
# Predict y with fitted model
y_pred = best_random_cv_model.predict(X_test_processed)

# results
random_cv_mae_mean = round(mean_absolute_error(y_test, y_pred),2)
random_cv_r2_mean = round(r2_score(y_test, y_pred),2)

print("Testing r2 score =", random_cv_r2_mean)
print("Testing Mean Absolute Error =", random_cv_mae_mean)

Testing r2 score = 0.87
Testing Mean Absolute Error = 44842.81


**Comments on Scores:**

 Overall, these metrics indicate a model that performs almost similarly to the baseline model. R2 is the same but MAE is slightly worse than the baseline (44806.75).

In [15]:
# Define the model
base_ridge_model = Ridge()

# Define the parameter grid
param_grid = {
    'alpha': np.linspace(0, 2, 50)
}

# Setup the grid search with 5-fold cross-validation
ridge_grid_search = GridSearchCV(estimator=base_ridge_model,
                                 param_grid=param_grid,
                                 scoring='r2', # or another relevant scoring method
                                 cv=5, # Number of folds in cross-validation
                                 verbose=1, # To see the progress
                                 n_jobs=-1) # Use all available cores


ridge_grid_search.fit(X_train_processed, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


In [16]:
best_parameters = ridge_grid_search.best_params_
best_grid_model = ridge_grid_search.best_estimator_
best_grid_model

In [17]:
# Predict y with fitted model
y_pred = best_grid_model.predict(X_test_processed)

# results
grid_search_mae_mean = round(mean_absolute_error(y_test, y_pred),2)
grid_search_r2_mean = round(r2_score(y_test, y_pred),2)

print("Testing r2 score =", grid_search_r2_mean)
print("Testing Mean Absolute Error =", grid_search_mae_mean)

Testing r2 score = 0.87
Testing Mean Absolute Error = 44804.05


**Comments on Scores:**

 Overall, these metrics indicate a model that performs almost similarly to the baseline model. R2 is the same (0.87) but MAE (44804.05) is ever so slightly better than the baseline (44806.75).


### Creating a Lasso Regression Model & Evaluation 

In [19]:
# Instantiate the model
base_lasso_model = Lasso()

# Define multiple scoring metrics
scoring = ['r2', 'neg_mean_absolute_error']

# Get the cross validation scores
scores = cross_validate(base_lasso_model, X_train_processed, y_train, cv=5, scoring=scoring, return_train_score=False)

# View scores dictionary
scores

  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


{'fit_time': array([158.04950929, 155.28038621, 156.28184772, 156.10373902,
        155.05353475]),
 'score_time': array([0.00600076, 0.00276566, 0.00283122, 0.00414681, 0.00351715]),
 'test_r2': array([0.8735344 , 0.87250976, 0.87322993, 0.87252731, 0.872574  ]),
 'test_neg_mean_absolute_error': array([-44831.3348501 , -45124.14874169, -44898.58772542, -44999.11473888,
        -45015.39346511])}

In [20]:
# Get rounded scores stored in variables
lasso_train_base_r2_mean = round(scores['test_r2'].mean(), 2)
lasso_train_base_mae_mean = round(-(scores['test_neg_mean_absolute_error'].mean()),2)

# Print scores to assess
print("Training r2 score =", lasso_train_base_r2_mean)
print("Training Mean Absolute Error =", lasso_train_base_mae_mean)

Training r2 score = 0.87
Training Mean Absolute Error = 44973.72


In [21]:
# Fitting model on data.
base_lasso_model.fit(X_train_processed, y_train)

  model = cd_fast.sparse_enet_coordinate_descent(


In [22]:
# Predict y with fitted model
y_pred = base_lasso_model.predict(X_test_processed)

# results
lasso_test_base_mae_mean = round(mean_absolute_error(y_test, y_pred),2)
lasso_test_base_r2_mean = round(r2_score(y_test, y_pred),2)

print("Testing r2 score =", lasso_test_base_r2_mean)
print("Testing Mean Absolute Error =", lasso_test_base_mae_mean)

Testing r2 score = 0.87
Testing Mean Absolute Error = 44843.36


In [33]:
# To Get importance of features in a DF:
# Get feature names
feature_names = prepoc.get_feature_names_out()

# Get coefficients
coefficients = base_lasso_model.coef_

# Create empty Dictionary
feature_coefficients = {}

# For loop to print coefficient and put them into dict
for feature, coef in zip(feature_names, coefficients):
    #print(f"{feature}: {coef}")
    feature_coefficients[feature] = coef

# Include intercept in the dict
#print(f"intercept: {base_lasso_model.intercept_}")
feature_coefficients["intercept"] = base_lasso_model.intercept_

# Converting to DataFrame
feature_coefficients_df = pd.DataFrame(list(feature_coefficients.items()), columns=['Feature', 'Coefficient'])

# Sorting the DataFrame by the absolute values of the 'Coefficient' column
feature_coefficients_df = feature_coefficients_df.sort_values(by='Coefficient', key=abs, ascending=False)

# Show top 15
feature_coefficients_df.head(15)

Unnamed: 0,Feature,Coefficient
173,pipeline__floor_area_sqm,592130.552088
174,pipeline__lease_commence_date,289089.979247
181,pipeline__walking_time_mrt,-249657.872353
96,onehotencoder__most_closest_mrt_CHANGI AIRPORT,247158.93645
175,pipeline__sold_year,242870.054373
71,onehotencoder__flat_model_Type S2,229152.704362
75,onehotencoder__most_closest_mrt_BEAUTY WORLD,218136.400559
111,onehotencoder__most_closest_mrt_HAVELOCK,216273.895735
69,onehotencoder__flat_model_Terrace,214587.452059
24,onehotencoder__town_YISHUN,-207169.856198


#### Improving Model by Including Model Hyperparameter Tuning

In [37]:
# Instantiate base model
base_lasso_model = Lasso()

# Define space to search
params = {'alpha': loguniform(0.0001, 1)}

# Create the RandomizedSearchCV object
lasso_random_search = RandomizedSearchCV(estimator=base_lasso_model,
                                         param_distributions=params,
                                         n_iter=20,
                                         scoring='r2',
                                         cv=3,
                                         random_state=108,
                                         n_jobs=-1,
                                         verbose=5)

# Fit the training data
lasso_random_search.fit(X_train_processed, y_train)


Fitting 3 folds for each of 20 candidates, totalling 60 fits


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END .......alpha=0.0001372620524176436;, score=0.873 total time= 6.1min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .......alpha=0.0001372620524176436;, score=0.873 total time= 6.1min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .........alpha=0.02091008322651608;, score=0.873 total time= 6.1min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .......alpha=0.0008598932581217309;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .........alpha=0.02091008322651608;, score=0.873 total time= 6.2min
[CV 3/3] END .......alpha=0.0008598932581217309;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .......alpha=0.0008598932581217309;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .......alpha=0.0001372620524176436;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END ......alpha=0.00032844083405203235;, score=0.873 total time= 5.9min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END ......alpha=0.00032844083405203235;, score=0.873 total time= 6.0min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END ......alpha=0.00032844083405203235;, score=0.873 total time= 6.0min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .........alpha=0.05558605222328068;, score=0.873 total time= 6.0min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .........alpha=0.05558605222328068;, score=0.873 total time= 6.0min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END ........alpha=0.054287490185850155;, score=0.873 total time= 6.0min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END .........alpha=0.02091008322651608;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END .........alpha=0.05558605222328068;, score=0.873 total time= 6.1min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END ........alpha=0.054287490185850155;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END ........alpha=0.054287490185850155;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .........alpha=0.31101681904367934;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .........alpha=0.31101681904367934;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END ........alpha=0.000322002773092474;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END ........alpha=0.000322002773092474;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END ........alpha=0.000322002773092474;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END .........alpha=0.31101681904367934;, score=0.873 total time= 6.4min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END ........alpha=0.022838857445567264;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END ........alpha=0.022838857445567264;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END ......alpha=0.00016225903940748885;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END ........alpha=0.022838857445567264;, score=0.873 total time= 6.4min


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .......alpha=0.0019460118562019395;, score=0.873 total time= 6.1min
[CV 2/3] END ......alpha=0.00016225903940748885;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END ......alpha=0.00016225903940748885;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .......alpha=0.0019460118562019395;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END .......alpha=0.0019460118562019395;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .........alpha=0.00665186636943989;, score=0.873 total time= 6.1min


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END ........alpha=0.014632928978691073;, score=0.873 total time= 6.2min
[CV 3/3] END .........alpha=0.00665186636943989;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .........alpha=0.00665186636943989;, score=0.873 total time= 6.4min


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END ..........alpha=0.4167804486953717;, score=0.873 total time= 6.4min
[CV 2/3] END ..........alpha=0.4167804486953717;, score=0.873 total time= 6.4min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END ..........alpha=0.4167804486953717;, score=0.873 total time= 6.4min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END ........alpha=0.014632928978691073;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END ........alpha=0.014632928978691073;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .......alpha=0.0001405395184708484;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END .........alpha=0.06151766685538422;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END .......alpha=0.0001405395184708484;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .........alpha=0.06151766685538422;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .......alpha=0.0001405395184708484;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .........alpha=0.06151766685538422;, score=0.873 total time= 6.3min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END ..........alpha=0.9455869126421402;, score=0.873 total time= 6.4min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END ..........alpha=0.9455869126421402;, score=0.873 total time= 6.5min


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .......alpha=0.0006305451995133826;, score=0.873 total time= 6.2min
[CV 2/3] END .......alpha=0.0006305451995133826;, score=0.873 total time= 6.1min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END .......alpha=0.0006305451995133826;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .........alpha=0.03683503719043096;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .........alpha=0.03683503719043096;, score=0.873 total time= 6.2min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END ..........alpha=0.9455869126421402;, score=0.873 total time= 6.4min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END .........alpha=0.03683503719043096;, score=0.873 total time= 3.1min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .........alpha=0.09231744974910122;, score=0.873 total time= 3.0min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .........alpha=0.09231744974910122;, score=0.873 total time= 2.9min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END .........alpha=0.09231744974910122;, score=0.873 total time= 2.9min
Best parameters found:  {'alpha': 0.0001372620524176436}
Best R2 score:  0.8731261452649233


  model = cd_fast.sparse_enet_coordinate_descent(


In [41]:
# Get best parameters and model
best_parameters = lasso_random_search.best_params_
best_model = lasso_random_search.best_estimator_

print("Best parameters found: ", best_parameters)
print("Best R2 score: ", lasso_random_search.best_score_)

Best parameters found:  {'alpha': 0.0001372620524176436}
Best R2 score:  0.8731261452649233


In [38]:
# Predict y with fitted model
y_pred = lasso_random_search.best_estimator_.predict(X_test_processed)

# results
lasso_random_cv_base_mae_mean = round(mean_absolute_error(y_test, y_pred),2)
lasso_random_cv_base_r2_mean = round(r2_score(y_test, y_pred),2)

print("Testing r2 score =", lasso_random_cv_base_r2_mean)
print("Testing Mean Absolute Error =", lasso_random_cv_base_mae_mean)

Testing r2 score = 0.87
Testing Mean Absolute Error = 44813.37


##### Using Grid Search 

In [39]:
# Instantiate base model
base_lasso_model = Lasso()

# Define grid to search
params = {'alpha': np.linspace(0.0001, 1, 5)}  # 5 values from 0.0001 to 1

# Create the GridSearchCV object
lasso_grid_search = GridSearchCV(estimator=base_lasso_model,
                                 param_grid=params,
                                 scoring='r2',
                                 cv=3,
                                 n_jobs=-2,
                                 verbose=4)

# Fit the training data
lasso_grid_search.fit(X_train_processed, y_train)


Fitting 3 folds for each of 5 candidates, totalling 15 fits


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END ......................alpha=0.0001;, score=0.873 total time= 5.7min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END ......................alpha=0.0001;, score=0.873 total time= 5.7min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END ....................alpha=0.250075;, score=0.873 total time= 5.7min


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END ....................alpha=0.250075;, score=0.873 total time= 5.8min
[CV 1/3] END ......................alpha=0.0001;, score=0.873 total time= 5.8min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END ....................alpha=0.250075;, score=0.873 total time= 5.8min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .....................alpha=0.50005;, score=0.873 total time= 5.9min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END .....................alpha=0.50005;, score=0.873 total time= 5.7min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .....................alpha=0.50005;, score=0.873 total time= 5.8min


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END ..........alpha=0.7500249999999999;, score=0.873 total time= 5.7min
[CV 3/3] END ..........alpha=0.7500249999999999;, score=0.873 total time= 5.7min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END ..........alpha=0.7500249999999999;, score=0.873 total time= 5.7min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 2/3] END .........................alpha=1.0;, score=0.873 total time= 5.7min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 1/3] END .........................alpha=1.0;, score=0.873 total time= 5.7min


  model = cd_fast.sparse_enet_coordinate_descent(


[CV 3/3] END .........................alpha=1.0;, score=0.873 total time= 2.2min
Best parameters found:  {'alpha': 0.0001}
Best R2 score:  0.8731261520990775


  model = cd_fast.sparse_enet_coordinate_descent(


In [42]:
# Get best parameters and model
best_parameters = lasso_grid_search.best_params_
best_model = lasso_grid_search.best_estimator_

print("Best parameters found: ", best_parameters)
print("Best R2 score: ", lasso_grid_search.best_score_)


Best parameters found:  {'alpha': 0.0001}
Best R2 score:  0.8731261520990775


In [40]:
# Predict y with fitted model
y_pred = lasso_grid_search.best_estimator_.predict(X_test_processed)

# results
lasso_grid_base_mae_mean = round(mean_absolute_error(y_test, y_pred),2)
lasso_grid_base_r2_mean = round(r2_score(y_test, y_pred),2)

print("Testing r2 score =", lasso_grid_base_r2_mean)
print("Testing Mean Absolute Error =", lasso_grid_base_mae_mean)

Testing r2 score = 0.87
Testing Mean Absolute Error = 44813.37


#### Summary of Results

In [45]:
# Base Ridge Model Test Scores
print("base ridge modedl mae score:", test_base_mae_mean)
print("base ridge modedl r2 score:", test_base_r2_mean)

# RandomCV Ridge Model Test
print("RandomizedSearchCV ridge mae score:", random_cv_mae_mean)
print("RandomizedSearchCV ridge r2 score:", random_cv_r2_mean)

# GridSearch Ridge Model Test
print("gridSearchCV ridge mae score:", grid_search_mae_mean)
print("gridSearchCV ridge r2 score:", grid_search_r2_mean)

# Base Lasso Model Test
print("base lasso mae score:", lasso_test_base_mae_mean)
print("base lasso r2 score:", lasso_test_base_r2_mean)

# RandomCV Lasso Model
print("RandomizedSearchCV lasso mae score:", lasso_random_cv_base_mae_mean)
print("RandomizedSearchCV lasso r2 score:", lasso_random_cv_base_r2_mean)

# Gridsearch Lasso Model
print("gridSearchCV lasso mae score:", lasso_grid_base_mae_mean)
print("gridSearchCV lasso r2 score:", lasso_grid_base_r2_mean)

base ridge modedl mae score: 44806.75
base ridge modedl r2 score: 0.87
RandomizedSearchCV ridge mae score: 44842.81
RandomizedSearchCV ridge r2 score: 0.87
gridSearchCV ridge mae score: 44804.05
gridSearchCV ridge r2 score: 0.87
base lasso mae score: 44843.36
base lasso r2 score: 0.87
RandomizedSearchCV lasso mae score: 44813.37
RandomizedSearchCV lasso r2 score: 0.87
gridSearchCV lasso mae score: 44813.37
gridSearchCV lasso r2 score: 0.87


**Comments on Results**
- Overall, all models show a similar R² score of 0.87, indicating comparable performance in terms of the proportion of variance explained. 
- The MAE scores are also quite close, with the lowest MAE observed in the Ridge model optimized with GridSearchCV (44804.05). 
- The models' performance differences are minimal, indicating the base models were already performing optimally.

### Putting it Together in a Pipeline to Save

In [3]:
# Make file path variable so that all we need is to change this if we move notebook location
file_path = '../data/processed/final_HDB_for_model.parquet.gzip'

# Read data into csv
df = pd.read_parquet(file_path)

# Put all columns to be deleted into a list
drop_cols = ['block', 'street_name','address','sold_year_month']

# Drop columns
df = df.drop(columns=drop_cols)

# Create lists of the categorical and numerical columns allowing them to be treated differently
cat_cols = df.select_dtypes(include=['object']).columns
num_cols_scale = ['floor_area_sqm',
 'lease_commence_date',
 'sold_year',
 'sold_remaining_lease',
 'max_floor_lvl',
 '5 year bond yields',
 'Unemployment Rate',
 'Lime, Cement, & Fabricated Construction Materials Excl Glass & Clay Materials',
 'walking_time_mrt',
 'ResidentPopulation_Growth_Rate']

# Create instances of OneHotEncoder
cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

# Create pipeline of two scalers for numeric data
num_transformer = make_pipeline(RobustScaler(), MinMaxScaler())

# Instantiate the model
ridge_model = Ridge(alpha=0.04081632653061224)

# Apply transformations to subsets of columns
prepoc = make_column_transformer(
    (cat_transformer, cat_cols),
    (num_transformer, num_cols_scale)
)

#create final pipeline
ridge_pipe = make_pipeline(prepoc, ridge_model)

ridge_pipe


In [4]:
# Instantiate the model
lasso_model = Lasso(alpha = 0.0001)

#create final pipeline
lasso_pipe = make_pipeline(prepoc, lasso_model)

lasso_pipe

In [5]:
# Select target column
target_col = 'resale_price'

# Ready X and y
X = df.loc[:, ~df.columns.isin([target_col])]
y = df[target_col]

# Split the data, 80-20 split with a random state included for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 54)

# Fit model
ridge_pipe.fit(X_train, y_train)

In [8]:
# R2 score on testing set
r2_test = ridge_pipe.score(X_test, y_test)

# MAE for testing
y_pred = ridge_pipe.predict(X_test)
mae_test = mean_absolute_error(y_test, y_pred)

# Print scores to assess
print("Test r2 score =", round(r2_test, 4))
print("Test MAE score =", round(mae_test, 2))

Test r2 score = 0.8735
Test MAE score = 44804.05


In [6]:
lasso_pipe.fit(X_train, y_train)

  model = cd_fast.sparse_enet_coordinate_descent(


In [9]:
# R2 score on testing set
r2_test = lasso_pipe.score(X_test, y_test)

# MAE for testing
y_pred = lasso_pipe.predict(X_test)
mae_test = mean_absolute_error(y_test, y_pred)

# Print scores to assess
print("Test r2 score =", round(r2_test, 4))
print("Test MAE score =", round(mae_test, 2))

Test r2 score = 0.8734
Test MAE score = 44813.37


In [10]:
#Saving Baseline model
import pickle

with open('../models/ridge_240130.pkl', 'wb') as file:
    pickle.dump(ridge_pipe, file)

with open('../models/lasso_240130.pkl', 'wb') as file:
    pickle.dump(lasso_pipe, file)