### Predicting HDB Prices with Support Vector Regression
In this notebook, we will be using Support Vector Regression for prediction. Having collected the data, wrangled it together, done exploratory data analysis, we will now begin to model the data to predict HDB prices. 

#### Brief Recap on Support Vector Regression
Support Vector Regression (SVR) is a type of Support Vector Machine (SVM) used for regression tasks. It extends the concept of SVMs, typically used for classification, to handle continuous outputs.

SVR aims to find a function that approximates the relationship between input features and continuous target values. The goal is to have as many data points as possible within a specified margin (ε-tube) around the regression function while minimizing the model complexity.
- ε-Tube (Epsilon-Tube):
  - SVR introduces the concept of an ε-tube around the regression line. 
  - This tube defines a margin of tolerance where no penalty is given to errors (the distance between the predicted and actual values) that are within this margin.
  - The ε parameter defines the width of this tube and controls the sensitivity of the model to prediction errors. 
  - Smaller ε-values lead to narrower tubes and vice versa.
- Regularization (C Parameter):
  - The C parameter acts as a regularization parameter in SVR. It balances the trade-off between having a flat (simple) model and ensuring that as many data points as possible fall within the ε-tube.
A higher value of C puts more emphasis on minimizing the training error, while a lower value of C increases the regularization strength, leading to a simpler model that may generalize better.
- Kernels:
  - SVR can use different kernel functions (linear, polynomial, radial basis function, sigmoid) to transform the input data into a higher-dimensional space. 
  - This transformation allows SVR to capture complex, non-linear relationships.
  - The choice of kernel and its parameters significantly affects the model's performance.
- Support Vectors:
  - Similar to SVMs for classification, SVR uses support vectors, which are the critical data points that lie on or outside the ε-tube. 
  - These points essentially define the regression function.
  - The model is less sensitive to data points that fall within the margin.
- Advantages:
  - SVR can model complex, non-linear relationships and is robust against overfitting, especially in high-dimensional spaces.
  - It's effective in cases where the number of dimensions exceeds the number of samples.
- Challenges:
  - Choosing the right kernel and tuning the hyperparameters (C, ε, and kernel parameters) can be challenging and requires cross-validation.
  - SVR can be computationally intensive, especially for large datasets and with certain kernel types (like RBF).
  - SVR is a powerful tool for regression analysis, especially in situations where the relationship between variables is complex or non-linear. 
  - Its ability to create a model that is not overly sensitive to small fluctuations in the data (thanks to the ε-tube) makes it a robust choice for many real-world regression tasks.



#### Load Libraries

In [1]:
# Data Manipulation
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Train Test Split
from sklearn.model_selection import train_test_split, cross_validate

# Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

# Modelling
from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import uniform, loguniform

# Model Evaluation
from sklearn.metrics import mean_absolute_error, r2_score
import scipy.stats as stats

#### Load Data into DataFrame, Prepare Data for Pipeline, and Train Test Split
Since the SVMs can handle multicollinearity better than some other algorithms, like linear regression, we will make a base model with all the columns, only getting rid of the columns we wish to exclude from our model.

SVMs focus on finding the optimal hyperplane to fit the data points (for regression tasks), making them less sensitive to the presence of correlated features.

It's often worth experimenting with both the full feature set and a reduced set to see which performs better. We will first use our full set. That said, other preprocessing steps like feature scaling are typically more crucial for SVMs, given their reliance on distance calculations.

In [2]:
# Make file path variable so that all we need is to change this if we move notebook location
file_path = '../data/processed/final_HDB_for_model.parquet.gzip'

# Read data into csv
df = pd.read_parquet(file_path)
#df = df.sample(50000)

# Put all columns to be deleted into a list
drop_cols = ['block', 'street_name','address','sold_year_month']

# Drop columns
df = df.drop(columns=drop_cols)

In [3]:
# Create lists of the categorical and numerical columns allowing them to be treated differently
cat_cols = df.select_dtypes(include=['object']).columns
num_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Create new list of numeric columns, removing resale_price from columns to scale
num_cols_scale = list(num_cols)
num_cols_scale.remove('resale_price')

In [4]:
# Select target column
target_col = 'resale_price'

# Ready X and y
X = df.loc[:, ~df.columns.isin([target_col])]
y = df[target_col]

# Split the data, 80-20 split with a random state included for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 54)

#### Create Preprocessing Pipeline

In [5]:
# Create instances of OneHotEncoder
cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

# Create pipeline of two scalers for numeric data
num_transformer = make_pipeline(RobustScaler(), MinMaxScaler())

# Create a final to apply transformations to subsets of columns
prepoc = make_column_transformer(
    (cat_transformer, cat_cols),
    (num_transformer, num_cols_scale),
    remainder = 'passthrough'
)

# View Pipeline
prepoc

In [6]:
# Process X & y with pipeline
X_train_processed = prepoc.fit_transform(X_train)
X_test_processed = prepoc.transform(X_test)

# Check to see if it worked
print("Number of columns originally:", X.shape[1])
print("Number of columns after preprocessing:",X_train_processed.shape[1])

Number of columns originally: 22
Number of columns after preprocessing: 190


### Creating a Basic SVR Model & Evaluation 

In [7]:
# Instantiate the model with a linear kernel
base_svr_model = SVR(kernel='linear', C=10000)

# Define multiple scoring metrics
scoring = ['r2', 'neg_mean_absolute_error']

# Get the cross validation scores
scores = cross_validate(base_svr_model, X_train_processed, y_train, cv=5, n_jobs= -1,
                        scoring=scoring, return_train_score=False, verbose = 1)

# View dictionary
scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


KeyboardInterrupt: 

In [None]:
# Get rounded scores stored in variables
train_base_r2_mean = round(scores['test_r2'].mean(), 2)
train_base_mae_mean = round(-(scores['test_neg_mean_absolute_error'].mean()),2)

# Print scores to assess
print("Training r2 score =", train_base_r2_mean)
print("Training Mean Absolute Error =", train_base_mae_mean)

In [None]:
# # Instantiate the model with a linear kernel
# base_svr_model = SVR(kernel='linear')

# # Define the hyperparameter grid to search
# # Tried with 'C': [0.1, 1, 10, 100],
# #            'epsilon': [0.1, 0.2]
# #best results with c = 100, e = .1
# param_grid = {
#     'C': [2000, 3000, 4000]
# }

# # Create the GridSearchCV object
# grid_search = GridSearchCV(estimator=base_svr_model, param_grid=param_grid,
#                            scoring='neg_mean_absolute_error', cv=3, n_jobs=-1, verbose=5)

# # Fit GridSearchCV
# grid_search.fit(X_train_processed, y_train)

# # Print the best parameters and the corresponding score
# print("Best parameters found: ", grid_search.best_params_)
# print("Best score (negative mean absolute error): ", grid_search.best_score_)

# # If you also want the R2 score of the best model
# best_r2_score = grid_search.best_estimator_.score(X_train_processed, y_train)
# print("R2 score of the best model: ", best_r2_score)

Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV 3/3] END .......................C=3000;, score=-41804.971 total time= 1.3min
[CV 2/3] END .......................C=2000;, score=-42888.274 total time= 1.3min
[CV 3/3] END .......................C=2000;, score=-42652.575 total time= 1.3min
[CV 1/3] END .......................C=2000;, score=-42569.678 total time= 1.3min
[CV 1/3] END .......................C=3000;, score=-41715.728 total time= 1.3min
[CV 1/3] END .......................C=4000;, score=-41149.834 total time= 1.3min
[CV 2/3] END .......................C=3000;, score=-42001.597 total time= 1.4min
[CV 2/3] END .......................C=4000;, score=-41393.683 total time= 1.4min
[CV 3/3] END .......................C=4000;, score=-41211.930 total time=  48.9s
Best parameters found:  {'C': 4000}
Best score (negative mean absolute error):  -41251.81566256274
R2 score of the best model:  0.8843923535734847


In [None]:
# # Instantiate the model with a linear kernel
# base_svr_model = SVR(kernel='linear')

# # Define the hyperparameter distribution to search
# param_distributions = {
#     'C': np.linspace(1000, 10000, 10)  # Continuous distribution between 2000 and 4000
# }

# # Create the RandomizedSearchCV object
# random_search = RandomizedSearchCV(estimator=base_svr_model,
#                                    param_distributions=param_distributions,
#                                    n_iter=10,  # Number of parameter settings sampled
#                                    scoring='neg_mean_absolute_error',
#                                    cv=3, n_jobs=-1, verbose=5)

# # Fit RandomizedSearchCV
# random_search.fit(X_train_processed, y_train)

# # Print the best parameters and the corresponding score
# print("Best parameters found: ", random_search.best_params_)
# print("Best score (negative mean absolute error): ", random_search.best_score_)

# # If you also want the R2 score of the best model
# best_r2_score = random_search.best_estimator_.score(X_train_processed, y_train)
# print("R2 score of the best model: ", best_r2_score)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 2/3] END .....................C=1000.0;, score=-44273.962 total time= 1.4min
[CV 3/3] END .....................C=2000.0;, score=-42652.575 total time= 1.4min
[CV 3/3] END .....................C=1000.0;, score=-44104.912 total time= 1.4min
[CV 1/3] END .....................C=2000.0;, score=-42569.678 total time= 1.4min
[CV 1/3] END .....................C=1000.0;, score=-44033.626 total time= 1.4min
[CV 2/3] END .....................C=3000.0;, score=-42001.597 total time= 1.4min
[CV 2/3] END .....................C=2000.0;, score=-42888.274 total time= 1.4min
[CV 1/3] END .....................C=3000.0;, score=-41715.728 total time= 1.4min
[CV 3/3] END .....................C=3000.0;, score=-41804.971 total time= 1.4min
[CV 2/3] END .....................C=5000.0;, score=-40967.187 total time= 1.4min
[CV 3/3] END .....................C=4000.0;, score=-41211.930 total time= 1.4min
[CV 1/3] END .....................C=4000.0;, sco