# Assignment 3: Support Vector Machines
Authors: Naomi Buell and Richie Rivera

## *Instructions:*

Perform an analysis of the dataset(s) used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.

- Read the following articles:
    - https://www.hindawi.com/journals/complexity/2021/5550344/
    - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/
- Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise.
- Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.
- Compare the results with the results from previous homework.
- Answer questions, such as:
    - Which algorithm is recommended to get more accurate results?
    - Is it better for classification or regression scenarios?
    - Do you agree with the recommendations?
    - Why?

## Literature Review

As background for this assignment, we reviewed the Ahmad et al. (2021) and Guhathakurata1 et al. (2021) articles on predicting COVID-19 cases using ML (ML) algorithms. These studies use different data sources and features to make these predictions: Ahmad uses COVID polymerase chain reaction lab test data from a hospital in Brazil, while Guhathakurata created a dataset of individuals' COVID symptoms and comorbidities to predict cases. The Ahmad (2021) article uses a single decision tree, random forest, bagging, XGBoost, AdaBoost, balanced random forest (RUS), and combinations of bagging, SMOTE oversampling, and RUS with these algorithms. Notably, SVMs were not tested in this article. The Guhathakuratal (2021) article uses the supervised ML Algorithms K-Nearest Neighbor (KNN), Naïve Bayes, Random Forest, AdaBoost, Binary Tree, and SVM. The Ahmad (2021) compares algorithms in terms of accuracy, precision, recall, F1-measure, AUROC, and AUPRC. The Guhathakuratal (2021) uses fewer metrics for comparison--just precision, recall, f1-score, and support. The Guthathakuratal (2021) article finds that SVMs outperform the other algorithms tested, while the Ahmad (2021) article finds that decision tree ensembles developed for imbalanced datasets perform best. We will borrow the use of precision, recall, and F1 from both articles to compare the SVM results with the decision tree, random forest, and AdaBoost results from Homework #2.

We found the following 3 articles that compare the use of decision trees and SVM in our current areas of expertise: public health and data science.

1. [Machine Learning Techniques in Chronic Kidney Diseases: A Comparative Study of Classification Model Performance (Phuong, 2025)](https://pubmed.ncbi.nlm.nih.gov/40735336/). As a public health researcher, Naomi has worked on a simulation model of chronic kidney disease (CKD) progression in a population. Similarly, this article works in the same field--predicting CKD--but does so using several ML techniques: random forest, SVM, naive bayes, logistic regression, KNN, and XGBoost. The data in this study comes from the UC Irvine ML Repository, while the model Naomi worked on uses population data from NHANES (National Health and Nutrition Examination Survey). Similar to Naomi's work, the model predicts CKD based on potassium levels, urea, blood pressure, and other features related to kidney function. This article finds that Random Forest, XGBoost, SVM, and logistic regression performed best (with an accuracy of 100%!), followed by Naive Bayes (97%) and KNN (93%).

2. [Machine learning prediction of dropping out of outpatients with alcohol use disorders (Park, 2021)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0255626). This study developed a ML algorithm to predict the risk of patients dropping out of outpatient treatment schemes for alcohol use disorder (AUD). Naomi has experience working with substance use disorder and mental health data in her role as a public health researcher, so this study was relevant to her expertise. Data was obtained for this study on patients with AUD from three hospitals within a medical center in South Korea, measuring whether patients continued to follow up or if they dropped out of treatment. The researchers implemented six ML models—logistic regression, SVM, kNN, random forest, neural network, and AdaBoost. They compared AUROC, accuracy, sensitivity, and specificity across all models, and the AdaBoost model was determined to be the best.

3.

>Richie to add a third article from his field of expertise!

In [91]:
# Load libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score, recall_score, classification_report
from imblearn.under_sampling import RandomUnderSampler
from sklearn.tree import plot_tree
from sklearn.svm import SVC
from sklearn.preprocessing import RobustScaler
from collections import Counter
import numpy as np
import zipfile
import io
import requests
import seaborn as sns

## Performing SVM

### Data Preparation

First, we import libraries and load the Portuguese banking dataset.

In [92]:
# Download the zip file from the internet
url = "https://archive.ics.uci.edu/static/public/222/bank+marketing.zip"
response = requests.get(url)

# Extract bank-additional.zip from the downloaded zip
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    with z.open('bank-additional.zip') as additional_zip_file:
        with zipfile.ZipFile(additional_zip_file) as additional_zip:
            # Extract bank-additional-full.csv from bank-additional.zip
            with additional_zip.open('bank-additional/bank-additional-full.csv') as csvfile:
                df = pd.read_csv(csvfile, sep=';')

df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


#### Handle Missing Values

There are some observations set as "unknown" in several categorical columns. Additionally, the `pdays` column has a value of 999 which means the client was not previously contacted. We will treat these as missing values.

In [93]:
# Check for categorical variables that are "unknown" and replace 'unknown' with np.nan in all object (categorical) columns
df_nas = df.copy()
df_nas = df_nas.replace('unknown', np.nan)
df_nas = df_nas.replace(999, np.nan)  # Replace 999 in 'pdays' with np.nan

# Check for categorical variables that are now NA
na_counts = df_nas.isna().sum()
na_counts = na_counts[na_counts > 0].sort_values(ascending=False)
na_perc = (na_counts / len(df_nas)) * 100
print("Columns with NA values and their % missing:")
print(na_perc.round(2))

Columns with NA values and their % missing:
pdays        96.32
default      20.87
education     4.20
housing       2.40
loan          2.40
job           0.80
marital       0.19
duration      0.00
dtype: float64


`pdays` is now 96% missing in cases where the client has not been previously contacted. `default` (whether or not the client has credit in default) has significant missingness. `education` (education level), `housing` (whether the client has a housing loan), `loan` (whether the client has a personal loan), `job` (type of job), and `marital` (marital status) have some minor missingness. We will address this missingness next.



We drop `default`, whether the client has credit in default, since it is 21% missing and there is little risk of losing valuable information that our classifier needs to discriminate between classes (most clients with known data are not in default. 79% no, <1% yes, 20% unknown). We also drop `pdays` (number of days that passed by after the client was last contacted from a previous campaign) since it is 96% missing and was determined to be not useful for prediction according to the IV. For the other columns with minor missingness (`education`, `housing`, `loan`, `job`, and `marital`), we will drop rows with missing data. Since the missingness is low, we will not lose much information by dropping these rows. We also drop the `duration` column, the last contact duration in seconds, because this attribute highly affects the output target (e.g., if duration=0 then y="no"). Per the bank data notes, this input should be discarded for our predictive model.

In [94]:
# Save a new dataframe for preprocessing
df_dropped = df_nas.copy()

# Remove `default` and `pdays` columns due to high missingness. Drop duration per data notes.
df_dropped = df_dropped.drop(columns=['default', 'pdays', 'duration'])

# Drop rows with missing values in other columns
df_dropped = df_dropped.dropna()

# Display ns before and after dropping missing values
print(f"Original dataset shape: {df.shape}")
print(f"Cleaned dataset shape: {df_dropped.shape}")

Original dataset shape: (41188, 21)
Cleaned dataset shape: (38245, 18)


#### Handling Duplicate rows

In [95]:
df_duplicates = df_dropped[df_dropped.duplicated()]

print(f"There are {df_duplicates.shape[0]:,} duplicate entries\nThis is {100*round(df_duplicates.shape[0]/df_dropped.shape[0],3)}% of the total")

There are 1,946 duplicate entries
This is 5.1% of the total


We can handle these duplicates by dropping them.

In [96]:
df_deduped = df_dropped.drop_duplicates()

print(f"Number of remaining entries: {df_deduped.shape[0]:,}")

Number of remaining entries: 36,299


After dropping the duplicates, we still have sufficient data to run a decision tree, random forest, and AdaBoost model with, even after removing rows with missing values. 

#### Handle Categorical Features

Next, we map ordinal categorical variables to numeric values. In this case, we map the education levels to estimated number of years of schooling. We also map the days of the week and months to their respective orderings.

In [97]:
df_fe = df_dropped.copy()

# Map education levels to numeric values
education_mapping = {
    'illiterate': 0,                # 0 years
    'basic.4y': 4,                  # 4 years
    'basic.6y': 6,                  # 6 years
    'basic.9y': 9,                  # 9 years
    'high.school': 12,              # 12 years (typical for high school)
    'professional.course': 14,      # 14 years (post-secondary/professional)
    'university.degree': 16         # 16 years (bachelor's degree)
}
df_fe['education'] = df_fe['education'].map(education_mapping)

# Encode month as integer
month_map = {'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
             'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}
df_fe['month'] = df_fe['month'].map(month_map)

# Encode day as integer
day_map = {'sun': 1, 'mon': 2, 'tue': 3, 'wed': 4, 'thu': 5, 'fri': 6, 'sat': 7}
df_fe['day_of_week'] = df_fe['day_of_week'].map(day_map)

We also perform label encoding on other categorical variables before we feed data to tree-based machine learning algorithms.

In [98]:
# One-hot encode nominal categorical variables
df_dummies = df_fe.copy()
df_dummies = pd.get_dummies(df_dummies, columns=['job', 'marital', 'housing', 'loan', 'contact', 'poutcome', 'y'], drop_first=True)
df_dummies = df_dummies.rename(columns={'y_yes': 'y'})

df_dummies.head()

Unnamed: 0,age,education,month,day_of_week,campaign,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,...,job_technician,job_unemployed,marital_married,marital_single,housing_yes,loan_yes,contact_telephone,poutcome_nonexistent,poutcome_success,y
0,56,4,5,2,1,0,1.1,93.994,-36.4,4.857,...,False,False,True,False,False,False,True,True,False,False
1,57,12,5,2,1,0,1.1,93.994,-36.4,4.857,...,False,False,True,False,False,False,True,True,False,False
2,37,12,5,2,1,0,1.1,93.994,-36.4,4.857,...,False,False,True,False,True,False,True,True,False,False
3,40,6,5,2,1,0,1.1,93.994,-36.4,4.857,...,False,False,True,False,False,False,True,True,False,False
4,56,12,5,2,1,0,1.1,93.994,-36.4,4.857,...,False,False,True,False,False,True,True,True,False,False


#### Standardize Features

Finally, we standardize the features to have mean 0 and standard deviation 1. Although the tree-based models from the previous assignment do not require standardization, this is important for SVMs since they are sensitive to the scale of the input features. We are using `RobustScaler()` to reduce the influence of outliers.

In [99]:
# Identify numerical columns
num_cols = df_dummies.select_dtypes(include=[np.number]).columns
df_standardized = df_dummies.copy()
scaler = RobustScaler()

# Fit and transform the numerical columns
df_standardized[num_cols] = scaler.fit_transform(df_standardized[num_cols])

df_standardized.head()

Unnamed: 0,age,education,month,day_of_week,campaign,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,...,job_technician,job_unemployed,marital_married,marital_single,housing_yes,loan_yes,contact_telephone,poutcome_nonexistent,poutcome_success,y
0,1.2,-1.142857,-0.333333,-1.0,-0.5,0.0,0.0,0.598477,0.857143,0.0,...,False,False,True,False,False,False,True,True,False,False
1,1.266667,0.0,-0.333333,-1.0,-0.5,0.0,0.0,0.598477,0.857143,0.0,...,False,False,True,False,False,False,True,True,False,False
2,-0.066667,0.0,-0.333333,-1.0,-0.5,0.0,0.0,0.598477,0.857143,0.0,...,False,False,True,False,True,False,True,True,False,False
3,0.133333,-0.857143,-0.333333,-1.0,-0.5,0.0,0.0,0.598477,0.857143,0.0,...,False,False,True,False,False,False,True,True,False,False
4,1.2,0.0,-0.333333,-1.0,-0.5,0.0,0.0,0.598477,0.857143,0.0,...,False,False,True,False,False,True,True,True,False,False


Our data is ready for us train our models.

### Create SVM Model

In this section, we train and validate the SVM model and tune hyper-parameters. For all models, we split the data into 70% for training and 30% for testing. We will start with a linear kernel and the regularization parameter C=1.0 (default).

In [100]:
# Separate features and target
X = df_standardized.drop(columns=['y'])
y = df_standardized['y']

# Randomly split the X and y arrays into 30 percent test data and 70 percent training data.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y
)

# Create and train the SVM model
svm_linear = SVC(kernel='linear', C=1.0, random_state=1)
svm_linear.fit(X_train, y_train)

# Display a detailed classification report
y_pred_svm = svm_linear.predict(X_test) # Predictions on test set
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred_svm))

# Accuracy on balanced training set
y_train_pred = svm_linear.predict(X_train) # Predictions on training set
train_acc = accuracy_score(y_train, y_train_pred)
print('Training Accuracy: %.2f%%' % (train_acc * 100))

# Accuracy on test set
test_acc = accuracy_score(y_test, y_pred_svm)
print('Test Accuracy: %.2f%%' % (test_acc * 100))

# Recall on the test set
test_recall_svm = recall_score(y_test, y_pred_svm)
print('Final Model Recall on Test Set:  %.2f%%' % (test_recall_svm * 100))


Classification Report on Test Set:
              precision    recall  f1-score   support

       False       0.91      0.99      0.94     10197
        True       0.63      0.19      0.30      1277

    accuracy                           0.90     11474
   macro avg       0.77      0.59      0.62     11474
weighted avg       0.88      0.90      0.87     11474

Training Accuracy: 89.83%
Test Accuracy: 89.78%
Final Model Recall on Test Set:  19.26%


The SVM model predicts bank subscriptions with 89.83% accuracy on the test set. Training and test accuracies are similar and do not indicate overfitting. The F1 score for the positive class (subscribed) is 0.30.

We will also explore using the RBF kernel (default).

In [101]:
# Create and train the SVM model
svm_rbf = SVC(kernel='rbf', C=1.0, random_state=1)
svm_rbf.fit(X_train, y_train)

# Display a more detailed classification report
y_train_pred = svm_rbf.predict(X_train) # Predictions on training set
y_pred_svm = svm_rbf.predict(X_test) # Predictions on test set
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred_svm))

# Accuracy on training set
train_acc = accuracy_score(y_train, y_train_pred)
print('Training Accuracy: %.2f%%' % (train_acc * 100))

# Accuracy on test set
test_acc = accuracy_score(y_test, y_pred_svm)
print('Test Accuracy: %.2f%%' % (test_acc * 100))

# Calculate final recall on the test set
test_recall_svm = recall_score(y_test, y_pred_svm)
print('Final Model Recall on Test Set:  %.2f%%' % (test_recall_svm * 100))


Classification Report on Test Set:
              precision    recall  f1-score   support

       False       0.91      0.98      0.94     10197
        True       0.63      0.21      0.32      1277

    accuracy                           0.90     11474
   macro avg       0.77      0.60      0.63     11474
weighted avg       0.88      0.90      0.88     11474

Training Accuracy: 90.51%
Test Accuracy: 89.82%
Final Model Recall on Test Set:  21.30%


The test accuracy is slightly improved with the RBF kernel (from 89.78% to 89.82%) and the F1 score is also improved (from 0.30 to 0.32). 

Given these improved results, we will proceed with hyper-parameter tuning to see if we can improve performance further.

In [102]:
# Tune hyper-parameters 
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': ['scale', 'auto', 0.01, 0.001],
}
grid_search_svm = GridSearchCV(
    estimator=svm_rbf,
    param_grid=param_grid,
    scoring='recall',
    cv=5,
    n_jobs=-1
)

# Train the model
grid_search_svm.fit(X_train, y_train)

In [103]:
# Display best parameters
print("Best Hyper-parameters:", grid_search_svm.best_params_)

# Display a more detailed classification report
y_pred_svm = grid_search_svm.predict(X_test) # Predictions on test set
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred_svm))

# Accuracy on training set
y_train_pred = grid_search_svm.predict(X_train) # Predictions on training set
train_acc = accuracy_score(y_train, y_train_pred)
print('Training Accuracy: %.2f%%' % (train_acc * 100))

# Accuracy on test set
test_acc = accuracy_score(y_test, y_pred_svm)
print('Test Accuracy: %.2f%%' % (test_acc * 100))

# Calculate final recall on the test set
test_recall_svm = recall_score(y_test, y_pred_svm)
print('Final Model Recall on Test Set:  %.2f%%' % (test_recall_svm * 100))
print('Best Cross-Validation Recall Score:  %.2f%%' % (grid_search_svm.best_score_ * 100))

Best Hyper-parameters: {'C': 10, 'gamma': 'scale'}

Classification Report on Test Set:
              precision    recall  f1-score   support

       False       0.91      0.98      0.94     10197
        True       0.60      0.24      0.34      1277

    accuracy                           0.90     11474
   macro avg       0.75      0.61      0.64     11474
weighted avg       0.88      0.90      0.88     11474

Training Accuracy: 91.76%
Test Accuracy: 89.73%
Final Model Recall on Test Set:  24.12%
Best Cross-Validation Recall Score:  24.52%


The hyper-parameter tuning yielded an accuracy of 89.73%, which is a slight decrease from the previous RBF model. However, since the grid search was optimized for recall, we see an improvement in recall from 21.3% to 24.12%, meaning the model can identify more of the positive class (subscribed) instances. The F1 score also improved slightly from 0.32 to 0.34.

Lastly, we try balancing the training data.

In [115]:
# Balance training data
print(f"Unbalanced training set size: {X_train.shape[0]:,}")
print(f"Unbalanced training set class distribution: {Counter(y_train)}")
undersampler = RandomUnderSampler(random_state=1)
X_train_resampled, y_train_resampled = undersampler.fit_resample(X_train, y_train)
print(f"Resampled training set size: {X_train_resampled.shape[0]:,}")
print(f"Resampled training set class distribution: {Counter(y_train_resampled)}")

# Train the optimized model on the balanced data
grid_search_svm.fit(X_train_resampled, y_train_resampled)

# Display a more detailed classification report
y_pred_svm = grid_search_svm.predict(X_test) # Predictions on test set
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred_svm))

# Accuracy on training set
y_train_pred = grid_search_svm.predict(X_train_resampled) # Predictions on (balanced) training set
train_acc = accuracy_score(y_train_resampled, y_train_pred)
print('Training Accuracy: %.2f%%' % (train_acc * 100))
print('Best Cross-Validation Recall Score on Training Set:  %.2f%%' % (grid_search_svm.best_score_ * 100))

# Accuracy on test set
test_acc = accuracy_score(y_test, y_pred_svm)
print('Test Accuracy: %.2f%%' % (test_acc * 100))

# Calculate final recall on the test set
test_recall_svm = recall_score(y_test, y_pred_svm)
print('Final Model Recall on Test Set:  %.2f%%' % (test_recall_svm * 100))

Unbalanced training set size: 26,771
Unbalanced training set class distribution: Counter({False: 23790, True: 2981})
Resampled training set size: 5,962
Resampled training set class distribution: Counter({False: 2981, True: 2981})

Classification Report on Test Set:
              precision    recall  f1-score   support

       False       0.95      0.72      0.82     10197
        True       0.24      0.70      0.36      1277

    accuracy                           0.72     11474
   macro avg       0.60      0.71      0.59     11474
weighted avg       0.87      0.72      0.77     11474

Training Accuracy: 71.15%
Best Cross-Validation Recall Score on Training Set:  70.41%
Test Accuracy: 72.01%
Final Model Recall on Test Set:  70.48%


Balancing the data made the accuracy drop significantly (nearly 20 percentage points), but we do see a substantial improvement in recall (from 24.12% to 70.48%). This means the model is able to identify a much larger proportion of the positive class (subscribed) instances, but at the cost of many more false positives. I.e., many more clients are incorrectly predicted to subscribe when they do not, but we aare able to identify many more clients who do subscribe. If the bank has the resources to cast a wider net with more potential clients, this trade-off may be acceptable.

## Comparison with Models from Previous Homework

We compare the results of the SVM model above with the decision tree, random forest, and AdaBoost models from the previous homework. We detail the model parameters and results of evaluation metrics for each experiment below and compare them with the SVM result.

>From Grading Rubric:
>2. Comparison was backed-up with facts & figures from results obtained (10) -- Richie, would you be able to add a figure?

#### Decision Tree 1

Decision tree with max depth of 4.

In [105]:
# Create and fit decision tree 1
tree_model_1 = DecisionTreeClassifier(criterion='gini',
                                     max_depth=4,
                                     random_state=1)
tree_model_1.fit(X_train, y_train)

# Display a detailed classification report
y_pred_tree1 = tree_model_1.predict(X_test) # Predictions on test set
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred_tree1))

# Accuracy on training set
y_train_pred_tree1 = tree_model_1.predict(X_train) # Predictions on training set
train_acc = accuracy_score(y_train, y_train_pred_tree1)
print('Training Accuracy: %.2f%%' % (train_acc * 100))

# Accuracy on test set
test_acc_tree1 = accuracy_score(y_test, y_pred_tree1)
print('Test Accuracy: %.2f%%' % (test_acc_tree1 * 100))

# Recall on the test set
test_recall_tree1 = recall_score(y_test, y_pred_tree1)
print('Recall on Test Set:  %.2f%%' % (test_recall_tree1 * 100))


Classification Report on Test Set:
              precision    recall  f1-score   support

       False       0.91      0.99      0.95     10197
        True       0.69      0.18      0.28      1277

    accuracy                           0.90     11474
   macro avg       0.80      0.58      0.61     11474
weighted avg       0.88      0.90      0.87     11474

Training Accuracy: 90.06%
Test Accuracy: 89.95%
Recall on Test Set:  17.70%


The decision tree model with a max depth of 4 was 89.95% accurate on the test data, much better than the final SVM model (72.01%), but had lower recall (17.70% vs. 70.48%) and F1 score (0.28 vs. 0.36) for the positive class (subscribed).

#### Decision Tree 2

Decision tree pruned to have a max depth of 2.

In [106]:
# Create and fit a decision tree model
tree_model_2 = DecisionTreeClassifier(criterion='gini',
                                     max_depth=2, # Decrease max depth to 2
                                     random_state=1)
tree_model_2.fit(X_train, y_train)

# Display a detailed classification report
y_pred_tree2 = tree_model_2.predict(X_test) # Predictions on test set
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred_tree2))

# Accuracy on training set
y_train_pred_tree2 = tree_model_2.predict(X_train) # Predictions on training set
train_acc = accuracy_score(y_train, y_train_pred_tree2)
print('Training Accuracy: %.2f%%' % (train_acc * 100))

# Accuracy on test set
test_acc_tree2 = accuracy_score(y_test, y_pred_tree2)
print('Test Accuracy: %.2f%%' % (test_acc_tree2 * 100))

# Recall on the test set
test_recall_tree2 = recall_score(y_test, y_pred_tree2)
print('Recall on Test Set:  %.2f%%' % (test_recall_tree2 * 100))



Classification Report on Test Set:
              precision    recall  f1-score   support

       False       0.91      0.99      0.95     10197
        True       0.70      0.17      0.28      1277

    accuracy                           0.90     11474
   macro avg       0.80      0.58      0.61     11474
weighted avg       0.88      0.90      0.87     11474

Training Accuracy: 90.02%
Test Accuracy: 89.96%
Recall on Test Set:  17.31%


The decision tree model with a max depth of 2 was 89.96% accurate, which is still better than the final SVM model (72.01%). However, it had even lower recall than the previous decision tree (17.31%), making the SVM model preferable.

#### Random Forest 1

Random forest with 25 trees.

In [107]:
# Generate RF model
forest_1 = RandomForestClassifier(n_estimators=25,
                                random_state=1,
                                n_jobs=2)
forest_1.fit(X_train, y_train)

# Display a detailed classification report
y_pred_forest1 = forest_1.predict(X_test) # Predictions on test set
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred_forest1))

# Accuracy on training set
y_train_pred_forest1 = forest_1.predict(X_train) # Predictions on training set
train_acc = accuracy_score(y_train, y_train_pred_forest1)
print('Training Accuracy: %.2f%%' % (train_acc * 100))

# Accuracy on test set
test_acc_forest1 = accuracy_score(y_test, y_pred_forest1)
print('Test Accuracy: %.2f%%' % (test_acc_forest1 * 100))

# Recall on the test set
test_recall_forest1 = recall_score(y_test, y_pred_forest1)
print('Recall on Test Set:  %.2f%%' % (test_recall_forest1 * 100))


Classification Report on Test Set:
              precision    recall  f1-score   support

       False       0.92      0.97      0.94     10197
        True       0.54      0.30      0.38      1277

    accuracy                           0.89     11474
   macro avg       0.73      0.63      0.66     11474
weighted avg       0.87      0.89      0.88     11474

Training Accuracy: 99.15%
Test Accuracy: 89.32%
Recall on Test Set:  29.68%


This random forest model is less accurate than the previous decision tree models at 89.36% accuracy in correctly predicting a subscriber, but it is still more accurate than the SVM model. However, since the training accuracy is very high at 99.15% (nearly a perfect fit to the training data!), this may indicate overfitting. The recall of the model is very poor. It is only correctly identifying 29.68% of true positives (subscribers) compared to 70.48% for the SVM model.

#### Random Forest 2

Random forest with grid search over number of trees (10-50, in increments of 5) and max depth (2-6 and None, in increments of 1) using 5 fold cross validation.

In [108]:
# Set up our parameters for grid search
param_grid = {
    'n_estimators': [10, 15, 20, 25, 30, 35, 40, 45, 50],
    'max_depth': [2, 3, 4, 5, 6, None]
}

# Initialize our random forest classifier
forest_2 = RandomForestClassifier(random_state=1)

# Set up the grid search with recall as the scoring metric
grid_search = GridSearchCV(
    estimator=forest_2,
    param_grid=param_grid,
    scoring='recall',
    cv=5,
    n_jobs=-1 # This allows us to use all available cores to speed up the training
)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score
print(f"Best Hyperparameters: {grid_search.best_params_}")

# Get the best model from the grid search
best_forest = grid_search.best_estimator_

Best Hyperparameters: {'max_depth': None, 'n_estimators': 25}


In [114]:
# Display a more detailed classification report
y_pred_best_forest = best_forest.predict(X_test)
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred_best_forest))

# Accuracy on training set
y_train_pred = best_forest.predict(X_train)
train_acc = accuracy_score(y_train, y_train_pred)
print('Training Accuracy: %.2f%%' % (train_acc * 100))

# Accuracy on test set
test_acc = accuracy_score(y_test, y_pred_best_forest)
print('Test Accuracy: %.2f%%' % (test_acc * 100))

# Calculate final recall on the test set
test_recall_rf2 = recall_score(y_test, y_pred_best_forest)
print('Final Model Recall on Test Set:  %.2f%%' % (test_recall_rf2 * 100))
print('Best Cross-Validation Recall Score:  %.2f%%' % (grid_search.best_score_ * 100))


Classification Report on Test Set:
              precision    recall  f1-score   support

       False       0.92      0.97      0.94     10197
        True       0.54      0.30      0.38      1277

    accuracy                           0.89     11474
   macro avg       0.73      0.63      0.66     11474
weighted avg       0.87      0.89      0.88     11474

Training Accuracy: 99.15%
Test Accuracy: 89.32%
Final Model Recall on Test Set:  29.68%
Best Cross-Validation Recall Score:  28.14%


This random forest model is even less accurate than the prior random forest, at 89.32% accuracy in correctly predicting subscribers in the test data, but this still performs better than the SVM (70.48%). Unfortunately, since the training accuracy is very high at 98.79% (nearly a perfect fit to the training data!), this one also shows signs of overfitting. Additionally, the recall of the model (29.68%) is very poor compared to the SVM (70.48%).

#### AdaBoost 1

Adaboost with 50 estimators and learning rate of 1.

In [110]:
# train the adaboost algorithm
adaboost_1 = AdaBoostClassifier(
    n_estimators=50,
    learning_rate=1,
    random_state=1
)

# Train our model
adaboost_1.fit(X_train, y_train)

# Display a more detailed classification report
y_pred_ada_1 = adaboost_1.predict(X_test)
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred_ada_1))

# Accuracy on training set
y_train_pred = adaboost_1.predict(X_train)
train_acc = accuracy_score(y_train, y_train_pred)
print('Training Accuracy: %.2f%%' % (train_acc * 100))

# Accuracy on test set
test_acc = accuracy_score(y_test, y_pred_ada_1)
print('Test Accuracy: %.2f%%' % (test_acc * 100))

# Calculate final recall on the test set
test_recall_ada_1 = recall_score(y_test, y_pred_ada_1)
print('Final Model Recall on Test Set:  %.2f%%' % (test_recall_ada_1 * 100))


Classification Report on Test Set:
              precision    recall  f1-score   support

       False       0.91      0.99      0.95     10197
        True       0.70      0.17      0.28      1277

    accuracy                           0.90     11474
   macro avg       0.80      0.58      0.61     11474
weighted avg       0.88      0.90      0.87     11474

Training Accuracy: 90.03%
Test Accuracy: 89.97%
Final Model Recall on Test Set:  17.38%


The AdaBoost model, configured with 50 estimators and a learning rate of 1, achieved a final recall of 17.38% on the test set. This really low score means that the model is failing to identify about 83% of actual customers with subscriptions--much lower than the final SVM model (70.48%). Despite this, the model showed high precision (70%) meaning that there's a strong chance that any customer predicted to subscribe actually does so. The high accuracy of 90% is misleading due to the imbalanced dataset, even though it is better than the SVM model (72.01%).

#### AdaBoost 2

A resampled dataset to balance the dataset and an optimized Adaboost model using a range of estimators (20-300 in increments of 25) and learning rates (0-1 in increments of 0.05) using 5-fold cross validation.

In [111]:
# Firstly, we will balance the dataset by undersampling the majority class
undersampler = RandomUnderSampler(random_state=1)
X_train_resampled, y_train_resampled = undersampler.fit_resample(X_train, y_train)

print(f"Resample training set size: {X_train_resampled.shape[0]:,}")

Resample training set size: 5,962


In [112]:
# Now we'll set up our grid search
param_grid = {
    'n_estimators': list(range(20, 301, 25)),
    'learning_rate': [round(x, 2) for x in np.arange(0.1, 1.01, 0.05)],
}

# Creating our AdaBoost classifier
ada_boost_2 = AdaBoostClassifier(random_state=1)

grid_search_ada = GridSearchCV(
    estimator=ada_boost_2,
    param_grid=param_grid,
    scoring='recall',
    cv=5,
    n_jobs=-1 # With our wide range of hyperparameters, we're going to need to use all available cores to speed up the training
)

# Run the grid search on the resampled training data
grid_search_ada.fit(X_train_resampled, y_train_resampled)

In [116]:
print(f"Best Hyperparameters Found: {grid_search_ada.best_params_}")
best_ada = grid_search_ada.best_estimator_

# Display a more detailed classification report
y_pred_ada2 = best_ada.predict(X_test)
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred_ada2))

# Accuracy on (balanced) training set
y_train_pred = best_ada.predict(X_train_resampled)
train_acc = accuracy_score(y_train_resampled, y_train_pred)
print('Training Accuracy: %.2f%%' % (train_acc * 100))
print('Best Cross-Validation Recall Score on Training Set:  %.2f%%' % (grid_search_ada.best_score_ * 100))

# Accuracy on test set
test_acc = accuracy_score(y_test, y_pred_ada2)
print('Test Accuracy: %.2f%%' % (test_acc * 100))

# Calculate final recall on the test set
test_recall_ada2 = recall_score(y_test, y_pred_ada2)
print('Final Model Recall on Test Set:  %.2f%%' % (test_recall_ada2 * 100))

Best Hyperparameters Found: {'learning_rate': np.float64(0.85), 'n_estimators': 20}

Classification Report on Test Set:
              precision    recall  f1-score   support

       False       0.95      0.74      0.83     10197
        True       0.25      0.70      0.37      1277

    accuracy                           0.73     11474
   macro avg       0.60      0.72      0.60     11474
weighted avg       0.87      0.73      0.78     11474

Training Accuracy: 71.50%
Best Cross-Validation Recall Score on Training Set:  70.24%
Test Accuracy: 73.26%
Final Model Recall on Test Set:  69.77%


From our grid search, our optimal AdaBoost model has a learning rate of 0.85 and 20 estimators. This method seems to be highly effective as we have a high recall indicating that we are able to correctly identify 69.77% of actual subscribers in the test set, only slightly worse than the final SVM at 70.48%. However, the precision is quite low at 25% meaning that many of our predictions are false positives--slightly more false positives than the 24% in the SVM. Despite this, the AdaBoost model still have a fairly good accuracy of 73.26%, which is 1% better than the SVM model at 72.01%.

## Conclusion

With the priority of increasing recall, we recommend using an SVM with hyperparameter tuning and data balancing. This model provided the best recall performance relative to tree-based models and correctly identifyied over 70% of actual subscribers. However, this came at the cost of low precision and many false positives. If the bank is okay with this tradeoff, then this model can be used in production. However, if the bank wants to improve precision, then further hyperparameter tuning and data balancing techniques should be explored.

> Need to Answer questions, such as:
>   - Which algorithm is recommended to get more accurate results?
>   - Is it better for classification or regression scenarios?
>   - Do you agree with the recommendations?
>   - Why?

## References

1. Ahmad, Amir, Safi, Ourooj, Malebary, Sharaf, Alesawi, Sami, Alkayal, Entisar, Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study, Complexity, 2021, 5550344, 8 pages, 2021. https://doi.org/10.1155/2021/5550344

2. Guhathakurata, S., Kundu, S., Chakraborty, A., & Banerjee, J. S. (2021). A novel approach to predict COVID-19 using support vector machine. Data Science for COVID-19, 351–364. https://doi.org/10.1016/B978-0-12-824536-1.00014-9

3. Phuong, N. D., Tuyen, N. T., Linh, V. T. T., Nguyen, N. N., & Nguyen, T. Q. (2025). Machine Learning Techniques in Chronic Kidney Diseases: A Comparative Study of Classification Model Performance. Bioinformatics and biology insights, 19, 11779322251356563. https://doi.org/10.1177/11779322251356563

4. Park SJ, Lee SJ, Kim H, Kim JK, Chun JW, et al. (2021) Machine learning prediction of dropping out of outpatients with alcohol use disorders. PLOS ONE 16(8): e0255626. https://doi.org/10.1371/journal.pone.0255626