# DATA622 Assignment 3 


### Load data

For this assignment, I will be using the same bank dataset as the first 2 assignments. 


In [2]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

# Load data 
url = "https://raw.githubusercontent.com/j-song-npc/Data622/refs/heads/main/bank-full.csv"
bank_data = pd.read_csv(url, sep=";")

## Data prep

I will use the same data prep as I performed on assignment 2. As a summary, I have removed the 'duration' and 'poutcomes' variables. I have also converted all variables into numerical values. I also checked for missing values and found there were none. 

In [3]:
print(bank_data.head())

# Remove duration variables
bank_data = bank_data.drop('duration', axis=1)

# Check unknown variables
unknown_counts = (bank_data == "unknown").sum()
unknown_percent = (unknown_counts / len(bank_data)) * 100
print(unknown_percent)

# Remove poutcome variable  
bank_data = bank_data.drop('poutcome', axis=1)

# Convert Y variable as 0/1 
bank_data['y'] = bank_data['y'].map({'no': 0, 'yes': 1})

# Convert all other variables into numerical values 
bank_data = pd.get_dummies(bank_data, drop_first=True)


   age           job  marital  education default  balance housing loan  \
0   58    management  married   tertiary      no     2143     yes   no   
1   44    technician   single  secondary      no       29     yes   no   
2   33  entrepreneur  married  secondary      no        2     yes  yes   
3   47   blue-collar  married    unknown      no     1506     yes   no   
4   33       unknown   single    unknown      no        1      no   no   

   contact  day month  duration  campaign  pdays  previous poutcome   y  
0  unknown    5   may       261         1     -1         0  unknown  no  
1  unknown    5   may       151         1     -1         0  unknown  no  
2  unknown    5   may        76         1     -1         0  unknown  no  
3  unknown    5   may        92         1     -1         0  unknown  no  
4  unknown    5   may       198         1     -1         0  unknown  no  
age           0.000000
job           0.637013
marital       0.000000
education     4.107407
default       0.000

### SVM analysis
I will be testing SVM for this dataset using the radial and linear kernel. 

In [None]:
# Separate features and target
x = bank_data.drop('y', axis=1)
y = bank_data['y']

# Split training data
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.3, random_state=1)

# Standard scaling
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)


### SVM using linear kernel  



In [None]:
svm1 = SVC(kernel='linear', class_weight='balanced', C=0.5, random_state=1)
svm1.fit(x_train_scaled, y_train)
y_pred2 = svm1.predict(x_test_scaled)

print(classification_report(y_test, y_pred2))

              precision    recall  f1-score   support

           0       0.94      0.72      0.82     12013
           1       0.24      0.66      0.35      1551

    accuracy                           0.72     13564
   macro avg       0.59      0.69      0.58     13564
weighted avg       0.86      0.72      0.77     13564



### SVM using radial basic function  

The radial SVM had better results than the linear kernel. The F score and accuracy were higher than the previous model.

In [None]:
svm_2 = SVC(kernel='rbf', class_weight='balanced', random_state=1)
svm_2.fit(x_train_scaled, y_train)
y_pred = svm_2.predict(x_test_scaled)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.84      0.89     12013
           1       0.33      0.61      0.43      1551

    accuracy                           0.82     13564
   macro avg       0.64      0.72      0.66     13564
weighted avg       0.87      0.82      0.84     13564



### SVM using RBF (unbalanced)

I wanted to see what the results of unbalanced radial kernel SVM would look like since I tested that in assignment 2 for my decision trees. The recall for class 1 is very low but but accuracy is high, likely due to the good results in the class 0 which is overrepresented in  this dataset. 

In [7]:
# SVM model (RBF kernel unbalanced)
svm3 = SVC(kernel='rbf', random_state=1)
svm3.fit(x_train_scaled, y_train)
y_pred = svm3.predict(x_test_scaled)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.99      0.94     12013
           1       0.53      0.11      0.18      1551

    accuracy                           0.89     13564
   macro avg       0.71      0.55      0.56     13564
weighted avg       0.85      0.89      0.85     13564

