## StatQuest - Support Vector Machine (SVM)

One of the best machine learning methods when getting the correct answer is a higher priority than understanding why you get the correct answer. SVMs work well with small datasets and they tend to work well "out of the box" - No need to do much optimization.

### 1. Importing the required modules needed for SVM

#### 1.1 SVM Modules

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from sklearn.utils import resample # Downsample the dataset
from sklearn.model_selection import train_test_split, GridSearchCV # splot data into test-train sets, for cross validation
from sklearn.preprocessing import scale # scales and centers data
from sklearn.svm import SVC # SVM model
from sklearn.metrics import confusion_matrix # creates confusion matrix
from sklearn.decomposition import PCA # perform PCA to plot the data

#### 1.2 Module for datasets

Currently using <a href='https://archive.ics.uci.edu/'>UC Irvine Machine Learning Repository </a>

In [2]:
from ucimlrepo import fetch_ucirepo

### 2. Fetching/Importing Data

#### 2.1 Loading and Verifying Data

In [3]:
# fetch dataset 
default_of_credit_card_clients = fetch_ucirepo(id=350) 
  
# data (as pandas dataframes) 
X = default_of_credit_card_clients.data.features 
y = default_of_credit_card_clients.data.targets 

In [4]:
default_of_credit_card_clients.variables # Viewing the feature names and descriptions

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,ID,ID,Integer,,,,no
1,X1,Feature,Integer,,LIMIT_BAL,,no
2,X2,Feature,Integer,Sex,SEX,,no
3,X3,Feature,Integer,Education Level,EDUCATION,,no
4,X4,Feature,Integer,Marital Status,MARRIAGE,,no
5,X5,Feature,Integer,Age,AGE,,no
6,X6,Feature,Integer,,PAY_0,,no
7,X7,Feature,Integer,,PAY_2,,no
8,X8,Feature,Integer,,PAY_3,,no
9,X9,Feature,Integer,,PAY_4,,no


In [5]:
default_of_credit_card_clients.metadata

{'uci_id': 350,
 'name': 'Default of Credit Card Clients',
 'repository_url': 'https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients',
 'data_url': 'https://archive.ics.uci.edu/static/public/350/data.csv',
 'abstract': "This research aimed at the case of customers' default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods.",
 'area': 'Business',
 'tasks': ['Classification'],
 'characteristics': ['Multivariate'],
 'num_instances': 30000,
 'num_features': 23,
 'feature_types': ['Integer', 'Real'],
 'demographics': ['Sex', 'Education Level', 'Marital Status', 'Age'],
 'target_col': ['Y'],
 'index_col': ['ID'],
 'has_missing_values': 'no',
 'missing_values_symbol': None,
 'year_of_dataset_creation': 2009,
 'last_updated': 'Fri Mar 29 2024',
 'dataset_doi': '10.24432/C55S3H',
 'creators': ['I-Cheng Yeh'],
 'intro_paper': {'ID': 365,
  'type': 'NATIVE',
  'title': 'The comparisons of data mining techniques for the

In [6]:
X.head() # Original imported data from uciML Repo

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
0,20000,2,2,1,24,2,2,-1,-1,-2,...,689,0,0,0,0,689,0,0,0,0
1,120000,2,2,2,26,-1,2,0,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,90000,2,2,2,34,0,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,50000,2,2,1,37,0,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,50000,1,2,1,57,-1,0,-1,0,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


In [7]:
y.head()

Unnamed: 0,Y
0,1
1,1
2,0
3,0
4,0


#### 2.2 Fixing Column and index names

In [8]:
x_df = X.copy()
y_df = y.copy()

# fixing features column names
correctColNames = default_of_credit_card_clients.variables.description[1:-1]
colMapping = dict(zip(x_df.columns, correctColNames))
x_df = x_df.rename(columns=colMapping)
x_df = x_df.rename_axis('ID')

# fixing label column names
y_df.columns = ['DEFAULT']

In [9]:
x_df.head()

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,20000,2,2,1,24,2,2,-1,-1,-2,...,689,0,0,0,0,689,0,0,0,0
1,120000,2,2,2,26,-1,2,0,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,90000,2,2,2,34,0,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,50000,2,2,1,37,0,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,50000,1,2,1,57,-1,0,-1,0,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


In [10]:
y_df.head()

Unnamed: 0,DEFAULT
0,1
1,1
2,0
3,0
4,0


### 3. Cleaning and Processing data

#### 3.1 Identifying and dealing with missing data

__Missing Data__ is simply a blank space, or a surrogate value like __NA__,, that indicates that we failed to collect data for one of the features.

There are two main ways to deal with missing data:
1. __Directly removing rows with missing data.__ This is relatively easy to do but potentially losses other usedful data.
1. __Impute missing values.__ Basically, make an ```educated```guess about what the value should be.

Seeing if there are any missing values in the data we are working with.

In [11]:
x_df.dtypes # Viewing the types of data in each column

LIMIT_BAL    int64
SEX          int64
EDUCATION    int64
MARRIAGE     int64
AGE          int64
PAY_0        int64
PAY_2        int64
PAY_3        int64
PAY_4        int64
PAY_5        int64
PAY_6        int64
BILL_AMT1    int64
BILL_AMT2    int64
BILL_AMT3    int64
BILL_AMT4    int64
BILL_AMT5    int64
BILL_AMT6    int64
PAY_AMT1     int64
PAY_AMT2     int64
PAY_AMT3     int64
PAY_AMT4     int64
PAY_AMT5     int64
PAY_AMT6     int64
dtype: object

In [12]:
x_df.info() # using .info() to see the overall data within the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 23 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   LIMIT_BAL  30000 non-null  int64
 1   SEX        30000 non-null  int64
 2   EDUCATION  30000 non-null  int64
 3   MARRIAGE   30000 non-null  int64
 4   AGE        30000 non-null  int64
 5   PAY_0      30000 non-null  int64
 6   PAY_2      30000 non-null  int64
 7   PAY_3      30000 non-null  int64
 8   PAY_4      30000 non-null  int64
 9   PAY_5      30000 non-null  int64
 10  PAY_6      30000 non-null  int64
 11  BILL_AMT1  30000 non-null  int64
 12  BILL_AMT2  30000 non-null  int64
 13  BILL_AMT3  30000 non-null  int64
 14  BILL_AMT4  30000 non-null  int64
 15  BILL_AMT5  30000 non-null  int64
 16  BILL_AMT6  30000 non-null  int64
 17  PAY_AMT1   30000 non-null  int64
 18  PAY_AMT2   30000 non-null  int64
 19  PAY_AMT3   30000 non-null  int64
 20  PAY_AMT4   30000 non-null  int64
 21  PAY_AMT5   3

#### 3.2 Ensuring the values in each feature is within acceptable ranges.
Sometimes, while there may be no NA or None Values, the actual values themselves may be wrongly stored. It is best practice to double check on what data we are working with and see if it aligns with the expected values.

__From UCI Machine Learning Repo...:__
- __LIMIT_BAL__: The amount of available credit __Integer__
- __SEX, Category__
    - 1 = male
    - 2 = female
- __EDUCATION, Category__
    - 1 = graduate school
    - 2 = university
    - 3 = high school
    - 4 = others
- __MARRIAGE, Category__
    - 1 = Married
    - 2 = Single
    - 3 = Other
- __AGE, Integer__
- __PAY__, When the last 6 bills were paid __Category__
    - -1 = Paid on time
    - 1 = Payment delayed by 1 month
    - 2 = Payment delayed by 2 months
    - ...
    - 8 = Payment delayed by 8 months
    - 9 = Payment delayed by 9 months or more
- __BILL_AMT__, What the last 6 bills were __Integer__
- __PAY_AMT__, How much the last payments were __Integer__
- __DEFAULT__, Whether or not a person defaulted on the next payment __Category__
    - 0 = Did not default
    - 1 = Defaulted


In [13]:
x_df.head()

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,20000,2,2,1,24,2,2,-1,-1,-2,...,689,0,0,0,0,689,0,0,0,0
1,120000,2,2,2,26,-1,2,0,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,90000,2,2,2,34,0,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,50000,2,2,1,37,0,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,50000,1,2,1,57,-1,0,-1,0,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


In [14]:
print('Unique values found in SEX Feature:', end='')
print(x_df.SEX.unique()) # Finding unique values in SEX Column. Expected values [1, 2]
print('Unique values found in EDUCATION Feature:', end='')
print(x_df.EDUCATION.unique()) # Finding unique values in EDUCATION Column. Expected values [1, 2]
print('Unique values found in MARRIAGE Feature:', end='')
print(x_df.MARRIAGE.unique()) # Finding unique values in MARRIAGE Column. Expected values [1, 2]
print('Unique values found in PAY_0 Feature:', end='')
print(x_df['PAY_0'].unique()) # Finding unique values in PAY_0 Column. Expected values [1, 2]
print('Unique values found in DEFAULT Feature:', end='')
print(y_df.DEFAULT.unique()) # Finding unique values in DEFAULT Column. Expected values [1, 2]

Unique values found in SEX Feature:[2 1]
Unique values found in EDUCATION Feature:[2 1 3 5 4 6 0]
Unique values found in MARRIAGE Feature:[1 2 3 0]
Unique values found in PAY_0 Feature:[ 2 -1  0 -2  1  3  4  8  7  5  6]
Unique values found in DEFAULT Feature:[1 0]


From above, we can see additional unexpected values in the __EDUCATION__ column where the expected values should only be between 1 to 4. Similarly, __MARRIAGE__ also contains additional values unexpected values of 0.

#### 3.3 Dealing with expected/missing data

In this case, first find the number of missing data we are woproking with

In [15]:
print(f'{len(x_df[(x_df.EDUCATION == 0) | (x_df.MARRIAGE == 0)])} rows of data are missing')

68 rows of data are missing


Of __30000__ rows of data, only __68__ are missing values - that makes up lesser than __1%__ of the total dataset. In this case, we can safely remove those rows rather than imputing the missing data.

In [16]:
removedMissing_x_df = x_df.loc[(x_df.EDUCATION != 0) & (x_df.MARRIAGE != 0)]
removedMissing_x_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29932 entries, 0 to 29999
Data columns (total 23 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   LIMIT_BAL  29932 non-null  int64
 1   SEX        29932 non-null  int64
 2   EDUCATION  29932 non-null  int64
 3   MARRIAGE   29932 non-null  int64
 4   AGE        29932 non-null  int64
 5   PAY_0      29932 non-null  int64
 6   PAY_2      29932 non-null  int64
 7   PAY_3      29932 non-null  int64
 8   PAY_4      29932 non-null  int64
 9   PAY_5      29932 non-null  int64
 10  PAY_6      29932 non-null  int64
 11  BILL_AMT1  29932 non-null  int64
 12  BILL_AMT2  29932 non-null  int64
 13  BILL_AMT3  29932 non-null  int64
 14  BILL_AMT4  29932 non-null  int64
 15  BILL_AMT5  29932 non-null  int64
 16  BILL_AMT6  29932 non-null  int64
 17  PAY_AMT1   29932 non-null  int64
 18  PAY_AMT2   29932 non-null  int64
 19  PAY_AMT3   29932 non-null  int64
 20  PAY_AMT4   29932 non-null  int64
 21  PAY_AMT5   29932 

### 4. Formatting the data for training

#### 4.1 One-Hot Encoding - for categorical features

One-Hot Encoding is mainly used to convert continous data into categorical data. Using the current dataset as and example, __SEX, EDUCAtiON, MARRIAGE,__ and __PAY__ are supposed to be categorical, thus they need to be modified using __one-hot Encoding__. 

This is because __scikit learn Support Vector Machines (SVM)__ natively support continous data, like __LIMIT_BAL__ and __AGE__, they do not natively support categorical data like __MARRIAGE__, which contains 3 different categories. 

```One-Hot Encoding essentially converts a column of categorical data into multiplle columns of binary values.```

<hr>

__"What's wrong with treating categorical data like continous data?"__

Looking at the __MARRIAGE__ column as an example...we have 3 options (categories):
1. 1 = Married
1. 2 = Single
1. 3 = Other

If we treat these values 1, 2, and 3 like continous data, we would assume that 3, which means "Other", is more similar to 2, which means "Single", than it is to 1, which means "Married".  This means the SVM would be more likely to cluster the people with 3s and 2s together than the people with 3s and 1s together. 

__In contrast__, if we treat these numbers as categorical data, each with its own columns, then each category (Married, Single, Other) are treated equally to each other.

<hr>

__NOTE:__ There are many different ways to do __One-Hot Encoding__ in Python. Two of the more popular ways are, ech with their own pros and cons:

1. ```ColumnTransformer()``` from __scikit-learn__
    - able to create a persistent funciton that can validate future data. _For example: an SVM model built using a categorical variable **colour (red, blue, graan)**. ```ColumnTransformer()``` will remember those options and can handle additonal categories to a production system if a new colour category becomes available (**orange, for example**) by throwing an error._ 
    - The downside is that it turns your data into an array and looses all of the column names, making it harder to verify that the usage of ```ColumnTransformer()``` worked as intended.
1. ```get_dummies()``` from __pandas__
    - leaves the data in the original dataframe and retains the column names, making it much easier to verify that it worked as intended.
    - Unfortunately, it doesn't have the persistent behaviour found in ```ColumnTransformer()```



In [18]:
x_encoded = pd.get_dummies(removedMissing_x_df, columns=['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4','PAY_5','PAY_6'])
x_encoded.head()

Unnamed: 0_level_0,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,...,PAY_6_-2,PAY_6_-1,PAY_6_0,PAY_6_2,PAY_6_3,PAY_6_4,PAY_6_5,PAY_6_6,PAY_6_7,PAY_6_8
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,20000,24,3913,3102,689,0,0,0,0,689,...,True,False,False,False,False,False,False,False,False,False
1,120000,26,2682,1725,2682,3272,3455,3261,0,1000,...,False,False,False,True,False,False,False,False,False,False
2,90000,34,29239,14027,13559,14331,14948,15549,1518,1500,...,False,False,True,False,False,False,False,False,False,False
3,50000,37,46990,48233,49291,28314,28959,29547,2000,2019,...,False,False,True,False,False,False,False,False,False,False
4,50000,57,8617,5670,35835,20940,19146,19131,2000,36681,...,False,False,True,False,False,False,False,False,False,False


#### 4.2 Centering and Scaling the data

Depending on the kernel used for training the SVM model (polynomial, linear, rbf, etc.) we may need to center and scale the data before training the model. Generally, scaling and centering data should be done where possible in order to bring all the features to a similar range, preventing features with larger magnitudes from dominating the learning process.

Scaling can also improve the convergence of ther optimization algorithm used in SVM training.

- __Radial Basis Function (RBF)__ kernel assumes that the data used to train the model are centered and scaled. - _Meaning, each column should have ```mean = 0```, and ```standard deviation = 1```._

__Methods for Centering and Scaling:__
1. ```MinMaxScaler``` - scales data to a specific range (i.e. 0 to 1) based on the minimum and maximum values within the data's range
1. ```StandardScaler``` - Standardizes data to have a mean of 0 and a standard deviation of 1. __Recomended for RBF Kernels__
1. ```RobustScaler``` - Similar to StandardScaler in operation but is less sensitive to outliers than StandardScaler

In general, it is good practice to center and scale data before training SVM Models, even if the kernel doesn't explicitly require it. This can improve model performance, interpretability, and convergence.

In [19]:
from sklearn.svm import SVR

In [None]:
mySVM = SVR()

In [None]:
mySVM = SVC(random_state=42)
mySVM.fit()