<a href="https://colab.research.google.com/github/ishwor2048/Machine-Learning/blob/main/Support_Vector_Machine_%26_PCA_with_Credit_Card_Default_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Support Vector Machine on Credit Card Default Data**

dataset: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

In [None]:
# Importing all necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from sklearn.utils import resample # Downsampling the dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale # Scale and center the data
from sklearn.svm import SVC # This will bring support vector machine for classification
from sklearn.model_selection import GridSearchCV # This one will do the cross validation part
from sklearn.metrics import confusion_matrix # This create a confusion matrix
from sklearn.metrics import plot_confusion_matrix # Draws / plots a confusion matrix
from sklearn.decomposition import PCA # to perform PCA to plot the data

%matplotlib inline

In [None]:
# Time to import the data (credit card default data, from UCLA Machine Learning repository data which provides complete data for good machine learnning model which helps us to learn, build and make the machine learning model and use for our learning and organizational purpose.
df = pd.read_csv("default_of_credit_card_clients.csv", header=1)

In [None]:
# Let's look the first 5 rows of the dataset to briefly understand what data actually looks like and how we should analyze the data while moving forward
df.head()

In [None]:
# Renaming the target variable
df.rename({'default payment next month': 'DEFAULT'}, axis='columns', inplace=True)
df.head()

In [None]:
# Dropping the unnecessary items
df.drop('ID', axis=1, inplace=True)
df.head()

In [None]:
# Let's check what datatypes we have in the dataset before we start with implementing the missing values
df.dtypes

In [None]:
# Here is what data says from UCLA Repositor:
""""
Data Set Information:

This research aimed at the case of customersâ€™ default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel â€œSorting Smoothing Methodâ€ to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default.


Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:
X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
X4: Marital status (1 = married; 2 = single; 3 = others).
X5: Age (year).
X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.
"""

In [None]:
# Let's have a sanity check if SEX has only 1 & 2 (1 for male and 2 for female)
df["SEX"].unique()

In [None]:
# Checking for education if Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)
df["EDUCATION"].unique()

In [None]:
# Let's check the marriage column
df['MARRIAGE'].unique()

In [None]:
# Dealing with missing values in the dataset, at this point we are considering missing value to zero since we don't have a lot of information about the dataset since the information
# is paid, at this moment, just for the shake of learning we will consider missng values to 0, but in the future we can work out on understanding what was actual missing value
# and what was 0 representing for.
len(df.loc[(df['EDUCATION'] == 0) | (df['MARRIAGE'] == 0)])

In [None]:
# Let's check the total length of the dataset
df.shape

In [None]:
# Removing 0 valued rows (missing values) from Education and Marriage columns
df_no_missing = df.loc[(df['EDUCATION'] != 0) & (df['MARRIAGE'] != 0)]

In [None]:
df_no_missing.shape

In [None]:
len(df_no_missing)

In [None]:
# Now verifying if Education and Marriage columns still have zero values (missing values)
df_no_missing['EDUCATION'].unique()

In [None]:
df_no_missing['MARRIAGE'].unique()

In [None]:
# Downsampling the dataset
# We will take 5000 each for defaulted and not defaulted data from the dataset since Support Vector Machine works very well on those kind of dataset

In [None]:
len(df_no_missing)

In [None]:
# Seperating dataset into defaulted and not defaulted data
df_no_default = df_no_missing[df_no_missing['DEFAULT'] == 0]
df_default = df_no_missing[df_no_missing['DEFAULT'] == 1]

print(len(df_default))
print(len(df_no_default))

In [None]:
# Time to downsample for 1000 rows each for defaulted and not defaulted dataset
df_no_default_downsampled = resample(df_no_default, replace=False, n_samples = 1000, random_state = 42)
len(df_no_default_downsampled)

In [None]:
df_default_downsampled = resample(df_default, replace=False, n_samples = 1000, random_state = 42)
len(df_default_downsampled)

In [None]:
# Time to merge default and not default 1000 each samples
df_downsample = pd.concat([df_no_default_downsampled, df_default_downsampled])
len(df_downsample)

In [None]:
# Time to format the data for support vector machine

# Setting up X as independent Variables (Predictors)
X = df_downsample.drop("DEFAULT", axis=1).copy()
X.head()

In [None]:
# Setting up the predictive variable (y) (Dependent on X variable)
y = df_downsample['DEFAULT'].copy()
y.head()

In [None]:
# One-Hot Encoding for the categorical variable
# Categorical variables (Sex, Education, Marriage and Pay)
X_encoded = pd.get_dummies(X, columns=['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'])
X_encoded.head(10)

In [None]:
# Centering and scaling the data since the feature values are not similar (same scale) in the original dataset
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, random_state = 42, test_size = 0.2)
X_train_scaled = scale(X_train)
X_test_scaled = scale(X_test)

In [None]:
# Building the preliminary Support Vector Machine Model
clf_svm = SVC(random_state=42)
clf_svm.fit(X_train_scaled, y_train)

In [None]:
# Time to build confusion matrix with the actual and predicted results
plot_confusion_matrix(clf_svm, X_test_scaled, y_test, values_format='d', display_labels=["Did Not Default", "Defaulted"])

In [None]:
# Since this time model did not do great job, let's use the cross validation to optimize the parameters
# We will use Optimization Parameters with Cross Validation and GridSearchCV()
param_grid = [
              {
                'C': [0.5, 1, 10, 100], # C is the regularization parameter which got to be greater than zero
               'gamma': ['scale', 1, 0.1, 0.01, 0.001, 0.0001],
               'kernel': ['rbf']
              }
]

optimal_params = GridSearchCV(
    SVC(),
    param_grid,
    cv = 5,
    scoring = 'accuracy',
    verbose = 0
)

optimal_params.fit(X_train_scaled, y_train)
print(optimal_params.best_params_)

In [None]:
# Building support vector machine based on the optimal parameters
clf_svm = SVC(random_state = 42, C = 1, gamma = 0.01, kernel='rbf')
clf_svm.fit(X_train_scaled, y_train)

In [None]:
# Let's plot the confusion matrix with the new results
plot_confusion_matrix(clf_svm, X_test_scaled, y_test, values_format='d', display_labels = ['Did Not Default', "Defaulted"])

In [None]:
# Time to plot the decision boundary
len(df_downsample.columns)

In [None]:
# Since we have large number of columns preventing us to plot that big size of the dimensions, we need to use the PCA to collapse the number of columns to just two number of columns

In [None]:
pca = PCA()
X_train_pca = pca.fit_transform(X_train_scaled)

per_var = np.round(pca.explained_variance_ratio_*100, decimals = 1)
labels = [str(x) for x in range(1, len(per_var)+1)]

plt.bar(x=range(1, len(per_var) + 1), height = per_var)
plt.tick_params(
    axis = 'x',
    which = 'both',
    bottom = False,
    top = False,
    labelbottom = False
)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principle Components')
plt.title('Scree Plot')
plt.show()

In [None]:
# Training Hyperparameters on different values to get even better results, with the output of PCA
train_pc1_coords = X_train_pca[:, 0]
train_pc2_coords = X_train_pca[:, 1]

# pc1 contains x-axis coordinates and pc2 contains y-axis coordinates after performing the PCA on the original data

pca_train_scaled = scale(np.column_stack((train_pc1_coords, train_pc2_coords)))

# time to optimize the svm parameters to fit the x and y-axis coordinates
param_grid = [
              {'C': [1, 10, 100, 1000],
               'gamma': ['scale', 1, 0.1, 0.001, 0.0001],
               'kernel': ['rbf']}
]

optional_params = GridSearchCV(
    SVC(),
    param_grid,
    cv = 5,
    scoring = 'accuracy',
    verbose = 0
)

optimal_params.fit(pca_train_scaled, y_train)
print(optimal_params.best_params_)

In [None]:
clf_svm = SVC(random_state = 42, C = 100, gamma = 0.01, kernel='rbf')
clf_svm.fit(pca_train_scaled, y_train)

# Transform the test dataset with the PCA
X_test_pca = pca.transform(X_train_scaled)

test_pc1_coords = X_test_pca[:, 0]
test_pc2_coords = X_test_pca[:, 1]

x_min = test_pc1_coords.min() - 1
x_max = test_pc1_coords.max() + 1

y_min = test_pc2_coords.min() - 1
y_max = test_pc2_coords.max() + 1

xx, yy = np.meshgrid(np.arange(start = x_min, stop = x_max, step = 0.1),
                     np.arange(start = y_min, stop = y_max, step = 0.01))

Z = clf_svm.predict(np.column_stack((xx.ravel(), yy.ravel())))
Z = Z.reshape(xx.shape)

fig, ax = plt.subplots(figsize = (10, 10))
ax.contour(xx, yy, Z, alpha = 0.1)

cmap = colors.ListedColormap(['#e41a1c', '#4daf4a'])

scatter = ax.scatter(test_pc1_coords, test_pc2_coords, c = y_train, cmap = cmap, s = 100, edgecolors = 'k', alpha = 0.8)
legend = ax.legend(scatter.legend_elements()[0], 
                   scatter.legend_elements()[1],
                   loc = 'upper right')
legend.get_texts()[0].set_text("No Default")
legend.get_texts()[1].set_text("Yes Default")

ax.set_ylabel("PC2")
ax.set_xlabel("PC1")
ax.set_title("Decision Surface using the PCA transformed / Projected features")
plt.show()

