## Suppport vector machine: Classification example
    1) It is a supervised algorithm
    2) It can be used for both classification and regression
### When SVM will be preferred?
    1) Data is not regularly distributed
    2) SVM will not create overfitting
### Learn more about SVM
- ### https://towardsdatascience.com/svm-and-kernel-svm-fed02bef1200
- ### https://scikit-learn.org/stable/modules/svm.html

### Dataset link: (download from kaggle) https://www.kaggle.com/datasets/ninzaami/loan-predication

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("8_svm_loan_status.csv")
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


---
### Handling missing values

In [3]:
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [4]:
# drop the missing values
df = df.dropna()

---
### Categorical encoding

In [5]:
df["Dependents"].value_counts()

0     274
2      85
1      80
3+     41
Name: Dependents, dtype: int64

In [6]:
df = df.replace(to_replace="3+", value=4)
# df['Married'].value_counts()
# df.info()

In [7]:
# Categorical encoding
df.replace(
    {
        "Married": {"No": 0, "Yes": 1},
        "Gender": {"Male": 1, "Female": 0},
        "Self_Employed": {"No": 0, "Yes": 1},
        "Property_Area": {"Rural": 0, "Semiurban": 1, "Urban": 2},
        "Education": {"Graduate": 1, "Not Graduate": 0},
    },
    inplace=True,
)

In [8]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001003,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,N
2,LP001005,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,2,Y
3,LP001006,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,2,Y
4,LP001008,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,2,Y
5,LP001011,1,1,2,1,1,5417,4196.0,267.0,360.0,1.0,2,Y


---
### Splitting train and test data

In [9]:
# Creating feature set and target
X = df.drop(columns=["Loan_ID", "Loan_Status"], axis=1)
y = df["Loan_Status"]

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=2
)

---
# SVM Model
## Linear kernel

In [11]:
from sklearn import svm

# Defining model
classifier = svm.SVC(kernel="linear")

# Training
classifier.fit(X_train, y_train)

In [12]:
classifier.predict(X_test)

array(['Y', 'N', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'N', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'N',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'N', 'N'], dtype=object)

In [13]:
from sklearn.metrics import accuracy_score

# Train accuracy
train_accuracy = accuracy_score(classifier.predict(X_train), y_train)
train_accuracy

0.8072916666666666

In [14]:
# Test accuracy
test_accuracy = accuracy_score(classifier.predict(X_test), y_test)
test_accuracy

0.6875

---
## Polynomial Kernel

In [15]:
poly = svm.SVC(kernel="poly")
poly.fit(X_train, y_train)

In [16]:
# Train accuracy
train_accuracy = accuracy_score(poly.predict(X_train), y_train)
print("Train Score: ", train_accuracy)
# Test accuracy
test_accuracy = accuracy_score(poly.predict(X_test), y_test)
print("Test Score: ", test_accuracy)

Train Score:  0.71875
Test Score:  0.6041666666666666


---
## RBF Kernel

In [17]:
rbf = svm.SVC(kernel="rbf")
rbf.fit(X_train, y_train)

In [18]:
# Train accuracy
train_accuracy = accuracy_score(rbf.predict(X_train), y_train)
print("Train Score: ", train_accuracy)
# Test accuracy
test_accuracy = accuracy_score(poly.predict(X_test), y_test)
print("Test Score: ", test_accuracy)

Train Score:  0.7239583333333334
Test Score:  0.6041666666666666
