## SVM - Theory: Important Points
SVM is the Supervised Machine Learning algorithm used for both classification, regression. But mostly preferred for classification.

Given a dataset, the algorithm tries to divide the data using hyperplanes and then makes the predictions. SVM is a non-probabilistic linear classifier. While other classifiers, when classifying, predict the probability of a data point to belong to one group or the another, SVM directly says to which group the datapoint belongs to without using any probability calculation.

How it works?

SVM constructs a best line or the decision boundary called Hyperplane which can be used for classification or regression or outlier detection. The dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane.

This hyperplane creates 2 margin lines parallel to it which have some distance so that it can distinctly classify the data points. The distance between the 2 margin lines are called marginal distance.

These 2 margin lines passes through the most nearest +ve points and the most nearest -ve points. Those points through which the margin lines pass are called support vectors. Support vectors are important as it helps to determine the maximum distance of the marginal plane.

### What sets SVM apart?
Ability to soft classify (in addition to hard classifying) and the presence of kernels that will do the non linear classification for nested data (or highly inseparable data) is what sets SVM apart, in addition to its principle of aiming at acheiving a more clearer distinguition of classes (or accurate decision boundary), even at the cost of few misclassifications.

##### SVM separates data points that belong to different classes with a decision boundary. When determining the decision boundary, a soft margin SVM (soft margin means allowing some data points to be misclassified) tries to solve an optimization problem with the following goals:

##### 1)Increase the distance of decision boundary to classes (or support vectors)
##### 2)Maximize the number of points that are correctly classified in the training set

## Hyper parameters
- Hyper parameters are the ones set by us and their optimal values to be found/set by us
- 2 hyper parameters for SVM. 
    - c: Penaliser - Defines the missed or misclassified points.
    - Gamma : Additional parameter for non linear SVM
- If c is large, the margins of decision boundary is small, which also says that we have more misclassifications. If c is small, the margins are large. There shall be certain points that always are likely to be misclassified. 
- For nonlinear model gamma is used. If gamma is high, it means points that are close are grouped together. If gamma is low, it means points that are farther are (or can be) also grouped together.
- Since linear model SVM is all about drawing two boundary lines alienating the classes, and no clustering involved, gamma isn't considered for linear SVM
- Grid search cv or random search cv are mechanisms for any models that have hyper parameters, it helps us to find optimal values for hyperparameters, helps us in hyperparameter tuning.

In [2]:
ls

 Volume in drive C has no label.
 Volume Serial Number is 06A3-F423

 Directory of C:\Users\Art Computer\Downloads\MLE2-20251105T062157Z-1-001\MLE2

11/18/2025  09:55 AM    <DIR>          .
11/18/2025  09:55 AM    <DIR>          ..
11/05/2025  01:12 PM    <DIR>          .ipynb_checkpoints
11/05/2025  11:52 AM    <DIR>          __pycache__
08/22/2025  06:45 PM               772 custom_functions.py
11/05/2025  01:14 PM            83,332 Feature_engineering_and_PCA_Class.ipynb
08/22/2025  06:47 PM           227,977 HR.csv
08/22/2025  06:44 PM            38,669 Loading_from_pickle_titanic.ipynb
08/22/2025  06:45 PM            38,022 loan_approved.csv
09/24/2025  10:32 PM            27,187 Naive_Bayes.ipynb
08/22/2025  06:51 PM               544 preprocessing.py
09/22/2025  09:14 PM            53,932 Processed_data.pkl
09/24/2025  08:47 PM           284,124 Processed_data_HR.pkl
08/22/2025  06:51 PM               544 processing_data.py
08/22/2025  06:47 PM           483,636 spam.csv
08/22/2

In [3]:
from processing_data import *

In [4]:
import pickle
with open('SVM_CT.pkl','rb') as f:
    pre = pickle.load(f)

In [5]:
with open('Processed_data.pkl','rb') as f:
    df = pickle.load(f)

In [6]:
pre

In [8]:
df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status (Approved)
0,LP001002,Male,No,0,Graduate,No,5849,0.0,146.412162,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.000000,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.000000,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.000000,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.000000,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.000000,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.000000,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.000000,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.000000,360.0,1.0,Urban,Y


In [9]:
#Feature Target Separation
x = df.drop('Loan_Status (Approved)',axis=1)
y = df['Loan_Status (Approved)'].map({"N":0,"Y":1})

In [10]:
# Spliting the data into train and test
from sklearn.model_selection import train_test_split
x_train_raw,x_test_raw,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state = 42)

In [18]:
# Transform the training data using the preprocessor object or PipeLine
x_train=pre.fit_transform(x_train_raw)
x_test=pre.transform(x_test_raw)

In [19]:
x_train

array([[1., 0., 0., ..., 1., 0., 1.],
       [1., 0., 0., ..., 1., 0., 1.],
       [1., 0., 0., ..., 1., 2., 1.],
       ...,
       [1., 0., 0., ..., 1., 0., 1.],
       [1., 0., 0., ..., 1., 0., 1.],
       [1., 0., 0., ..., 1., 0., 1.]])

In [20]:
x_train.shape

(491, 15)

In [21]:
y_train

83     0
90     1
227    1
482    1
464    0
      ..
71     1
106    1
270    1
435    1
102    1
Name: Loan_Status (Approved), Length: 491, dtype: int64

In [22]:
y_train.value_counts()

Loan_Status (Approved)
1    342
0    149
Name: count, dtype: int64

In [89]:
#!pip install imblearn

In [25]:
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# Oversampling using RandomOverSampler
oversample = RandomOverSampler(sampling_strategy='minority')
x_over, y_over = oversample.fit_resample(x_train, y_train)
print("Oversampled class distribution:", Counter(y_over))
 
 
# Undersampling using RandomUnderSampler
undersample = RandomUnderSampler(sampling_strategy='majority')
x_under, y_under = undersample.fit_resample(x_train, y_train)
print("Undersampled class distribution:", Counter(y_under))

Oversampled class distribution: Counter({0: 342, 1: 342})
Undersampled class distribution: Counter({0: 149, 1: 149})


In [26]:
from imblearn.over_sampling import SMOTE
x_smote,y_smote = SMOTE().fit_resample(x_train,y_train)
print("SMOTE class distribution:", Counter(y_smote))

SMOTE class distribution: Counter({0: 342, 1: 342})


In [27]:
from sklearn.svm import SVC  # # assign Support vector classifier
svclassifier = SVC() ## base model with default parameters
svclassifier

In [28]:
svclassifier.fit(x_smote,y_smote)  ## Fit the SVC to the resampled training data

In [29]:
import sklearn
sklearn.set_config(print_changed_only=False)
svclassifier

In [30]:
# Getting predictions from model
y_pred=svclassifier.predict(x_test)

In [31]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred)
re = recall_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)
cm = confusion_matrix(y_test,y_pred)
print('Accuracy Score: ',acc,'\nPrecision:', pr,'\nRecall Score:', re,'\nF1 Score', f1, '\nConfusion Matrix: \n', cm)
print(classification_report(y_test,y_pred))

Accuracy Score:  0.7642276422764228 
Precision: 0.7524752475247525 
Recall Score: 0.95 
F1 Score 0.8397790055248618 
Confusion Matrix: 
 [[18 25]
 [ 4 76]]
              precision    recall  f1-score   support

           0       0.82      0.42      0.55        43
           1       0.75      0.95      0.84        80

    accuracy                           0.76       123
   macro avg       0.79      0.68      0.70       123
weighted avg       0.78      0.76      0.74       123



In [33]:
pd.

TypeError: pivot() takes 1 positional argument but 2 were given

In [34]:
from sklearn.svm import SVC  # # assign Support vector classifier
svclassifier = SVC(random_state=42) ## base model with default parameters
svclassifier.fit(x_train,y_train) 
# Getting predictions from model
y_pred=svclassifier.predict(x_test)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred)
re = recall_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)
cm = confusion_matrix(y_test,y_pred)
print('Accuracy Score: ',acc,'\nPrecision:', pr,'\nRecall Score:', re,'\nF1 Score', f1, '\nConfusion Matrix: \n', cm)
print(classification_report(y_test,y_pred))

Accuracy Score:  0.6504065040650406 
Precision: 0.6504065040650406 
Recall Score: 1.0 
F1 Score 0.7881773399014779 
Confusion Matrix: 
 [[ 0 43]
 [ 0 80]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        43
           1       0.65      1.00      0.79        80

    accuracy                           0.65       123
   macro avg       0.33      0.50      0.39       123
weighted avg       0.42      0.65      0.51       123



In [109]:
df.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area',
       'Loan_Status (Approved)'],
      dtype='object')

In [36]:
#Cross validation

from sklearn.model_selection import cross_val_score
scores=cross_val_score(SVC(),x_train,y_train,cv=10,scoring='f1',n_jobs=-1)
print(scores)
print('Cross_validation_score:', scores.mean())
print('STD:', scores.std())
#std<0.05 is good

[0.82352941 0.81927711 0.81927711 0.82926829 0.81927711 0.82926829
 0.82926829 0.82926829 0.81927711 0.83333333]
Cross_validation_score: 0.8251044349564687
STD: 0.005247639934545074


## Hyper parameters
- Hyper parameters are the ones set by us and their optimal values to be found/set by us
- 2 hyper parameters for SVM. 
    - c: Penaliser - Defines the missed or misclassified points.
    - Gamma : Additional parameter for non linear SVM
- If c is large, the margins of decision boundary is small, which also says that we have more misclassifications. If c is small, the margins are large. There shall be certain points that always are likely to be misclassified. 
- For nonlinear model gamma is used. If gamma is high, it means points that are close are grouped together. If gamma is low, it means points that are farther are (or can be) also grouped together.
- Since linear model SVM is all about drawing two boundary lines alienating the classes, and no clustering involved, gamma isn't considered for linear SVM
- Grid search cv or random search cv are mechanisms for any models that have hyper parameters, it helps us to find optimal values for hyperparameters, helps us in hyperparameter tuning.

In [24]:
import sklearn
sklearn.set_config(print_changed_only=False)

In [25]:
svclassifier

In [37]:
from sklearn.model_selection import GridSearchCV
#Define a parameter grid
param_grid = {'C':[0.1,0.5,1,10,200],'gamma':[1,0.1,0.31,0.001],'kernel':['poly','linear','rbf']}
grid = GridSearchCV(SVC(random_state=42),param_grid,scoring='f1',cv=5)
grid.fit(x_smote,y_smote)

In [27]:
5 *4 *3* 5

300

In [118]:
grid.best_params_

{'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}

In [119]:
model=SVC(C=10,gamma=1,random_state=42)
model.fit(x_smote,y_smote)
y_pred = model.predict(x_test)

acc = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred)
re = recall_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)
cm = confusion_matrix(y_test,y_pred)
print('Accuracy Score: ',acc,'\nPrecision:', pr,'\nRecall Score:', re,'\nF1 Score', f1, '\nConfusion Matrix: \n', cm)
print(classification_report(y_test,y_pred))


Accuracy Score:  0.6747967479674797 
Precision: 0.717391304347826 
Recall Score: 0.825 
F1 Score 0.7674418604651163 
Confusion Matrix: 
 [[17 26]
 [14 66]]
              precision    recall  f1-score   support

           0       0.55      0.40      0.46        43
           1       0.72      0.82      0.77        80

    accuracy                           0.67       123
   macro avg       0.63      0.61      0.61       123
weighted avg       0.66      0.67      0.66       123

