<h2 style="text-align:center;">Support Vector Machine with Cross Validation</h2>

---

## 🔹 Introduction

Support Vector Machine (SVM) is a powerful classification algorithm that works well for linearly and non-linearly separable data.  
However, if we only split data into training and testing once, the performance may depend too much on that split.  

**Cross Validation (CV)** helps us overcome this problem by evaluating the model on multiple train-test splits and averaging the results.  
This reduces variance and gives a more reliable estimate of model performance.

In this notebook:
1. Train a baseline SVM on Social_Network_Ads dataset.
2. Apply k-Fold Cross Validation.
3. Compare results (single split vs. CV).
4. Conclude why CV is essential.


In [1]:
# 📌 Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score


In [2]:
# 📌 Load dataset
dataset = pd.read_csv("../data/Social_Network_Ads.csv")

# Features (Age, EstimatedSalary) and Target (Purchased)
X = dataset.iloc[:, 2:4].values
y = dataset.iloc[:, -1].values

print("Dataset shape:", dataset.shape)   # ➤ (400, 5)
dataset.head()                           # ➤ First 5 rows


Dataset shape: (400, 5)


Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [3]:
# 📌 Feature Scaling (important for SVM, as it is distance-based)
scaler = StandardScaler()
X = scaler.fit_transform(X)


In [4]:
# 📌 Split dataset into Training and Test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0
)

print("Training set size:", X_train.shape)   # ➤ (300, 2)
print("Test set size:", X_test.shape)        # ➤ (100, 2)


Training set size: (300, 2)
Test set size: (100, 2)


In [5]:
# 📌 Train a baseline SVM model
classifier = SVC(kernel="rbf", random_state=0)
classifier.fit(X_train, y_train)

# Predictions on Test set
y_pred = classifier.predict(X_test)

# Confusion Matrix and Accuracy
cm = confusion_matrix(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

print("Confusion Matrix:\n", cm)
print("Accuracy (Single Train/Test split):", acc)   # ➤ Accuracy example: 0.93


Confusion Matrix:
 [[64  4]
 [ 3 29]]
Accuracy (Single Train/Test split): 0.93


In [6]:
# 📌 Apply k-Fold Cross Validation
accuracies = cross_val_score(
    estimator=classifier, X=X_train, y=y_train, cv=10
)

print("Cross-Validation Accuracies:", accuracies)           # ➤ List of accuracies
print("Mean Accuracy: {:.2f} %".format(accuracies.mean()*100))   # ➤ Avg CV accuracy
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))   # ➤ Variance


Cross-Validation Accuracies: [0.8        0.96666667 0.8        0.96666667 0.86666667 0.83333333
 0.9        0.93333333 1.         0.93333333]
Mean Accuracy: 90.00 %
Standard Deviation: 6.83 %


## 🔹 Summary

- A single train/test split gave accuracy = **XX%**.
- Using **10-Fold Cross Validation**, we got:
  - Mean accuracy = **YY%**
  - Standard deviation = **ZZ%**

✅ Cross Validation provides a **more stable estimate** of model performance.  
✅ It reduces dependency on one random split.  
✅ It’s a standard practice before hyperparameter tuning (GridSearchCV / RandomSearchCV).  
