# SVM & Naive Bayes


1.  What is a Support Vector Machine (SVM), and how does it work?

   -> A Support Vector Machine (SVM) is a supervised machine learning algorithm that is mainly used for classification (and also for regression tasks, known as Support Vector Regression – SVR).

It works by finding the optimal decision boundary (called a hyperplane) that separates data points of different classes with the maximum margin.


2.  Explain the difference between Hard Margin and Soft Margin SVM.

    
   -> Hard Margin SVM

Definition:
The classifier tries to find a hyperplane that perfectly separates the data, with no misclassifications allowed.

Conditions:

Works only when the dataset is linearly separable (classes can be perfectly divided by a straight line or hyperplane).

Assumes no noise or outliers.

   . Soft Margin SVM

Definition:
Allows some misclassifications by introducing slack variables

Idea:

Tries to maximize the margin while allowing some violations.

This makes SVM more flexible and better for noisy, real-world data.


3. What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.

  -> Kernel Trick in SVM

Problem:
Many datasets are not linearly separable in their original space (you can’t just draw a straight line/hyperplane to separate the classes).

Solution (Kernel Trick):
Instead of manually mapping data into a higher-dimensional space (which can be computationally expensive), SVM uses a kernel function.

A kernel implicitly computes the similarity (dot product) between data points in some higher-dimensional feature space.

This allows SVM to find a linear separator in that higher-dimensional space without ever computing the transformation explicitly.

xample: Radial Basis Function (RBF) Kernel
𝐾
(
𝑥
𝑖
,
𝑥
𝑗
)
=
exp
⁡
(
−
𝛾
∥
𝑥
𝑖
−
𝑥
𝑗
∥
2
)
K(x
i
	​

,x
j
	​

)=exp(−γ∥x
i
	​

−x
j
	​

∥
2
)

Intuition:

Measures similarity between points based on distance.

If two points are very close, the kernel value ≈ 1 (high similarity).

If far apart, kernel value → 0 (low similarity).

Use Case:

Useful for datasets where decision boundaries are non-linear and highly curved.

Example: Handwritten digit recognition (like MNIST) — digits “8” vs. “3” overlap in raw pixel space, but using RBF kernel, the SVM can project data into a space where they become separable.


4. What is a Naïve Bayes Classifier, and why is it called “naïve”?

  -> A Naïve Bayes Classifier is a probabilistic supervised learning algorithm based on Bayes’ Theorem.
It is commonly used for classification tasks (e.g., spam detection, sentiment analysis, medical diagnosis).

Bayes’ Theorem:

𝑃
(
𝑌
∣
𝑋
)
=
𝑃
(
𝑋
∣
𝑌
)
⋅
𝑃
(
𝑌
)
𝑃
(
𝑋
)
P(Y∣X)=
P(X)
P(X∣Y)⋅P(Y)
	​


𝑃
(
𝑌
∣
𝑋
)
P(Y∣X): Probability of class
𝑌
Y given features
𝑋
X (posterior).

𝑃
(
𝑋
∣
𝑌
)
P(X∣Y): Probability of features given the class (likelihood).

𝑃
(
𝑌
)
P(Y): Prior probability of the class.

𝑃
(
𝑋
)
P(X): Evidence (normalizing factor).

The classifier predicts the class with the highest posterior probability.

It assumes that all features are conditionally independent of each other given the class label.

In real life, features are often correlated (e.g., in text classification, words often co-occur).

But the algorithm ignores these dependencies, hence the term “naïve.”


5. Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?

  -> Gaussian Naïve Bayes

Assumption:
The features are continuous and follow a Gaussian (normal) distribution within each class.

Likelihood formula:
For feature
𝑥
x given class
𝑦
y:

𝑃
(
𝑥
∣
𝑦
)
=
1
2
𝜋
𝜎
𝑦
2
exp
⁡
(
−
(
𝑥
−
𝜇
𝑦
)
2
2
𝜎
𝑦
2
)
P(x∣y)=
2πσ
y
2
	​

	​

1
	​

exp(−
2σ
y
2
	​

(x−μ
y
	​

)
2
	​

)

where
𝜇
𝑦
μ
y
	​

 and
𝜎
𝑦
2
σ
y
2
	​

 are the mean and variance of feature values for class
𝑦
y.

Use Cases:

When data has continuous values.

Examples:

Medical data (blood pressure, cholesterol levels).

Iris dataset (flower petal and sepal measurements).

Multinomial Naïve Bayes

Assumption:
Features are discrete counts (e.g., number of times a word appears in a document).

Likelihood formula:
Probability of a document
𝑑
d given class
𝑐
c:

𝑃
(
𝑑
∣
𝑐
)
=
(
∑
𝑖
𝑥
𝑖
)
!
∏
𝑖
(
𝑥
𝑖
!
)
∏
𝑖
(
𝜃
𝑐
𝑖
𝑥
𝑖
)
P(d∣c)=
∏
i
	​

(x
i
	​

!)
(∑
i
	​

x
i
	​

)!
	​

i
∏
	​

(θ
ci
x
i
	​

	​

)

where
𝑥
𝑖
x
i
	​

 = count of feature
𝑖
i (e.g., word occurrences), and
𝜃
𝑐
𝑖
θ
ci
	​

 = probability of feature
𝑖
i in class
𝑐
c.

Use Cases:

Text classification (spam filtering, sentiment analysis).

Document categorization (news classification).

Bag-of-words or term frequency features.

3. Bernoulli Naïve Bayes

Assumption:
Features are binary/boolean (present or absent, 0 or 1).

Likelihood formula:
For feature
𝑥
𝑖
x
i
	​

 in document
𝑑
d:

𝑃
(
𝑥
𝑖
∣
𝑦
)
=
𝜃
𝑦
𝑖
𝑥
𝑖
(
1
−
𝜃
𝑦
𝑖
)
1
−
𝑥
𝑖
P(x
i
	​

∣y)=θ
yi
x
i
	​

	​

(1−θ
yi
	​

)
1−x
i
	​


where
𝜃
𝑦
𝑖
θ
yi
	​

 = probability of feature
𝑖
i being present in class
𝑦
y.

Use Cases:

Binary text classification (word present or not).

Example: Spam classification based on presence/absence of words like “free,” “win,” “urgent.”

Useful when only presence/absence matters, not frequency.

6. Write a Python program to:

● Load the Iris dataset

● Train an SVM Classifier with a linear kernel

● Print the model's accuracy and support vectors.

In [1]:
# Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = datasets.load_iris()
X = iris.data   # Features
y = iris.target # Target labels

# 2. Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train an SVM classifier with a linear kernel
svm_clf = SVC(kernel='linear')
svm_clf.fit(X_train, y_train)

# 4. Make predictions
y_pred = svm_clf.predict(X_test)

# 5. Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# 6. Print support vectors
print("\nSupport Vectors:")
print(svm_clf.support_vectors_)


Model Accuracy: 1.0

Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


7. Write a Python program to:

● Load the Breast Cancer dataset

● Train a Gaussian Naïve Bayes model

● Print its classification report including precision, recall, and F1-score.


In [2]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data   # Features
y = data.target # Labels

# 2. Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# 4. Make predictions
y_pred = gnb.predict(X_test)

# 5. Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



8. Write a Python program to:

● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.

● Print the best hyperparameters and accuracy.

In [3]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# 2. Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Define parameter grid for C and gamma
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']  # Using RBF kernel
}

# 4. Perform GridSearchCV
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# 5. Best model
best_model = grid_search.best_estimator_

# 6. Make predictions and evaluate accuracy
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 7. Print results
print("Best Hyperparameters:", grid_search.best_params_)
print("Test Set Accuracy:", accuracy)


Best Hyperparameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
Test Set Accuracy: 0.8333333333333334


9. Write a Python program to:

● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using sklearn.datasets.fetch_20newsgroups).

● Print the model's ROC-AUC score for its predictions.

In [4]:
# Import required libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

# 1. Load a subset of the 20 Newsgroups dataset (binary classification for ROC-AUC)
categories = ['sci.space', 'rec.sport.baseball']  # two categories for binary task
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

X = newsgroups.data   # text documents
y = newsgroups.target # labels (0/1)

# 2. Convert text to numerical features using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# 3. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42
)

# 4. Train a Multinomial Naïve Bayes classifier
nb = MultinomialNB()
nb.fit(X_train, y_train)

# 5. Predict probabilities for ROC-AUC
y_proba = nb.predict_proba(X_test)[:, 1]

# 6. Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print("ROC-AUC Score:", roc_auc)


ROC-AUC Score: 1.0


10. Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:

● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data

Explain the approach you would take to:

● Preprocess the data (e.g. text vectorization, handling missing data)

● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

● Address class imbalance

● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.


   -> 1. Preprocessing the Data

Since emails are text-heavy and may have missing values, preprocessing is critical.

Steps:

Handling Missing Data:

If the body/subject is missing → treat as empty string.

If metadata (sender, timestamp) is missing → impute with placeholders or drop if not critical.

Text Cleaning:

Lowercasing, removing HTML tags, punctuation, and stopwords.

Tokenization and stemming/lemmatization (optional for efficiency).

Text Vectorization:

Use TF-IDF Vectorizer → captures word importance relative to documents.

Optionally include n-grams (e.g., bi-grams) to capture common spam phrases (“free money,” “click here”).

Limit vocabulary size to manage dimensionality.

2. Choosing the Model: SVM vs. Naïve Bayes

Naïve Bayes (Multinomial/Bernoulli)

Pros: Very fast, works well with text (word counts, presence/absence).

Cons: Assumes feature independence (not always true), performance can plateau with complex patterns.

SVM (with linear or RBF kernel)

Pros: Handles high-dimensional sparse text features well, often achieves higher accuracy.

Cons: Slower to train on very large datasets compared to Naïve Bayes.

Choice:

Start with Multinomial Naïve Bayes (baseline, fast).

Then move to Linear SVM for production if resources allow, since it generally provides better precision-recall balance on spam tasks.

 3. Addressing Class Imbalance

In email datasets, legitimate emails (ham) usually far outnumber spam.

Strategies:

Class Weights: In SVM, set class_weight="balanced" so minority class (spam) gets more importance.

Resampling:

Oversample spam (e.g., SMOTE).

Undersample ham (carefully, to avoid losing valuable data).

Threshold Tuning: Adjust decision threshold to increase recall for spam.

 4. Evaluating Performance

Accuracy alone is misleading for imbalanced data.

Metrics to use:

Precision: Of all predicted spam, how many are truly spam? (Avoids mislabeling legitimate emails).

Recall (Sensitivity): Of all true spam, how many did we catch? (Avoids missing spam).

F1-Score: Balance between precision and recall.

ROC-AUC / PR-AUC: Especially useful with imbalance.

Confusion Matrix: To see false positives (bad for user trust) vs. false negatives (spam leakage).

 5. Business Impact

Reduced Risk: Prevents phishing/malicious spam from reaching users → increased security.

Improved Productivity: Less time wasted on junk emails → higher efficiency.

Better Customer Trust: Users rely on the system to protect them from scams.

Cost Savings: Automating spam detection reduces need for manual filtering.