Question 1: What is a Support Vector Machine (SVM), and how does it work?
- A Support Vector Machine (SVM) works by finding the optimal boundary to separate different classes of data.

How it works:

- It tries to find a decision boundary (a line or hyperplane) that has the maximum possible margin from the nearest data points of any class.

- These closest data points are called support vectors—they are the only ones that define the position of the boundary.

- For data that isn't linearly separable, it uses the "kernel trick" to project the data into a higher dimension where finding a clear separating boundary becomes possible.

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

-:- Hard Margin SVM:

- Aims to find a hyperplane that perfectly separates two classes with no misclassifications.
- Assumes data is linearly separable without noise or outliers.
- Maximizes the margin (distance between hyperplane and nearest data points, called support vectors).
- Inflexible: fails if data is not perfectly separable.
- Risk of overfitting in noisy datasets.

-:- Soft Margin SVM:

- Allows some misclassifications to achieve better generalization, especially for noisy or non-linearly separable data.
- Introduces a slack variable to permit data points to be on the wrong side of the margin or hyperplane.
- Balances margin maximization with a penalty for misclassifications (controlled by parameter C).
- More robust: works with real-world, noisy, or overlapping data.
- Lower C emphasizes larger margin; higher C emphasizes fewer misclassifications.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.
- The Kernel Trick is a mathematical technique that allows an SVM to find a nonlinear decision boundary without explicitly transforming the data into a higher-dimensional space, which would be computationally expensive.

 It works by computing the similarity (dot product) between pairs of data points in the original feature space using a kernel function. This lets the SVM operate in a high-dimensional space while all calculations remain efficient in the original space.

Example Kernel: Radial Basis Function (RBF) Kernel

- What it does: The RBF kernel measures how close points are to each other. It can create complex, non-linear boundaries.

- Use Case: It is the most common general-purpose kernel. Use it when there is no obvious linear separation between classes, like for complex, overlapping datasets (e.g., medical image classification or handwritten digit recognition). It's a good default choice when you're unsure which kernel to use.

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?
- A Naïve Bayes classifier is a simple probabilistic algorithm based on applying Bayes’ Theorem with a strong assumption: it assumes that every feature used to predict the class is independent of all the others, given the class.

- It's called "naïve" because this assumption of feature independence is very rarely true in real-world data. For example, in classifying an email as spam, the presence of the word "win" and the word "prize" are not independent—they often appear together. The algorithm "naively" assumes they are unrelated to simplify the calculation.

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?
 1. Gaussian Naïve Bayes

Description: Assumes that continuous numerical features follow a normal (Gaussian) distribution.

Use Case: Use this when your features are continuous values (e.g., height, weight, temperature, income). It's common for general numerical data.

2. Multinomial Naïve Bayes

Description: Designed for discrete counts, especially frequency counts. It uses the frequency of events (like word counts) for classification.

Use Case: Ideal for text classification where features are word counts or TF-IDF scores (e.g., document categorization, spam filtering).

3. Bernoulli Naïve Bayes

Description: Designed for binary/boolean features. It models features that are either 1 (present) or 0 (absent).

Use Case: Use for text classification where features indicate only the presence or absence of a word (ignoring frequency), or for any dataset with on/off features.



In [2]:
#Question 6: Write a Python program to:
# ● Load the Iris dataset
# ● Train an SVM Classifier with a linear kernel
# ● Print the model's accuracy and support vectors.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)

y_pred = svm_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)
print("Number of support vectors for each class:", svm_classifier.n_support_)
print("Total number of support vectors:", len(svm_classifier.support_vectors_))

Model Accuracy: 1.0
Number of support vectors for each class: [ 3 11 10]
Total number of support vectors: 24


In [3]:
# Question 7: Write a Python program to:
# ● Load the Breast Cancer dataset
# ● Train a Gaussian Naïve Bayes model
# ● Print its classification report including precision, recall, and F1-score.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = GaussianNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred, target_names=data.target_names))



              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



In [4]:
# Question 8: Write a Python program to:
# ● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
# C and gamma.
# ● Print the best hyperparameters and accuracy.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

data = load_wine()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']
}

grid = GridSearchCV(SVC(), param_grid, refit=True, cv=5)
grid.fit(X_train, y_train)

best_params = grid.best_params_
best_accuracy = grid.best_score_

y_pred = grid.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print("Best Hyperparameters:", best_params)
print("Best Cross-validation Accuracy:", best_accuracy)
print("Test Accuracy:", test_accuracy)

Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Best Cross-validation Accuracy: 0.6946666666666667
Test Accuracy: 0.7777777777777778


In [5]:
# Question 9: Write a Python program to:
# ● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
# sklearn.datasets.fetch_20newsgroups).
# ● Print the model's ROC-AUC score for its predictions.

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer

categories = ['sci.space', 'comp.graphics']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)

X = newsgroups.data
y = newsgroups.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

y_pred_proba = model.predict_proba(X_test_tfidf)

lb = LabelBinarizer()
y_test_bin = lb.fit_transform(y_test)

roc_auc = roc_auc_score(y_test_bin, y_pred_proba[:, 1])

print("ROC-AUC Score:", roc_auc)

ROC-AUC Score: 0.9961588300629396


In [8]:
# Question 10: Imagine you’re working as a data scientist for a company that handles
# email communications.
# Your task is to automatically classify emails as Spam or Not Spam. The emails may
# contain:
# ● Text with diverse vocabulary
# ● Potential class imbalance (far more legitimate emails than spam)
# ● Some incomplete or missing data
# Explain the approach you would take to:
# ● Preprocess the data (e.g. text vectorization, handling missing data)
# ● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
# ● Address class imbalance
# ● Evaluate the performance of your solution with suitable metrics
# And explain the business impact of your solution.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score

data = {
    'text': [
        'Win a free iPhone', 'Meeting at 10am', 'Claim your prize now',
        'Lunch at office', 'Get free lottery ticket', 'Project deadline tomorrow',
        'Congratulations! You won', 'Dinner at 8pm'
    ],
    'label': ['spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham']
}
df = pd.DataFrame(data)

# Handle missing values
df['text'] = df['text'].fillna('')

# Split data (50% so both classes appear in test)
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.5, random_state=42, stratify=df['label'])

# TF-IDF Vectorization
tfidf = TfidfVectorizer()
X_train_vec = tfidf.fit_transform(X_train)
X_test_vec = tfidf.transform(X_test)

# Train Naive Bayes
model = MultinomialNB()
model.fit(X_train_vec, y_train)

# Predict
y_pred = model.predict(X_test_vec)

# Evaluation
print("Classification Report:")
print(classification_report(y_test, y_pred, zero_division=0))

print("ROC-AUC:", roc_auc_score((y_test=='spam').astype(int), (y_pred=='spam').astype(int)))




Classification Report:
              precision    recall  f1-score   support

         ham       0.50      1.00      0.67         2
        spam       0.00      0.00      0.00         2

    accuracy                           0.50         4
   macro avg       0.25      0.50      0.33         4
weighted avg       0.25      0.50      0.33         4

ROC-AUC: 0.5


q.10 ANSWER

1. Preprocessing the Data

Text Cleaning: Remove punctuation, lowercase, stopwords, and perform stemming/lemmatization.

Vectorization: Use TF-IDF (better than Bag of Words for diverse vocabulary).

Handle Missing Data: Fill missing email text with empty string or drop rows if labels are missing.

2. Choosing Model: SVM vs Naïve Bayes

Naïve Bayes: Works well with text because of word independence assumption + fast training.

SVM: Handles high-dimensional data but slower and needs tuning.
✔ Here, Multinomial Naïve Bayes is ideal for text classification.

3. Handling Class Imbalance

Use SMOTE (Synthetic Minority Oversampling) or class weights.

Alternatively, undersample majority or oversample minority.

4. Evaluation Metrics

Accuracy is misleading → use:

Precision & Recall (Spam detection needs high recall)

F1-score

ROC-AUC

5. Business Impact

Reduce spam → better user experience, security

Improved productivity (less manual filtering)

Maintain brand trust (avoid phishing/spam reaching users)