#### 1. What is a Support Vector Machine (SVM), and how does it work?

#### 2. Explain the difference between Hard Margin and Soft Margin SVM.

#### 3. What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.


#### 4. What is a Naïve Bayes Classifier, and why is it called “naïve”?

#### 5. Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. 
When would you use each one?

#### 6. Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train an SVM classifier with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Make predictions
y_pred = svm_model.predict(X_test)

# Print model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print support vectors
print("Support Vectors:\n", svm_model.support_vectors_)


Model Accuracy: 1.0
Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


#### 7. Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predictions
y_pred = gnb.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



#### 8.  Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define parameter grid for C and gamma
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

# GridSearchCV with 5-fold cross-validation
grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Best model
best_model = grid.best_estimator_

# Predictions
y_pred = best_model.predict(X_test)

# Print best hyperparameters and accuracy
print("Best Hyperparameters:", grid.best_params_)
print("Test Accuracy:", accuracy_score(y_test, y_pred))


Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Test Accuracy: 0.7777777777777778


#### 9. Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.

In [6]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load a subset of 20 Newsgroups (binary classification for ROC-AUC)
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

X = newsgroups.data
y = newsgroups.target  # Binary labels

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.3, random_state=42
)

# Train Naïve Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict probabilities
y_prob = nb_model.predict_proba(X_test)[:, 1]

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob)
print("ROC-AUC Score:", roc_auc)


ROC-AUC Score: 0.9999773689094078


#### 10. Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

In [7]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# 1. Load dataset (simulating spam vs. not spam)
categories = ['rec.sport.baseball', 'sci.space']  # "spam" vs "ham"
data = fetch_20newsgroups(subset='all', categories=categories,
                          remove=('headers', 'footers', 'quotes'))

X = data.data
y = data.target  # 0/1 labels

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 2. Define TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000, ngram_range=(1, 2))

# 3A. Naïve Bayes pipeline
nb_pipeline = Pipeline([
    ('tfidf', vectorizer),
    ('clf', MultinomialNB())
])

# 3B. Linear SVM pipeline with class weighting
# Compute class weights for imbalance
classes = np.unique(y_train)
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
class_weights = dict(zip(classes, weights))

svm_pipeline = Pipeline([
    ('tfidf', vectorizer),
    ('clf', LinearSVC(class_weight=class_weights))
])

# 4. Train and evaluate Naïve Bayes
print("=== Multinomial Naïve Bayes ===")
nb_pipeline.fit(X_train, y_train)
y_pred_nb = nb_pipeline.predict(X_test)
print(classification_report(y_test, y_pred_nb, target_names=data.target_names))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_nb))

# 5. Train and evaluate SVM
print("\n=== Linear SVM (with class weights) ===")
svm_pipeline.fit(X_train, y_train)
y_pred_svm = svm_pipeline.predict(X_test)
print(classification_report(y_test, y_pred_svm, target_names=data.target_names))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))

# 6. ROC-AUC for Naïve Bayes (since it provides probabilities)
y_prob_nb = nb_pipeline.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_prob_nb)
print("\nROC-AUC (Naïve Bayes):", roc_auc)


=== Multinomial Naïve Bayes ===
                    precision    recall  f1-score   support

rec.sport.baseball       0.92      0.98      0.95       299
         sci.space       0.97      0.92      0.95       296

          accuracy                           0.95       595
         macro avg       0.95      0.95      0.95       595
      weighted avg       0.95      0.95      0.95       595

Confusion Matrix:
 [[292   7]
 [ 24 272]]

=== Linear SVM (with class weights) ===
                    precision    recall  f1-score   support

rec.sport.baseball       0.91      0.97      0.94       299
         sci.space       0.97      0.91      0.94       296

          accuracy                           0.94       595
         macro avg       0.94      0.94      0.94       595
      weighted avg       0.94      0.94      0.94       595

Confusion Matrix:
 [[291   8]
 [ 28 268]]

ROC-AUC (Naïve Bayes): 0.9939437765524721
