Question 1: What is a Support Vector Machine (SVM), and how does it work?

Answer:
A Support Vector Machine is a supervised algorithm used for classification and regression. It works by finding the best separating boundary (hyperplane) between classes in the feature space and aims to maximize the margin—the distance between the hyperplane and the nearest data points from each class, called support vectors.

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Answer:
Hard Margin: Requires perfectly linearly separable data, finds a boundary with zero misclassifications, and is sensitive to outliers.

Soft Margin: Allows some classification errors by introducing slack variables, balancing margin size and errors; it's robust to noisy and non-linearly separable data and controlled by a regularization parameter

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.

Answer:
The kernel trick allows SVMs to handle non-linear data by implicitly mapping it into higher-dimensional space, where classes can be separated by a hyperplane.

For example, the RBF (Radial Basis Function) kernel is used to capture complex boundaries—ideal when classes can't be separated by a straight line.

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

Answer:
Naïve Bayes is a probabilistic classifier based on Bayes’ theorem. It’s called “naïve” because it assumes all features are independent from each other, which rarely holds in real data, but often works surprisingly well.

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?

Answer:
Gaussian: Assumes features are continuous and follow a normal (Gaussian) distribution. Use when input features are real values.

Multinomial: Used for discrete count data like word frequencies in text classification.

Bernoulli: Designed for binary features (0/1). Use for data with yes/no, presence/absence attributes.






In [1]:
#Question 6: Write a Python program to:
#● Load the Iris dataset
#● Train an SVM Classifier with a linear kernel
#● Print the model's accuracy and support vectors.
#Answer

from sklearn.datasets import load_iris
from sklearn.svm import SVC

iris = load_iris()
X, y = iris.data, iris.target

svm = SVC(kernel="linear")
svm.fit(X, y)

print("Accuracy:", svm.score(X, y))
print("Support Vectors:\n", svm.support_vectors_)


Accuracy: 0.9933333333333333
Support Vectors:
 [[5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [4.5 2.3 1.3 0.3]
 [6.9 3.1 4.9 1.5]
 [6.3 3.3 4.7 1.6]
 [6.1 2.9 4.7 1.4]
 [5.6 3.  4.5 1.5]
 [6.2 2.2 4.5 1.5]
 [5.9 3.2 4.8 1.8]
 [6.3 2.5 4.9 1.5]
 [6.8 2.8 4.8 1.4]
 [6.7 3.  5.  1.7]
 [6.  2.7 5.1 1.6]
 [5.4 3.  4.5 1.5]
 [5.1 2.5 3.  1.1]
 [4.9 2.5 4.5 1.7]
 [6.5 3.2 5.1 2. ]
 [6.  2.2 5.  1.5]
 [6.3 2.7 4.9 1.8]
 [6.2 2.8 4.8 1.8]
 [6.1 3.  4.9 1.8]
 [7.2 3.  5.8 1.6]
 [6.3 2.8 5.1 1.5]
 [6.  3.  4.8 1.8]
 [6.3 2.5 5.  1.9]
 [6.5 3.  5.2 2. ]
 [5.9 3.  5.1 1.8]]


In [2]:
#Question 7: Write a Python program to:
#● Load the Breast Cancer dataset
#● Train a Gaussian Naïve Bayes model
#● Print its classification report including precision, recall, and F1-score.
#Answer

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report


cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      0.93      0.96        43
           1       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



In [3]:
#Question 8: Write a Python program to:
#● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
#C and gamma.
#● Print the best hyperparameters and accuracy.

#Answer:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# GridSearchCV
param_grid = {'C': [0.1, 1, 10], 'gamma': [0.01, 0.1, 1]}
grid = GridSearchCV(SVC(), param_grid, cv=5)
grid.fit(X_train, y_train)

print('Best Params:', grid.best_params_)
print('Accuracy:', grid.score(X_test, y_test))


Best Params: {'C': 10, 'gamma': 0.01}
Accuracy: 0.6666666666666666


In [4]:
#Question 9: Write a Python program to:
#● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
#sklearn.datasets.fetch_20newsgroups).
#● Print the model's ROC-AUC score for its predictions.
#Answer:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score


data = fetch_20newsgroups(subset='all', categories=['sci.space', 'rec.autos'])
X, y = data.data, data.target

vectorizer = CountVectorizer(binary=False)
X_vec = vectorizer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42)

nb = MultinomialNB()
nb.fit(X_train, y_train)

# ROC-AUC
y_prob = nb.predict_proba(X_test)[:,1]
print('ROC-AUC:', roc_auc_score(y_test, y_prob))


ROC-AUC: 0.9971941638608306


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution

Answer:
Preprocessing: Vectorize text using methods like TF-IDF or CountVectorizer. Handle missing values by imputing with empty strings or dropping incomplete samples if needed.

Model Choice: Start with Naïve Bayes (good for text and fast with high-dimensional data), but also try SVM for potentially higher accuracy. Naïve Bayes excels with text and imbalanced data; SVM needs more resources.

Class Imbalance: Use resampling (e.g., SMOTE, oversample spam, or undersample non-spam), or adjust class weights ('class_weight' in SVM).

Evaluation: Use metrics like Precision, Recall, F1-score, and ROC-AUC (not just accuracy) to judge performance—critical when legitimate emails outnumber spam.

Business Value: Accurate spam detection saves user time, reduces risk of fraud/phishing, and improves overall email system reliability. Automated classification enables scalable monitoring and protects company reputation.

In [6]:
# Create a dummy emails.csv for demonstration
data = {'text': ['This is a legitimate email.', 'Buy now and get free shipping!', 'Another important message.', 'Claim your prize today!'],
        'label': ['not spam', 'spam', 'not spam', 'spam']}
df_dummy = pd.DataFrame(data)
df_dummy.to_csv('emails.csv', index=False)

print("Dummy 'emails.csv' created.")

Dummy 'emails.csv' created.


In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# 1. Data loading (assume a CSV with columns 'text', 'label')
df = pd.read_csv('emails.csv')

# 2. Preprocessing: Handle missing values
df['text'] = df['text'].fillna('')
df['label'] = df['label'].fillna('not spam')

# 3. Encode labels: spam=1, not spam=0
df['label_num'] = (df['label'] == 'spam').astype(int)

# 4. Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label_num'], test_size=0.2, random_state=42)

# 5. Text vectorization using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# 6. Handle class imbalance using class weights
class_weights = compute_class_weight('balanced', classes=np.array([0,1]), y=y_train)
class_weight_dict = {0: class_weights[0], 1: class_weights[1]}

# 7. Model training with SVM (linear kernel)
clf = SVC(kernel='linear', class_weight=class_weight_dict, probability=True)
clf.fit(X_train_vec, y_train)

# 8. Predictions
y_pred = clf.predict(X_test_vec)

# 9. Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['Not Spam', 'Spam']))

# Optional: Probability outputs, ROC curve, etc.

Confusion Matrix:
 [[0 0]
 [1 0]]

Classification Report:
               precision    recall  f1-score   support

    Not Spam       0.00      0.00      0.00       0.0
        Spam       0.00      0.00      0.00       1.0

    accuracy                           0.00       1.0
   macro avg       0.00      0.00      0.00       1.0
weighted avg       0.00      0.00      0.00       1.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
