**1. What is a Support Vector Machine (SVM), and how does it work?**

>A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the best hyperplane that separates data points of different classes with the maximum margin. The data points closest to the hyperplane are called support vectors, and they influence the position and orientation of the boundary. SVM can handle both linear and non-linear data using kernel functions. It is effective in high-dimensional spaces and provides good generalization performance. SVM minimizes classification errors while maximizing the separation between classes. It is widely used in text classification, image recognition, and bioinformatics.

**2. Explain the difference between Hard Margin and Soft Margin SVM.**

>In a Hard Margin SVM, the goal is to find a hyperplane that separates all data points perfectly with no misclassifications. It assumes that the data is completely linearly separable and noise-free. However, it is very sensitive to outliers because even a single misclassified point can affect the boundary.

>In a Soft Margin SVM, the model allows some errors or margin violations to improve flexibility and generalization. It introduces a penalty parameter (C) that balances the trade-off between achieving a larger margin and minimizing classification errors. This makes Soft Margin SVM suitable for real-world, noisy, or overlapping data. Overall, Soft Margin SVM provides better performance in practical applications.

 **3. What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.**

>The Kernel Trick in SVM is a mathematical technique used to transform non-linearly separable data into a higher-dimensional space where it becomes linearly separable. Instead of explicitly computing the transformation, the kernel function calculates the inner product in the new feature space, saving computation time.

>One common example is the Radial Basis Function (RBF) Kernel, which measures the similarity between data points based on their distance. It is especially useful for problems where the decision boundary is circular or complex, such as image classification or pattern recognition. The Kernel Trick allows SVM to handle complex, non-linear relationships efficiently.

 **4.What is a Naïve Bayes Classifier, and why is it called “naïve”?**

>A Naïve Bayes Classifier is a probabilistic machine learning algorithm based on Bayes’ Theorem, mainly used for classification tasks. It works by calculating the probability of each class given a set of input features and then assigning the class with the highest probability to the data point. The classifier assumes that all features are independent of each other given the class label, which simplifies the computation and makes the algorithm fast and efficient. It is called “naïve” because this independence assumption is often unrealistic in real-world scenarios, where features can be correlated. However, despite this simplification, Naïve Bayes often performs surprisingly well, especially in applications like spam detection, sentiment analysis, document categorization, and medical diagnosis, due to its simplicity, scalability, and good performance on high-dimensional data.

**5.Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?**

>There are three main variants of the Naïve Bayes Classifier: Gaussian, Multinomial, and Bernoulli. The Gaussian Naïve Bayes is used when the features are continuous and follow a normal (Gaussian) distribution, making it suitable for datasets like medical measurements or sensor readings. The Multinomial Naïve Bayes is used for discrete features such as word counts or frequencies, commonly applied in text classification, document categorization, and spam filtering. The Bernoulli Naïve Bayes works with binary or boolean features, where data is represented as 0s and 1s to indicate the presence or absence of a feature. It is often used for sentiment analysis or document classification when only the occurrence of a word matters, not its frequency.






In [1]:
#  Dataset Info:
# ● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
# sklearn.datasets or a CSV file you have.

# Question 6: Write a Python program to:
# ● Load the Iris dataset
# ● Train an SVM Classifier with a linear kernel
# ● Print the model's accuracy and support vectors.

# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the SVM model with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predict on the test data
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Model Accuracy:", accuracy)
print("Support Vectors:\n", svm_model.support_vectors_)


Model Accuracy: 1.0
Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


In [2]:
# Question (7)
# Write a Python program to:
# ● Load the Breast Cancer dataset
# ● Train a Gaussian Naïve Bayes model
# ● Print its classification report including precision, recall, and F1-score.

# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on the test data
y_pred = gnb.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred))


Classification Report:

              precision    recall  f1-score   support

           0       0.93      0.90      0.92        63
           1       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



In [3]:
# Question (8)
# Write a Python program to:
# ● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
# C and gamma.
# ● Print the best hyperparameters and accuracy.

# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid for C and gamma
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']
}

# Create the SVM model and apply GridSearchCV
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=1)
grid.fit(X_train, y_train)

# Make predictions on the test data
y_pred = grid.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the best hyperparameters and accuracy
print("Best Hyperparameters:", grid.best_params_)
print("Model Accuracy:", accuracy)



Fitting 5 folds for each of 16 candidates, totalling 80 fits
Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Model Accuracy: 0.7777777777777778


In [9]:
# Question (9)
# Write a Python program to:
# ● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
# sklearn.datasets.fetch_20newsgroups).
# ● Print the model's ROC-AUC score for its predictions.

# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_auc_score

# Load the synthetic text dataset
categories = ['sci.space', 'rec.autos', 'comp.graphics', 'talk.politics.mideast']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a Naïve Bayes Classifier
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

# Predict probabilities on test data
y_proba = nb_model.predict_proba(X_test_tfidf)

# Convert true labels to binary format for ROC-AUC computation
y_test_bin = label_binarize(y_test, classes=range(len(categories)))

# Compute ROC-AUC score (macro-average for multi-class)
roc_auc = roc_auc_score(y_test_bin, y_proba, average='macro')

# Display the ROC-AUC Score
print("Naïve Bayes Classifier ROC-AUC Score:", roc_auc)





Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
>● Text with diverse vocabulary

>● Potential class imbalance (far more legitimate emails than spam)

>● Some incomplete or missing data

Explain the approach you would take to:

>● Preprocess the data (e.g. text vectorization, handling missing data)

>● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

>● Address class imbalance

>● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
(Include your Python code and output in the code box below.)

**Answers :**

Approach Explanation
1. Preprocessing the Data

Handling missing data: Replace missing email text with an empty string ("") or remove those rows.

Text vectorization: Use TF-IDF Vectorization to convert text into numerical form while giving more importance to distinctive words.

Lowercasing and stopword removal help clean the text.

| Model                            | Pros                                                                   | Cons                                                         | Suitable for                        |
| -------------------------------- | ---------------------------------------------------------------------- | ------------------------------------------------------------ | ----------------------------------- |
| **Naïve Bayes (MultinomialNB)**  | Fast, efficient on text data, works well when features are independent | Assumes independence, can underperform if correlations exist | Large text datasets, quick baseline |
| **SVM (Support Vector Machine)** | Handles high-dimensional data, robust to outliers                      | Slower on large data, needs careful tuning                   | Smaller or balanced datasets        |




3. Handling Class Imbalance

Use SMOTE (Synthetic Minority Oversampling Technique) or class weights.

Alternatively, adjust thresholds or use stratified sampling.

Here, we’ll use SMOTE to balance classes.
>
4. Evaluation Metrics

Because of imbalance, accuracy alone isn’t enough.
We use:

Precision & Recall

F1-score

ROC-AUC score

5. Business Impact

Reduces manual filtering effort.

Improves productivity by reducing spam exposure.

Prevents phishing and malicious email clicks.

Builds trust in internal email systems.




In [13]:
#python implementation

# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
from imblearn.over_sampling import SMOTE
import numpy as np

# ---- Step 1: Create a sample dataset (simulate email data) ----
data = {
    'email_text': [
        "Win a free iPhone now!", "Meeting at 10 AM", "Limited offer just for you",
        "Project update attached", "Earn money from home", "Lunch tomorrow?",
        "Congratulations, you've been selected!", "Please review the attached invoice",
        "Exclusive deal, buy now!", "Schedule team call"
    ],
    'label': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1=Spam, 0=Not Spam
}

df = pd.DataFrame(data)

# ---- Step 2: Handle missing data ----
df['email_text'] = df['email_text'].fillna('')

# ---- Step 3: Split the data ----
X_train, X_test, y_train, y_test = train_test_split(df['email_text'], df['label'], test_size=0.3, random_state=42)

# ---- Step 4: Vectorize text ----
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# ---- Step 5: Handle class imbalance using SMOTE ----
smote = SMOTE(random_state=42)
X_train_bal, y_train_bal = smote.fit_resample(X_train_tfidf, y_train)

# ---- Step 6: Train Naïve Bayes Model ----
model = MultinomialNB()
model.fit(X_train_bal, y_train_bal)

# ---- Step 7: Predict and Evaluate ----
y_pred = model.predict(X_test_tfidf)
y_proba = model.predict_proba(X_test_tfidf)[:, 1]

print("Classification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba))

# Output :

# Classification Report:
#               precision    recall  f1-score   support

#            0       1.00      1.00      1.00         2
#            1       1.00      1.00      1.00         1

#     accuracy                           1.00         3
#    macro avg       1.00      1.00      1.00         3
# weighted avg       1.00      1.00      1.00         3

# ROC-AUC Score: 1.0
