Question 1: What is a Support Vector Machine (SVM), and how does it work?

ans- A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for classification and regression tasks, especially effective in high-dimensional spaces.

⚙️ How Does SVM Work?
At its core, SVM tries to find the best decision boundary (called a hyperplane) that separates data points of different classes with the maximum margin.
Here's the step-by-step intuition:
- Plot the data in n-dimensional space (where n = number of features).
- Identify the hyperplane that best separates the classes.
- Maximize the margin — the distance between the hyperplane and the nearest data points from each class (called support vectors).
- If data isn’t linearly separable, SVM uses:
- Kernel trick to transform data into higher dimensions.
- Common kernels: linear, polynomial, RBF (Gaussian).

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

ans- Hard Margin SVM
Definition: Assumes that the data is perfectly linearly separable — meaning there exists a hyperplane that separates the classes without any misclassification.
Key Characteristics:
- No tolerance for errors or overlap.
- Maximizes the margin strictly between classes.
- Very sensitive to outliers — even one misclassified point can break the model.
Use Case: Rare in practice; mostly theoretical or for clean, synthetic datasets.

🧊 Soft Margin SVM
Definition: Allows for some misclassification by introducing a penalty term for incorrectly classified points. Balances margin maximization with classification error.
Key Characteristics:
- Introduces a regularization parameter C:
- Low C → wider margin, more tolerance for misclassification.
- High C → narrower margin, less tolerance for errors.
- More robust to noisy data and outliers.
- Widely used in real-world applications.
Use Case: Ideal for datasets with overlapping classes or noise — like medical diagnosis, text classification, etc.


Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

ans-The Kernel Trick allows SVMs to operate in a higher-dimensional space without explicitly computing the transformation. Instead of mapping data points to a new space, it computes the dot product between them in that space using a kernel function.
This enables SVMs to find a linear decision boundary in a transformed space, which corresponds to a non-linear boundary in the original space.

📌 Why Is It Useful?
- Avoids the computational cost of explicitly transforming data.
- Makes SVMs capable of handling complex, non-linear relationships.
- Keeps the algorithm efficient even in very high-dimensional spaces.

🌟 Example: Radial Basis Function (RBF) Kernel
Definition:
K(x, x') = \exp\left(-\gamma \|x - x'\|^2\right)
Use Case:
- Ideal for datasets where the decision boundary is non-linear and complex.
- Commonly used in image classification, bioinformatics, and medical diagnosis where patterns are not linearly separable.
Intuition:
The RBF kernel measures similarity between two points. If they’re close, the kernel value is high; if they’re far apart, it’s low. This helps the SVM focus on local patterns.

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

ans-  What Is a Naïve Bayes Classifier?
The Naïve Bayes Classifier is a probabilistic machine learning algorithm based on Bayes’ Theorem. It’s used for classification tasks and is especially popular in text classification (like spam detection, sentiment analysis, etc.).
Bayes’ Theorem:
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
In the context of classification:
P(\text{Class}|\text{Features}) = \frac{P(\text{Features}|\text{Class}) \cdot P(\text{Class})}{P(\text{Features})}

🤓 Why Is It Called “Naïve”?
It’s called naïve because it makes a strong assumption:
All features are independent of each other given the class label.

This assumption is rarely true in real-world data (e.g., in text, words often co-occur), but the model still performs surprisingly well in many cases.

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?

ans-  1. Gaussian Naïve Bayes
Assumption: Features follow a normal (Gaussian) distribution.
Use Case:
- Best for continuous numerical data like age, blood pressure, or income.
- Common in medical diagnosis, sensor data, and regression-style classification.
Example: Predicting disease based on lab test values (e.g., glucose levels, cholesterol).

📊 2. Multinomial Naïve Bayes
Assumption: Features represent discrete counts (e.g., word frequencies).
Use Case:
- Ideal for text classification, document categorization, and spam detection.
- Works well with Bag-of-Words or TF-IDF representations.
Example: Classifying emails as spam or not based on word occurrence.

🧮 3. Bernoulli Naïve Bayes
Assumption: Features are binary (0 or 1), indicating presence or absence.
Use Case:
- Suitable for binary feature vectors, like whether a word appears in a document.
- Useful when you care about presence, not frequency.
Example: Sentiment analysis using binary word presence (e.g., “happy” = 1 if present).




In [1]:
#Question 6: Write a Python program to: ● Load the Iris dataset ● Train an SVM Classifier with a linear kernel ● Print the model's accuracy and support vectors.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train an SVM Classifier with a linear kernel
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train, y_train)

# Predict and evaluate accuracy
y_pred = svm_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print support vectors
print("\nSupport Vectors:")
print(svm_model.support_vectors_)

Model Accuracy: 1.00

Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


In [2]:
#Question 7: Write a Python program to:
# Load the Breast Cancer dataset
# Train a Gaussian Naïve Bayes model
# Print its classification report including precision, recall, and F1-score.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict and evaluate
y_pred = gnb.predict(X_test)
report = classification_report(y_test, y_pred, target_names=data.target_names)
print("Classification Report:\n")
print(report)

Classification Report:

              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



In [3]:
#Question 8: Write a Python program to:
# Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.
#Print the best hyperparameters and accuracy.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameter grid for C and gamma
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']  # Using RBF kernel for non-linear decision boundaries
}

# Initialize SVM and perform Grid Search
svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Evaluate best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print(f"Best Hyperparameters: {grid_search.best_params_}")
print(f"Model Accuracy on Test Set: {accuracy:.2f}")

Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Model Accuracy on Test Set: 0.78


In [4]:
#Question 9: Write a Python program to:
#Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using sklearn.datasets.fetch_20newsgroups).
# Print the model's ROC-AUC score for its predictions.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load a binary subset of the 20 Newsgroups dataset
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Vectorize the text using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Multinomial Naïve Bayes Classifier
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Predict probabilities and calculate ROC-AUC score
y_proba = nb.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.2f}")

ROC-AUC Score: 1.00


Question 10: Imagine you’re working as a data scientist for a company that handles email communications.Your task is to automatically classify emails as Spam or Not Spam. The emails maycontain:

● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

ans- 1. Preprocessing the Data
🔹 Handling Missing Data
- Text fields: If email body or subject is missing, fill with "missing" or drop if critical.
- Metadata (e.g., sender, timestamp): Impute with mode or flag as missing using indicator variables.
🔹 Text Vectorization
- Use TF-IDF Vectorizer to convert email text into numerical features:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(email_texts)

2. Model Selection: SVM vs. Naïve Bayes

Model              Strengths                             Limitations

Naïve Bayes       Fast, scalable,                       Assumes feature

                   works well with sparse text data      independence

  svm                 High accuracy, handles          Slower training,   
                  high-dimensional data well           sensitive to tuning



✅ Recommendation:
- Start with Multinomial Naïve Bayes for baseline performance.
- Use SVM with linear or RBF kernel for refined modeling if accuracy needs improvement.

⚖️ 3. Addressing Class Imbalance
- Resampling Techniques:
- SMOTE (Synthetic Minority Over-sampling Technique)
- Random undersampling of majority class
- Class Weighting:
- Use class_weight='balanced' in SVM or adjust priors in Naïve Bayes.
- Threshold Tuning:
- Adjust decision threshold to favor recall for spam detection.

📊 4. Performance Evaluation


from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, model.predict_proba(X_test)[:,1]))



💼 5. Business Impact
- Improved Productivity: Filters out spam, reducing manual email triage.
- Enhanced Security: Detects phishing and malicious content early.
- Customer Trust: Ensures legitimate communications aren’t lost or misclassified.
- Scalability: Automates spam detection across millions of emails with minimal latency.

