### **SVM & Naive Bayes - Assignment Questions & Answers**

**Q.1. What is a Support Vector Machine (SVM), and how does it work?**
  - A **Support Vector Machine (SVM)** is a powerful supervised machine learning algorithm used for both classification and regression tasks, though it's most commonly applied to classification. The core idea behind SVMs is to find the optimal hyperplane that best separates different classes in the feature space.
  
  Here's a breakdown of how it works:
  1.  **Finding the Optimal Hyperplane:** In a 2D space, a hyperplane is simply a line. In higher dimensions, it's a flat subspace that divides the space into two regions. The goal of an SVM is to find a hyperplane that maximizes the margin between the different classes. The margin is the distance between the hyperplane and the nearest data points from each class.
  2.  **Support Vectors:** The data points that are closest to the hyperplane and influence its position are called **support vectors**. These are the critical points that determine the optimal hyperplane and the margin.

  3.  **Maximizing the Margin:** By maximizing the margin, the SVM aims to create a more robust and generalized model. A larger margin generally leads to better performance on unseen data because it provides a buffer zone between the classes.

  4.  **Handling Non-linearly Separable Data:** SVMs can handle cases where the data is not linearly separable in the original feature space. This is done using the **kernel trick**. The kernel trick allows SVMs to implicitly map the data into a higher-dimensional space where it might become linearly separable, without explicitly calculating the coordinates in that higher dimension. Common kernel functions include:
    *   **Linear Kernel:** Suitable for linearly separable data.
    *   **Polynomial Kernel:** Maps data into a polynomial feature space.
    *   **Radial Basis Function (RBF) Kernel:** A popular choice that can handle complex non-linear relationships.

  5.  **Soft Margin SVM:** In real-world scenarios, data often contains noise or outliers, making it impossible to achieve perfect linear separation. Soft margin SVM introduces a regularization parameter (often denoted by 'C') that allows for some misclassifications or violations of the margin. A smaller 'C' allows more violations but creates a wider margin, while a larger 'C' penalizes violations more heavily, resulting in a narrower margin. This parameter helps balance the trade-off between maximizing the margin and minimizing classification errors.

  In summary, SVMs work by finding the optimal hyperplane that maximizes the margin between classes, using support vectors to define this hyperplane. They can handle non-linearly separable data through the kernel trick and can tolerate some misclassifications with the soft margin approach.

**Q.2.Explain the difference between Hard Margin and Soft Margin SVM.**
  - **Hard Margin SVM vs. Soft Margin SVM**

  The key difference between Hard Margin and Soft Margin SVM lies in how they handle data points that are not perfectly separable or contain noise:

*   **Hard Margin SVM:**
    *   Assumes the data is **linearly separable** without any errors.
    *   Finds a hyperplane that perfectly separates the classes with the largest possible margin.
    *   Does not allow any data points to fall within the margin or on the wrong side of the hyperplane.
    *   Can be very sensitive to outliers and noise, as even a single misclassified point can make it impossible to find a separating hyperplane.
    *   Requires the data to be perfectly clean and separable, which is rarely the case in real-world scenarios.

*   **Soft Margin SVM:**
    *   Allows for some **misclassifications** or violations of the margin.
    *   Introduces a **regularization parameter (C)** that controls the trade-off between maximizing the margin and minimizing classification errors.
    *   A smaller 'C' allows more violations but creates a wider margin, leading to a more generalized model but potentially more training errors.
    *   A larger 'C' penalizes violations more heavily, resulting in a narrower margin but fewer training errors. This can lead to overfitting if 'C' is too large.
    *   Is more robust to noise and outliers, making it suitable for real-world datasets where perfect separation is not possible.
    *   Is the more commonly used approach in practice.

  In essence, Hard Margin SVM is a stricter version that demands perfect separation, while Soft Margin SVM is more flexible and allows for some errors to achieve better generalization on noisy or non-linearly separable data. The 'C' parameter in Soft Margin SVM provides a way to tune this flexibility.

**Q.3. What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.**
  - The **Kernel Trick** is a powerful technique used in Support Vector Machines (SVMs) to handle non-linearly separable data. It allows SVMs to implicitly map data into a higher-dimensional feature space where it may become linearly separable, without actually computing the coordinates of the data in that higher dimension. This is computationally less expensive than explicitly transforming the data.

  Essentially, the kernel trick replaces the dot product of the transformed data points in the higher dimension with a kernel function applied to the original data points. The kernel function calculates the similarity between two data points as if they were in the higher-dimensional space.

  **One example of a kernel is the Radial Basis Function (RBF) kernel.**

*   **RBF Kernel:** The RBF kernel is one of the most commonly used kernels in SVMs. It is defined as:

  $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2)$

   where:
  *   $\mathbf{x}_i$ and $\mathbf{x}_j$ are two data points.
  *   $\|\mathbf{x}_i - \mathbf{x}_j\|^2$ is the squared Euclidean distance between the two data points.
  *   $\gamma$ is a parameter that controls the influence of a single training example. A larger $\gamma$ means that a single training example has a closer reach, while a smaller $\gamma$ means a further reach.

  *   **Use Case:** The RBF kernel is particularly useful when the relationship between the data points is non-linear and complex. It can create complex decision boundaries that are not possible with linear kernels. It is often used in image classification, handwriting recognition, and other tasks where the data is not linearly separable in the original feature space. The RBF kernel is a good default choice when you don't have prior knowledge about the data's structure.

**Q.4. What is a Naïve Bayes Classifier, and why is it called “naïve”?**
  - A **Naïve Bayes Classifier** is a probabilistic machine learning algorithm based on Bayes' Theorem. It is commonly used for classification tasks, particularly in text classification and spam filtering. Despite its simplicity, it can perform surprisingly well in many real-world applications.

  Here's how it works:

  1.  **Bayes' Theorem:** The classifier is based on Bayes' Theorem, which describes the probability of a hypothesis given evidence. In the context of classification, it calculates the probability that a given data point belongs to a particular class, given its features. The theorem is stated as:

  $P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

   Where:

    *   $P(A|B)$ is the posterior probability of class A given feature B.
    *   $P(B|A)$ is the likelihood of feature B given class A.
    *   $P(A)$ is the prior probability of class A.
    *   $P(B)$ is the prior probability of feature B.

  2.  **Classification:** To classify a new data point, the Naïve Bayes classifier calculates the probability of the data point belonging to each class, using Bayes' Theorem. The data point is then assigned to the class with the highest probability.

  The algorithm is called **"naïve"** because it makes a simplifying assumption that the features are **conditionally independent** of each other, given the class. In other words, it assumes that the presence or absence of a particular feature does not affect the presence or absence of any other feature, given that we know the class.

  For example, in a text classification task, a Naïve Bayes classifier might assume that the probability of the word "spam" appearing in an email is independent of the probability of the word "discount" appearing in the same email, given that the email is classified as "spam."

  This independence assumption is often not true in reality, as features can be correlated. However, despite this "naïve" assumption, the classifier often performs well in practice, especially when the independence assumption holds approximately or when the dataset is large. The simplicity of the algorithm and its computational efficiency make it a popular choice for many applications.

**Q.5. Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?**
  - There are several variants of the Naïve Bayes classifier, distinguished by the probability distribution assumed for the features. The most common ones are:

  1.  **Gaussian Naïve Bayes:**
    *   **Description:** Assumes that the features follow a **Gaussian (normal) distribution**. This means the likelihood of a feature value given a class is calculated using the probability density function of the normal distribution.
    *   **When to use:** It is typically used for **continuous numerical features** that are assumed to be normally distributed. For example, in classifying emails, if you use the length of the email as a feature, and you assume the lengths of emails in each class follow a normal distribution, you would use Gaussian Naïve Bayes.

  2.  **Multinomial Naïve Bayes:**
    *   **Description:** Assumes that the features represent **counts or frequencies** of events. It is most commonly used for **text classification**, where features are typically word counts or term frequencies. It follows a multinomial distribution.
    *   **When to use:** It is suitable for data where features are discrete counts, such as in document classification (e.g., spam filtering), where the features are the counts of words in a document.

  3.  **Bernoulli Naïve Bayes:**
    *   **Description:** Assumes that the features are **binary (Boolean) variables**, meaning they can only take two values, typically 0 or 1. It models the presence or absence of a particular feature.
    *   **When to use:** It is useful for data where features are indicators of whether a particular event occurred or not. In text classification, for example, it can be used to indicate whether a specific word is present in a document (1) or not (0), regardless of its frequency.

  In summary:

  *   Use **Gaussian Naïve Bayes** for continuous, normally distributed data.
  *   Use **Multinomial Naïve Bayes** for discrete count data, like word counts in text.
  *   Use **Bernoulli Naïve Bayes** for binary data, indicating the presence or absence of a feature.

  The choice of which Naïve Bayes variant to use depends on the nature of your data and the type of features you have.

**Dataset Info:**

**● You can use any suitable datasets like Iris, Breast Cancer, or Wine from sklearn.datasets or a CSV file you have.**

**Q.6. Write a Python program to:**

**● Load the Iris dataset**

**● Train an SVM Classifier with a linear kernel**

**● Print the model's accuracy and support vectors.**

**(Include your Python code and output in the code box below.)**

  -

In [None]:
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an SVM Classifier with a linear kernel
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_linear.predict(X_test)

# Print the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the linear SVM classifier: {accuracy:.2f}")

# Print the support vectors
print("\nSupport Vectors:")
print(svm_linear.support_vectors_)

Accuracy of the linear SVM classifier: 1.00

Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


**Q.7. Write a Python program to:**

**● Load the Breast Cancer dataset**

**● Train a Gaussian Naïve Bayes model**

**● Print its classification report including precision, recall, and F1-score.**

**(Include your Python code and output in the code box below.)**
  -

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb.predict(X_test)

# Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.93      0.96        43
           1       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



**Q.8. Write a Python program to:**

**● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.**

**● Print the best hyperparameters and accuracy.**

**(Include your Python code and output in the code box below.)**

In [None]:
from sklearn.datasets import load_wine
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid for GridSearchCV
param_grid = {'C': [0.1, 1, 10, 100],
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['rbf']} # Using RBF kernel as it's common for GridSearchCV

# Create a GridSearchCV object
grid_search = GridSearchCV(SVC(), param_grid, refit=True, verbose=2)

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best Hyperparameters found by GridSearchCV:")
print(grid_search.best_params_)

# Make predictions on the test set using the best estimator
y_pred = grid_search.predict(X_test)

# Print the accuracy of the best model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy of the best SVM model: {accuracy:.2f}")

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END ......................C=0.1, gamma=0.01, kernel=rbf; total time=   0.0s
[CV] END ......................C=0.1, gamma=0.01

**Q.9. Write a Python program to:**

**● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using sklearn.datasets.fetch_20newsgroups).**

**● Print the model's ROC-AUC score for its predictions.**

**(Include your Python code and output in the code box below.)**

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Load a subset of the 20 newsgroups dataset for binary classification
categories = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))

X_train, y_train = newsgroups_train.data, newsgroups_train.target
X_test, y_test = newsgroups_test.data, newsgroups_test.target

# Convert text data to TF-IDF features
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a Multinomial Naïve Bayes model (suitable for text data)
mnb = MultinomialNB()
mnb.fit(X_train_tfidf, y_train)

# Get the predicted probabilities for the positive class
y_pred_proba = mnb.predict_proba(X_test_tfidf)[:, 1]

# Calculate the ROC-AUC score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score: {roc_auc:.2f}")

# Optional: Plot the ROC curve
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
# plt.figure()
# plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc:.2f})')
# plt.plot([0, 1], [0, 1], 'k--')
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('Receiver Operating Characteristic (ROC) Curve')
# plt.legend(loc="lower right")
# plt.show()

ROC-AUC Score: 0.96


**Q.10. Imagine you're working as a data scientist for a company that handles email communications.**

**Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:**

**● Text with diverse vocabulary**

**● Potential class imbalance (far more legitimate emails than spam)**

**● Some incomplete or missing data**

**Explain the approach you would take to:**

**● Preprocess the data (e.g. text vectorization, handling missing data)**

**● Choose and justify an appropriate model (SVM vs. Naïve Bayes)**

**● Address class imbalance**

**● Evaluate the performance of your solution with suitable metrics And explain the business impact of your solution.**

**(Include your Python code and output in the code box below.)**

  - Here's an approach to automatically classify emails as Spam or Not Spam, considering the challenges of diverse vocabulary, class imbalance, and missing data:

**1. Data Preprocessing:**

*   **Handling Missing Data:** Email data can have missing values in various fields (e.g., sender information, subject, body). Depending on the nature of the missing data, you could:
    *   **Impute:** Replace missing values with a placeholder (e.g., "unknown") for categorical features, or with the mean/median for numerical features (though less common in email text classification).
    *   **Remove:** If missing data is pervasive in certain features and those features are not critical, you might consider removing those features.
    *   **Indicator:** Create a binary indicator variable to denote the presence of a missing value in a particular field.
*   **Text Vectorization:** To use machine learning models with text data, you need to convert the text into numerical features. Common techniques include:
    *   **Bag-of-Words (BoW):** Represents each email as a vector where each dimension corresponds to a word in the vocabulary, and the value is the count of that word in the email.
    *   **TF-IDF (Term Frequency-Inverse Document Frequency):** Similar to BoW, but it weights words based on their importance in a document relative to the entire corpus. This helps to downweight common words that appear in many emails.
    *   **Word Embeddings (e.g., Word2Vec, GloVe):** These techniques learn dense vector representations of words that capture semantic relationships. While more complex, they can improve performance on tasks requiring understanding of context.
    *   **Preprocessing Steps for Text:** Before vectorization, apply standard text preprocessing steps:
        *   **Lowercasing:** Convert all text to lowercase.
        *   **Punctuation Removal:** Remove punctuation marks.
        *   **Stop Word Removal:** Remove common words (e.g., "the", "a", "is") that don't carry much meaning.
        *   **Stemming or Lemmatization:** Reduce words to their root form (e.g., "running" -> "run").

**2. Model Selection and Justification (SVM vs. Naïve Bayes):**

Both SVM and Naïve Bayes are suitable for text classification, but they have different strengths:

*   **Naïve Bayes (Specifically Multinomial Naïve Bayes):**
    *   **Pros:** Simple, fast to train and predict, works well with high-dimensional sparse data (like text), and is a good baseline model. It's particularly effective when the independence assumption (features are conditionally independent given the class) holds reasonably well.
    *   **Cons:** The independence assumption is often violated in real-world text.
    *   **Justification:** Given the diverse vocabulary and the nature of text data (word counts/frequencies), Multinomial Naïve Bayes is a strong initial choice due to its simplicity and efficiency. It can provide a solid baseline performance.

*   **Support Vector Machine (SVM):**
    *   **Pros:** Powerful for high-dimensional data, can find complex decision boundaries using the kernel trick, and often performs well in practice.
    *   **Cons:** Can be computationally more expensive to train than Naïve Bayes, especially with large datasets and complex kernels. Can be sensitive to the choice of kernel and hyperparameters.
    *   **Justification:** If Naïve Bayes performance is not satisfactory, SVM with a suitable kernel (like the RBF kernel) can capture more complex relationships between features and potentially achieve higher accuracy. It's a good option for exploring non-linear separation in the feature space.

**Recommended Approach:** Start with a Multinomial Naïve Bayes model as a baseline. If needed, then experiment with an SVM model, possibly with GridSearchCV to tune hyperparameters.

**3. Addressing Class Imbalance:**

Class imbalance (many more legitimate emails than spam) can lead to a model that is biased towards the majority class (not spam). Techniques to address this include:

*   **Resampling Techniques:**
    *   **Oversampling:** Duplicate instances of the minority class (spam) to balance the dataset. Techniques like SMOTE (Synthetic Minority Over-sampling Technique) can generate synthetic minority class samples.
    *   **Undersampling:** Randomly remove instances from the majority class (not spam) to match the number of minority class instances. This can lead to loss of information.
*   **Class Weighting:** Many machine learning algorithms (including SVM and some Naïve Bayes implementations) allow you to assign higher weights to the minority class during training. This penalizes misclassifications of the minority class more heavily.
*   **Using Appropriate Evaluation Metrics:** Avoid relying solely on accuracy when dealing with class imbalance (see below).

**4. Evaluating Performance with Suitable Metrics:**

Accuracy can be misleading with class imbalance. Instead, use metrics that provide a more nuanced view of performance:

*   **Precision:** Of all the emails classified as spam, what proportion were actually spam? (Minimizes false positives - classifying legitimate email as spam).
*   **Recall (Sensitivity):** Of all the actual spam emails, what proportion were correctly classified as spam? (Minimizes false negatives - classifying spam as legitimate email).
*   **F1-Score:** The harmonic mean of precision and recall, providing a single metric that balances both.
*   **ROC-AUC (Receiver Operating Characteristic - Area Under Curve):** Measures the ability of the model to distinguish between the classes. A higher AUC indicates better performance.

**5. Business Impact of the Solution:**

Implementing an effective spam classification solution can have significant business impact:

*   **Increased User Productivity:** Users spend less time dealing with unwanted spam, allowing them to focus on important emails.
*   **Improved Security:** Reduces the risk of users clicking on malicious links or opening infected attachments in spam emails.
*   **Reduced Infrastructure Costs:** Less storage and processing power are needed to handle spam emails.
*   **Enhanced User Experience:** Users have a cleaner and more relevant inbox.
*   **Protection of Brand Reputation:** Prevents the company's email system from being used to send spam, which could damage its reputation.

By implementing a robust spam classification system, the company can improve efficiency, security, and user satisfaction, leading to a positive business impact.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score

# Create a synthetic dataset (replace with your actual email data)
data = {'email_text': ["This is a legitimate email about a meeting.",
                       "Buy now and get a free gift!",
                       "Meeting rescheduled to 3 PM.",
                       "Spammy subject line: Urgent financial matter",
                       "Hello, let's discuss the project.",
                       " nigeria lottery winner click here",
                       "Project update attached.",
                       "Free money no strings attached"],
        'label': ['not spam', 'spam', 'not spam', 'spam', 'not spam', 'spam', 'not spam', 'spam']}

df = pd.DataFrame(data)

X = df['email_text']
y = df['label']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42) # Using a larger test size for this small example

# Text Vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a Multinomial Naïve Bayes model
mnb = MultinomialNB()
mnb.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = mnb.predict(X_test_tfidf)
y_pred_proba = mnb.predict_proba(X_test_tfidf)[:, 1] # Probability for the 'spam' class

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Calculate ROC-AUC if applicable (needs binary labels)
# Convert labels to binary for ROC-AUC calculation
y_test_binary = [1 if label == 'spam' else 0 for label in y_test]
y_pred_proba_binary = [proba for proba in y_pred_proba]

try:
    roc_auc = roc_auc_score(y_test_binary, y_pred_proba_binary)
    print(f"\nROC-AUC Score: {roc_auc:.2f}")
except ValueError as e:
    print(f"\nCould not calculate ROC-AUC: {e}")
    print("ROC-AUC requires at least one instance of each class in the test set.")

Classification Report:
              precision    recall  f1-score   support

    not spam       0.25      1.00      0.40         1
        spam       0.00      0.00      0.00         3

    accuracy                           0.25         4
   macro avg       0.12      0.50      0.20         4
weighted avg       0.06      0.25      0.10         4


ROC-AUC Score: 0.67


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
