# SVM & Naive Bayes

**Question 1: What is a Support Vector Machine (SVM), and how does it work?**

->

- A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks.
- SVM tries to find the best boundary (hyperplane) that separates data points of different classes with the maximum margin.

### **How it works?**
---

#### **1. Plot the Data**

- Represent each data point as a vector in feature space (e.g., $(x_1, x_2)$).
- Each point has a class label (e.g., +1 or −1).

---

#### **2. Identify Support Vectors**

- These are the **closest data points to the separating hyperplane**.
- They are **critical** because the hyperplane depends only on them.
- SVM focuses only on these points to build the model.

---

#### **3. Draw the Margin Lines**

- Draw two parallel lines passing through the support vectors (one for each class).
- These lines form the **margin**.
- **Margin width = distance between these two lines**.

---

#### **4. Construct the Optimal Hyperplane**

- The **decision boundary** (hyperplane) lies **exactly in the middle** of the margin lines.
- It is the boundary that separates the two classes.

**Equation of the hyperplane:**

$$
\mathbf{w}^T \mathbf{x} + b = 0
$$

---

#### **5. Apply Classification Condition**

For each data point $(x_i, y_i)$, the condition must be satisfied:

$$
y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1
$$

This ensures all points are correctly classified and lie outside or on the margin.

---

#### **6. Maximize the Margin**

SVM chooses the hyperplane that **maximizes the margin**:

$$
\text{Margin} = \dfrac{2}{\|\mathbf{w}\|}
$$

**Optimization objective:**

$$
\min \frac{1}{2} \|\mathbf{w}\|^2 \quad \text{subject to} \quad y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1
$$

---

#### **7. Predict New Data**

Use the learned $\mathbf{w}$ and $b$ to classify new points:

- If $\mathbf{w}^T \mathbf{x} + b > 0$: predict **+1**  
- If $\mathbf{w}^T \mathbf{x} + b < 0$: predict **−1**

**Question 2: Explain the difference between Hard Margin and Soft Margin SVM.**

->

### 1. Hard Margin SVM

- **Assumes** the data is **linearly separable**.
- Finds the **maximum margin hyperplane** that **perfectly separates** the two classes.
- **No misclassifications** are allowed — all data points must lie **outside or on the correct side** of the margin.
- **Highly sensitive to outliers**.

**Mathematical constraint:**

$$
y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \forall i
$$

**Used when:**

- Data is **clean**, with **no overlap** between classes.
- **No noise or outliers** present.

---

### 2. Soft Margin SVM

- Allows **some violations** of the margin (misclassified or margin-crossing points).
- Introduces **zeta variables** ($\xi_i$) to tolerate errors.
- Balances **margin maximization** and **classification accuracy** using a parameter **$C$**.

**Optimization objective:**

$$
\min \left( \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^n \xi_i \right)
$$

- **$C$** = regularization parameter:
  - Large $C$: **less tolerance** for misclassification
  - Small $C$: **more tolerance** for margin violations

**Used when:**

- Data is **not linearly separable**
- Dataset contains **noise** or **outliers**

---


| Feature               | **Hard Margin SVM**                        | **Soft Margin SVM**                         |
|----------------------|--------------------------------------------|---------------------------------------------|
| Data Requirement      | Linearly separable                         | May not be linearly separable               |
| Misclassifications    | Not allowed                                | Allowed (with penalty)                      |
| Zeta Variables       | Not used                                   | Used: $\xi_i$                               |
| Handles Outliers      | Poorly                                     | Better (can tolerate some errors)           |
| Regularization Param  | Not required                               | Uses $C$ to balance margin vs. error        |


**Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.**

->
### What is the Kernel Trick?

- The **kernel trick** is used when data is **not linearly separable** in the current (low-dimensional) space.
- Instead of applying a linear Support Vector Classifier (SVC), the kernel trick **implicitly maps data into a higher-dimensional space** where a linear separator may exist.
- This allows SVM to draw a **linear hyperplane in the transformed space**, which corresponds to a **non-linear boundary** in the original space.

> **Example Transformation**:  
> From 2D → 3D (or higher) using a mathematical mapping function.

---

### Example: Polynomial Kernel

**Formula:**
\[
K(x, x') = (x^T x' + c)^d
\]

- \( c \): a constant (usually ≥ 0)  
- \( d \): degree of the polynomial  
- Allows learning **curved decision boundaries** by introducing interaction features like \( x_1x_2, x_1^2, x_2^2 \), etc.

---

#### Example Scenario:

Imagine a 2D dataset:

- **Class 1**: Points lie **inside a circle** (e.g., near the origin)  
- **Class 2**: Points lie **outside the circle**

In this case:

- The data is **not linearly separable** in 2D
- A **linear SVM would fail**
- A **Polynomial Kernel** with degree \( d = 2 \) can transform the space so that the **circular boundary** becomes a **linear separator** in the new (higher-dimensional) space

---

### Use Case of Polynomial Kernel

- **Pattern recognition problems** where classes are separated by **curved boundaries**
- **Financial classification tasks** where **feature interactions** (like $( x_1 \cdot x_2)$) are important
- Works well when:
  - Feature combinations are informative
  - You want to **control model complexity** using the polynomial degree \( d \)



**Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?**

->

### What is Naïve Bayes?

- **Naïve Bayes** is a **probabilistic classifier** based on **Bayes’ Theorem**.
- It is used for **classification tasks**, especially in **text classification**, **spam detection**, **sentiment analysis**, etc.
- It assumes that the features (input variables) are **independent** of each other given the class label — this is the **naïve assumption**.

---

### Bayes’ Theorem:

$$
P(C \mid X) = \frac{P(X \mid C) \cdot P(C)}{P(X)}
$$

Where:

- \( P(C | X) \): Posterior probability of class \( C \) given data \( X \)  
- \( P(X | C) \): Likelihood of data \( X \) given class \( C \)  
- \( P(C) \): Prior probability of class \( C \)  
- \( P(X) \): Evidence or probability of data \( X \)

---

### Why is it called "Naïve"?

- It is named as "Naïve" because it assumes the presence of one feature does not affect other features.


**Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?**

->

### 1. Gaussian Naïve Bayes

- Assumes **features are continuous** and **normally (Gaussian) distributed**.
- Common when feature values are **real numbers** (e.g., height, weight, temperature).

**Use when:**
- Features are **continuous** and follow **normal distribution**
- Examples: **Iris dataset**, **medical data**, **sensor data**

---

### 2. Multinomial Naïve Bayes

- Assumes **features are discrete count data** (non-negative integers).
- Suitable for **word frequency** or **term counts** in documents.

**Use when:**
- Features represent **counts** (e.g., how many times a word appears)
- Common in **text classification**, e.g.:
  - **Spam detection**
  - **News categorization**
  - **Sentiment analysis**

**Example input:**
- `x1 = 3` (word "free" appears 3 times)
- `x2 = 1` (word "win" appears once)

---

### 3. Bernoulli Naïve Bayes

- Assumes **binary features** — whether a feature is **present or absent**.
- Suitable for **binary word presence** in documents.

**Use when:**
- Data is **binary (0 or 1)**, e.g.:
  - Whether a word appears in an email (yes/no)
  - Pixel is on/off in an image

**Example input:**
- `x1 = 1` (word "buy" appears)
- `x2 = 0` (word "cheap" does not appear)

Question 6: Write a Python program to:
- Load the Iris dataset
- Train an SVM Classifier with a linear kernel
- Print the model's accuracy and support vectors.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [2]:
data = load_iris()

In [3]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [4]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

In [5]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [6]:
df.isna().sum()

Unnamed: 0,0
sepal length (cm),0
sepal width (cm),0
petal length (cm),0
petal width (cm),0
target,0


In [7]:
X = df.drop('target', axis=1)
y = df['target']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [9]:
model = SVC(kernel='linear')
model.fit(X_train, y_train)

In [10]:
y_pred = model.predict(X_test)

In [11]:
accuracy_score(y_test, y_pred)

1.0

In [12]:
print(model.support_vectors_)

[[5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [4.8 3.4 1.9 0.2]
 [6.1 3.  4.6 1.4]
 [6.5 2.8 4.6 1.5]
 [6.  3.4 4.5 1.6]
 [5.7 2.8 4.5 1.3]
 [6.  2.7 5.1 1.6]
 [6.9 3.1 4.9 1.5]
 [5.9 3.2 4.8 1.8]
 [4.9 2.4 3.3 1. ]
 [6.1 2.9 4.7 1.4]
 [6.2 2.2 4.5 1.5]
 [6.3 2.5 4.9 1.5]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.3 2.7 4.9 1.8]
 [6.1 3.  4.9 1.8]
 [6.5 3.2 5.1 2. ]
 [6.  3.  4.8 1.8]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]
 [7.2 3.  5.8 1.6]
 [6.3 2.8 5.1 1.5]]


Question 7: Write a Python program to:
- Load the Breast Cancer dataset
- Train a Gaussian Naïve Bayes model
- Print its classification report including precision, recall, and F1-score.

In [13]:
from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

In [14]:
data = load_breast_cancer()

In [15]:
print(data.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - 

In [16]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

In [17]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [18]:
df.isna().sum()

Unnamed: 0,0
mean radius,0
mean texture,0
mean perimeter,0
mean area,0
mean smoothness,0
mean compactness,0
mean concavity,0
mean concave points,0
mean symmetry,0
mean fractal dimension,0


In [19]:
X = df.drop('target', axis=1)
y = df['target']

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [21]:
model = GaussianNB()
model.fit(X_train, y_train)

In [22]:
y_pred = model.predict(X_test)

In [23]:
accuracy_score(y_test, y_pred)

0.9473684210526315

In [24]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.90      0.93        42
           1       0.95      0.97      0.96        72

    accuracy                           0.95       114
   macro avg       0.95      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



Question 8: Write a Python program to:
- Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
- Print the best hyperparameters and accuracy.


In [25]:
from sklearn.datasets import load_wine
from sklearn.model_selection import GridSearchCV

In [26]:
data = load_wine()

In [27]:
print(data.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

:Number of Instances: 178
:Number of Attributes: 13 numeric, predictive attributes and the class
:Attribute Information:
    - Alcohol
    - Malic acid
    - Ash
    - Alcalinity of ash
    - Magnesium
    - Total phenols
    - Flavanoids
    - Nonflavanoid phenols
    - Proanthocyanins
    - Color intensity
    - Hue
    - OD280/OD315 of diluted wines
    - Proline
    - class:
        - class_0
        - class_1
        - class_2

:Summary Statistics:

                                Min   Max   Mean     SD
Alcohol:                      11.0  14.8    13.0   0.8
Malic Acid:                   0.74  5.80    2.34  1.12
Ash:                          1.36  3.23    2.36  0.27
Alcalinity of Ash:            10.6  30.0    19.5   3.3
Magnesium:                    70.0 162.0    99.7  14.3
Total Phenols:                0.98  3.88    2.29  0.63
Flavanoids:                   0.34  5.08    2.03  1.00

In [28]:
data.data

array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]])

In [29]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

In [30]:
data.feature_names

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

In [31]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

In [32]:
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [33]:
X = df.drop('target', axis=1)
y = df['target']

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=1)

In [35]:
svm_model = SVC(kernel='rbf')

In [36]:
param_grid = {
    'C': [0.1, 0.2, 1, 2, 3, 10, 50, 100],
    'gamma': [1, 0.1, 0.2, 0.001, 0.003]
}

In [37]:
grid = GridSearchCV(SVC(), param_grid, cv=5, verbose=3)

In [38]:
grid.fit(X_train, y_train)

Fitting 5 folds for each of 40 candidates, totalling 200 fits
[CV 1/5] END ....................C=0.1, gamma=1;, score=0.435 total time=   0.0s
[CV 2/5] END ....................C=0.1, gamma=1;, score=0.435 total time=   0.0s
[CV 3/5] END ....................C=0.1, gamma=1;, score=0.391 total time=   0.0s
[CV 4/5] END ....................C=0.1, gamma=1;, score=0.391 total time=   0.0s
[CV 5/5] END ....................C=0.1, gamma=1;, score=0.391 total time=   0.0s
[CV 1/5] END ..................C=0.1, gamma=0.1;, score=0.435 total time=   0.0s
[CV 2/5] END ..................C=0.1, gamma=0.1;, score=0.435 total time=   0.0s
[CV 3/5] END ..................C=0.1, gamma=0.1;, score=0.391 total time=   0.0s
[CV 4/5] END ..................C=0.1, gamma=0.1;, score=0.391 total time=   0.0s
[CV 5/5] END ..................C=0.1, gamma=0.1;, score=0.391 total time=   0.0s
[CV 1/5] END ..................C=0.1, gamma=0.2;, score=0.435 total time=   0.0s
[CV 2/5] END ..................C=0.1, gamma=0.2

In [39]:
grid.best_params_

{'C': 50, 'gamma': 0.001}

In [40]:
grid.best_estimator_

In [41]:
grid.best_score_

np.float64(0.7478260869565216)

In [42]:
y_pred = grid.best_estimator_.predict(X_test)

In [43]:
print("Accuracy:",accuracy_score(y_test, y_pred))

Accuracy: 0.6825396825396826


In [44]:
print("Best Parameters:",grid.best_params_)

Best Parameters: {'C': 50, 'gamma': 0.001}


Question 9: Write a Python program to:
- Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
- Print the model's ROC-AUC score for its predictions.

In [45]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

In [46]:
categories = ['sci.space', 'rec.sport.hockey']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

In [47]:
print(newsgroups.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

Classes                     20
Samples total            18846
Dimensionality               1
Features                  text

In [48]:
newsgroups.data

['From: mccall@mksol.dseg.ti.com (fred j mccall 575-3539)\nSubject: Re: Vandalizing the sky.\nArticle-I.D.: mksol.1993Apr22.204742.10671\nOrganization: Texas Instruments Inc\nLines: 62\n\nIn <C5tvL2.1In@hermes.hrz.uni-bielefeld.de> hoover@mathematik.uni-bielefeld.de (Uwe Schuerkamp) writes:\n\n>In article <C5t05K.DB6@research.canon.oz.au> enzo@research.canon.oz.au \n>(Enzo Liguori) writes:\n\n>> hideous vision of the future.  Observers were\n>>startled this spring when a NASA launch vehicle arrived at the\n>>pad with "SCHWARZENEGGER" painted in huge block letters on the\n\n>This is ok in my opinion as long as the stuff *returns to earth*.\n\n>>What do you think of this revolting and hideous attempt to vandalize\n>>the night sky? It is not even April 1 anymore.\n\n>If this turns out to be true, it\'s time to get seriously active in\n>terrorism. This is unbelievable! Who do those people think they are,\n>selling every bit that promises to make money? \n\nWell, I guess I\'m left wondering

In [49]:
newsgroups.target

array([1, 0, 1, ..., 0, 0, 1])

In [50]:
df = pd.DataFrame(newsgroups.data, columns=['text'])
df['target'] = newsgroups.target

In [51]:
df.head()

Unnamed: 0,text,target
0,From: mccall@mksol.dseg.ti.com (fred j mccall ...,1
1,From: epritcha@s.psych.uiuc.edu ( Evan Pritcha...,0
2,From: baalke@kelvin.jpl.nasa.gov (Ron Baalke)\...,1
3,From: mse@cc.bellcore.com (25836-michael evenc...,0
4,From: apanjabi@guvax.acc.georgetown.edu\nSubje...,0


In [52]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])
y = df['target']

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [54]:
model = MultinomialNB()
model.fit(X_train, y_train)

In [55]:
y_probs = model.predict_proba(X_test)[:, 1]
y_probs

array([9.94402854e-01, 3.80843194e-02, 6.69785671e-01, 8.66628537e-03,
       9.06746531e-01, 9.30759525e-01, 9.78533142e-01, 7.37122111e-03,
       8.94321691e-03, 9.71351044e-01, 9.36298089e-03, 9.70436307e-01,
       2.76766032e-02, 9.66948834e-01, 8.14199452e-01, 6.81893149e-02,
       9.69581670e-03, 6.45086002e-03, 9.25364518e-01, 9.52321672e-01,
       4.41920521e-02, 2.49537410e-02, 9.86553579e-01, 9.92467853e-01,
       3.69485551e-03, 9.25830563e-01, 2.89805011e-03, 9.85780366e-01,
       9.08274366e-01, 9.90955187e-01, 9.88136046e-01, 9.14774099e-01,
       1.09390094e-02, 9.40497964e-01, 5.97464357e-02, 7.08205050e-01,
       9.87697102e-01, 9.39945694e-01, 9.74475310e-01, 7.52494164e-01,
       6.23110821e-05, 2.23380438e-02, 2.84644357e-02, 8.24240141e-01,
       9.90526424e-01, 9.73443588e-01, 5.82186845e-03, 8.82337692e-01,
       1.17639062e-03, 8.10017081e-03, 1.42046772e-02, 9.10727558e-01,
       4.86284782e-02, 6.71871605e-02, 1.31264526e-02, 9.61575078e-01,
      

In [56]:
auc = roc_auc_score(y_test, y_probs)
print("ROC-AUC Score:", auc)

ROC-AUC Score: 1.0


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
- Text with diverse vocabulary
- Potential class imbalance (far more legitimate emails than spam)
- Some incomplete or missing data
Explain the approach you would take to:
- Preprocess the data (e.g. text vectorization, handling missing data)
- Choose and justify an appropriate model (SVM vs. Naïve Bayes)
- Address class imbalance
- Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.


In [57]:
from datasets import load_dataset
data = load_dataset("TrainingDataPro/email-spam-classification")

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


In [58]:
df = data["train"].to_pandas()
df.head()

Unnamed: 0,title,text,type
0,?? the secrets to SUCCESS,"Hi James,\n\nHave you claim your complimentary...",spam
1,?? You Earned 500 GCLoot Points,"\nalt_text\nCongratulations, you just earned\n...",not spam
2,?? Your GitHub launch code,"Here's your GitHub launch code, @Mortyj420!\n ...",not spam
3,[The Virtual Reward Center] Re: ** Clarifications,"Hello,\n \nThank you for contacting the Virtua...",not spam
4,"10-1 MLB Expert Inside, Plus Everything You Ne...","Hey Prachanda Rawal,\n\nToday's newsletter is ...",spam


In [59]:
df.shape

(84, 3)

In [60]:
df.sample()

Unnamed: 0,title,text,type
60,The Last Call!,"Dear Customer,\n\nAs the tax season comes to a...",spam


In [61]:
df.isna().sum()

Unnamed: 0,0
title,0
text,0
type,0


In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   84 non-null     object
 1   text    84 non-null     object
 2   type    84 non-null     object
dtypes: object(3)
memory usage: 2.1+ KB


In [63]:
df['type'].value_counts()

Unnamed: 0_level_0,count
type,Unnamed: 1_level_1
not spam,58
spam,26


In [64]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.utils import resample

In [65]:
majority = df[df.type == 'not spam']
minority = df[df.type == 'spam']

In [66]:
minority_upsampled = resample(minority,
                               replace=True,
                               n_samples=len(majority),
                               random_state=1)

In [67]:
df_balanced = pd.concat([majority, minority_upsampled])

In [68]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df_balanced['text'])
y = df_balanced['type']

In [69]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [70]:
model = MultinomialNB()
model.fit(X_train, y_train)

In [71]:
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print("Classification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba))

Classification Report:
               precision    recall  f1-score   support

    not spam       1.00      0.64      0.78        14
        spam       0.67      1.00      0.80        10

    accuracy                           0.79        24
   macro avg       0.83      0.82      0.79        24
weighted avg       0.86      0.79      0.79        24

ROC-AUC Score: 0.9785714285714286
