# Assignment 1 - Fundamentals of Embedding and BERT

**University of Chicago**
**MS in Applied Data Science**

Course: Generative AI Principles 

Date: 04/05/2025

Author: Hyunji Amy Kim

In [46]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score

In [2]:
df = pd.read_csv('Phishing_Email.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Email Text,Email Type
0,0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,1,the other side of * galicismos * * galicismo *...,Safe Email
2,2,re : equistar deal tickets are you still avail...,Safe Email
3,3,\nHello I am your hot lil horny toy.\n I am...,Phishing Email
4,4,software at incredibly low prices ( 86 % lower...,Phishing Email


### 1. Email Classification using KNN (With first 1000 rows)

In [48]:
# Use the first 1000 rows
df_subset = df.iloc[:1000].copy()

# Drop rows with missing 'Email Text'
df_subset.dropna(subset=['Email Text'], inplace=True)

# Vectorize the email text using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(df_subset['Email Text'])

# Encode the labels ('Phishing Email' and 'Safe Email' to numeric labels)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_subset['Email Type'])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)
f1 = f1_score(y_test, knn.predict(X_test), average="weighted")

In [49]:
print("Accuracy: ", accuracy)
print("F1 Score: ", f1)
print("Classification Report: \n", report)

Accuracy:  0.57
F1 Score:  0.5240067911714771
Classification Report: 
                 precision    recall  f1-score   support

Phishing Email       0.49      0.99      0.65        82
    Safe Email       0.97      0.28      0.43       118

      accuracy                           0.57       200
     macro avg       0.73      0.63      0.54       200
  weighted avg       0.77      0.57      0.52       200



**Summary:** 

Using K-Nearest Neighbors (KNN) without BERT resulted in an overall accuracy of 57% and an F1 score of 0.52, suggesting that the model is only slightly better than random guessing and struggles to balance false positives and false negatives. 

While it achieved high recall (0.99) for phishing emails, it performed poorly in identifying safe emails (recall = 0.28), indicating a strong bias toward predicting emails as phishing. 

As a result, the KNN model produced a high number of false positives for phishing.

### 2. Comparing [CLS] Token and Average Token Embeddings

BERT provides embeddings for each token of the input text, including: 

**[CLS] token embedding:** Represent the entire sequence for classification tasks. 

**Average token embedding:** Average the embeddings of all tokens. 

This section explores the differences between using the [CLS] token embedding and the average of all token embeddings for classification.

#### (1) Calculating the average embedding for all tokens in a sequence

In [16]:
import torch
from transformers import BertTokenizer, BertModel

import matplotlib.pyplot as plt

  from .autonotebook import tqdm as notebook_tqdm


In [25]:
# Extract the text and labels
texts = df_subset["Email Text"].tolist()
labels = df_subset["Email Type"].tolist()

In [18]:
# Convert 'Phishing Email' and 'Safe Email' to numeric labels (e.g., 0 and 1)
label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)

In [19]:
# Load pre-trained BERT tokenizer and model (base uncased version)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()  # Set model to inference mode (no training)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [None]:
# Get BERT embeddings ([CLS] and Average)
def get_bert_embeddings(texts):
    cls_embeddings = []
    avg_embeddings = []

    with torch.no_grad():  # No gradients needed
        for text in texts:
            # Tokenize and encode the text (This line automatically marks the text before Tokenizing)
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
            outputs = model(**inputs)

            # Last hidden states of all tokens (shape: 1, seq_len, hidden_dim)
            last_hidden_state = outputs.last_hidden_state.squeeze(0)  # Remove batch dim

            # [CLS] token embedding is always the first token (index 0)
            cls_emb = last_hidden_state[0].numpy()

            # Average of all token embeddings
            avg_emb = last_hidden_state.mean(dim=0).numpy()

            cls_embeddings.append(cls_emb)
            avg_embeddings.append(avg_emb)

    return np.array(cls_embeddings), np.array(avg_embeddings)

# Get embeddings
cls_embs, avg_embs = get_bert_embeddings(texts)


In [None]:
# Confirm the shape
cls_embs.shape, avg_embs.shape

((998, 768), (998, 768))

#### (2) Comparing the performance using F1 scores of a KNN classifier using these two embedding (CLS embedding and average embedding) methods

In [34]:
# Train-test split (80% training, 20% testing) for KNN
X_train_cls, X_test_cls, y_train, y_test = train_test_split(cls_embs, labels_encoded, test_size=0.2, random_state=42)
X_train_avg, X_test_avg, _, _ = train_test_split(avg_embs, labels_encoded, test_size=0.2, random_state=42)

In [51]:
# KNN with [CLS] embedding 
knn_cls = KNeighborsClassifier(n_neighbors=5)
knn_cls.fit(X_train_cls, y_train)
y_pred_cls = knn_cls.predict(X_test_cls)

# KNN with average embedding 
knn_avg = KNeighborsClassifier(n_neighbors=5)
knn_avg.fit(X_train_avg, y_train)
y_pred_avg = knn_avg.predict(X_test_avg)

In [53]:
# F1 Scores
f1_cls = f1_score(y_test, y_pred_cls, average="weighted")
f1_avg = f1_score(y_test, y_pred_avg, average="weighted")

print("F1 Score using [CLS] embedding:", f1_cls)
print("F1 Score using average embedding:", f1_avg)

F1 Score using [CLS] embedding: 0.9301225414478427
F1 Score using average embedding: 0.9350586718246293


In [52]:
# Classification Reports
report_cls = classification_report(y_test, y_pred_cls, target_names=label_encoder.classes_)
report_avg = classification_report(y_test, y_pred_avg, target_names=label_encoder.classes_)

print("Classification Report using [CLS] embedding: \n", report_cls)
print("Classification Report using average embedding: \n", report_avg)

Classification Report using [CLS] embedding: 
                 precision    recall  f1-score   support

Phishing Email       0.90      0.93      0.92        82
    Safe Email       0.95      0.93      0.94       118

      accuracy                           0.93       200
     macro avg       0.93      0.93      0.93       200
  weighted avg       0.93      0.93      0.93       200

Classification Report using average embedding: 
                 precision    recall  f1-score   support

Phishing Email       0.92      0.93      0.92        82
    Safe Email       0.95      0.94      0.94       118

      accuracy                           0.94       200
     macro avg       0.93      0.93      0.93       200
  weighted avg       0.94      0.94      0.94       200



**Summary:**

Using BERT embeddings significantly improved the KNN model’s performance. 

With the [CLS] embedding, the model achieved an accuracy of 93% and an F1 score of 0.93. 

When using the average embedding, the performance improved slightly, reaching an accuracy of 94% and an F1 score of 0.94. 

Both methods achieved high precision and recall across classes, but the average embedding showed a slight edge, particularly in capturing more safe emails correctly.

### 3. Compare 768 dimensional embedding vs 2 dimensional embedding knn results using UMAP and PCA dimensional reduction techniques

This section explores how dimensionality reduction techniques, specifically UMAP and PCA, affect the performance of a KNN classifier when using high-dimensional BERT embeddings by comparing the classifier's performance using original 768-dimensional embeddings against 2-dimensional embeddings obtained through dimensionality reduction.

#### (1) Apply PCA and UMAP to reduce BERT embeddings ([CLS]) to 2 dimension


In [38]:
pip install umap-learn

Collecting umap-learn
  Using cached umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting numba>=0.51.2 (from umap-learn)
  Downloading numba-0.61.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (2.7 kB)
Collecting pynndescent>=0.5 (from umap-learn)
  Using cached pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Collecting llvmlite<0.45,>=0.44.0dev0 (from numba>=0.51.2->umap-learn)
  Downloading llvmlite-0.44.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (4.8 kB)
Using cached umap_learn-0.5.7-py3-none-any.whl (88 kB)
Downloading numba-0.61.0-cp313-cp313-macosx_11_0_arm64.whl (2.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached pynndescent-0.5.13-py3-none-any.whl (56 kB)
Downloading llvmlite-0.44.0-cp313-cp313-macosx_11_0_arm64.whl (26.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.2/26.2 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00

In [39]:
from sklearn.decomposition import PCA
import umap.umap_ as umap

# Dimensionality Reduction
pca_model = PCA(n_components=2) # PCA
umap_model = umap.UMAP(n_components=2, random_state=42) # UMAP

cls_pca = pca_model.fit_transform(cls_embs)
cls_umap = umap_model.fit_transform(cls_embs)


  warn(
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


In [42]:
# Check if dimension is correctly reduced
cls_pca.shape, cls_umap.shape

((998, 2), (998, 2))

#### (2) Train a KNN classifier using the original embeddings ([CLS]), PCA-reduced embeddings, and UMAP-reduced embeddings .

In [55]:
# Train/test split on reduced
X_train_pca, X_test_pca, y_train, y_test = train_test_split(cls_pca, labels, test_size=0.2, random_state=42)
X_train_umap, X_test_umap, _, _ = train_test_split(cls_umap, labels, test_size=0.2, random_state=42)

# KNN on reduced embeddings
knn.fit(X_train_pca, y_train)
knn.fit(X_train_umap, y_train)

y_pred_pca = knn.predict(X_test_pca)
y_pred_umap = knn.predict(X_test_umap)

#### (3) Compare the classifiers' performances using F1 scores to understand the impact of dimensionality reduction.

In [56]:
# Results
f1_pca = f1_score(y_test, y_pred_pca, average="weighted")
f1_umap = f1_score(y_test, y_pred_umap, average="weighted")

print("F1 Score - CLS:", f1_cls)
print("F1 Score - AVG:", f1_avg)
print("F1 Score - PCA (CLS→2D):", f1_pca)
print("F1 Score - UMAP (CLS→2D):", f1_umap)

F1 Score - CLS: 0.9301225414478427
F1 Score - AVG: 0.9350586718246293
F1 Score - PCA (CLS→2D): 0.2598095238095238
F1 Score - UMAP (CLS→2D): 0.9196666666666666


In [57]:
# Classification Reports
report_pca = classification_report(y_test, y_pred_pca, target_names=label_encoder.classes_)
report_umap = classification_report(y_test, y_pred_umap, target_names=label_encoder.classes_)

print("Classification Report using [CLS] embedding: \n", report_pca)
print("Classification Report using average embedding: \n", report_umap)

Classification Report using [CLS] embedding: 
                 precision    recall  f1-score   support

Phishing Email       0.41      1.00      0.59        82
    Safe Email       1.00      0.02      0.03       118

      accuracy                           0.42       200
     macro avg       0.71      0.51      0.31       200
  weighted avg       0.76      0.42      0.26       200

Classification Report using average embedding: 
                 precision    recall  f1-score   support

Phishing Email       0.92      0.88      0.90        82
    Safe Email       0.92      0.95      0.93       118

      accuracy                           0.92       200
     macro avg       0.92      0.91      0.92       200
  weighted avg       0.92      0.92      0.92       200



**Summary:**

Using PCA, the model's performance dropped drastically with only 42% accuracy and an F1 score of 0.26. In addition, the classifier predicts nearly all emails as phishing. 

UMAP preserved most of the original embedding's structure, achieving an accuracy of 92% and an F1 score of 0.92, nearly matching the full-dimensional average embedding performance. 

This showed that UMAP is far more effective than PCA for preserving semantic information in low-dimensional space when using KNN with BERT embeddings.