<a href="https://colab.research.google.com/github/leahmdmartins10/Phishing-Email-Detection-System-BERT/blob/main/Phishing_Email_Detection_System_Using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📧 Phishing Email Detection System Using BERT

In this project, we aim to build a phishing email detection model using deep learning techniques, with a focus on the BERT (Bidirectional Encoder Representations from Transformers) architecture.

Phishing emails are deceptive messages designed to trick users into revealing sensitive information. As attackers increasingly use AI to craft convincing emails, traditional rule-based filters fall short. This motivates the need for a more intelligent, language-aware detection system.

We begin by loading and preprocessing real-world phishing and legitimate email datasets. After tokenizing the data, we will train and evaluate a fine-tuned BERT model, and compare its performance to a logistic regression baseline. Our objective is to build a model that accurately classifies emails as "phishing" or "safe" using language patterns and contextual understanding.





In [None]:
from google.colab import userdata
#KaggkeAPIKey = userdata.get('KaggleAPIKey')

---
# Mounting the google drive
We have to mount the google drive seeing as the files for the datasets are stored there

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


---
# Cleaning Data across Datasets
Making all data sets consistent in labeling, data type and format:

1. "body": Holds the body of all emails.
2. "urls": Holds the boolean value for if a url is present or not (1: url, 0: no url)
3. "label": Holds the boolen value for if an email is Phishing or Safe (1: phishing, 0 not phishing)


- REMOVING UNPARSABLE/ ILLEGAL DATA
- You can view all data at "APS360_Final_Cleaned_Data" in shared folder

In [None]:
!pip install xlsxwriter
!pip install pandas openpyxl



In [None]:
import os
import pandas as pd
import re

#Folder with your CSVs
source_folder = '/content/drive/MyDrive/APS360 Notes/Datasets'
output_excel_path = os.path.join(source_folder, 'APS360_Final_Cleaned_Data.xlsx')

#Patterns to detect illegal Excel characters and ANSI sequences
ansi_pattern = re.compile(r'[\x1B\x1b]\[[0-9;]*[A-Za-z]|[0-9]+;[0-9]+[Hf]')
illegal_excel_chars = re.compile(r"[\x00-\x08\x0B-\x1F]")

#Function to check if a row contains illegal characters
def row_has_illegal_data(row):
    return any(
        ansi_pattern.search(str(cell)) or illegal_excel_chars.search(str(cell))
        for cell in row
    )

#Create ExcelWriter object
with pd.ExcelWriter(output_excel_path, engine='openpyxl') as writer:
    for filename in os.listdir(source_folder):
        if filename.endswith('.csv'):
            filepath = os.path.join(source_folder, filename)

            try:
                df = pd.read_csv(filepath, on_bad_lines='skip', encoding='utf-8', engine='python')
            except Exception as e:
                print(f"Skipping {filename} due to read error: {e}")
                continue

            #Drop rows with illegal characters
            df = df[~df.apply(row_has_illegal_data, axis=1)]

            #Clean and rename columns
            df.columns = [col.strip() for col in df.columns]
            col_map = {}
            for col in df.columns:
                if col.lower() in ['email text', 'text']:
                    col_map[col] = 'body'
                elif col.lower() == 'email type':
                    col_map[col] = 'label'
            df = df.rename(columns=col_map)

            #Add 'urls' column if missing
            if 'urls' not in df.columns and 'body' in df.columns:
                df['urls'] = df['body'].astype(str).apply(lambda x: 1 if 'http' in x else 0)

            #Keep only ['body', 'urls', 'label']
            keep_cols = [col for col in ['body', 'urls', 'label'] if col in df.columns]
            df = df[keep_cols]

            #Write sheet to Excel
            sheet_name = os.path.splitext(filename)[0][:31]
            try:
                df.to_excel(writer, sheet_name=sheet_name, index=False)
            except Exception as e:
                print(f"Failed to write sheet for {filename}: {e}")

print(f"Done! Cleaned Excel file saved at:\n{output_excel_path}")

Done! Cleaned Excel file saved at:
/content/drive/MyDrive/APS360 Notes/Datasets/APS360_Final_Cleaned_Data.xlsx


---
#Combine Data into One Large Dataset

- Takes all csv files and merges into one giant data set.
- Removes empty and null rows.
- Randomly shuffles and rearranges data.
- Makes sure that "label" and "urls" data is numerical later processing

In [None]:
#This is a function Force string/int labels to integer 0 or 1
#Will be used later in combination (for cleaning purposes)

def clean_numerics(x):
    x_str = str(x).strip().lower()
    if x_str in ['1', 'phishing email']:
        return 1
    elif x_str in ['0', 'safe email']:
        return 0
    else:
        return 0

In [None]:
#Load all sheets
all_sheets = pd.read_excel(output_excel_path, sheet_name=None)

#Concatenate all sheets into one DataFrame
phishing_df = pd.concat(all_sheets.values(), ignore_index=True)

#Drop rows with missing values (if any)
phishing_df = phishing_df.dropna()

#Shuffle dataset
phishing_df = phishing_df.sample(frac=1, random_state=42).reset_index(drop=True)

#Checks that this data is numerical
phishing_df['label'] = phishing_df['label'].apply(clean_numerics)
phishing_df['urls'] = phishing_df['urls'].apply(clean_numerics)


---
#Convert Text Data to Tensor Value

Using a Hugging face Transformer: BertTokenizer that can take the text values and make it a tensor

In [None]:
from transformers import BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

#Tokenize email bodies
encoded = tokenizer(
    phishing_df['body'].tolist(),
    padding=True,
    truncation=True,
    return_tensors='pt'
)

#Convert labels and urls to tensors
label = torch.tensor(phishing_df['label'].values)
urls = torch.tensor(phishing_df['urls'].values)


---
#Split Tensor Data into Training Validation and Testing Datasets

- Randomly split the encoded email data into 70% training, 15% validation, and 15% test sets.
- Each split contains input tensors from the tokenization (input_ids, attention_mask) along with corresponding labels and URL indicators (from phishing_df ).
- This prepares the data for use in training and evaluating an AI classification model.




In [None]:
from sklearn.model_selection import train_test_split

#First split into training data for 70% and temp data (vaidation + testing) for 30%
train_idx, temp_idx = train_test_split(range(len(label)), test_size=0.3, random_state=42)

#Then split temp into validation and testing 15% each
val_idx, test_idx = train_test_split(temp_idx, test_size=0.5, random_state=42)


In [None]:
#Helper function to index data
def select_tensors(indices):
    input_ids = encoded['input_ids'][indices]
    attention_mask = encoded['attention_mask'][indices]
    label_heading = label[indices]
    url_heading = urls[indices]
    return input_ids, attention_mask, label, urls

train_data = select_tensors(train_idx)
val_data   = select_tensors(val_idx)
test_data  = select_tensors(test_idx)


In [None]:
#Check sizes
print("Train size:", train_data[0].shape[0])
print("Val size:", val_data[0].shape[0])
print("Test size:", test_data[0].shape[0])
print("\n")

#Check that total matches original
total = train_data[0].shape[0] + val_data[0].shape[0] + test_data[0].shape[0]
print("Total samples:", total, "| Original:", len(label))
print("\n")

#Check tensor shapes for debugging
print("Train input_ids shape:", train_data[0].shape)
print("Train attention_mask shape:", train_data[1].shape)
print("Train labels shape:", train_data[2].shape)
print("Train urls shape:", train_data[3].shape)


Train size: 46988
Val size: 10069
Test size: 10069


Total samples: 67126 | Original: 67126


Train input_ids shape: torch.Size([46988, 512])
Train attention_mask shape: torch.Size([46988, 512])
Train labels shape: torch.Size([67126])
Train urls shape: torch.Size([67126])




---

# LEE PREVIOUS TIME: Splitting the Data Into Training Data, Validation Data, and Test Data

In [None]:
from torch.utils.data.dataset import random_split
trainSizedf = int(0.7 * len(df))
valSizedf = int(0.15 * len(df))
testSizedf = len(df) - trainSizedf - valSizedf

dfTrain_Data, dfVal_Data, dfTest_Data = random_split(df, [trainSizedf, valSizedf, testSizedf])

trainSizePhishing_Email = int(0.7 * len(Phishing_Email))
valSizePhishing_Email = int(0.15 * len(Phishing_Email))
testSizePhishing_Email = len(Phishing_Email) - trainSizePhishing_Email - valSizePhishing_Email

Phishing_EmailTrain_Data, Phishing_EmailVal_Data, Phishing_EmailTest_Data = random_split(Phishing_Email, [trainSizePhishing_Email, valSizePhishing_Email, testSizePhishing_Email])