<a href="https://colab.research.google.com/github/leahmdmartins10/Phishing-Email-Detection-System-BERT/blob/main/ver_2_June_26_Phishing_Email_Detection_System_Using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📧 Phishing Email Detection System Using BERT

In this project, we aim to build a phishing email detection model using deep learning techniques, with a focus on the BERT (Bidirectional Encoder Representations from Transformers) architecture.

Phishing emails are deceptive messages designed to trick users into revealing sensitive information. As attackers increasingly use AI to craft convincing emails, traditional rule-based filters fall short. This motivates the need for a more intelligent, language-aware detection system.

We begin by loading and preprocessing real-world phishing and legitimate email datasets. After tokenizing the data, we will train and evaluate a fine-tuned BERT model, and compare its performance to a logistic regression baseline. Our objective is to build a model that accurately classifies emails as "phishing" or "safe" using language patterns and contextual understanding.





In [None]:
from google.colab import userdata
#KaggkeAPIKey = userdata.get('KaggleAPIKey')

---
# Mounting the google drive
We have to mount the google drive seeing as the files for the datasets are stored there

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


---
# Cleaning Data across Datasets
Making all data sets consistent in labeling, data type and format:

1. "body": Holds the body of all emails.
2. "urls": Holds the boolean value for if a url is present or not (1: url, 0: no url)
3. "label": Holds the boolen value for if an email is Phishing or Safe (1: phishing, 0 not phishing)


- REMOVING UNPARSABLE/ ILLEGAL DATA
- You can view all data at "APS360_Final_Cleaned_Data" in shared folder

In [None]:
!pip install xlsxwriter
!pip install pandas openpyxl



In [None]:
import os
import pandas as pd
import re

#Folder with your CSVs
source_folder = '/content/drive/MyDrive/APS360 Notes/Datasets'
output_excel_path = os.path.join(source_folder, 'APS360_Final_Cleaned_Data.xlsx')

#Patterns to detect illegal Excel characters and ANSI sequences
ansi_pattern = re.compile(r'[\x1B\x1b]\[[0-9;]*[A-Za-z]|[0-9]+;[0-9]+[Hf]')
illegal_excel_chars = re.compile(r"[\x00-\x08\x0B-\x1F]")

#Function to check if a row contains illegal characters
def row_has_illegal_data(row):
    return any(
        ansi_pattern.search(str(cell)) or illegal_excel_chars.search(str(cell))
        for cell in row
    )

#Create ExcelWriter object
with pd.ExcelWriter(output_excel_path, engine='openpyxl') as writer:
    for filename in os.listdir(source_folder):
        if filename.endswith('.csv'):
            filepath = os.path.join(source_folder, filename)

            try:
                df = pd.read_csv(filepath, on_bad_lines='skip', encoding='utf-8', engine='python')
            except Exception as e:
                print(f"Skipping {filename} due to read error: {e}")
                continue

            #Drop rows with illegal characters
            df = df[~df.apply(row_has_illegal_data, axis=1)]

            #Clean and rename columns
            df.columns = [col.strip() for col in df.columns]
            col_map = {}
            for col in df.columns:
                if col.lower() in ['email text', 'text']:
                    col_map[col] = 'body'
                elif col.lower() == 'email type':
                    col_map[col] = 'label'
            df = df.rename(columns=col_map)

            #Add 'urls' column if missing
            if 'urls' not in df.columns and 'body' in df.columns:
                df['urls'] = df['body'].astype(str).apply(lambda x: 1 if 'http' in x else 0)

            #Keep only ['body', 'urls', 'label']
            keep_cols = [col for col in ['body', 'urls', 'label'] if col in df.columns]
            df = df[keep_cols]

            #Write sheet to Excel
            sheet_name = os.path.splitext(filename)[0][:31]
            try:
                df.to_excel(writer, sheet_name=sheet_name, index=False)
            except Exception as e:
                print(f"Failed to write sheet for {filename}: {e}")

print(f"Done! Cleaned Excel file saved at:\n{output_excel_path}")

Done! Cleaned Excel file saved at:
/content/drive/MyDrive/APS360 Notes/Datasets/APS360_Final_Cleaned_Data.xlsx


---
#Combine Data into One Large Dataset

- Takes all csv files and merges into one giant data set.
- Removes empty and null rows.
- Randomly shuffles and rearranges data.
- Makes sure that "label" and "urls" data is numerical later processing

In [None]:
#This is a function Force string/int labels to integer 0 or 1
#Will be used later in combination (for cleaning purposes)

def clean_numerics(x):
    x_str = str(x).strip().lower()
    if x_str in ['1', 'phishing email']:
        return 1
    elif x_str in ['0', 'safe email']:
        return 0
    else:
        return 0

In [None]:
#Load all sheets
all_sheets = pd.read_excel(output_excel_path, sheet_name=None)

#Concatenate all sheets into one DataFrame
phishing_df = pd.concat(all_sheets.values(), ignore_index=True)

#Drop rows with missing values (if any)
phishing_df = phishing_df.dropna()

#Shuffle dataset
phishing_df = phishing_df.sample(frac=1, random_state=42).reset_index(drop=True)

#Checks that this data is numerical
phishing_df['label'] = phishing_df['label'].apply(clean_numerics)
phishing_df['urls'] = phishing_df['urls'].apply(clean_numerics)


---
#Split Tensor Data into Training Validation and Testing Datasets

- Randomly split the encoded email data into 70% training, 15% validation, and 15% test sets.
- Each split contains input tensors from the tokenization (input_ids, attention_mask) along with corresponding labels and URL indicators (from phishing_df ).
- This prepares the data for use in training and evaluating an AI classification model.




In [None]:
from sklearn.model_selection import train_test_split
import torch

#Convert labels and urls to tensors
label = torch.tensor(phishing_df['label'].values)
urls = torch.tensor(phishing_df['urls'].values)

#First split into training data for 70% and temp data (vaidation + testing) for 30%
train_idx, temp_idx = train_test_split(range(len(label)), test_size=0.3, random_state=42)

#Then split temp into validation and testing 15% each
val_idx, test_idx = train_test_split(temp_idx, test_size=0.5, random_state=42)


In [None]:
# splitting the bodies for train, val, and test data
train_bodies = phishing_df['body'][train_idx].tolist()
val_bodies = phishing_df['body'][val_idx].tolist()
test_bodies = phishing_df['body'][test_idx].tolist()

# converting the training, val, and test urls and labels to tensors
train_urls = torch.tensor(phishing_df['urls'][train_idx].tolist())
train_labels = torch.tensor(phishing_df['label'][train_idx].tolist())


val_urls = torch.tensor(phishing_df['urls'][val_idx].tolist())
val_labels = torch.tensor(phishing_df['label'][val_idx].tolist())


test_urls = torch.tensor(phishing_df['urls'][test_idx].tolist())
test_labels = torch.tensor(phishing_df['label'][test_idx].tolist())

# **Tokenize the training, validation, and testing bodies**
We are now tokenizing the data that we have previously split. This tokenizing code has been repurposed from Asmita's code.

In [None]:
from transformers import BertTokenizerFast
import torch

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

#Tokenize email bodies
tokenizedTraining = tokenizer(
    train_bodies,
    padding=True,
    truncation=True,
    return_attention_mask=True,
    return_tensors='pt'
)

tokenizedValidation = tokenizer(
    val_bodies,
    padding=True,
    truncation=True,
    return_attention_mask=True,
    return_tensors='pt'
)

tokenizedTest = tokenizer(
    test_bodies,
    padding=True,
    truncation=True,
    return_attention_mask=True,
    return_tensors='pt'
)




In [None]:
#print the attention mask and tokenized list for first email
print("Training input_ids shape:", tokenizedTraining['input_ids'].shape)
print("Training attention_mask shape:", tokenizedTraining['attention_mask'].shape)
decoded_input_ids = tokenizer.decode(tokenizedTraining['input_ids'][0], skip_special_tokens=True)
print(decoded_input_ids)
print(tokenizedTraining['attention_mask'][0])

Training input_ids shape: torch.Size([46988, 512])
Training attention_mask shape: torch.Size([46988, 512])
greetings from dubai, this letter must come to you as a big surprise, but i believe it is only a day that people meet and become great friends and business partners. i am mr. arif shaikh, currently chief credit & risk officer with a reputable bank here in u. a. e. i write you this proposal in good faith, believing that i can trust you with the information i am about to reveal to you. i have an urgent and very confidential business proposition for you. on november 6, 2000, an iraqi foreign oil consultant / contractor with the chevron petroleum corporation, mr mohammad al nasser made a ( fixed deposit ) for 36 calendar months, valued at us $ 17, 500, 000. 00 ( seventeen million five hundred thousand dollars only ) in my bank and i happen to be his account officer before i was moved to my present position recently. upon maturity in 2003, as his account officer and as well the bank ma

# **Converting the Tokenized Data to Complete Tensors**


In [None]:
from torch.utils.data import TensorDataset
completeTraining_dataset = TensorDataset(
    tokenizedTraining['input_ids'],
    tokenizedTraining['attention_mask'],
    train_labels,
    train_urls
)

completeValidation_dataset = TensorDataset(
    tokenizedValidation['input_ids'],
    tokenizedValidation['attention_mask'],
    val_labels,
    val_urls
)

completeTesting_dataset = TensorDataset(
    tokenizedTest['input_ids'],
    tokenizedTest['attention_mask'],
    test_labels,
    test_urls
)

print("Training dataset size:", len(completeTraining_dataset))
print("Validation dataset size:", len(completeValidation_dataset))
print("Testing dataset size:", len(completeTesting_dataset))

Training dataset size: 46988
Validation dataset size: 10069
Testing dataset size: 10069
