## Project Overview

### Dataset:

https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset/data?select=phishing_email.csv

 

### Project Objective:

The objective of the project is to classify emails using supervised learning methods. Emails will be categorized into several classes such as:

- Spam
- Normal emails
- Phishing
- Fraud

The dataset is composed of several separate sources:

    Enron and Ling Datasets: primarily focused on the core content of emails.
    CEAS, Nazario, Nigerian Fraud, and SpamAssassin Datasets: provide context about the message, such as the sender, recipient, date, etc.

The data will require preprocessing to create a unified database that includes all necessary information. The entire project consists of approximately 85,000 emails.

## Preprocessing

The first part of preprocessing involves **importing** the necessary libraries and **NLTK packages** (a library for text processing) as well as **creating** a set of **stop words**—words that carry low informational value.


In [None]:
import pandas as pd
import nltk
import re
from nltk.stem import WordNetLemmatizer

In [None]:
def prepare_nltk() -> None:
    nltk.download('stopwords')
    nltk.download('wordnet')
    nltk.download('omw-1.4')


prepare_nltk()
stop_words_set = set(nltk.corpus.stopwords.words('english'))

### Division into Different Categories

The **final dataset** will consist of data from **6 different datasets**. Each of these datasets includes a specific category (label) that classifies a message as either **legit (label 0) or not (label 1)**. However, the datasets provide different types of information about the messages, and the **labels have different meanings depending on the dataset**. For this reason, each dataset has been assigned to a specific category:

| Dataset name      | Category  | Label of faked messages |
|-------------------|-----------|-------------------------|
| **Enron**         | Spam      | **1**                  |
| **Ling**          | Spam      | **1**                  |
| **SpamAssasin**   | Spam      | **1**                  |
| **CEAS_08**       | Phishing  | **2**                  |
| **Nazario**       | Phishing  | **2**                  |
| **Nigerian_Fraud**| Fraud     | **3**                  |

In [None]:
list_of_spam_dataset = [
    "dataset/non-processed/SpamAssasin.csv",
    "dataset/non-processed/Enron.csv",
    "dataset/non-processed/Ling.csv"
]
list_of_fraud_dataset = [
    "dataset/non-processed/Nigerian_Fraud.csv"
]
list_of_phishing_dataset = [
    "dataset/non-processed/CEAS_08.csv",
    "dataset/non-processed/Nazario.csv"
]


final_data_frame = pd.DataFrame()

for dataset_path in list_of_spam_dataset:

    df = manage_dataset_preprocess(dataset_path)
    df['label'] = df['label'].map({0: 0, 1: 1})
    final_data_frame = pd.concat([final_data_frame, df], ignore_index=True)

for dataset_path in list_of_phishing_dataset:
    
    df = manage_dataset_preprocess(dataset_path)
    df['label'] = df['label'].map({0: 0, 1: 2})
    final_data_frame = pd.concat([final_data_frame, df], ignore_index=True)


for dataset_path in list_of_fraud_dataset:
    
    df = manage_dataset_preprocess(dataset_path)
    df['label'] = df['label'].map({0: 0, 1: 3})
    final_data_frame = pd.concat([final_data_frame, df], ignore_index=True)

### General Data Preprocessing

Each dataset contains different columns. All of them include the **subject** and **body** columns, but some do not include the **receiver** and **sender** columns. If a dataset is missing one of these columns, the corresponding values are set to `'None'`. The final dataset will include the following five columns:

- **body**
- **subject**
- **receiver**
- **sender**
- **label**

Each of these columns will be processed by a dedicated function.

In [None]:
def manage_dataset_preprocess(dataset_path: str) -> pd.DataFrame:
    df = pd.read_csv(dataset_path)

    if 'sender' not in df.columns:
        df['sender'] = None
    if 'receiver' not in df.columns:
        df['receiver'] = None
    
    df = df[['subject', 'body', 'sender', 'receiver',  'label']]
    df['body'] = df['body'].apply(preprocess_dataset_text)
    df['subject'] = df['subject'].apply(preprocess_dataset_subject)
    df['receiver'] = df['receiver'].apply(preprocess_dataset_sender_receiver)
    df['sender'] = df['sender'].apply(preprocess_dataset_sender_receiver)

    return df


### Column Processing Methodology

Each string in the columns (excluding 'label') is **processed using regex** to reduce the amount of data and eliminate unnecessary noise. The table below summarizes the processing steps for each column:

| Column            | Delete E-mail | Delete Links | Only Chars | Delete Garbage | Lowercase | Lemmatizer |
|-------------------|---------------|--------------|------------|----------------|-----------|------------|
| **Body**          | ✅             | ✅            | ✅          | ✅              | ✅         | ✅          |
| **Subject**       | ❌             | ❌            | ✅          | ❌              | ✅         | ✅          |
| **Sender**, **Receiver** | ❌       | ❌            | ❌          | ❌              | ✅         | ❌          |

#### Definitions:
- **Garbage word**: A word with four consecutive repetitions of the same letter or longer than 15 characters. These are removed only from the **'body'** column, where reducing data size is most critical.
- **Stop words**: Frequently used words with low informational value are removed from the **'body'** and **'subject'** columns (discussed in a later section).
- **Sender** and **Receiver**: Allowed characters are `'.'`, `'-'`, and `'_'`. In these columns, the **username** of the email address is stripped, leaving only the **domain name** and **top-level domain**.

If the input data **is not a string** or the processed string **is empty**, the value is returned as **'None'**.

#### Code for Processing Columns

In [None]:
def preprocess_dataset_text(text: str) -> str | None:

    if not isinstance(text, str):
        return None
    
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    ### Remove garbage
    text = re.sub(r'\b([a-zA-Z])\1{4,}\b', '', text)  # Same letter more than 5 times
    text = re.sub(r'\b\w{15,}\b', '', text)  # Words longer than 15 chars

    text = text.lower()
    text = " ".join(word for word in text.split() if word not in stop_words_set)

    lemmatizer = WordNetLemmatizer()
    text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])

    if text == '':
        return None

    return text


def preprocess_dataset_subject(subject: str) -> str | None:
    
    if not isinstance(subject, str) or subject == '':
        return None
    
    lemmatizer = nltk.stem.WordNetLemmatizer()
    
    subject = subject.lower()
    subject = re.sub(r'[^a-zA-Z\s]', '', subject)

    # Stop_words
    subject = " ".join(word for word in subject.split() if word not in stop_words_set)
    subject = " ".join([lemmatizer.lemmatize(word) for word in subject.split()])  # 4. Lematyzacja

    if subject == '':
        return None

    return subject


def preprocess_dataset_sender_receiver(data: str) -> str | None:
    
    if not isinstance(data, str):
        return None

    data = data.strip().lower()
    data = data.split('@')[-1]        # TODO: Only domain

    match = re.match(r'^[a-z0-9._-]*', data)

    if match:
        data = match.group(0)

    if data == '':
            return None

    return data

### Handling Empty or 'None' Values

Some datasets initially lacked certain required data, as shown below:

| Dataset            | Body | Subject | Sender | Receiver |
|--------------------|------|---------|--------|----------|
| **Enron**          | ✅    | ✅       | ❌      | ❌        |
| **Ling**           | ✅    | ✅       | ❌      | ❌        |
| **SpamAssasin**    | ✅    | ✅       | ✅      | ✅        |
| **CEAS_08**        | ✅    | ✅       | ✅      | ✅        |
| **Nazario**        | ✅    | ✅       | ✅      | ✅        |
| **Nigerian_Fraud** | ✅    | ✅       | ✅      | ✅        |

Additionally, some data was lost during filtering and column processing (all empty values were replaced with `'None'`). To address this, missing values (`'None'`) will be replaced by the **most frequently occurring value** (mode) in each column.

#### Code for Imputing Missing Values

In [None]:
def impute_missing_values(df: pd.DataFrame, column: str) -> None:
    mode_value = df[column].mode()[0]
    df.fillna({column: mode_value}, inplace=True)


impute_missing_values(final_data_frame, "subject")
impute_missing_values(final_data_frame, "body")
impute_missing_values(final_data_frame, "sender")
impute_missing_values(final_data_frame, "receiver")

### Final Datasets

The processed data will be saved as `.csv` files. Since the project focuses on testing various models with different characteristics, four distinct datasets were created:

- **final.csv**
- **final-domain-only.csv**
- **final-with-stop-words.csv**
- **final-with-stop-words-domain-only.csv**

#### Transformations Applied:

| Final Dataset Name               | Transform 'sender' and 'receiver' to domain format | Include Stop Words |
|----------------------------------|----------------------------------------------------|--------------------|
| **final.csv**                    | ❌                                                  | ❌                 |
| **final-domain-only.csv**        | ✅                                                  | ❌                 |
| **final-with-stop-words.csv**    | ❌                                                  | ✅                 |
| **final-with-stop-words-domain-only.csv** | ✅                                          | ✅                 |

Each of these datasets will be compared across all models to evaluate their performance.

In [None]:
final_data_frame.to_csv("dataset/processed/final-domain-only.csv", index=False)

## Selected Models

Three different models from distinct categories were chosen for training to evaluate which model performs best for categorization within its category. The selected models are:

- **XGBoost**: A gradient boosting model utilizing an ensemble of decision trees. It is highly efficient and excels in structured data tasks.
- **LSTM**: A recurrent neural network model designed to handle sequential data. It is particularly well-suited for tasks involving temporal dependencies, such as text processing.
- **BERT**: A transformer-based model pre-trained on a large corpus of text. BERT excels in understanding context and meaning in text, making it ideal for natural language processing tasks.

### Purpose of Model Selection
The diversity of these models allows for a comprehensive comparison, highlighting strengths and weaknesses based on the type of data and categorization task.