## Project Overview

### Dataset:

https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset/data?select=phishing_email.csv

 

### Project Objective:

The objective of the project is to classify emails using supervised learning methods. Emails will be categorized into several classes such as:

- Spam
- Normal emails
- Phishing
- Fraud

The dataset is composed of several separate sources:

    Enron and Ling Datasets: primarily focused on the core content of emails.
    CEAS, Nazario, Nigerian Fraud, and SpamAssassin Datasets: provide context about the message, such as the sender, recipient, date, etc.

The data will require preprocessing to create a unified database that includes all necessary information. The entire project consists of approximately 85,000 emails.

## Preprocessing

Pierwsza częśc preprocessingu zawiera **import** odpowiednich bilbiotek oraz **paczek nltk** (biblioteki do przetwarzania tekstu) oraz **stworzenia** zbioru **stop_words** - słów, które niosą niską zawartość informacyjną.

In [None]:
import pandas as pd
import nltk
import re
from nltk.stem import WordNetLemmatizer

In [None]:
def prepare_nltk() -> None:
    nltk.download('stopwords')
    nltk.download('wordnet')
    nltk.download('omw-1.4')


prepare_nltk()
stop_words_set = set(nltk.corpus.stopwords.words('english'))

### Podział na różne kategorie

**Finalny dataset**, będzie składał się z danych z **6 różnych zbiorów danych**. Każdy z tych dataset-ów zawiera odpowiednie kategorie (label), która oznacza wiadomość **legit (label 0) bądź nie (label 1)**. Jednak różne datasety zawierają różne informacje dotyczące samych wiadomości i **etykiety mają inne znaczenie w zależności od zbioru danych**. Z tego powodu każdy dataset został przypisany do konkretnej kategorii:

| Dataset name | Category | Label of faked messages |
| --- | --- | --- |
| **Enron** | Spam | **1** |
| **Ling** | Spam | **1** |
| **SpamAssasin** | Spam | **1** |
| **CEAS_08** | Phishing | **2** |
| **Nazario** | Phishing | **2** |
| **Nigerian_Fraud** | Fraud | **3** |




In [None]:
list_of_spam_dataset = [
    "dataset/non-processed/SpamAssasin.csv",
    "dataset/non-processed/Enron.csv",
    "dataset/non-processed/Ling.csv"
]
list_of_fraud_dataset = [
    "dataset/non-processed/Nigerian_Fraud.csv"
]
list_of_phishing_dataset = [
    "dataset/non-processed/CEAS_08.csv",
    "dataset/non-processed/Nazario.csv"
]


final_data_frame = pd.DataFrame()

for dataset_path in list_of_spam_dataset:

    df = manage_dataset_preprocess(dataset_path)
    df['label'] = df['label'].map({0: 0, 1: 1})
    final_data_frame = pd.concat([final_data_frame, df], ignore_index=True)

for dataset_path in list_of_phishing_dataset:
    
    df = manage_dataset_preprocess(dataset_path)
    df['label'] = df['label'].map({0: 0, 1: 2})
    final_data_frame = pd.concat([final_data_frame, df], ignore_index=True)


for dataset_path in list_of_fraud_dataset:
    
    df = manage_dataset_preprocess(dataset_path)
    df['label'] = df['label'].map({0: 0, 1: 3})
    final_data_frame = pd.concat([final_data_frame, df], ignore_index=True)

