## Final Discussion of Project Results

### Brief Summary of the Best Model Configurations

| Model Type | Final Accuracy | Final F1-Score |
| --- | --- | --- |
| **XGBoost** | **~94.41%*** | **~94.34%*** |
| **LSTM** | **~96.37%** | **~96.35%** |
| **BERT** | -\** | -\** |

\* - model tested on a smaller data sample <br>
\*\* - model did not achieve satisfactory results

<br>

#### XGBoost
The XGBoost model turned out to be a **highly accurate model**, but unfortunately, it requires **intensive preprocessing of the input data** and a **long training time**. Due to the very long training times, hardware limitations, and the need for intensive data processing to create the TF-IDF matrix, **this model was trained on a small fragment of the dataset** (between 4,000 and 32,000 records depending on the selected model type). However, it still achieved relatively good results, particularly when combining more specialized models.

#### LSTM
The LSTM model turned out to be by far the **best choice** in terms of classification, both regarding **final accuracy** and **fast input data processing and training**. This model utilized a combination of different columns into one, which contained all the necessary data, eliminating the need to combine various models into one. Due to **low hardware requirements**, it was the only model that was successfully trained on the **full dataset** (64,000 records).

#### BERT
As a transformer, this model had the **highest hardware requirements** of all models, and for this reason, **model training** was **complex and time-consuming**. Unfortunately, it **did not achieve satisfactory results** (the model continuously predicted one class) and did not respond to either oversampling or class weights.


## Discussion of Individual Models

### XGBoost

For this model, an approach was implemented that focused on creating several independent models that categorized data based on different parameters (in this case, columns - 'body', 'subject', 'sender', and 'receiver').

#### Body Model

The body model contained the content of the email, which had the most data, making it possible to train the model on **only a small fragment of the dataset (4,000 records)**. The best model achieved an accuracy of 89.67% (as shown in the next graph).

#### Stop Words vs Without Stop Words

When comparing two body models (with and without stop words), it turned out that **the model that removed stop words performed slightly better**, although the difference was within the margin of error (shown on the left side of the confusion matrix with stop words and on the right side without stop words).

<div style="display: flex; justify-content: space-around;">
    <img src="src/saved-results/xgb/plots/stop-vs-non-stop-words/only-body-body-model-with-stop-words-conf-matrix.png" alt="Description of image 1" width=375>
    <img src="notebook-images/xgboost-stop-words-plot.png" alt="Description of image 1" width=375>
    <img src="src/saved-results/xgb/plots/stop-vs-non-stop-words/only-body-without-stop-words-body-model-conf-matrix.png" alt="Description of image 2" width=375>
 </div> 

Due to the slightly better result and the potential reduction in input data size for further tests, the approach of removing stop words was applied for the XGBoost model.

#### Subject Model

Since the subject usually contains much less data than the email body, it was possible to train the model on a significantly larger data sample (16,000 records). However, the accuracy of recognizing messages based on the subject remained lower at approximately 78.67%.

<div style="text-align: center;"> 
    <img src="src/saved-results/xgb/plots/body-subject/body-subject-subject-model-accuracy.png" alt="Description of image"> 
    <img src="src/saved-results/xgb/plots/body-subject/body-subject-subject-model-conf-matrix.png" alt="Description of image 2" width=500>
</div>

#### Domain Model
Unfortunately, after numerous attempts, satisfactory results were not achieved for the domain model, despite using a large sample (32,000 records). This applied both to searching for frequent domains used in creating fraudulent emails and checking whether the domain of the receiving and sending parties matched.

<div style="text-align: center;"> 
    <img src="src/saved-results/xgb/plots/body-domain/body-domain-domain-model-conf-matrix.png" alt="Description of image" width=500> 
</div>

#### Multi Evaluation
Since each model focuses on classifying data based on a single column, the hypothesis was tested whether combining better and worse models would yield better results. Three different combinations were tested:

- body-subject
- body-domain
- body-subject-domain (full)

**Classification involves gathering probability tables for specific classes, then multiplying them by the model's accuracy weights and summing them up.**

#### Body Subject
| Model Type | Final Accuracy | Final F1-Score |
| --- | --- | --- |
| **body** | **~89.67%** | **~89.49%** |
| **subject** | **~78.67%** | **~77.67%** |
| **body-subject** | **~94.41%** | **~94.34%** |

The combination of the subject and body models yielded the expected results, with the combination of these two models increasing the final accuracy by nearly 5 percentage points, from 89.67% to 94.41%.

<div style="text-align: center;"> 
    <img src="src/saved-results/xgb/plots/body-subject/final-body-subject-conf-matrix.png" alt="Description of image" width=500> 
</div>

#### Body Domain and Full Model
Unfortunately, because the domain model was not adequately trained, its combination with other models resulted in very similar accuracy to the best model (due to its low accuracy, this model had a lower impact on the final result).

| Model Type | Final Accuracy | Final F1-Score |
| --- | --- | --- |
| **body** | **~89.67%** | **~89.49%** |
| **domain** | **48.67%** | **31.86%** |
| **body-domain** | **~89.93%** | **~89.76%** |
| **full** | **~93.98%** | **~93.82%** |

\* - Model guesses only one class.

<div style="text-align: center;">
    <b>Full Model</b>
    <br>
    <img src="src/saved-results/xgb/plots/full-model/final-full-conf-matrix.png" alt="Description of image" width=500> 
</div>

### LSTM
Training LSTM models differed from training XGBoost models. LSTM models were not divided into smaller parts (classification based on different columns was carried out by a single model), and instead, additional data were "attached." **Training** was significantly faster than for other models, and as a result, **the models were trained on the full dataset** (64,000 records).

#### Same Sample as XGBoost

Using the same sample as XGBoost (4,000 records), the model was unable to achieve satisfactory results, regardless of the number of epochs (tested from 30 to 100). The model was tested with stop words.

<div style="text-align: center;"> 
    <img src="src/saved-results/lstm/small-dataset-lstm.png" alt="Description of image">
</div>

#### Stop Words vs Without Stop Words
**Stop words** allow the **LSTM model** to **understand context faster**, leading to **faster training**. The model using stop words achieved satisfactory results much faster than the one without stop words (comparisons were made for 35 epochs, while the model without stop words started to show results around the 42-44 epoch). Despite the limited number of epochs, the LSTM model with stop words started to train more quickly.

<div style="text-align: center;"> 
    <b>Model without stop words</b>
    <br>
    <img src="src/saved-results/lstm/v2/plots/only-body-accuracy.png" alt="Description of image">
    <b>Model with stop words</b>
    <br>
    <img src="src/saved-results/lstm/v2/plots/stop-words-body-accuracy.png" alt="Description of image 2">
</div>

Ultimately, the model with stop words achieved an **excellent result of 96.09%** accuracy and 96.19% F1-score after 35 epochs, while the model without stop words did not exceed 28% accuracy in the same time. For this reason, the LSTM model with stop words was used for further tests.

#### Subject Model
Each subsequent model uses transformed input data by appending new information (such as subject or domain). Unfortunately, no model proved to be significantly better than the base model. When using the subject column, the model started learning noticeably later (around the 18th epoch instead of the 10th in the base model).

<div style="text-align: center;"> 
    <b>Model with stop words</b>
    <br>
    <img src="src/saved-results/lstm/v2/plots/body-subject-stop-accuracy.png" alt="Description of image 2">
</div>

#### Domain Model
In this model, a unique approach was used – if the 'sender' and 'receiver' domains matched, a value of 1 was returned, otherwise 0. Unfortunately, this approach proved ineffective for training, and the domain model failed to categorize emails properly.

#### Full Model
This model achieved very similar results (slightly higher than the body model, but this is likely a statistical outlier). Model comparison results:

| Model Type | Final Accuracy | Final F1-Score |
| --- | --- | --- |
| **body** | **~96.09%** | **~96.19%** |
| **subject** | **~95.87%** | **~95.95%** |
| **domain** | **~28.97%** | **~14.20%** |
| **full** | **~96.37%** | **~96.35%** |

<div style="text-align: center;"> 
    <b>Full Model</b>
    <br>
    <img src="src/saved-results/lstm/v2/plots/full-data-stop-accuracy.png" alt="Description of image 2">
    <img src="src/saved-results/lstm/v2/plots/full-data-stop-conf-matrix.png" alt="Description of image 2" width=500>
</div>

### BERT
For the BERT model, two types of models were tested (both with and without stop words), but no satisfactory results were achieved for either. Various configurations were tested, including oversampling, adjusting weights for different classes, expanding the dataset fragment (from 1,000 to 16,000 samples), different padding options, and varying maximum text lengths (from 64 to 512 characters). Due to hardware limitations, a relatively small batch size (8) was used, which might also have impacted the final result. No difference was noticed between models with or without stop words, as both models consistently predicted one class.

### Final Conclusions

- Models respond well to task division and creating clusters of specialized models, where each model evaluates a record based on different data. This approach leads to noticeable improvements in accuracy and other metrics.
<br> <br>
- When a single model evaluates data based on a specific column, it does not yield positive results, and in some cases, it even worsens the outcome (the best approach is one data type - one model).
<br> <br>
- The LSTM model performs best regarding both training time and final metrics.
<br> <br>
- No correlation was found, nor disproven, between the credibility of the message and whether the domains of the sender and receiver match.
<br> <br>
- The decision to remove stop words or keep them depends on the needs of the specific model. For instance, for the LSTM model, which tries to understand context, this can be crucial, while for other models, such as XGBoost, removing stop words may yield better results.
<br> <br>
- Despite its long training process due to the need to convert data into a TF-IDF matrix and the size of the matrix, the XGBoost model can achieve relatively good results on a much smaller data sample, making it a potentially good
