## Feature Extraction for Training

The `process_emails.py` script is used to process `.mbox` email files, utilizing the `feature_finders.py` module to extract features from each email. These features are then stored in a `.csv` file, rendering the data set ready for either training machine learning models or making predictions.

### Usage

Execute the script to process the specified `.mbox` file. The script applies feature extraction on each email within the limit of `.mbox` file and compiles the extracted features into a `.csv` file.

In [1]:
%run process_emails.py # or python3 email_converter.py, with example output below

Processing res/emails-phishing-pot.mbox with encoding iso-8859-1, limit 2279, is_phishy True...
Processing file: res/emails-phishing-pot.mbox
Encoding: iso-8859-1, Is Phishy: True, Limit: 2279
Reached processing limit of 2279 emails.
Data exported to res/emails-phishing-pot.mbox-export.csv
Email index exported to res/emails-phishing-pot.mbox-export-index.csv
Processing res/emails-enron.mbox  limit 2257, is_phishy False..
Processing file: res/emails-enron.mbox
Encoding: ascii, Is Phishy: False, Limit: 2257
Reached processing limit of 2257 emails.
Data exported to res/emails-enron.mbox-export.csv
Email index exported to res/emails-enron.mbox-export-index.csv
Finished processing all files.


## Model Training and Evaluation Process

The `model_train.py` script then executes full lifecycle of machine learning model — from data preparation through training to evaluation and saving. This script uses modules for data handling (`utils_data_preparation`) and model operations (`utils_model`), to train and evaluate a model capable of distinguishing phishing emails from legitimate ones.


example output:
```
Starting data preparation...
...
Confusion Matrix:
[[438  25]
 [ 29 414]]
ROC AUC: 0.9811051684713982
Precision: 0.9430523917995444
Recall: 0.9345372460496614
F1 Score: 0.9387755102040816
Saving the model...
Model saved successfully.
```


In [5]:
import warnings
warnings.filterwarnings('ignore')

import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

%run model_train.py

Starting data preparation...
Sample data point from training set: {'html_form': <tf.Tensor: shape=(32,), dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32)>, 'attachments': <tf.Tensor: shape=(32,), dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32)>, 'flash_content': <tf.Tensor: shape=(32,), dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32)>, 'html_iframe': <tf.Tensor: shape=(32,), dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32)>, 'html_content': <tf.T

## EML to MBOX Conversion for Prediction

The `email_converter.py` script converts `.eml` email files into a single `.mbox` file, for further processing and analysis. This conversion process is used for aggregating email samples stored as individual `.eml` files into a `.mbox` format that's widely supported and more convenient for bulk email processing.

### Usage

Run these scripts be to convert all `.eml` files located in the `samples` folder into a single `emails-samples.mbox` file within the `samples_res` directory.

In [3]:
%run email_converter.py 

Cleaned up samples_res.
Found 13 .eml files in samples to process.
Processing file 1 of 13: 13 - Phishing.eml
Processing file 2 of 13: 4.eml
Processing file 3 of 13: 7 - Safe.eml
Processing file 4 of 13: I SET THIS MESELF.eml
Processing file 5 of 13: sample-2968.eml
Processing file 6 of 13: sample-2969.eml
Processing file 7 of 13: sample-2970.eml
Processing file 8 of 13: sample-2971.eml
Processing file 9 of 13: sample-2972.eml
Processing file 10 of 13: sample-2973.eml
Processing file 11 of 13: sample-2974.eml
Processing file 12 of 13: sss.eml
Processing file 13 of 13: studio.eml
All .eml files from samples have been added to samples_res/emails-samples.mbox.
Conversion completed successfully.


## Email Classification Prediction

The trained model is applied to the `/samples` folder with `.eml` emails samples. The `model_predict.py` script executes steps:

1. **Preprocessing**: Converts the sample data `.mbox` file into a `.csv` format, extracting relevant features using the `utils_feature_extraction` (ufe) and `utils_data_preparation` (udp) modules.
2. **Loading Data**: Reads the preprocessed data from the `.csv` file, displaying the count and a sample of the processed emails.
3. **Feature Preparation**: Prepares the features for prediction by adjusting data types and dropping unnecessary columns.
4. **Prediction**: Utilizes the trained TensorFlow model to predict the class (phishing or legitimate) of each email.

In [4]:
import warnings
warnings.filterwarnings('ignore')

%run model_predict.py

Processing file: samples_res/emails-samples.mbox
Encoding: iso-8859-1, Is Phishy: None, Limit: 200
Data exported to samples_res/emails-samples.mbox-export.csv
Email index exported to samples_res/emails-samples.mbox-export-index.csv
Number of emails processed: 13
Sample of loaded data:
   Unnamed: 0  html_form  attachments  flash_content  html_iframe  \
0           0      False            0          False        False   
1           1      False            0          False        False   
2           2      False            0          False        False   
3           3      False            0          False        False   
4           4      False            0          False        False   

   html_content  urls  external_resources  javascript  css  ips_in_urls  \
0         False     0                   0           0    0        False   
1         False     7                   0           0    0        False   
2         False     3                   0           0    0        False   