In [1]:
import raw_utils as util
import os

import pandas as pd
import numpy as np

import random
random.seed(1746)

## Phishing

### Nazario Phishing Corpus

We will start with reading the subset of the Phishing Corpus that we want.

In [2]:
# Paths
cwd = os.getcwd()
nazario_path = os.path.join(cwd, 'data/phishing/nazario/')
enron_path = os.path.join(cwd, 'data/legitimate/enron/')

csv_path = os.path.join(cwd, 'data/csv/')

In [3]:
# Files to be ignored for read_dataset()
files_ignored = ['README.txt']
files_ignored_recent = ['README.txt', '20051114.mbox',  'phishing0.mbox',  'phishing1.mbox',  'phishing2.mbox',  'phishing3.mbox', 'private-phishing4.mbox']

First, we will read and convert all of the dataset. It is straightforward since it is a collection of .mbox files

In [4]:
phishing = util.read_dataset(nazario_path, files_ignored, text_only=True)

Now reading file: phishing1.mbox
Now reading file: phishing3.mbox
Now reading file: phishing-2016
Now reading file: phishing0.mbox
Now reading file: 20051114.mbox
Now reading file: phishing-2017
Now reading file: phishing-2020
Now reading file: private-phishing4.mbox
1 emails skipped: Headers contain non-ascii characters, or otherwise corrupted email data.
Now reading file: phishing-2019
Now reading file: phishing-2018
Now reading file: phishing-2015
Now reading file: phishing-2021
Now reading file: phishing2.mbox


In [5]:
phishing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10459 entries, 0 to 10458
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   body    10459 non-null  object
dtypes: object(1)
memory usage: 81.8+ KB


In [6]:
util.save_to_csv(phishing, csv_path, 'nazario_full.csv')

Saving to /home/ichanis/projects/phishing_public/data/csv/nazario_full.csv


Then, we will also take the subset of only the recent emails.

In [7]:
phishing_recent = util.read_dataset(nazario_path, files_ignored_recent, text_only=True)

Now reading file: phishing-2016
Now reading file: phishing-2017
Now reading file: phishing-2020
Now reading file: phishing-2019
Now reading file: phishing-2018
Now reading file: phishing-2015
Now reading file: phishing-2021


In [8]:
phishing_recent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1916 entries, 0 to 1915
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   body    1916 non-null   object
dtypes: object(1)
memory usage: 15.1+ KB


In [9]:
util.save_to_csv(phishing_recent, csv_path, 'nazario_recent.csv')

Saving to /home/ichanis/projects/phishing_public/data/csv/nazario_recent.csv


## Legitimate

### Enron Email Dataset

This dataset is very big in size so we will just sample different sized sets of random emails from it.

In [10]:
filename = util.sample_enron_to_mbox(enron_path, 2000)
enron_2000 = util.mbox_to_df(filename, enron_path+'/mbox', text_only=True)
util.save_to_csv(enron_2000, csv_path, 'enron_text_2000.csv')

3028 folders will be checked.
300452 emails found.
Extracting 2000 random emails.
Creating output file /home/ichanis/projects/phishing_public/data/legitimate/enron/mbox/enron_2000.mbox
7 emails skipped: Headers contain non-ascii characters, or otherwise corrupted email data.
/home/ichanis/projects/phishing_public/data/legitimate/enron/mbox/enron_2000.mbox was created successfully.
Saving to /home/ichanis/projects/phishing_public/data/csv/enron_text_2000.csv


In [11]:
filename = util.sample_enron_to_mbox(enron_path, 20000)
enron_20000 = util.mbox_to_df(filename, enron_path+'/mbox', text_only=True)
util.save_to_csv(enron_20000, csv_path, 'enron_text_20000.csv')

3028 folders will be checked.
300452 emails found.
Extracting 20000 random emails.
Creating output file /home/ichanis/projects/phishing_public/data/legitimate/enron/mbox/enron_20000.mbox
82 emails skipped: Headers contain non-ascii characters, or otherwise corrupted email data.
/home/ichanis/projects/phishing_public/data/legitimate/enron/mbox/enron_20000.mbox was created successfully.
Saving to /home/ichanis/projects/phishing_public/data/csv/enron_text_20000.csv
