In [1]:
import os
import pandas as pd

from sklearn.model_selection import train_test_split

import preprocessing as util
from raw_utils import save_to_csv

import random
random.seed(1746)

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

[nltk_data] Downloading package punkt to /home/ichanis/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ichanis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ichanis/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/ichanis/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/ichanis/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
# Path
cwd = os.getcwd()
csv_path = os.path.join(cwd, 'data/csv/')

data_files = ['dataset_1.csv', 'dataset_2.csv']

In [3]:
dataset_1 = pd.read_csv(os.path.join(csv_path, data_files[0]), index_col=0, encoding='latin-1', dtype={'body': 'object', 'class': 'bool'})
dataset_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3440 entries, 0 to 3439
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   body    3440 non-null   object
 1   class   3440 non-null   bool  
dtypes: bool(1), object(1)
memory usage: 57.1+ KB


In [4]:
dataset_2 = pd.read_csv(os.path.join(csv_path, data_files[1]), index_col=0, encoding='latin-1', dtype={'body': 'object', 'class': 'bool'})
dataset_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18920 entries, 0 to 18919
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   body    18920 non-null  object
 1   class   18920 non-null  bool  
dtypes: bool(1), object(1)
memory usage: 314.1+ KB


## Preprocessing

We need to convert the text data into a format more suitable for use with machine learning algorithms.<br>
The process will follow the steps below:

### 1. Cleanup HTML and whitespace

A percentage of the emails extracted are either in HTML format or they have the same message in both plaintext and HTML, as part of a `multipart/alternative` content-type message.<br>
In order to extract the text, the HTML formatting has to be removed, along with unnecessary whitespace any duplicated text created by the aforementioned multipart emails.

In [5]:
dataset_1['body'] = dataset_1['body'].apply(util.strip_characters)
dataset_1['body'] = dataset_1['body'].apply(util.deduplicate_text)

In [6]:
dataset_2['body'] = dataset_2['body'].apply(util.strip_characters)
dataset_2['body'] = dataset_2['body'].apply(util.deduplicate_text)

### 2. Replacing addresses

A lot of the emails contain either **web addresses** (URLs) or **email addresses** that need to be removed in order for the frequency of certain domains to not influence the results.<br>
In order for this information to not get completely lost however, those addresses will be replaced by the strings `<urladdress>` and `<emailaddress>` respectively. Those strings are chosen because they do not occur normally in the emails.

In [7]:
dataset_1['body'] = dataset_1['body'].apply(util.replace_email)
dataset_1['body'] = dataset_1['body'].apply(util.replace_url)

In [8]:
dataset_2['body'] = dataset_2['body'].apply(util.replace_email)
dataset_2['body'] = dataset_2['body'].apply(util.replace_url)

### 3. Tokenization and stopword removal

Tokenization is the process of splitting text into individual words. This is useful because generally speaking, the meaning of the text can easily be interpreted by analyzing the words present in the text.<br>
Along with this process, letters are also converted to lowercase and punctuation or other special characters are removed.<br>
Since there are some words (called **stopwords**) that do not contribute very much in meaning (like pronouns or simple verbs), they can be removed to reduce the noise.

In [9]:
dataset_1['body'] = dataset_1['body'].apply(util.tokenize)
dataset_1['body'] = dataset_1['body'].apply(util.remove_stopwords)

In [10]:
dataset_2['body'] = dataset_2['body'].apply(util.tokenize)
dataset_2['body'] = dataset_2['body'].apply(util.remove_stopwords)

### 4. Lemmatization with POS tagging

Lemmatization is the process that reduces the inflectional forms of a word to keep its root form. This is useful because the set of words that results from this process is smaller because all the inflections of a word are converted to one, thus reducing the dimensionality without sacrificing information.<br>
In order to facilitate and improve the lemmatization, the **part-of-speech tagging** technique has been used. The POS of the word (which indicates whether a word is a noun, a verb, an adjective, or an adverb) is used as a part of the process.

In [11]:
dataset_1['body'] = dataset_1['body'].apply(util.lemmatize)

In [12]:
dataset_2['body'] = dataset_2['body'].apply(util.lemmatize)

### Deleting Empty Rows

After all the preprocessing, it is possible that some of the emails are now empty (because they did not contain any useful words from the beginning).<br>
So, these have to be removed to keep the data clean.

In [13]:
dataset_1 = dataset_1[dataset_1['body'].astype(bool)]

In [14]:
dataset_2 = dataset_2[dataset_2['body'].astype(bool)]

## Train-Test Split

In order to evaluate the classification process, we will use only 80% of the data to train the models and then test them on the remaining 20%, which will be unknown to the algorithms.

In [15]:
train_1, test_1 = train_test_split(dataset_1, test_size=0.2)

In [16]:
train_2, test_2 = train_test_split(dataset_2, test_size=0.2)

### Saving the Results

In [17]:
save_to_csv(train_1, csv_path, 'train_1.csv')
save_to_csv(test_1, csv_path, 'test_1.csv')

Saving to /home/ichanis/projects/phishing_public/data/csv/train_1.csv
Saving to /home/ichanis/projects/phishing_public/data/csv/test_1.csv


In [18]:
save_to_csv(train_2, csv_path, 'train_2.csv')
save_to_csv(test_2, csv_path, 'test_2.csv')

Saving to /home/ichanis/projects/phishing_public/data/csv/train_2.csv
Saving to /home/ichanis/projects/phishing_public/data/csv/test_2.csv
