In [1]:
import os
import pandas as pd
pd.options.display.max_colwidth = 160

import preprocessing as util
from raw_utils import save_to_csv

In [2]:
# Path
cwd = os.getcwd()
csv_path = os.path.join(cwd, 'data/csv/')

data_files = ['balanced.csv', 'imbalanced.csv']

In [3]:
balanced = pd.read_csv(os.path.join(csv_path, data_files[0]), index_col=0, dtype={'body': 'object', 'class': 'bool', 'id': 'int16'})
balanced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3100 entries, 0 to 3099
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      3100 non-null   int16 
 1   body    3100 non-null   object
 2   class   3100 non-null   bool  
dtypes: bool(1), int16(1), object(1)
memory usage: 57.5+ KB


In [4]:
imbalanced = pd.read_csv(os.path.join(csv_path, data_files[1]), index_col=0, dtype={'body': 'object', 'class': 'bool', 'id': 'int16'})
imbalanced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17050 entries, 0 to 17049
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      17050 non-null  int16 
 1   body    17050 non-null  object
 2   class   17050 non-null  bool  
dtypes: bool(1), int16(1), object(1)
memory usage: 316.4+ KB


### Initial Data

This is the initial state of the data, along with some representative examples of phishing and legitimate emails.

In [5]:
balanced.head(10)

Unnamed: 0,id,body,class
0,0,"Your message\n\n To: OMeara, Dina; Edmiston, John; Gas Daily (E-mail); Iferc1\n(E-mail); Iferc2 (E-mail); NGI (E-mail)\n Sent: Fri, 19 Oct 2001 15...",False
1,1,"\n Note: This is a service message with information related to your Chase account(s). It may include specific details about transactions, products or o...",True
2,2,\n\n \t \n \n \n \n Server Message \n \n \n\n \n Dear jose@monkey.org Our record indicates that you recently made a request to deactivate email A...,True
3,3,Absa’s ChatBanking on WhatsApp is now fully available to SA customers\nTo\nWhom\nIt\nMay\nConcern:\nPlease find attached a copy of your payment notification...,True
4,4,Your Microsoft Outlook Password will expire today. You are to click on this link http://site9423341.92.webydo.com/?v=1 immediately and fill the form correc...,True
5,5,\n\nx\n\neWebmailAlert#\n\n\n Take note of this important update that our new web mail has been improved with a new messaging system for outlook which also ...,True
6,6,Hi Jose\n\nWe discovered unusual activities on your account.\n\nTo be better protected against virus and spywares.\n\nYou need enroll for our new security f...,True
7,7,"Dear folks,\n\nApril 17 sounds OK to me. Just wondering so I can make arrangements: How long should I plan to be in town? Is the class for just one day? A...",False
8,8,"Dear\njose\n,\nDomain Admin provider\n(SQL)\nhave sent you several notice.\nsetup our latest security or\nYou will be deactivated from using this service.\n...",True
9,9,Account\nVerification\nRequired\nDear jose\nOn 9th December monkey.org record indicates\njose@monkey.org\nis not yet verified in our system and this will ab...,True


#### Legitimate emails:

In [6]:
print(balanced['body'].iloc[12])

Shubh,

How are you doing through all of the uncertainty associated with DPC these days? Are we still waiting for the lenders to decide whether or not they will continue funding the project? What is the MSEB's current position?

I'll give you a call Wednesday morning (Mumbai time).

Best Regards,

Paul



In [7]:
print(balanced['body'].iloc[18])


Are you Ready2Go ?
Have you ever experienced problems accessing the network remotely when traveling on business or when trying to work from home on your laptop?  Would you like someone to test and customize the dial-up access on your laptop before you head out of town on your next trip? 

We have identified a need and are now supplying a new service for Enron Wholesale Services and EBS employee's...

	It's just for you and it's called Ready2Go !

This new service has been established to test, modify, and update the dial-up software on your laptop.

How it works:
A calendar has been created on ITCentral at <http://itcentral.enron.com/Data/Services/Ready2Go>
You can go to the site and schedule a time that is convenient for you. Then, you will need to bring your laptop to our testing site, (either at 3AC1631D or EB2249B).  We will examine your system, update your software (if needed) and even customize it for the next location you will be traveling to.  We'll also provide a brief trainin

#### Phishing Emails:

In [8]:
print(balanced['body'].iloc[2])

 

 	  
 
  
 
  Server Message  
 
  

 
   Dear jose@monkey.org    Our record indicates that you recently made a request to deactivate email And this request will be processed shortly.  If this request was made accidentally or you have no knowledge of it, you are advised to cancel the request now  
 
 
  
    Cancel De-activation    
 
 
  
  However, if you do not cancel this request, the deactivation will be effected 
 and all email data be lost permanently.  Regards.
 Email Administrator  
 
 
  
 
 
 
  
  This message is auto-generated from E-mail security server, and replies sent to this email can not be delivered. 
 This email is meant for: jose@monkey.org  
 
 
  
 	 	 
 


In [9]:
print(balanced['body'].iloc[19])

Your e-mailbox password will expire in 2 days. to keep your password. CLICK=HERE<http://owaogr.tripod.com/> to update immediately

IT-Service Help Desk.

Your e-mailbox password will expire in 2 days. to keep your password.
CLICK=HERE
to update immediately
IT-Service
Help Desk
.


Some observations:
- There is a lot of extra whitespace that can be sanitized later.
- There still exist some emails with duplicated text (this example was because the source url from the html was inside the `<a>` tag and not part of the text that was extracted).
- Emails and URLs can give away the class of the message (domain enron.com vs domain monkey.org), so removing them should make the model more general.

# Preprocessing

We need to convert the text data into a format more suitable for use with machine learning algorithms.<br>
Since we aim for two different feature sets, the process will be split.

## Basic Preprocessing

These processeses should happen to all the feature sets.

### Replacing addresses

As is obvious from the examples, a lot of the emails contain either **web addresses** (URLs) or **email addresses** that need to be removed in order for the frequency of certain domains to not influence the results.<br>
In order for this information to not get completely lost however, those addresses will be replaced by the strings `'<urladdress>'` and `'<emailaddress>'` respectively. Those strings are chosen because they do not occur normally in the emails.

In [10]:
balanced['body'] = balanced['body'].apply(util.replace_email)
balanced['body'] = balanced['body'].apply(util.replace_url)

In [11]:
imbalanced['body'] = imbalanced['body'].apply(util.replace_email)
imbalanced['body'] = imbalanced['body'].apply(util.replace_url)

In [12]:
print(balanced['body'].iloc[2])

 

 	  
 
  
 
  Server Message  
 
  

 
   Dear <emailaddress>    Our record indicates that you recently made a request to deactivate email And this request will be processed shortly.  If this request was made accidentally or you have no knowledge of it, you are advised to cancel the request now  
 
 
  
    Cancel De-activation    
 
 
  
  However, if you do not cancel this request, the deactivation will be effected 
 and all email data be lost permanently.  Regards.
 Email Administrator  
 
 
  
 
 
 
  
  This message is auto-generated from E-mail security server, and replies sent to this email can not be delivered. 
 This email is meant for: <emailaddress>  
 
 
  
 	 	 
 


In [13]:
print(balanced['body'].iloc[19])

Your e-mailbox password will expire in 2 days. to keep your password. CLICK=HERE<<urladdress>> to update immediately

IT-Service Help Desk.

Your e-mailbox password will expire in 2 days. to keep your password.
CLICK=HERE
to update immediately
IT-Service
Help Desk
.


The examples show that the URLs and email addresses have indeed been anonymized now.

## Preprocessing for content features

This preprocessing is necessary in order to convert the text strings to lists of words, that will be vectorized in order to be used by machine learning algorithms.

### Tokenization and stopword removal

Tokenization is the process of splitting text into individual words. This is useful because generally speaking, the meaning of the text can easily be interpreted by analyzing the words present in the text.<br>
Along with this process, letters are also converted to lowercase and punctuation or other special characters are removed.<br>
Since there are some words (called **stopwords**) that do not contribute very much in meaning (like pronouns or simple verbs), they can be removed to reduce the noise.

In [14]:
balanced['body'] = balanced['body'].apply(util.tokenize)
balanced['body'] = balanced['body'].apply(util.remove_stopwords)

In [15]:
imbalanced['body'] = imbalanced['body'].apply(util.tokenize)
imbalanced['body'] = imbalanced['body'].apply(util.remove_stopwords)

In [16]:
print(balanced['body'].iloc[18])

['ready2go', 'ever', 'experienced', 'problems', 'accessing', 'network', 'remotely', 'traveling', 'business', 'trying', 'work', 'home', 'laptop', 'would', 'like', 'someone', 'test', 'customize', 'dial-up', 'access', 'laptop', 'head', 'town', 'next', 'trip', 'identified', 'need', 'supplying', 'new', 'service', 'enron', 'wholesale', 'services', 'ebs', 'employee', 'called', 'ready2go', 'new', 'service', 'established', 'test', 'modify', 'update', 'dial-up', 'software', 'laptop', 'works', 'calendar', 'created', 'itcentral', 'urladdress', 'go', 'site', 'schedule', 'time', 'convenient', 'need', 'bring', 'laptop', 'testing', 'site', 'either', '3ac1631d', 'eb2249b', 'examine', 'system', 'update', 'software', 'needed', 'even', 'customize', 'next', 'location', 'traveling', 'also', 'provide', 'brief', 'training', 'session', 'accessing', 'network', 'remotely', 'laptop', 'interested', 'participating', 'simply', 'access', 'web', 'site', 'schedule', 'appointment', 'thank', 'time']


In [17]:
print(balanced['body'].iloc[19])

['e-mailbox', 'password', 'expire', '2', 'days', 'keep', 'password', 'urladdress', 'update', 'immediately', 'it-service', 'help', 'desk', 'e-mailbox', 'password', 'expire', '2', 'days', 'keep', 'password', 'update', 'immediately', 'it-service', 'help', 'desk']


The example shows how a quite big chunk of text was reduced to a smaller list that contains the more meaningful words. The addresses still exist as tokens ('urladdress').<br>
Also, the emails with duplicate emails obviously will have duplicate tokens, this however is not a big issue with most vectorizers.

### Lemmatization with POS tagging

Lemmatization is the process that reduces the inflectional forms of a word to keep its root form. This is useful because the set of words that results from this process is smaller because all the inflections of a word are converted to one, thus reducing the dimensionality without sacrificing information.<br>
In order to facilitate and improve the lemmatization, the **part-of-speech tagging** technique has been used. The POS of the word (which indicates whether a word is a noun, a verb, an adjective, or an adverb) is used as a part of the process.

In [18]:
balanced['body'] = balanced['body'].apply(util.lemmatize)

In [19]:
imbalanced['body'] = imbalanced['body'].apply(util.lemmatize)

In [20]:
print(balanced['body'].iloc[18])

['ready2go', 'ever', 'experience', 'problem', 'access', 'network', 'remotely', 'travel', 'business', 'try', 'work', 'home', 'laptop', 'would', 'like', 'someone', 'test', 'customize', 'dial-up', 'access', 'laptop', 'head', 'town', 'next', 'trip', 'identify', 'need', 'supply', 'new', 'service', 'enron', 'wholesale', 'service', 'ebs', 'employee', 'call', 'ready2go', 'new', 'service', 'establish', 'test', 'modify', 'update', 'dial-up', 'software', 'laptop', 'work', 'calendar', 'create', 'itcentral', 'urladdress', 'go', 'site', 'schedule', 'time', 'convenient', 'need', 'bring', 'laptop', 'test', 'site', 'either', '3ac1631d', 'eb2249b', 'examine', 'system', 'update', 'software', 'need', 'even', 'customize', 'next', 'location', 'travel', 'also', 'provide', 'brief', 'training', 'session', 'access', 'network', 'remotely', 'laptop', 'interested', 'participate', 'simply', 'access', 'web', 'site', 'schedule', 'appointment', 'thank', 'time']


The example shows how the lemmatization process has worked: words like 'trying' have been converted to their root form 'try'.<br>
In addition, it also shows the working of the POS tagging process, since the 'training' near the end has remained the same as it is used as an noun and not a verb (in that case, 'trainings' would have become 'training').

### Deleting Empty Rows

After all the preprocessing, it is possible that some of the emails are now empty (because they did not contain any useful words from the beginning).<br>
So, these have to be removed to keep the data clean.

In [21]:
balanced = balanced[balanced['body'].astype(bool)]

In [22]:
imbalanced = imbalanced[imbalanced['body'].astype(bool)]

## Train-Test Split

In order to evaluate the classification process, only 80% of the data will be used to train the models. The remaining 20%, which will be unknown to the algorithms, will be used to test the performance of the classifiers on unknown data.

### Text Features

In [23]:
train_balanced_tokens, test_balanced_tokens = util.dataset_split(balanced, percent=20)

In [24]:
train_imbalanced_tokens, test_imbalanced_tokens = util.dataset_split(imbalanced, percent=20)

In [25]:
train_balanced_tokens[train_balanced_tokens['id'] == 19]

Unnamed: 0,id,body,class
1442,19,"[e-mailbox, password, expire, 2, day, keep, password, urladdress, update, immediately, it-service, help, desk, e-mailbox, password, expire, 2, day, keep, pa...",True


In [26]:
test_balanced_tokens[test_balanced_tokens['id'] == 18]

Unnamed: 0,id,body,class
540,18,"[ready2go, ever, experience, problem, access, network, remotely, travel, business, try, work, home, laptop, would, like, someone, test, customize, dial-up, ...",False


One of the examples is on the train set while the other is on the test set.

### Saving the Results

#### Text Features

In [27]:
save_to_csv(train_balanced_tokens, csv_path, 'train_balanced_tokens.csv')
save_to_csv(test_balanced_tokens, csv_path, 'test_balanced_tokens.csv')

Saving to /home/ichanis/projects/phishing_public/data/csv/train_balanced_tokens.csv
Saving to /home/ichanis/projects/phishing_public/data/csv/test_balanced_tokens.csv


In [28]:
save_to_csv(train_imbalanced_tokens, csv_path, 'train_imbalanced_tokens.csv')
save_to_csv(test_imbalanced_tokens, csv_path, 'test_imbalanced_tokens.csv')

Saving to /home/ichanis/projects/phishing_public/data/csv/train_imbalanced_tokens.csv
Saving to /home/ichanis/projects/phishing_public/data/csv/test_imbalanced_tokens.csv
