> <h1>IMPORT AND TIDYING</h1>

This notebook is focused on cleaning each email to extract only the most important information from all of the emails and then creating a comprehensive dataset. For each email, a series of "cleaner" functions are called to extract information, which is then used to create a dictionary for each email. This continues until all emails have been cleaned, and associated information is stored in a list of dictionaries. 

An important part of this cleaning involves using natural language processing to extract unique keywords from all of the emails. An extensive file of stop words is used to determine whether any particular keyword should be ignored or not. Once the keywords are extracted, I use them to build up a database of spam scores associated with each email. These values are then assigned to each email.

With these steps complete, I then began testing whether there were any encoding errors when processing the emails. I then handled these by filling in empty values with appropriate information and dropping certain rows of data. This developed dataset is then used in the remainder of my project.

NOTE: All testing of cleaning functions was done through sampling emails and comparing the output of the function calls to the desired values to be extracted from the emails. Test cases at the end ensure that the dataset has been properly created.

In [1]:
import email.parser
from datetime import date
from datetime import datetime
import spacy 
from spacy.lang.en.stop_words import STOP_WORDS  # These are a list of common stop words.
import pandas as pd
import numpy as np

> <h3> Extracting Spam and Ham (Not Spam) values </h3>

To begin cleaning the emails, I imported the test file which classifies all of the emails as either Spam or Ham. I used this to create an easy to use dictionary to classify any given email as spam or not.

In [2]:
""" Returns a dictionary of email classifications.

This function takes in a text file which classifies all of the labeled emails as being spam or not. It then outputs a dictionary that is used to show the classification for each email.
"""
def classify(txt_file):
    cl = {}
    with open(txt_file) as txt:
        lines = txt.readlines()
        for line in lines: 
            temp = line.replace('\n', '').split(' ')
            cl[temp[1]] = int(temp[0])
    return cl 

In [3]:
# cl is the dictionary that contains the spam classification for all of the emails.

cl = classify('/data/thenateandre/CSDMC2010_SPAM/CSDMC2010_SPAM/SPAMTrain.txt')

In [4]:
# This short snippet of cl shows each email and whether it is spam or not (0 is spam and 1 is not spam).

keys = list(cl.keys())
for i in range(1, 5):
    print(keys[i], cl[keys[i]])

TRAIN_00001.eml 0
TRAIN_00002.eml 1
TRAIN_00003.eml 0
TRAIN_00004.eml 0


> <h2> Cleaning and Tidying </h2>

The functions in this next section are used to clean and extract all of the important attributes from each email. This includes attributes such as day and hour the email was sent, sender, whether there is a subscribe option, keywords in the emails, whether the sender used mailer software, and much more. 

In [5]:
""" Extracts all the relevant information from the email and returns a list of dictionaries (each dictionary representing a unique email).

This function takes in a dictionary of all of the files with labels for whether an email is spam and extracts the important contents from each of the emails. This function then returns a list of dictionaries, with each dictionary representing one email.
"""
def email_extract(cl):
    global count
    lis_of_emails = []
    for email in cl.keys():
        # Getting the values for one particular email:
        temp = extract_values(email, cl[email])
        lis_of_emails.append(temp)
    return lis_of_emails

In [6]:
""" Returns the dictionary of email attributes for each individual email.

This function calls an assortment of other "cleaning" functions that extract and clean all of the relevant data from the emails and stores it in a dictionary for each of the emails.
"""
def extract_values(file_name, label):
    return_dic = {}
    # Opening each file and calling cleaning functions on the contents of the emails.
    with open('/data/thenateandre/CSDMC2010_SPAM/CSDMC2010_SPAM/TRAINING/'+file_name, 'rb') as fp:
        msg = email.message_from_binary_file(fp)
        payload = msg.get_payload()
        # Building the dictionary:
        return_dic['email'] = file_name
        return_dic['label'] = label
        return_dic['receive_count'] = msg.keys().count('Received')  # Counts how many servers the email went through.
        return_dic = clean_date_time(msg['Date'], return_dic)
        return_dic = clean_main_attributes(msg['From'], msg['Content-Type'], return_dic)
        return_dic = clean_other_attributes(msg['MIME-Version'], msg['List-Subscribe'], msg['List-Unsubscribe'], return_dic)
        return_dic = clean_x_attributes(msg['X-mailer'], msg['X-priority'], return_dic)
        return_dic = subject_message_handler(msg['Subject'], return_dic)  # Subject line.
        # Main message body handler:
        if type(payload) == type(list()):
            try:  # Handles improper email encoding errors.
                return_dic = main_message_handler_alternative(str(payload[0]), str(payload[1]), return_dic)  # payload[1] includes any html attributes.
            except:
                return_dic['message'] = None
        else:
            return_dic = main_message_handler(str(payload), return_dic)
    return return_dic

> <h3> Natural Language Processing </h3>

This section of cleaning and tidying functions are focused on the main message and subject lines for the emails. The functions are primarily focused on using NLP and an extensive list of stop words to extract relevant keywords from the subject line and main email message. 

In [7]:
# Using Spacy as the natural language processor.

nlp = spacy.load('en')

In [8]:
""" Takes in a dictionary and an email's subject-line and details whether an email has 're:' in the subject-line, along with adding the subject-line keywords.

This function adds one if the email subject-line has 're:' in it and also adds the subject-line keywords to the dictionary after they have been cleaned. 
"""
def subject_message_handler(subject, return_dic):
    lowered = str(subject).lower()
    if 're:' in lowered:
        return_dic['re'] = 1
    else:
        return_dic['re'] = 0
    subject = tokenizer(list(nlp(lowered)))  # Calling the rokenizer helper function to clean keyword list.
    return_dic['subject_line'] = list(subject)
    return return_dic

In [9]:
""" Takes in a dictionary and an email's main message and adds the important keywords to the dictionary along with whether html was being used and returns the dictionary.

This function is the main message handler, meaning it handles "type A" emails, that are less likely to have encoding errors. These emails also still contain html attributes.
"""
def main_message_handler(message, return_dic):
    # HTML tag handler to determine whether html tags are being used:
    message = message.lower()  # Making sure all words are lower case before analysis.
    if "<html>" in message or "<head>" in message or "<body>" in message:
        return_dic['html_use'] = 1
    else:
        return_dic['html_use'] = 0
    # NLP used to determine content of message:
    content = tokenizer(list(nlp(message)))
    return_dic['message'] = list(content)
    return return_dic

In [10]:
""" Takes in a dictionary and two forms of the email (one with html attributes and one without), and adds the important keywords to the dictionary along with whether html was being used, and returns the dictionary.

This is used for separate encodings of the main message for "type B" emails, meaning the emails have potentially more encoding errors.
"""
def main_message_handler_alternative(message, html_message, return_dic):
    # HTML tag handler to determine whether html tags are being used:
    html_message = html_message.lower()  # Making sure all words are lower case before analysis.
    if "<html>" in html_message or "<head>" in html_message or "<body>" in html_message:
        return_dic['html_use'] = 1
    else:
        return_dic['html_use'] = 0
    # NLP to determine content of message:
    content = tokenizer(list(nlp(message.lower())))
    return_dic['message'] = list(content)
    return return_dic

In [11]:
""" Takes in a list of keyword tokens and returns only relevent keywords.

This helper function determines whether a keyword is a stop word and whether it is significant. 
"""
def tokenizer(text):
    return_lis = []
    for token in text:
        # Determining if keyword is a stop word:
        if str(token) not in stop_words and str(token).isalpha():
            return_lis.append(token)
    return return_lis

In [12]:
# This creates a comprehensive list of stop words from Spacy suggestions to words pulled from the internet.

stop_words = []
with open('/data/thenateandre/CSDMC2010_SPAM/CSDMC2010_SPAM/stop_word.txt', 'r') as f:
    lines = f.readlines()
    for line in lines:
        line = line.replace("\n", '')
        if line not in stop_words:
            stop_words.append(line)
    # Stop words from spacy.
    for word in STOP_WORDS:
        if word not in STOP_WORDS:
            stop_words.append(word)

In [13]:
# These are examples of some of stop words:

print(stop_words[20:30])

['about', 'above', 'abroad', 'according', 'accordingly', 'across', 'actually', 'adj', 'after', 'afterwards']


The contents of these functions are used to gather and clean data from the other attributes of each email





> <h3> Cleaning the Rest of the Email Attributes </h3>

After processing the keywords and attributes of the subject-line and main message portions of all of the emails, I moved on to cleaning the rest of the important attributes of the emails. These include whether mailer software was used to send the email, the priority of the email set, and other attributes. 

In [14]:
""" Returns a dictionary that includes the important x-values of the emails.

Determines whether the sender of the email was using mailer software, and if they set a priority for their email.
"""
def clean_x_attributes(mailer, priority, return_dic):
    # Cleaning X-priority values:
    if priority is not None:
        return_dic['X-Priority'] = int(float(priority.split(' ')[0].replace("'", "")))
    else:
        return_dic['X-Priority'] = 3  # 3 is the standard value of the priority attribute if not specified.
    # Cleaning X-mailer string values:
    if mailer is not None:
        return_dic['X-Mailer'] = 1
    else:
        return_dic['X-Mailer'] = 0
    return return_dic

In [15]:
""" Returns a dictionary that includes message id, mime, subscribe, and unsubscribe attributes of the emails. 

This function takes three other attributes from the emails. It cleans the values and ensures that there will not be a cardinality problem.
"""
def clean_other_attributes(mime, subscribe, unsubscribe, return_dic):
    # Cleaning MIME values:
    if mime is not None:
        return_dic['MIME-Version'] = 1  # Represents True.
    else:
        return_dic['MIME-Version'] = 0
    # Cleaning Subscribe values:
    if subscribe is not None:
        return_dic['Subscribe'] = 1
    else:
        return_dic['Subscribe'] = 0
    # Cleaning unsubscribe values:
    if unsubscribe is not None:
        return_dic['Unsubscribe'] = 1
    else:
        return_dic['Unsubscribe'] = 0
    return return_dic

In [16]:
""" Returns a dictionary with the cleaned values of the sender and content_type attributes of an email.

This function determines whether there is a content_type and who the sender is. It cleans the sender attribute of the email such that there is a section focused on the trailing section of the email handle such as 'com' or 'net'.
"""
def clean_main_attributes(sender, content_type, return_dic):
    # Cleaning content_type:
    if content_type is not None:
        return_dic['content-type'] = str(content_type).split(';')[0]
    else:
        return_dic['content-type'] = 'None'
    # Cleaning the sender:
    if '<' in str(sender) or '>' in str(sender):
        sender = str(sender).replace('>', '').replace('<', '').replace('"', '').split(' ')
    else:
        sender = str(sender).replace('"', '').split(' ')
    var = 0  # We are ignoring the first word (name of sender).
    for i in sender:
        if ('@' in i and var != 0) or ('@' in i and len(sender) == 1):
            return_dic['from_full'] = i
            # Cleaning the sender attribute for low cardinality:
            fro = i.split(".")
            if len(fro) > 1:  # If there is a sender:
                return_dic['from'] = fro[-1]
            else:  # Handles no sender.
                return_dic['from'] = None
            break
        var += 1
    return return_dic

In [17]:
""" Returns a dictionary with the cleaned time and day values.

This function focuses on getting the day and time values of when the email was sent and then ensuring that these values won't have a cardinality problem.
"""
def clean_date_time(date_val, return_dic):
    try:
        if date_val is not None:  # If there is actually a date value.
            date_temp = str(date_val).replace(',', '').split(' ')
            # Conditional logic dependent on whether the day is included in the Date email attribute.
            if ord(date_temp[0][0]) > 64:  # There is a day of the week.
                day = date_temp[0].lower()
                return_dic = day_sep(day, return_dic)
                hour = date_temp[4].split(':')[0]
                if (date_temp[5]):  # If there is a time zone adjustment.
                    if '-' in date_temp[5]:
                        fin_hour = int(hour) + int(date_temp[5].split('-')[1]) / 100
                    else: 
                        fin_hour = int(hour) - int(date_temp[5].split('+')[1]) / 100
                    return_dic = time_of_day_sep(fin_hour, return_dic)
                else:
                    return_dic = time_of_day_sep(hour, return_dic)
            else:
                day = datetime(int(date_temp[2]), datetime.strptime((date_temp[1]), '%b').month, int(date_temp[0])).strftime('%a').lower()
                return_dic = day_sep(day, return_dic)
                hour = date_temp[3].split(':')[0]
                if (date_temp[4]):
                    if '-' in date_temp[4]:
                        fin_hour = int(hour) + int(date_temp[4].split('-')[1]) / 100
                    else: 
                        fin_hour = int(hour) - int(date_temp[4].split('+')[1]) / 100
                    return_dic = time_of_day_sep(fin_hour, return_dic)
                else:
                    return_dic = time_of_day_sep(hour, return_dic)
            return return_dic
        else:
            return_dic['hour'] = None
            return_dic['day'] = None
            return return_dic
    except:
        return_dic['hour'] = None
        return_dic['day'] = None
        return return_dic

In [18]:
""" Returns a dictionary with the value of an hour in integer form.

This function is focused on decreasing the cardinality problem for hour values.
"""
def time_of_day_sep(hour, return_dic):
    if hour > 24:
        hour = hour - 24
    if hour >= 0 and hour < 4:
        return_dic['hour'] = 0
    elif hour >= 4 and hour < 8:
        return_dic['hour'] = 1
    elif hour >= 8 and hour < 12:
        return_dic['hour'] = 2
    elif hour >= 12 and hour < 16:
        return_dic['hour'] = 3
    elif hour >= 16 and hour < 20:
        return_dic['hour'] = 4
    elif hour >= 20 and hour <= 24:
        return_dic['hour'] = 5
    else:
        return_dic['hour'] = None
    return return_dic

In [19]:
""" Returns a dictionary with the value of a day in integer form.

This function is focused on decreasing the cardinality problem for day values.
"""
def day_sep(day, return_dic):
    if day == 'sun':
        return_dic['day'] = 0
    elif day == 'mon':
        return_dic['day'] = 1
    elif day == 'tue':
        return_dic['day'] = 2
    elif day == 'wed':
        return_dic['day'] = 3
    elif day == 'thu':
        return_dic['day'] = 4
    elif day == 'fri':
        return_dic['day'] = 5
    elif day == 'sat':
        return_dic['day'] = 6
    else:
        return_dic['day'] = None
    return return_dic

After this point, the remaining effort is put into improving the dataset and adding other important predictive variables to the dataset. 

In [20]:
# Do not re-run this. This takes more than a few minutes.
# This is the list of dictionaries that represents all emails and their attributes (with everything cleaned).

email_content = email_extract(cl)

> <h2> Spam Keyword Database and Bayesian Spam Classifier </h2>

After cleaning the keywords from all of the emails, I began to build up my database with all of the mined keywords. The database is a python dictionary that contains the following attributes for any given keyword:

1. Count of how many spam emails the keyword shows up in.
2. Count of how many ham emails the keyword shows up in.
3. Total count of how many times the keyword showed up.

These counts are then used to assign a value of "spaminess" to any given keyword. This is done through an equation I outline later in this notebook.

> <h3> Spam Keyword Database </h3>

As specified above, these next two functions are being used to create the spam keyword database. Only the first occurrence of a keyword for any given email will be included in the database (thus no double counting, which would decrease the accuracy of any analysis using the database).

In [23]:
# This block of code creates the keyword database, which includes all of the keywords from the emails.

keyword_data = {}

for email in email_content:
    temp_list = []
    # Handler for missing message body or subject_line.
    if email['subject_line'] is not None and email['message'] is not None:
        email_keywords = email['subject_line'] + email['message']
    elif email['subject_line'] is not None:
        email_keywords = email['subject_line']
    else:
        email_keywords = email['message']
    email_keywords_str = []
    # This changes the dictionary from spacy tokens to strings.
    for i in email_keywords:
        email_keywords_str.append(str(i))
    # Making sure that only unique values are present:
    for keyw in email_keywords_str:
        if keyw not in temp_list:
            temp_list.append(keyw)
    # Actual creation of the keyword_data dictionary:
    keyword_data = keyword_parser(email, temp_list, keyword_data)

In [22]:
""" Takes in and returns a dictionary representing the spam keyword database, with additional keyword entries.

This helper function helps create the spam keyword database through handling actually adding values to the database.
"""
def keyword_parser(email, keywords, keyword_data):
    for word in keywords:
        if word in keyword_data.keys():  # If the keyword is already in the dictionary.                
            if email['label'] == 1:  # Not spam.
                keyword_data[word]['ham'] += 1 
            else:  # Is spam.
                keyword_data[word]['spam'] += 1
            keyword_data[word]['total_count'] += 1
        else:  # The word is not in the dictionary.
            keyword_data[word] = {'ham':0, 'spam':0, 'total_count':0}
            if email['label'] == 1:
                keyword_data[word]['ham'] += 1
            else:
                keyword_data[word]['spam'] += 1
            keyword_data[word]['total_count'] += 1
    return keyword_data

In [24]:
# This is the total count of keywords:

len(keyword_data)

51723

In [25]:
# These are some of the keywords in the database:

keys = keyword_data.keys()
count = 0
for i in keys:
    if count <= 10:
        print(type(i), i)
    count += 1

<class 'str'> kind
<class 'str'> money
<class 'str'> maker
<class 'str'> free
<class 'str'> link
<class 'str'> webcam
<class 'str'> wanted
<class 'str'> wanna
<class 'str'> sexually
<class 'str'> curious
<class 'str'> teens


> <h3> Bayesian Spam Classifier </h3>

With the spam keyword database created, I then began to assign a "spamminess" score to each of the keywords, indicating whether it has a high or low probability of being associated with spam emails. The equation used to compute this score is showcased below:

$$P_{spam}(token) = \frac{\frac{n_{spam}(token)}{n_{spam}}}{\frac{n_{spam}(token)}{n_{spam}}+\frac{n_{ham}(token)}{n_{ham}}} $$

In [26]:
# Total number of spam and nonspam emails used for later analysis.

total_ham = 0
total_spam = 0

spam_counts = list(cl.values())
for val in spam_counts:
    if val == 0:
        total_spam += 1
    elif val == 1:
        total_ham += 1
total_ham, total_spam

(2949, 1378)

In [27]:
# Creating a spam score for every entry in the spam dictionary. High value indicates a "spammy" keyword.

for word in keyword_data.keys():
    keyword_data[word]['spam_score'] = ((keyword_data[word]['spam'])/total_spam) / (((keyword_data[word]['spam'])/total_spam) + (((keyword_data[word]['ham'])/total_ham)))

In [28]:
# This shows some of the counts for the values in the dictionary:

counter = 0
for word in keyword_data.keys():
    if keyword_data[word]['total_count'] > 1:
        print(keyword_data[word]['total_count'], word)
        print(keyword_data[word])
    if counter >= 5:
        break
    counter +=1

199 kind
{'ham': 141, 'spam': 58, 'total_count': 199, 'spam_score': 0.46817211364756117}
277 money
{'ham': 107, 'spam': 170, 'total_count': 277, 'spam_score': 0.7727320369434134}
23 maker
{'ham': 17, 'spam': 6, 'total_count': 23, 'spam_score': 0.4303015564202335}
721 free
{'ham': 350, 'spam': 371, 'total_count': 721, 'spam_score': 0.6940456578018357}
375 link
{'ham': 136, 'spam': 239, 'total_count': 375, 'spam_score': 0.7899529151475142}
22 webcam
{'ham': 15, 'spam': 7, 'total_count': 22, 'spam_score': 0.49967322634521816}


> <h3> Using Weighted Average of Keyword Scores to Estimate Email Spamminess </h3>

To best leverage the spam scores, I computed with my spam keyword database, I used two distinct averages to assign a "spamminess" score to each email. These are:

1. Averaging based solely on keyword appearance in the subject-line and main message body.

2. Averaging based on keyword appearance in the subject-line and main message body along with using additional weighting based on how many occurances of the keyword there are in the spam keyword database. The though here is that the "spamminess" score given to a keyword with more occurances will most likely be more accurate.

These are showcased below:

> Equation 1.

$$P_{spam}(email) = \big[W_{keyword_1} * P_{spam}(keyword_1)\big] + \big[W_{keyword_2} * P_{spam}(keyword_2)\big] +  \hspace{.25cm}... \hspace{.1cm}+ \big[W_{keyword_i} * P_{spam}(keyword_i)\big]$$

> Equation 2. 

$$P_{spam}(email) = \bigg[\big[W_{keyword_1} * P_{spam}(keyword_1) * Count_{keyword_1} \big] + \big[W_{keyword_2} * P_{spam}(keyword_2) * Count_{keyword_2} \big] +  \hspace{.25cm}... \hspace{.1cm}+ \big[W_{keyword_i} * P_{spam}(keyword_i) * Count_{keyword_i} \big]\bigg] \bigg/ Total Count$$

In [33]:
# Calculating the spam score for every email.
# Subject-line and main message body have different scores.
# nwas score corresponds with equation 1.
# extra score corresponds with equation 2.

for email in email_content:
    if email['subject_line'] is not None and email['message'] is not None:
        if email['subject_line'] != [] and email['message'] != []:
            email['subject_line_nwas_score'] = keyword_occurance(email['subject_line'])
            email['main_message_nwas_score'] = keyword_occurance(email['message'])
            email['subject_line_extra_score'] = keyword_occurance_database(email['subject_line'])
            email['main_message_extra_score'] = keyword_occurance_database(email['message']) 
        elif email['subject_line'] != []:
            email['subject_line_nwas_score'] = keyword_occurance(email['subject_line'])
            email['main_message_nwas_score'] = None
            email['subject_line_extra_score'] = keyword_occurance_database(email['subject_line'])
            email['main_message_extra_score'] = None
        else:  # Subject line == [].
            email['subject_line_nwas_score'] = None
            email['main_message_nwas_score'] = keyword_occurance(email['message'])
            email['subject_line_extra_score'] = None
            email['main_message_extra_score'] = keyword_occurance_database(email['message']) 
        
    elif email['subject_line'] is not None:
        if email['subject_line'] != []:
            email['subject_line_nwas_score'] = keyword_occurance(email['subject_line'])
            email['subject_line_extra_score'] = keyword_occurance_database(email['subject_line'])
        else:
            email['subject_line_extra_score'] = None
            email['subject_line_nwas_score'] = None
    else:
        if email['message'] != []:
            email['main_message_nwas_score'] = keyword_occurance(email['message'])
            email['main_message_extra_score'] = keyword_occurance_database(email['message']) 
        else:
            email['main_message_nwas_score'] = None
            email['main_message_extra_score'] = None

In [30]:
""" Takes in a list of keywords and returns the "nwas" spam score for a given email.

Weight for each score based on weight in email.
"""
def keyword_occurance(keywords):
    if (len(keywords) != 0):
        running_total = 0.0
        for keyw in keywords:
            score = keyword_data[str(keyw)]['spam_score']
            running_total += score
        return running_total / len(keywords)
    else:
        return None

In [31]:
""" Takes in a list of keywords and returns the "extra" spam score for a given email.

Weight for each score based on weight in database and weight in email.
"""
def keyword_occurance_database(keywords):
    if (len(keywords) != 0):
        running_denom = 0.0
        running_total = 0.0
        for keyw in keywords:
            score = keyword_data[str(keyw)]['spam_score'] * keyword_data[str(keyw)]['total_count']
            running_denom += keyword_data[str(keyw)]['total_count']
            running_total += score
        return running_total / running_denom # len(keywords)
    else:
        return None

In [32]:
# Number of emails with unindexable message bodies. 
# This is relevant because these NaN values in the dataset will need to be accounted for.

count = 0
for i in email_content:
    if i['message'] == None:
        count += 1
count

47

> <h3> Creating and Cleaning up the Pandas Dataframe </h3>

After adding the spam scores for all of the emails, my attention is now on creating a pandas dataframe with my data. Because of the way I processed my emails, this is a relatively easy step. However, I then check to see if there are any values that should not be there, and I handle all NaN values.

In [34]:
# Creating the dataframe.

email_data_frame = pd.DataFrame(email_content)

In [35]:
# Droping values that I don't plan on using in my analysis and renaming certain columns.

email_data_frame = email_data_frame.drop(['message', 'subject_line', 'from_full'], axis=1)
email_data_frame.rename(columns={'X-Mailer':'Mailer', 'X-Priority':'Priority', 'MIME-Version': 'MIME', 'content-type':'content'}, inplace=True)

# Making sure that the 'content' and 'from' columns are encoded correctly with lower case strings:

email_data_frame['content'] = email_data_frame['content'].str.lower()
email_data_frame['from'] = email_data_frame['from'].str.lower()

len(email_data_frame)

4327

In [36]:
email_data_frame.head()

Unnamed: 0,MIME,Subscribe,Unsubscribe,Mailer,Priority,content,day,email,from,hour,html_use,label,main_message_extra_score,main_message_nwas_score,re,receive_count,subject_line_extra_score,subject_line_nwas_score
0,1,0,0,0,3,multipart/mixed,5.0,TRAIN_00000.eml,com,1.0,,0,,,0,2,0.670096,0.591313
1,1,0,0,1,1,text/plain,1.0,TRAIN_00001.eml,com,0.0,0.0,0,0.789154,0.674667,0,8,0.687633,0.567793
2,1,1,1,0,3,multipart/signed,6.0,TRAIN_00002.eml,org,3.0,0.0,1,0.463395,0.377856,1,11,0.572255,0.465104
3,1,0,0,1,3,text/plain,2.0,TRAIN_00003.eml,com,0.0,0.0,0,0.842464,0.821,0,7,0.817109,0.811058
4,0,0,0,0,3,none,2.0,TRAIN_00004.eml,,5.0,0.0,0,0.7539,0.619219,0,4,0.786136,0.786136


In [37]:
# Checking to make sure that all columns have correct values.
# All of these columns should not have any possibility of including NaN values. 

assert not email_data_frame['MIME'].isnull().values.any()
assert not email_data_frame['Subscribe'].isnull().values.any()
assert not email_data_frame['Unsubscribe'].isnull().values.any()
assert not email_data_frame['Mailer'].isnull().values.any()
assert not email_data_frame['Priority'].isnull().values.any()
assert not email_data_frame['content'].isnull().values.any()
assert not email_data_frame['re'].isnull().values.any()
assert not email_data_frame['receive_count'].isnull().values.any()

In [38]:
# For emails that have either a missing main message or subject line, I am replacing those spam scores with the scores of whichever is not missing.
# This is sensible because the scores are coming from the same spam keyword database.

email_data_frame['main_message_extra_score'] = email_data_frame['main_message_extra_score'].fillna(email_data_frame['subject_line_extra_score'])
email_data_frame['main_message_nwas_score'] = email_data_frame['main_message_nwas_score'].fillna(email_data_frame['subject_line_nwas_score'])
email_data_frame['subject_line_extra_score'] = email_data_frame['subject_line_extra_score'].fillna(email_data_frame['main_message_extra_score'])
email_data_frame['subject_line_nwas_score'] = email_data_frame['subject_line_nwas_score'].fillna(email_data_frame['main_message_nwas_score'])

# For the 'html_use', 'from', 'day', and 'hour' columns I am replacing any missing values with the most common occurance for each column.
# This is sensible because averaging would not make sense in this case, and this would not impact the results. 

email_data_frame['html_use'] = email_data_frame['html_use'].fillna(email_data_frame['html_use'].value_counts().index[0])
email_data_frame['from'] = email_data_frame['from'].fillna(email_data_frame['from'].value_counts().index[0])
email_data_frame['day'] = email_data_frame['day'].fillna(email_data_frame['day'].value_counts().index[0])
email_data_frame['hour'] = email_data_frame['hour'].fillna(email_data_frame['hour'].value_counts().index[0])

In [39]:
email_data_frame.head()

Unnamed: 0,MIME,Subscribe,Unsubscribe,Mailer,Priority,content,day,email,from,hour,html_use,label,main_message_extra_score,main_message_nwas_score,re,receive_count,subject_line_extra_score,subject_line_nwas_score
0,1,0,0,0,3,multipart/mixed,5.0,TRAIN_00000.eml,com,1.0,0.0,0,0.670096,0.591313,0,2,0.670096,0.591313
1,1,0,0,1,1,text/plain,1.0,TRAIN_00001.eml,com,0.0,0.0,0,0.789154,0.674667,0,8,0.687633,0.567793
2,1,1,1,0,3,multipart/signed,6.0,TRAIN_00002.eml,org,3.0,0.0,1,0.463395,0.377856,1,11,0.572255,0.465104
3,1,0,0,1,3,text/plain,2.0,TRAIN_00003.eml,com,0.0,0.0,0,0.842464,0.821,0,7,0.817109,0.811058
4,0,0,0,0,3,none,2.0,TRAIN_00004.eml,com,5.0,0.0,0,0.7539,0.619219,0,4,0.786136,0.786136


In [40]:
# None of these columns should now have NaN values:

assert not email_data_frame['day'].isnull().values.any()
assert not email_data_frame['from'].isnull().values.any()
assert not email_data_frame['hour'].isnull().values.any()
assert not email_data_frame['html_use'].isnull().values.any()

In [41]:
# Dropping rows that have the following attributes: unindexible main messages and empty subject-lines.
# This ends up being 8 rows of data.

email_data_frame = email_data_frame.dropna()
len(email_data_frame) 

4319

In [42]:
# The remaining columns should not have NaN values now:

assert not email_data_frame['main_message_extra_score'].isnull().values.any()
assert not email_data_frame['main_message_nwas_score'].isnull().values.any()
assert not email_data_frame['subject_line_extra_score'].isnull().values.any()
assert not email_data_frame['subject_line_nwas_score'].isnull().values.any()

In [43]:
# Changing datatype of 'day', 'hour', and 'html_use' columns.

email_data_frame['day'] = email_data_frame['day'].astype(int)
email_data_frame['hour'] = email_data_frame['hour'].astype(int)
email_data_frame['html_use'] = email_data_frame['html_use'].astype(int)

In [44]:
email_data_frame.head()

Unnamed: 0,MIME,Subscribe,Unsubscribe,Mailer,Priority,content,day,email,from,hour,html_use,label,main_message_extra_score,main_message_nwas_score,re,receive_count,subject_line_extra_score,subject_line_nwas_score
0,1,0,0,0,3,multipart/mixed,5,TRAIN_00000.eml,com,1,0,0,0.670096,0.591313,0,2,0.670096,0.591313
1,1,0,0,1,1,text/plain,1,TRAIN_00001.eml,com,0,0,0,0.789154,0.674667,0,8,0.687633,0.567793
2,1,1,1,0,3,multipart/signed,6,TRAIN_00002.eml,org,3,0,1,0.463395,0.377856,1,11,0.572255,0.465104
3,1,0,0,1,3,text/plain,2,TRAIN_00003.eml,com,0,0,0,0.842464,0.821,0,7,0.817109,0.811058
4,0,0,0,0,3,none,2,TRAIN_00004.eml,com,5,0,0,0.7539,0.619219,0,4,0.786136,0.786136


In [45]:
# Ensuring the datatypes are all correct for all observations in the dataset:

assert email_data_frame['MIME'].apply(type).all() == int
assert email_data_frame['Subscribe'].apply(type).all() == int
assert email_data_frame['Unsubscribe'].apply(type).all() == int
assert email_data_frame['Mailer'].apply(type).all() == int
assert email_data_frame['Priority'].apply(type).all() == int
assert email_data_frame['content'].apply(type).all() == str
assert email_data_frame['day'].apply(type).all() == int
assert email_data_frame['email'].apply(type).all() == str
assert email_data_frame['from'].apply(type).all() == str
assert email_data_frame['hour'].apply(type).all() == int
assert email_data_frame['html_use'].apply(type).all() == int
assert email_data_frame['label'].apply(type).all() == int
assert email_data_frame['main_message_extra_score'].apply(type).all() == float
assert email_data_frame['main_message_nwas_score'].apply(type).all() == float
assert email_data_frame['re'].apply(type).all() == int
assert email_data_frame['receive_count'].apply(type).all() == int
assert email_data_frame['subject_line_extra_score'].apply(type).all() == float
assert email_data_frame['subject_line_nwas_score'].apply(type).all() == float

In [46]:
# There is a problem with the ecoding of the content column values.

email_data_frame.content.unique()

array(['multipart/mixed', 'text/plain', 'multipart/signed', 'none',
       'text/html', 'multipart/related', 'multipart/alternative',
       '\n text/html', 'multipart/mixed ', 'multipart/report'],
      dtype=object)

In [47]:
# Fixing the encoding problem:

email_data_frame.content = email_data_frame.content.replace('\n text/html', 'text/html')
email_data_frame.content = email_data_frame.content.replace('multipart/mixed ', 'multipart/mixed')
email_data_frame.content.unique()

array(['multipart/mixed', 'text/plain', 'multipart/signed', 'none',
       'text/html', 'multipart/related', 'multipart/alternative',
       'multipart/report'], dtype=object)

> Exporting the dataset:

In [48]:
email_data_frame.to_csv("/data/thenateandre/email_data_frame.csv")