Our datasets are unstructured text files, that contain email headers, subject and body. In this part, we parse through the text files, convert into a structured pandas dataset, and save it as "structured.xlsx"

Execute the following cells only if running on Google Colab, to install the needed packages and download the datasets from Google drive.

In [1]:
!pip install PyDrive
!pip install XlsxWriter

Collecting PyDrive
[?25l  Downloading https://files.pythonhosted.org/packages/52/e0/0e64788e5dd58ce2d6934549676243dc69d982f198524be9b99e9c2a4fd5/PyDrive-1.3.1.tar.gz (987kB)
[K    100% |████████████████████████████████| 993kB 9.1MB/s 
Building wheels for collected packages: PyDrive
  Running setup.py bdist_wheel for PyDrive ... [?25l- \ done
[?25h  Stored in directory: /content/.cache/pip/wheels/fa/d2/9a/d3b6b506c2da98289e5d417215ce34b696db856643bad779f4
Successfully built PyDrive
Installing collected packages: PyDrive
Successfully installed PyDrive-1.3.1
Collecting XlsxWriter
[?25l  Downloading https://files.pythonhosted.org/packages/33/50/136b801d106fcebb2428a764e5c599e020d8227a3623db078e05eb4793a5/XlsxWriter-1.0.5-py2.py3-none-any.whl (142kB)
[K    100% |████████████████████████████████| 143kB 5.5MB/s 
[?25hInstalling collected packages: XlsxWriter
Successfully installed XlsxWriter-1.0.5


In [0]:
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

Authenticate PyDrive API to access Google drive

In [0]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Now, we download and untar the two datasets, 1998 and trec07p, that we are going to work with.

In [0]:
download = drive.CreateFile({'id': '1QtoxpJmd1lys7c7LaYXiOjbzMdMOpeVX'})
download.GetContentFile('1998.tar')

In [0]:
download2 = drive.CreateFile({'id': '1xaJL1eoccrCyS45xgF23dVY_KCER-oAD'})
download2.GetContentFile('trec07p.tar')

In [0]:
!tar xf 1998.tar
!tar xf trec07p.tar

In [0]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

In [0]:
notHeaders = ['A',
 'A.normal2{color',
 'A.normal{color',
 'A.over{color',
 'ATT',
 'Action',
 'Added',
 'Address',
 'An',
 'Asunto',
 'Author',
 'Auto-Submitted',
 'BATCH',
 'Betreff',
 'Beurteilung',
 'Bookmark',
 'Brma',
 'BrmaSmtpAuthUser',
 'C',
 'Call',
 'Cancel-Lock',
 'Carlson',
 'Caveats',
 'Changeset',
 'Cia1iis',
 'Cited',
 'Classification',
 'Comment',
 'Company',
 'Complaints',
 'Congratulations',
 'Contact',
 'Copyright',
 'Corp.<http',
 'Corporation<http',
 'Credibility',
 'Date-warning',
 'Datum',
 'De',
 'Details',
 'De\xa0',
 'Diagnostic-Code',
 'Dinah',
 'Envoyé\xa0',
 'FATAL',
 'FDA',
 'FONT-SIZE',
 'Featuring',
 'Fixes',
 'Gesendet',
 'Hash',
 'Hinweis',
 'ISIN',
 'Importance',
 'Inc.<http',
 'Index',
 'Jabber-ID',
 'Jim',
 'John',
 'Kopia',
 'LINE-HEIGHT',
 'List-Help',
 'Log',
 'London',
 'Lookup',
 'MOTD',
 'Mail-Followup-To',
 'Managed-by',
 'Market',
 'Metze',
 'Modified',
 'NOTE',
 'Name',
 'Napster<http',
 'Newshawk',
 'Notice',
 'Number',
 'OTC',
 'Objet\xa0',
 'Old-Return-Path',
 'OpenPGP',
 'Organisation',
 'Organization',
 'PADDING-BOTTOM',
 'PHONE',
 'Package',
 'Page',
 'Para',
 'Phone',
 'Posted',
 'Precedence',
 'President',
 'Priority',
 'Products',
 'Pubdate',
 'Publicitate',
 'REF',
 'RT-Ticket',
 'Rangel',
 'Received-SPF',
 'References',
 'Reminder',
 'Removed',
 'Reporting-MTA',
 'S.umbol',
 'Sent',
 'Severity',
 'Site',
 'Skickat',
 'Source',
 'Spoken',
 'Sym8oL',
 'Symbol',
 'Sys.putenv("http_proxy"="http',
 'TELEPHONE',
 'Talk',
 'Tel',
 'Teste',
 'Ticker',
 'Till',
 'Timing',
 'Tname',
 'Try',
 'Type',
 'UEI',
 'URL',
 'Visit',
 'Von',
 'WASHINGTON',
 'WKN',
 'Webpage',
 'Website',
'A.RVTS2',
 'ASTIG',
 'AXIS',
 'Ambieen',
 'Anova',
 'Below',
 'CARGO',
 'CNN',
 'CNNMoney',
 'City',
 'Clinton',
 'Collins',
 'Given',
 'HEADLINES',
 'IRAN',
 'Iterations',
 'Juego!<o',
 'Kernel',
 'L<Parrot',
 'Parrot',
 'Price']

Since many email body fields contain HTML markup too, we define a method that filters out visible text from HTML content.

In [0]:
def parseTextFromHTML(body):
    body = body.lower()
    htmlstartindex = body.find('<html>')
    htmlendindex = body.find('</html>')
    
    #if(htmlstartindex == -1 or htmlendindex == -1):
        #return body
    
    htmlcontent = body#[htmlstartindex:htmlendindex+7]
    soup = BeautifulSoup(htmlcontent, 'html.parser')
    [s.extract() for s in soup(['style', 'script'])]
    return soup.getText().strip()
    #return body[:htmlstartindex] + soup.getText().strip() + body[htmlendindex+7:]

In [10]:
"""# parse the email file(spam or ham) to create a pandas dataframe
def getDFFromEmail(path, spam):
    l = []
    precurrkv = []
    sufcurrkv = []
    count = 0
    body = False
    try:
        stream = open(path, errors='strict', encoding='UTF-8')
        stream.readlines()
        stream.seek(0)
        print("Using UTF-8 encoding")
    except (Exception):
        print("Using ANSI encoding")
        stream = open(path, errors='strict', encoding='ANSI')
        stream.readlines()
        stream.seek(0)
        
    for line in stream:
        if(line.startswith('<DOCTYPE')):
            continue
            
        colonIndex = line.find(":")
        
        if(count > 15 and not body):
            body = True
            l.append(precurrkv)
            
        if(line[0].isupper() and line.find(' ',0,colonIndex)<0 and colonIndex >= 0 and colonIndex < 30 and not body 
           and line[:colonIndex] not in notHeaders):
            count = 0
            if(len(sufcurrkv) > 0 and len(precurrkv) > 0):
                precurrkv[1] = "".join((precurrkv[1], "".join(sufcurrkv)))
                l.append(precurrkv)
            elif(len(precurrkv) > 0):
                l.append(precurrkv)
            precurrkv = line.split(sep=":", maxsplit=1)
            sufcurrkv.clear()

        elif(str(line[0:2]).isspace() and not body):
            count += 1
            sufcurrkv.append(line)
        elif(str(line).startswith('>') and not body):
            body = True
            sufcurrkv = [''.join((':'.join(precurrkv), "".join(sufcurrkv)))]
        else:
            #count += 1
            sufcurrkv.append(line)

    l.append(["Body", parseTextFromHTML("".join(sufcurrkv))])
    l.append(["Spam", 'Spam' if spam else 'Ham'])
    l.append(["Tname", path])

    d = pd.DataFrame(np.array(l)).drop_duplicates(subset=0)
    return d    """
""""""

''

We define a method that parses an email text file represented by the path parameter and returns a pandas dataframe.

In [0]:
# parse the email file(spam or ham) to create a pandas dataframe
from collections import deque
from codecs import open
def getDFFromEmail(path, spam):
    l = dict()
    precurrkv = []
    sufcurrkv = deque()
    count = 0
    body = False
        
    try:
        stream = open(path, errors='strict')
        stream.readlines()
        stream.seek(0)
        print("Using UTF-8 encoding")
    except (Exception):
        try:
            print("Using ISO-8859-1 encoding")
            stream = open(path, errors='strict', encoding='iso-8859-1')
            stream.readlines()
            stream.seek(0)
        except (Exception):
            return ""
        
    for line in stream:
        if(line.startswith('<DOCTYPE')):
            continue
            
        colonIndex = line.find(":")
        
        if(count > 15 and not body):
            body = True

            
        if(line[0].isupper() and line.find(' ',0,colonIndex)<0 and colonIndex >= 0 and colonIndex < 30 and not body 
           and line[:colonIndex] not in notHeaders):
            count = 0
            if(sufcurrkv and len(precurrkv) > 0):
                fieldname = sufcurrkv.popleft()
                l[fieldname] = [''.join(sufcurrkv)]
            
            precurrkv = line.split(sep=":", maxsplit=1)
            sufcurrkv.clear()
            sufcurrkv.extend(precurrkv)

        elif(str(line[0:2]).isspace() and not body):
            count += 1
            sufcurrkv.append(line)
        elif(str(line).startswith('>') and not body):
            body = True
            sufcurrkv.append(line)
        else:
            #count += 1
            sufcurrkv.append(line)

    fieldname = sufcurrkv.popleft()
    l[fieldname] = sufcurrkv.popleft()    
    l["Body"] = [parseTextFromHTML("".join(sufcurrkv))]
    l["Spam"] = ['Spam'] if spam else ['Ham']
    l["Tname"] = [path]

    d = pd.DataFrame(l)
    return d.T.reset_index()    

We define a generator for the 1998 dataset that iterates over all email text files in a directory and yields resulting dataframes.

In [0]:
# iterate over all files in the directory to create the dataframes
def getAllDFFromDirectory(directorypath):
    for filename in os.listdir(directorypath):
        print("Getting DF for "+"".join((directorypath, filename)))
        if(filename.startswith("spm")):
            yield getDFFromEmail("".join((directorypath, filename)), True)
        elif(filename.count("msg") > 0):
            yield getDFFromEmail("".join((directorypath, filename)), False)
        else:
            yield getDFFromEmail("".join((directorypath, filename)), True)

In [0]:
from functools import reduce

# create dataframes from all directories from that contain spam or ham email text files

In [14]:
directorypath = "1998/1998/03/"
dfs = [i for i in getAllDFFromDirectory(directorypath) if(isinstance(i,pd.DataFrame))]

Getting DF for 1998/1998/03/891020025.3222.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/891272355.532.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929482.24868.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/891184729.10913.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/891020032.3224.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929562.24883.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929468.24864.txt
Using ISO-8859-1 encoding
Getting DF for 1998/1998/03/890929499.24873.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/890929485.24869.txt
Using ISO-8859-1 encoding
Getting DF for 1998/1998/03/891285049.11748.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/891219144.5405.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/891020028.3223.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/891219152.5407.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/891285044.11747.txt
Using UTF-8 encoding
Getting DF for 1998/1998/03/891

We define a generator for the trec07p dataset that iterates over all email text files in a directory and yields dataframes.

In [0]:
# iterate over all files in the directory to create the dataframes
def getAllDFFromDirectory2(directorypath, target):
    for i,filename in enumerate(os.listdir(directorypath)[:5000]):
        print("Getting DF for "+"".join((directorypath, filename)))
        
        if(target[i] == 'spam'):
            yield getDFFromEmail("".join((directorypath, filename)), True)
        else:
            yield getDFFromEmail("".join((directorypath, filename)), False)

In [16]:
def getTrec07Target():
    for line in open("trec07p/trec07p/full/index"):
        yield line.split(" ")[0]

target = [i for i in getTrec07Target()]
print(len(target))

directorypath = "trec07p/trec07p/data/"
dfs2 = [i for i in getAllDFFromDirectory2(directorypath, target) if(isinstance(i,pd.DataFrame))]

75419
Getting DF for trec07p/trec07p/data/inmail.71535
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.33277
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.10451
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.50259
Using ISO-8859-1 encoding
Getting DF for trec07p/trec07p/data/inmail.69434
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.44359
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.70925
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.53784
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.1906
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.13040
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.70343
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.69929
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.36598
Using UTF-8 encoding
Getting DF for trec07p/trec07p/data/inmail.22420
Using UTF-8 encoding
Getting DF

In [0]:
# Test for number of columns in dataframes
#dfs2.map(lambda df: df.shape[0])
#[i for i in zip(map(lambda df: df.shape[0],dfs2), range(len(dfs2))) if(i[0]>40)]


Use functools.reduce to merge all resulting dataframes into a single dataframe representing all emails.

In [0]:
df_final = reduce(lambda left,right: pd.merge(left,right,how='outer',on='index'), dfs+dfs2)

In [0]:
df_final = df_final.set_index('index').T

In [0]:
df_final['index'] = [i for i in range(df_final.shape[0])]
df_final.set_index('index', inplace=True)

In [19]:
df_final.head()

index,Body,Comments,Date,Delivered-To,From,Message-Id,Received,Return-Path,Spam,Subject,...,X-Centeq-MailScanner,X-Centeq-MailScanner-From,X-TM-AS-Product-Ver,X-TM-AS-Result,Username,SOURCE,X-Env-Sender,X-Msg-Ref,X-StarScan-Version,X-VirusChecked
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,increase your sales \n30% to 100% \nacce...,Authenticated Sender is <user122@whynot.net>\n,"Sat, 21 Mar 1998 23:20:27 -0600 (CST)\n",bguenter@hal.qcc.sk.ca\n,islander622@yahoo.com\n,<31867701_67397293>\n,from login_0246.whynot.net (mx.whynot.net[206...,<islander622@yahoo.com>\n,Spam,Re: Your Merchant Account\n,...,,,,,,,,,,
1,"hi,\nwould you like to earn an extra $700 a we...",Authenticated Sender is <cm_offer@earthling.n...,"Mon, 30 Mar 1998 00:43:57 -0500 (EST)\n",bruceg@qcc.sk.ca\n,cm_offer@earthling.net\n,<36424144_99983662>\n,from login_0122.ybecker.net (mail.ybecker.net...,<cm_offer@earthling.net>\n,Spam,I thought you might be interested\n,...,,,,,,,,,,
2,do you love cars?\n\nwant your own business?\n...,,"Thu, 19 Feb 98 17:31:31 EST\n",bait@mikhail.qcc.sk.ca\n,carlover@goplay.com\n,,from mail.anet-chi.com (1Cust245.tnt13.atl2.d...,<carlover@goplay.com>\n,Spam,AUTOMOBILE OPPORTUNITY\n,...,,,,,,,,,,
3,"hi,\n\nwhat's in it for you? \n\n*************...",,"\tSat, 28 Mar 98 17:07:45 PST8PDT\n",bruceg@qcc.sk.ca\n,\t<info@clubcaddie.com>\n,<659888169528.7523837031@clubcaddie.com>\n,from 207.212.50.128 [151.196.111.133] by mail...,<owner-linux-kernel-outgoing@vger.rutgers.edu>\n,Spam,What's in it for me?\n,...,,,,,,,,,,
4,,,"Thu, 26 Mar 1998 21:47:54 -0500\n",bguenter@hal.qcc.sk.ca\n,list@ListMe.com\n,<199803270247.VAA16763@rover.listme.com>\n,(from list@localhost)\n\tby rover.listme.com ...,<list@listme.com>\n,Spam,Your Search Engine Listing\n\n,...,,,,,,,,,,


Notice that some of the columns are duplicates, for eg. 'Message-Id', 'Message-ID' & 'Message-id', so we define a method to combine these duplicate columns.

In [0]:
def combineDuplicateColumns(df_final, colNames):
    originalCol = ''
    for i,duplicateCol in enumerate(colNames):
        if(i == 0):
            originalCol = duplicateCol
            continue
        
        df_final.loc[df_final[originalCol].isna(), originalCol] = df_final.loc[df_final[originalCol].isna()][duplicateCol]
    
    
    df_final.drop(labels=colNames[1:], axis=1, inplace=True)

In [0]:
# Combine duplicate columns
combineDuplicateColumns(df_final, ['Message-Id', 'Message-ID', 'Message-id'])
combineDuplicateColumns(df_final, ['Reply-To', 'Reply-to'])
combineDuplicateColumns(df_final, ['Mime-Version', 'MIME-version', 'MIME-Version'])
combineDuplicateColumns(df_final, ['Content-Type', 'Content-type'])
combineDuplicateColumns(df_final, ['Content-Transfer-Encoding', 'Content-transfer-encoding'])
combineDuplicateColumns(df_final, ['Error-To', 'Errors-To', 'Errors-to'])
combineDuplicateColumns(df_final, ['Content-Length','Content-length'])

Now that all duplicate columns are eliminated, lets display all columns of our final dataframe.

In [22]:
# Check for more duplicate columns
np.sort(df_final.columns.values)

array(['ADDRESS', 'AGE', 'ALERT', 'AT', 'Accreditor', 'Addressee', 'Age',
       'Amount', 'Approved-By', 'Architecture', 'Arrival-Date', 'Article',
       'Attention', 'Attn', 'Automatic-Legal-Notices', 'Availability',
       'BIRTH', 'Bcc', 'Bericht', 'BioC2007', 'Bloomberg', 'Body',
       'Bounce-To', 'Breakdown', 'BroadcastJobID', 'BroadcastRecipientID',
       'CC', 'CITY', 'CODE', 'COMMENTS', 'COMPANY', 'Category', 'Cc',
       'Cc\xa0', 'Cell', 'Cellular', 'Cialis', 'Cierre', 'Close',
       'Coefficients', 'Comments', 'Complaints-To', 'Consider',
       'Content-Base', 'Content-Class', 'Content-Description',
       'Content-Disposition', 'Content-ID', 'Content-Length',
       'Content-MD5', 'Content-Transfer-Encoding',
       'Content-Transfer-encoding', 'Content-Type', 'Content-class',
       'Content-description', 'Content-disposition', 'Content-language',
       'Content-return', 'Coordinator', 'Correction', 'D.i.a.l', 'DBD',
       'DBI->connect("dbi', 'DESIGN', 'DEVELOPME

Let's keep the columns that we need and eliminate the rest.

In [23]:
#From, To, Body, Subject, Message-Id, X-UIDL, Sender, Spam
c1 = df_final.columns
c2 = pd.Index(['Spam', 'Body', 'Subject', 'From', 'To', 'Message-Id', 'X-UIDL', 'Sender'])
c1 = c1.difference(c2)

df_new = df_final.drop(labels=c1.format(), axis=1, inplace=False)
df_new.columns

Index(['Body', 'From', 'Message-Id', 'Spam', 'Subject', 'To', 'X-UIDL',
       'Sender'],
      dtype='object', name='index')

Here is our final dataframe. We now have data in a structured format. Saving it as "structured.xlsx". 

In [None]:
df_new.head()

In [24]:
excelwriter = pd.ExcelWriter('structured.xlsx', engine='xlsxwriter')

df_new.to_excel(excelwriter, index=False)

excelwriter.save()


about%20prize4life

his%20latest%20idea%20is%20to%20offer%20$10%20million%20%5bin%5d%20prize%5bs%5d,%20prize4life,
to%20create%20buzz%20and%20attract%20researchers%20to%20solve%20the%20puzzles%20of%20als,
much%20as%20the%20x%20prize%20stimulated%20private%20teams%20to%20build%20a
spacecraft%20.%20.%20.%20it%20is%20all%20about%20harnessing%20market%20forces%20to%20his
goal,%20mr.%20kremer%20says.

%20the%20wall%20street%20journal

what%20is%20prize4life?

prize4life%20is%20a%20results-oriented%20nonprofit%20founded%20to%20accelerate%20als/
mnd%20research%20by%20offering%20substantial%20prizes%20to%20scientists%20who%20solve
the%20most%20critical%20scientific%20problems%20preventing%20the%20discovery%20of%20an
effective%20als/mnd%20treatment.

the%20prize4life%20concept%20is%20inspired%20by%20other%20prize%20awards%20for
stimulating%20research,%20such%20as%20the%20x-prize%20for%20commercial%20space%20travel
and%20dna-decoding,%20the%20u.s.%20governments%20h-prize%20for%20hydrogen


In [0]:
from google.colab import files
files.download('structured.xlsx')