In this part, we randomly partition our dataset into test-train splits. We then transform our train dataset to extract features and train a Multinomial Naive Bayes model. Then, we use the test dataset to calculate the accuracy of our prediction.

### If on Google Colab

Execute the below cells only if running on Google Colab. They install the needed packages and download the "structured.xlsx" file from Google drive. 

In [1]:
!pip install PyDrive
!pip install xlrd

Collecting xlrd
[?25l  Downloading https://files.pythonhosted.org/packages/07/e6/e95c4eec6221bfd8528bcc4ea252a850bffcc4be88ebc367e23a1a84b0bb/xlrd-1.1.0-py2.py3-none-any.whl (108kB)
[K    100% |████████████████████████████████| 112kB 4.2MB/s 
[?25hInstalling collected packages: xlrd
Successfully installed xlrd-1.1.0


In [0]:
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [0]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
!rm -rf structured.xlsx
download = drive.CreateFile({'id': '1oh_fic0-1N1xh4OlTvGMQ5BTKjOmUYKi'})
download.GetContentFile('structured.xlsx')

In [0]:
download3 = drive.CreateFile({'id': '1iFVW0RxqL1VNrIOrfJXrDkPHF8ewXH49'})
download3.GetContentFile('spark-2.3.1-bin-hadoop2.7.tgz')

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!tar xf spark-2.3.1-bin-hadoop2.7.tgz
!pip install -q findspark

Set environment variables for Java and Spark

In [0]:
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-hadoop2.7"

### If not on Google Colab

If running locally, run from here.

In [1]:
import pandas as pd
import numpy as np
import os
from time import time
from bs4 import BeautifulSoup
import math

We read the "structured.xlsx" file into a pandas dataframe.

In [2]:
def conv(content):
    '''
    convert datatype to string or np.NaN
    '''
    # if content is NaN
    if(content != content):
        return np.NaN
    # else
    return str(content)
    

In [3]:
converters = {'Spam':conv, 'Body':conv, 'Subject':conv, 'From':conv, 'To':conv, 'X-UIDL':conv, 'Message-Id':conv, 'Sender':conv}

In [4]:
df_final = pd.read_excel('structured.xlsx', sheet_name='Sheet1', index_col=None, converters=converters )

## Engineering features from Email Headers

Email headers show the route an email has taken before arriving at its recipient. It contains important information like sender, recipient, message-id, date and time, subject etc. 

There are two reasons why spammers try to malform email headers.<br/>
    1.They try to conceil their identity and real source of the email.<br/>
    2.They try to conceil the fact that the email was part of a mass mailing effort. 

FEATURE1: Spammers will sometimes enter all recipients in the Bcc field, and the address in the 'From' field is used in the 'To' field. Creating a new column 'Feature1' where 1 indicates 'From' field is the same as 'To' field, 0 indicates otherwise.

In [5]:
# Feature1 -> From same as To. If yes -> 1, else 0
df_final['Feature1'] = 0
df_final.loc[df_final['From'] == df_final['To'], 'Feature1'] = 1
df_final.head()

Unnamed: 0,From,To,Message-Id,Subject,Body,Spam,X-UIDL,Sender,Feature1
0,aj881c <aj881c@ix.netcom.com>\n,<bagpipes@acadia.net>\n,<19943672.886214@relay.comanche.denmark.eu> M...,2-1\n,email marketing works!!\n\nbull's eye gold is ...,Spam,,,0
1,iwbp@mailcity.com\n,members@your.net\n,<>\n,"Exclusive Internet Business, 1st Time Offered...",>>>this is the most exciting breakthrough ever...,Spam,,,0
2,am74rt <am74rt@worldnet.att.net>\n,<badams@eastky.com>\n,<19943672.886214@relay.comanche.denmark.eu> T...,2-17\n,email marketing works!!\n\nbull's eye gold is ...,Spam,,,0
3,"""D.Reynolds"" <subwiz1@friendlyserver.com>\n",,<199802161222.EAA24869@net1.aoci.com>\n,ADV: FREE DOWNLOAD:Register your web site to ...,free download.register your web site to over 7...,Spam,,,0
4,carlover@goplay.com\n,carlovers@america.com\n,<>\n,AUTOMOBILE OPPORTUNITY\n,do you love cars?\n\nwant your own business?\n...,Spam,,,0


FEATURE2: Again, because spammers send out emails by filling the 'Bcc' field, they sometimes leave the 'To' field empty or with an invalid string. Creating a new column 'Feature2' where 1 indicates invalid or NaN 'To' field, 0 indicates otherwise.

The method defined below splits the passed string around the ',' character, to get individual email addresses, 
which are stripped off of the new line chars. Method can handle email address strings like 
`"Tomas Jacobs" <RickyAmes@aol.com>` also. Regex is then used to check format correctness

In [7]:
import re
def isValidEmailFormat(emails):
    """The method splits the passed string around the ',' character, to get individual 
    email addresses, which are stripped off of the new line chars. Method can handle email address 
    strings like  "Tomas Jacobs" <RickyAmes@aol.com> also. Regex is then used to check format 
    correctness"""
    for email in str(emails).split(','):
        if(email.isspace() or len(email) == 0):
            continue
        
        # strip new line chars
        email = re.sub(r'(\n+)', r' ', str(email)).strip()
        print(email)
        
        # handle both "Tomas Jacobs" <RickyAmes@aol.com> or <RickyAmes@aol.com>
        if(re.match(r"(.+)<(.+)>|<(.+)>", email)):
            email = email[email.find("<")+1:-1]
            print(0, email)
            
        if(len(email) > 7):
            if(re.match("^.+@([?)[a-zA-Z0-9-.]+.([a-zA-Z]{2,3}|[0-9]{1,3})(]?))$", email) != None):
                print('continue')
                continue
            else:
                print('1')
                return 1
        else:
            return 1
        
    return 0

In [8]:
# Feature2 -> is the To column na or invalid ? 1->invalid, 0->valid
df_final['Feature2'] = 0
df_final['Feature2'] = df_final['To'].map(isValidEmailFormat)
df_final.loc[df_final['To'].isna(),'Feature2'] = 1
df_final[['Feature2','To']].head()

<bagpipes@acadia.net>
0 bagpipes@acadia.net
continue
members@your.net
continue
<badams@eastky.com>
0 badams@eastky.com
continue
nan
carlovers@america.com
continue
bait@mikhail.qcc.sk.ca
continue
<badmin@forum-de-beyrouth.com.lb>
0 badmin@forum-de-beyrouth.com.lb
continue
UDog244@aol.com
continue
baileyl@cream.cambridge.scr.slb.com
continue
baileyl@delphi.com
continue
baileyl@erols.com
continue
user@aol.com
continue
bguenter@gemprint.com
continue
buratuss_ef@bigfoot.com
continue
guluimai67@msn.com
continue
<vworlds@vworlds.com>
0 vworlds@vworlds.com
continue
p40508@presence4u.com
continue
bruceg@qcc.sk.ca
continue
usr999@aol.com
continue
BeSeen@At.Our.Site.com
continue
bguenter@gemprint.com       Is your site listed with the top search engines?  ListMe will       list you with 50 search engines and indexes for $90.       Satisfaction guaranteed!  Search engines are the only way most people have to find internet sites. But if your site is not listed
1
bruce.guenter@gemprint.com       Is 

0 mhln@mhln.com
continue
<smiles@speedy.uwaterloo.ca> Sent: Thursday
0 smiles@speedy.uwaterloo.ca> Sent: Thursda
1
perl6-internals@perl.org
continue
theorize@plg.uwaterloo.ca
continue
adf   <producttestpanel@speedy.uwaterloo.ca>
0 producttestpanel@speedy.uwaterloo.ca
continue
manager@speedy.uwaterloo.ca
continue
<smiles@speedy.uwaterloo.ca>
0 smiles@speedy.uwaterloo.ca
continue
<producttestpanel@speedy.uwaterloo.ca> Precedence: normal
0 producttestpanel@speedy.uwaterloo.ca> Precedence: norma
1
R-help@stat.math.ethz.ch
continue
"'slomascolo'" <slomascolo@zoo.ufl.edu>
0 slomascolo@zoo.ufl.edu
continue
<r-help@stat.math.ethz.ch> References: <9907582.post@talk.nabble.com>
0 r-help@stat.math.ethz.ch> References: <9907582.post@talk.nabble.com
continue
ktwarwic@speedy.uwaterloo.ca
continue
adf   <producttestpanel@speedy.uwaterloo.ca>
0 producttestpanel@speedy.uwaterloo.ca
continue
<manager@speedy.uwaterloo.ca>
0 manager@speedy.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
mhln@e

0 theorize@plg.uwaterloo.ca
continue
<chortled@plg.uwaterloo.ca>
0 chortled@plg.uwaterloo.ca
continue
"gvreugde" <gvreugde@plg.uwaterloo.ca>
0 gvreugde@plg.uwaterloo.ca
continue
<manager@speedy.uwaterloo.ca>
0 manager@speedy.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
"Milton Cezar Ribeiro" <milton_ruser@yahoo.com.br>
0 milton_ruser@yahoo.com.br
continue
"R-help" <r-help@stat.math.ethz.ch>
0 r-help@stat.math.ethz.ch
continue
producttestpanel@speedy.uwaterloo.ca
continue
<sktwarwic@speedy.uwaterloo.ca>
0 sktwarwic@speedy.uwaterloo.ca
continue
<producttestpanel@speedy.uwaterloo.ca> Precedence: normal
0 producttestpanel@speedy.uwaterloo.ca> Precedence: norma
1
BREAKINGNEWS Subscribers<BREAKINGNEWS-Subscribers@foxnews.com>
0 BREAKINGNEWS-Subscribers@foxnews.com
continue
tridge@samba.org
continue
ktwarwic@speedy.uwaterloo.ca
continue
r-help@stat.math.ethz.ch
continue
<theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
"=?iso-8859-1?Q?ktwarwic@speedy=2Euwaterloo=

0 warwick@speedy.uwaterloo.ca
continue
nan
<theorize@speedy.uwaterloo.ca>
0 theorize@speedy.uwaterloo.ca
continue
<theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
manager@speedy.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
producttestpanel@speedy.uwaterloo.ca
continue
TEXTBREAKINGNEWS@CNNIMAIL12.CNN.COM
continue
<gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
<deficient@speedy.uwaterloo.ca>
0 deficient@speedy.uwaterloo.ca
continue
cruiseca@speedy.uwaterloo.ca
continue
samba-cvs@samba.org
continue
"Tyree Simpson" <ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
<theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
samba-cvs@samba.org
continue
samba-cvs@samba.org
continue
the00@plg2.math.uwaterloo.ca
continue
"Tamra" <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
samba-cvs@samba.org
continue
manager@speedy.uwaterloo.ca
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
ktwa

0 the00@plg.uwaterloo.ca
continue
"the00" <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
"the00" <the00@plg2.math.uwaterloo.ca>
0 the00@plg2.math.uwaterloo.ca
continue
samba-cvs@samba.org
continue
ktwarwic@SPEEDY.UWATERLOO.CA
continue
<warwickktwarwic@speedy.uwaterloo.ca>
0 warwickktwarwic@speedy.uwaterloo.ca
continue
<soundtrackdeficient@speedy.uwaterloo.ca>
0 soundtrackdeficient@speedy.uwaterloo.ca
continue
manager@speedy.uwaterloo.ca
continue
gnitpick@speedy.uwaterloo.ca
continue
Gabor Grothendieck <ggrothendieck@gmail.com>
0 ggrothendieck@gmail.com
continue
<catchall@speedy.uwaterloo.ca>
0 catchall@speedy.uwaterloo.ca
continue
R packages list <r-packages@stat.math.ethz.ch>
0 r-packages@stat.math.ethz.ch
continue
"Jarod" <ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
"gnitpick" <gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
<theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continu

0 theorize@plg.uwaterloo.ca
continue
samba-cvs@samba.org
continue
ktwarwic@speedy.uwaterloo.ca
continue
producttestpanel@speedy.uwaterloo.ca
continue
"theorize" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
<theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
samba-cvs@samba.org
continue
<tabvrttypmcbgnitpick@speedy.uwaterloo.ca>
0 tabvrttypmcbgnitpick@speedy.uwaterloo.ca
continue
"theorize" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
theorize@plg.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
samba-technical@samba.org
continue
samba-cvs@samba.org
continue
"gnitpick" <gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
"Babara Cox" <manager@speedy.uwaterloo.ca>
0 manager@speedy.uwaterloo.ca
continue
"the00" <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
"gnitpick" <gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
samba-cvs@samba.org
contin

continue
<faisalabad@speedy.uwaterloo.ca>
0 faisalabad@speedy.uwaterloo.ca
continue
"ktwarwic" <ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
"mhln" <mhln@mhln.com>
0 mhln@mhln.com
continue
the00@plg.uwaterloo.ca
continue
gnitpick@speedy.uwaterloo.ca
continue
<mack@speedy.uwaterloo.ca>
0 mack@speedy.uwaterloo.ca
continue
Theorize <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
Acm <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
theorize@plg.uwaterloo.ca
continue
samba-cvs@samba.org
continue
theorize@plg.uwaterloo.ca
continue
the00@speedy.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
<elaastic@speedy.uwaterloo.ca>
0 elaastic@speedy.uwaterloo.ca
continue
"the00" <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca> Sent: Thursday
0 ktwarwic@speedy.uwaterloo.ca> Sent: Thursda
1
<gnitpick@speedy.uwaterloo.ca> Sent: Thur

continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
samba-cvs@samba.org
continue
ktwarwic@speedy.uwaterloo.ca
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
"Theorize" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
"theorize" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
samba-technical@lists.samba.org
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
catchall@speedy.uwaterloo.ca
continue
"mhln" <mhln@mhln.com>
0 mhln@mhln.com
continue
<the00@plg2.math.uwaterloo.ca>
0 the00@plg2.math.uwaterloo.ca
continue
<gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
<gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
samba-cvs@samba.org
continue
samba-cvs@samba.org
continue
<smiles@speedy.uwaterloo.ca>
0 smiles@speedy.uwaterloo.ca
continue
<mail@speedy.uwaterloo.ca>
0 mail@speedy.uwater

0 manager@speedy.uwaterloo.ca
continue
<wheat@speedy.uwaterloo.ca>
0 wheat@speedy.uwaterloo.ca
continue
<mail@speedy.uwaterloo.ca>
0 mail@speedy.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
avcooper@speedy.uwaterloo.ca
continue
"Valerie Lane" <mail@speedy.uwaterloo.ca>
0 mail@speedy.uwaterloo.ca
continue
<r-help@stat.math.ethz.ch>
0 r-help@stat.math.ethz.ch
continue
<theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
r-help@stat.math.ethz.ch
continue
"'Jim Lemon'" <jim@bitwrit.com.au>
0 jim@bitwrit.com.au
continue
"'Pedro A Reche'" <reche@research.dfci.harvard.edu> References: <66C18F5B-FEAD-440A-AC9D-DFAD53B145EA@research.dfci.harvard.edu> 	<4624A333.2020109@bitwrit.com.au>
0 reche@research.dfci.harvard.edu> References: <66C18F5B-FEAD-440A-AC9D-DFAD53B145EA@research.dfci.harvard.edu> 	<4624A333.2020109@bitwrit.com.au
continue
the

0 elaastic@speedy.uwaterloo.ca
continue
<producttestpanel@speedy.uwaterloo.ca> Precedence: normal
0 producttestpanel@speedy.uwaterloo.ca> Precedence: norma
1
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
<mail@speedy.uwaterloo.ca>
0 mail@speedy.uwaterloo.ca
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
producttestpanel@speedy.uwaterloo.ca
continue
"Subscriber" <producttestpanel@speedy.uwaterloo.ca>
0 producttestpanel@speedy.uwaterloo.ca
continue
mail@speedy.uwaterloo.ca
continue
samba-cvs@samba.org
continue
"gnitpick" <gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
Carl Worth <cworth@redhat.com>
0 cworth@redhat.com
continue
<producttestpanel@speedy.uwaterloo.ca> Precedence: normal
0 producttestpanel@speedy.uwaterloo.ca> Precedence: norma
1
"Subscriber" <producttestpanel@speedy.uwaterloo.ca>
0 producttestpanel@speedy.uwaterloo.ca
continue
adf   <producttestpanel@speedy.uwaterloo.ca>
0 producttestpanel@speedy.uwaterloo.ca
continue

0 ktwarwic@speedy.uwaterloo.ca
continue
R-help <r-help@stat.math.ethz.ch>
0 r-help@stat.math.ethz.ch
continue
bugs-bitbucket@netlabs.develooper.com
continue
"Subscriber" <producttestpanel@speedy.uwaterloo.ca>
0 producttestpanel@speedy.uwaterloo.ca
continue
<fantasy@speedy.uwaterloo.ca>
0 fantasy@speedy.uwaterloo.ca
continue
"gnitpick" <gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
"catchall" <catchall@speedy.uwaterloo.ca>
0 catchall@speedy.uwaterloo.ca
continue
producttestpanel@speedy.uwaterloo.ca
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
"the00" <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
samba-cvs@samba.org
continue
"gnitpick" <gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
<furmanfw@speedy.uwaterloo.ca>
0 furmanfw@speedy.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
"R-help" <r-help@stat.math.ethz.ch>
0 r-help@stat.math.ethz.ch
continue
the00@plg.uwaterloo.ca
continue
the00@plg2.mat

samba-cvs@samba.org
continue
gnitpick@speedy.uwaterloo.ca
continue
"ktwarwic" <ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
gjditchf@plg.uwaterloo.ca
continue
the00@plg.uwaterloo.ca
continue
gnitpick@speedy.uwaterloo.ca
continue
gnitpick@speedy.uwaterloo.ca
continue
<smiles@speedy.uwaterloo.ca>
0 smiles@speedy.uwaterloo.ca
continue
AbouEl-Makarim Aboueissa <aaboueissa@usm.maine.edu> References: <4624C9F2.A437.00A6.0@usm.maine.edu>
0 aaboueissa@usm.maine.edu> References: <4624C9F2.A437.00A6.0@usm.maine.edu
continue
Ip-health <ip-health@lists.essential.org>
0 ip-health@lists.essential.org
continue
ktwarwic@speedy.uwaterloo.ca
continue
catchall@speedy.uwaterloo.ca
continue
r-help@stat.math.ethz.ch
continue
r-help@stat.math.ethz.ch
continue
<gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
"Patrick Burns" <pburns@pburns.seanet.com>
0 pburns@pburns.seanet.com
continue
1@bellsouth.net
continue
"Sanora" <dmason@plg2.math.uwaterloo.ca>
0 dmason@plg

0 smilesnn@speedy.uwaterloo.ca
continue
<smilenn@speedy.uwaterloo.ca>
0 smilenn@speedy.uwaterloo.ca
continue
ip-health@lists.essential.org
continue
ipsociety@yahoogroups.com
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
<the00@plg2.math.uwaterloo.ca>
0 the00@plg2.math.uwaterloo.ca
continue
Acm <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
catchall@speedy.uwaterloo.ca
continue
<theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
memolink@speedy.uwaterloo.ca
continue
John Jorgensen <jorgensen.john@gmail.com> References: <3641a2b10704162303h3c99a449o48f8abdc8e950f18@mail.gmail.com>
0 jorgensen.john@gmail.com> References: <3641a2b10704162303h3c99a449o48f8abdc8e950f18@mail.gmail.com
continue
debian-vote@lists.debian.org
continue
debian-legal@lists.debian.org
continue
cruiseca@speedy.uwaterloo.ca
continue
mail@speedy.uwaterloo.ca
continue
gnitpick@speedy.uwaterloo

continue
James Peach <jpeach@samba.org> References: <20070417221248.E5BA3162C2E@lists.samba.org> 	<2D62765F-890A-4287-8330-37764DCA8B0D@samba.org>
0 jpeach@samba.org> References: <20070417221248.E5BA3162C2E@lists.samba.org> 	<2D62765F-890A-4287-8330-37764DCA8B0D@samba.org
continue
"Subscriber" <producttestpanel@speedy.uwaterloo.ca>
0 producttestpanel@speedy.uwaterloo.ca
continue
Ip-health <ip-health@lists.essential.org>
0 ip-health@lists.essential.org
continue
<catchall@speedy.uwaterloo.ca>
0 catchall@speedy.uwaterloo.ca
continue
<elaastic@speedy.uwaterloo.ca>
0 elaastic@speedy.uwaterloo.ca
continue
"Theorize" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
"Marco Pesenti Gritti" <mpg@redhat.com>
0 mpg@redhat.com
continue
parrot-porters@perl.org
continue
"Perl 6 announce list" <perl6-announce@perl.org>
0 perl6-announce@perl.org
continue
perl6-language@perl.org
continue
perl5-porters@perl.org
continue
parrot-porters@perl.org
continue
"Perl 6 announce list" <perl6-announ

0 avcooper@speedy.uwaterloo.ca
continue
deficient@speedy.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
<elaastic@speedy.uwaterloo.ca>
0 elaastic@speedy.uwaterloo.ca
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
<the00@plg2.math.uwaterloo.ca>
0 the00@plg2.math.uwaterloo.ca
continue
debian-laptop@lists.debian.org
continue
simpsons@speedy.uwaterloo.ca
continue
antelopehn@speedy.uwaterloo.ca
continue
bugs-bitbucket@netlabs.develooper.com
continue
gnitpick@speedy.uwaterloo.ca
continue
ed@speedy.uwaterloo.ca
continue
"Andrew Coopers" <avcooper@speedy.uwaterloo.ca>
0 avcooper@speedy.uwaterloo.ca
continue
debian-laptop@lists.debian.org
continue
djakwwnKVYWTQBgnitpick@speedy.uwaterloo.ca
continue
samba-cvs@samba.org
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
Easytrade.Service@speedy.uwaterloo.ca
continue
<Speakup@braille.uwo.ca>
0 Speakup@brail

gnitpick@speedy.uwaterloo.ca
continue
<jjosh@speedy.uwaterloo.ca>
0 jjosh@speedy.uwaterloo.ca
continue
"Csourelis" <csourelis@plg.uwaterloo.ca>
0 csourelis@plg.uwaterloo.ca
continue
<catchall@speedy.uwaterloo.ca>
0 catchall@speedy.uwaterloo.ca
continue
"Warwick" <warwick@speedy.uwaterloo.ca>
0 warwick@speedy.uwaterloo.ca
continue
"Warwick" <warwick@speedy.uwaterloo.ca>
0 warwick@speedy.uwaterloo.ca
continue
"theorize" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
"manager" <manager@speedy.uwaterloo.ca>
0 manager@speedy.uwaterloo.ca
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
<the00@plg2.math.uwaterloo.ca>
0 the00@plg2.math.uwaterloo.ca
continue
gnitpick@speedy.uwaterloo.ca
continue
"manager" <manager@speedy.uwaterloo.ca>
0 manager@speedy.uwaterloo.ca
continue
"Warwick" <warwick@speedy.uwaterloo.ca>
0 warwick@speedy.uwaterloo.ca
continue
"the00" <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
"the00" <the00@plg2.math.uwaterloo.ca>
0 

0 the00@plg.uwaterloo.ca
continue
oats@speedy.uwaterloo.ca
continue
obniafp@speedy.uwaterloo.ca
continue
"gnitpick" <gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
Smiles <smiles@speedy.uwaterloo.ca>
0 smiles@speedy.uwaterloo.ca
continue
Smile <smile@speedy.uwaterloo.ca>
0 smile@speedy.uwaterloo.ca
continue
gnitpick@speedy.uwaterloo.ca
continue
"the00" <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
tjiyp@speedy.uwaterloo.ca
continue
ovq@speedy.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
abwtierzyypgnitpick@speedy.uwaterloo.ca
continue
theplg@speedy.uwaterloo.ca
continue
the00@plg.uwaterloo.ca
continue
<smiles@speedy.uwaterloo.ca>
0 smiles@speedy.uwaterloo.ca
continue
"gnitpick" <gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
fejbrsyairqegnitpick@speedy.uwaterloo.ca
continue
nan
afe0c5d8@speedy.uwaterloo.ca
continue
"Edra Armstrong" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
<manager@speedy.u

the00@plg2.math.uwaterloo.ca
continue
syvztfpuxebgnitpick@speedy.uwaterloo.ca
continue
Jon Phillips <jon@creativecommons.org> References: <1176837101.25892.22.camel@localhost>
0 jon@creativecommons.org> References: <1176837101.25892.22.camel@localhost
continue
tabvrttypmcbgnitpick@speedy.uwaterloo.ca
continue
"catchall" <catchall@speedy.uwaterloo.ca>
0 catchall@speedy.uwaterloo.ca
continue
"Andrew Coopers" <avcoopers@speedy.uwaterloo.ca>
0 avcoopers@speedy.uwaterloo.ca
continue
sugar@laptop.org
continue
producttestpanel@speedy.uwaterloo.ca
continue
samba-cvs@samba.org
continue
sugar@laptop.org
continue
"the00" <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
"the00" <the00@plg2.math.uwaterloo.ca>
0 the00@plg2.math.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
Multiple recipients of list SAMBA-TECHNICAL <samba-technical@samba.org>
0 samba-technical@samba.org
continue
<mailnn@speedy.uwaterloo.ca>
0 mailnn@

mail@speedy.uwaterloo.ca
continue
<producttestpanel@speedy.uwaterloo.ca> Precedence: normal
0 producttestpanel@speedy.uwaterloo.ca> Precedence: norma
1
the00@plg.uwaterloo.ca
continue
the00@plg2.math.uwaterloo.ca
continue
shoekbhorror@speedy.uwaterloo.ca
continue
opt17@speedy.uwaterloo.ca
continue
ktwarwic-plg@speedy.uwaterloo.ca
continue
opt2@speedy.uwaterloo.ca
continue
the00@speedy.uwaterloo.ca
continue
refinance@speedy.uwaterloo.ca
continue
opt4@speedy.uwaterloo.ca
continue
<producttestpanel@speedy.uwaterloo.ca> Precedence: normal
0 producttestpanel@speedy.uwaterloo.ca> Precedence: norma
1
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
John Fox <jfox@mcmaster.ca>
0 jfox@mcmaster.ca
continue
r-help@stat.math.ethz.ch
continue
producttestpanel@speedy.uwaterloo.ca
continue
"Schmitt
1
<producttestpanel@speedy.uwaterloo.ca> Precedence: normal
0 producttestpanel@speedy.uwaterloo.ca> Precedence: norma
1
"Allison Randal" <allison@perl.org>
0 allison@perl.org
continue
catchall@sp

continue
"Catrice Lopez" <manager@speedy.uwaterloo.ca>
0 manager@speedy.uwaterloo.ca
continue
"Schmitt
1
lafjk@speedy.uwaterloo.ca
continue
laksdfj@speedy.uwaterloo.ca
continue
"theorize" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
<theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
r-help@stat.math.ethz.ch
continue
Alberto Monteiro <albmont@centroin.com.br>
0 albmont@centroin.com.br
continue
<theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
debian-mirrors@lists.debian.org
continue
"smiles" <smiles@speedy.uwaterloo.ca>
0 smiles@speedy.uwaterloo.ca
continue
"Speakup is a screen review system for Linux." <speakup@braille.uwo.ca>
0 speakup@braille.uwo.ca
continue
m@speedy.uwaterloo.ca
continue
producttestpanel@speedy.uwaterloo.ca
continue
cruiseca@speedy.uwaterloo.ca
continue
mack@speedy.uwaterloo.ca
continue
mackd@speedy.uwaterloo.ca
continue
mackdd@speedy.uwaterloo.ca
continue
adf

continue
elaastic@speedy.uwaterloo.ca
continue
"the00" <the00@plg2.math.uwaterloo.ca>
0 the00@plg2.math.uwaterloo.ca
continue
"the00" <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
"the00" <the00@plg2.math.uwaterloo.ca>
0 the00@plg2.math.uwaterloo.ca
continue
"the00" <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
nan
the00@plg.uwaterloo.ca
continue
parrotbug-followup@parrotcode.org
continue
"ktwarwic" <ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
"Hong Su An" <anhong@msu.edu>
0 anhong@msu.edu
continue
nan
debian-legal@lists.debian.org
continue
cc-community@lists.ibiblio.org References: <20070413150250.qlj606zm39b4kso4@webmail.robmyers.org>	<20070415160753.GM11509@yukidoke.org> 	<FA6061C2-04DF-4877-872B-2008B66E2BE1@pobox.com>
0 20070413150250.qlj606zm39b4kso4@webmail.robmyers.org>	<20070415160753.GM11509@yukidoke.org> 	<FA6061C2-04DF-4877-872B-2008B66E2BE1@pobox.com
continue
ktwarwic@SPEEDY.UWATERLOO.CA
continue
<antelopehn@speedy.uwater

0 theorize@plg.uwaterloo.ca
continue
"gnitpick" <gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
"=?iso-8859-1?Q?ktwarwic@speedy=2Euwaterloo=2Eca=20?=" <ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
producttestpanel@speedy.uwaterloo.ca
continue
samba-cvs@samba.org
continue
producttestpanel@speedy.uwaterloo.ca
continue
cruiseca@speedy.uwaterloo.ca
continue
Alek Storm <alek.storm@gmail.com>
0 alek.storm@gmail.com
continue
the00@plg.uwaterloo.ca
continue
r-help@stat.math.ethz.ch
continue
<gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
<elaastic@speedy.uwaterloo.ca>
0 elaastic@speedy.uwaterloo.ca
continue
debian-legal@lists.debian.org
continue
"ktwarwic" <ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
gnitpick@speedy.uwate

"Allison Randal" <allison@perl.org>
0 allison@perl.org
continue
ktwarwic@speedy.uwaterloo.ca
continue
<gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
producttestpanel@speedy.uwaterloo.ca
continue
samba-cvs@samba.org
continue
"Lukas Biewald" <lukeb@powerset.com>
0 lukeb@powerset.com
continue
SMILE@speedy.uwaterloo.ca
continue
Jonathan Worthington <jonathan@jnthn.net>
0 jonathan@jnthn.net
continue
smiles@speedy.uwaterloo.ca
continue
"theorize" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
sktwarwic@speedy.uwaterloo.ca
continue
soundtrackdeficient@speedy.uwaterloo.ca
continue
"theorize" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
"Clyde" <ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
"theorize" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
<theorize@plg.uwaterloo.ca>
0 the

continue
"gnitpick" <gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
"theorize" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
<gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
<gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
"Smiles" <smiles@speedy.uwaterloo.ca>
0 smiles@speedy.uwaterloo.ca
continue
<gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
<gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
<catchall@speedy.uwaterloo.ca>
0 catchall@speedy.uwaterloo.ca
continue
"gnitpick" <gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
<smiles@speedy.uwaterloo.ca>
0 smiles@speedy.uwaterloo.ca
continue
<theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
samba-cvs@samba.org
continue
"smiles" <smiles@

0 producttestpanel@speedy.uwaterloo.ca
continue
mail@speedy.uwaterloo.ca
continue
r-help@stat.math.ethz.ch
continue
mail@speedy.uwaterloo.ca
continue
r-help@stat.math.ethz.ch
continue
<producttestpanel@speedy.uwaterloo.ca> Precedence: normal
0 producttestpanel@speedy.uwaterloo.ca> Precedence: norma
1
"Simon Pickett" <S.Pickett@exeter.ac.uk>
0 S.Pickett@exeter.ac.uk
continue
r-help@stat.math.ethz.ch
continue
ktwarwic@speedy.uwaterloo.ca
continue
Perl 6 announce list <perl6-announce@perl.org>
0 perl6-announce@perl.org
continue
manager@speedy.uwaterloo.ca
continue
"9ec06fdb" <9ec06fdb@speedy.uwaterloo.ca>
0 9ec06fdb@speedy.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
r-help@stat.math.ethz.ch
continue
"the00" <the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
"the00" <the00@plg2.math.uwaterloo.ca>
0 the00@plg2.math.uwaterloo.ca
continue
"ktwarwic" <ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
<gnitpick@speedy.

0 ktwarwic@speedy.uwaterloo.ca
continue
"Alena Jacobs" <manager@speedy.uwaterloo.ca>
0 manager@speedy.uwaterloo.ca
continue
ktwarwic@speedy.uwaterloo.ca
continue
mail@speedy.uwaterloo.ca
continue
<r-help@stat.math.ethz.ch>
0 r-help@stat.math.ethz.ch
continue
manager@speedy.uwaterloo.ca
continue
<mail@speedy.uwaterloo.ca>
0 mail@speedy.uwaterloo.ca
continue
<manager@speedy.uwaterloo.ca>
0 manager@speedy.uwaterloo.ca
continue
samba-cvs@samba.org
continue
the00@plg.uwaterloo.ca
continue
<the00@plg2.math.uwaterloo.ca>
0 the00@plg2.math.uwaterloo.ca
continue
"Ron Michael" <ron_michael70@yahoo.com>
0 ron_michael70@yahoo.com
continue
r-help@stat.math.ethz.ch
continue
"mtb954@gmail.com" <mtb954@gmail.com>
0 mtb954@gmail.com
continue
<mhln@mhln.com>
0 mhln@mhln.com
continue
"Gerald (Jerry) Carter" <jerry@samba.org>
0 jerry@samba.org
continue
sugar@laptop.org
continue
<theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
samba-technical@lists.samba.org
continue
ktwarwic@speedy.uwaterl

ktwarwic@SPEEDY.UWATERLOO.CA
continue
Peter Dalgaard <p.dalgaard@biostat.ku.dk>
0 p.dalgaard@biostat.ku.dk
continue
"'Bob Green'" <bgreen@dyson.brisnet.org.au>
0 bgreen@dyson.brisnet.org.au
continue
<r-help@stat.math.ethz.ch>
0 r-help@stat.math.ethz.ch
continue
the00@plg.uwaterloo.ca
continue
<mhln@mhln.com>
0 mhln@mhln.com
continue
theorize@plg.uwaterloo.ca
continue
<elaastic@speedy.uwaterloo.ca>
0 elaastic@speedy.uwaterloo.ca
continue
<the00@plg2.math.uwaterloo.ca>
0 the00@plg2.math.uwaterloo.ca
continue
producttestpanel@speedy.uwaterloo.ca
continue
<gnitpick@speedy.uwaterloo.ca>
0 gnitpick@speedy.uwaterloo.ca
continue
<the00@plg.uwaterloo.ca>
0 the00@plg.uwaterloo.ca
continue
<ktwarwic@speedy.uwaterloo.ca>
0 ktwarwic@speedy.uwaterloo.ca
continue
Prof Brian Ripley <ripley@stats.ox.ac.uk>
0 ripley@stats.ox.ac.uk
continue
<catchall@speedy.uwaterloo.ca>
0 catchall@speedy.uwaterloo.ca
continue
debian-mirrors@lists.debian.org
continue
<elaastic@speedy.uwaterloo.ca>
0 elaastic@speedy.uwate

gnitpick@speedy.uwaterloo.ca
continue
smiles@speedy.uwaterloo.ca
continue
"mail" <mail@speedy.uwaterloo.ca>
0 mail@speedy.uwaterloo.ca
continue
r-help@stat.math.ethz.ch
continue
ktwarwic@speedy.uwaterloo.ca
continue
gnitpick@speedy.uwaterloo.ca
continue
"Theorize" <theorize@plg.uwaterloo.ca>
0 theorize@plg.uwaterloo.ca
continue
SMILE@speedy.uwaterloo.ca
continue
R-Help <r-help@stat.math.ethz.ch>
0 r-help@stat.math.ethz.ch
continue
SMILES@speedy.uwaterloo.ca
continue
"manager" <manager@speedy.uwaterloo.ca>
0 manager@speedy.uwaterloo.ca
continue
theorize@plg.uwaterloo.ca
continue
myl@cis.upenn.edu
continue
<the00@plg2.math.uwaterloo.ca>
0 the00@plg2.math.uwaterloo.ca
continue
myl@cis.upenn.edu
continue
r-help@stat.math.ethz.ch
continue
mail@speedy.uwaterloo.ca
continue
perl6-internals@perl.org
continue
gnitpick@speedy.uwaterloo.ca
continue
"Speakup is a screen review system for Linux." <speakup@braille.uwo.ca>
0 speakup@braille.uwo.ca
continue
<netsearch@canola1.uwaterloo.ca>
0 netsearch

Unnamed: 0,Feature2,To
0,0,<bagpipes@acadia.net>\n
1,0,members@your.net\n
2,0,<badams@eastky.com>\n
3,1,
4,0,carlovers@america.com\n


In [9]:
#df_final[df_final['To'].notna() & df_final['Feature2'] == 1][['To','Feature2']]

In [10]:
import re
def isValidMessageID(mid):
    '''
    checks for valid domain in message id where 0 is valid and 1 is invalid
    '''
    for email in str(mid).split('\n'):
        if(email.isspace() or len(email) == 0 or email.find('@') < 0):
            continue
            
        email = email.strip()
        print(email)
        
        if(re.match(r"(.+)?<<(.+)@(.+)>>(.+)?", email)):
            email = email[email.find("<<")+1:email.rfind(">>")]
            print(0, email)
        
        if(re.match(r"(.+)?<(.+)@(.+)>(.+)?", email)):
            email = email[email.find("<")+1:email.rfind(">")]
            print(0, email)
            
        if(len(email) > 7):
            if(re.match("^.+@([?)[a-zA-Z0-9-.]+.([a-zA-Z]{2,3}|[0-9]{1,3})(]?))$", email) != None):
                return 0
            else:
                continue
        else:
            continue
        
    return 1

FEATURE3: Since the 'Message-Id' contains information about where the email is coming from, in spam mails, it is typically missing or malformed. Message-Ids are of the form xxx@domain.com. The method below checks the correctness of the Message-Id format. Creating a new column 'Feature3' where 1 indicates Message-ID malformed or missing, 0 indicates otherwise. 

In [11]:
# Feature3 -> is the Message-Id column na or invalid ? 1->invalid, 0->valid
df_final['Feature3'] = 0
df_final['Feature3'] = df_final['Message-Id'].map(isValidMessageID)
df_final.loc[df_final['Message-Id'].isna(),'Feature3'] = 1
df_final[['Feature3','Message-Id']].head()

<19943672.886214@relay.comanche.denmark.eu> Monday, February 2nd, 1998
0 19943672.886214@relay.comanche.denmark.eu
<19943672.886214@relay.comanche.denmark.eu> Tuesday, February 17th, 1998
0 19943672.886214@relay.comanche.denmark.eu
<199802161222.EAA24869@net1.aoci.com>
0 199802161222.EAA24869@net1.aoci.com
<19943672.886214@relay.comanche.denmark.eu> Friday, March 6th, 1998
0 19943672.886214@relay.comanche.denmark.eu
<367a6bc2.35086056@aol.com>
0 367a6bc2.35086056@aol.com
<a4193ab9.35086277@aol.com>
0 a4193ab9.35086277@aol.com
<<34B589AA.83376E4D@hotmail.com>>
0 <34B589AA.83376E4D@hotmail.com
<199803140024.GAA02943@1arbiscad.com>
0 199803140024.GAA02943@1arbiscad.com
<199803251503EAA37015@post.ideasign.com>
0 199803251503EAA37015@post.ideasign.com
<0EQG00DSE4M7MP@PM04SM.PMM.MCI.NET>
0 0EQG00DSE4M7MP@PM04SM.PMM.MCI.NET
<199803270303.WAA02236@ns.owlseye.com>
0 199803270303.WAA02236@ns.owlseye.com
<199803270121.TAA11252@linus.vsource.com>
0 199803270121.TAA11252@linus.vsource.com
<19980327

0 000301c78039$c3b811e0$c83814aa@cbs.ad.cbs.net
<462391F6.7010300@ktl.fi>
0 462391F6.7010300@ktl.fi
<6288910E4C67941.66B7007C8D@t-dialin.net>
0 6288910E4C67941.66B7007C8D@t-dialin.net
<911b01c7803b$ee2dbe18$5acb8f8b@seductive.com>
0 911b01c7803b$ee2dbe18$5acb8f8b@seductive.com
<db2401c7803b$1d90a151$2d560179@rome.com>
0 db2401c7803b$1d90a151$2d560179@rome.com
<000001c7803c$9f28e900$0100007f@localhost>
0 000001c7803c$9f28e900$0100007f@localhost
<235401c77adc$4b20aea7$2b2cbcf3@fishhoo.com>
0 235401c77adc$4b20aea7$2b2cbcf3@fishhoo.com
<01c76a35$320ad490$6c822ecf@interruptedoutcasts>
0 01c76a35$320ad490$6c822ecf@interruptedoutcasts
<46239691-kfdyd-1c@toilskirting.com>
0 46239691-kfdyd-1c@toilskirting.com
<01c7803c$d1a31ef0$6c822ecf@dwcodanm>
0 01c7803c$d1a31ef0$6c822ecf@dwcodanm
<01c7803c$d5c35590$6c822ecf@dwmaderthanerm>
0 01c7803c$d5c35590$6c822ecf@dwmaderthanerm
<Pine.LNX.4.64.0704161632020.6268@gannet.stats.ox.ac.uk>
0 Pine.LNX.4.64.0704161632020.6268@gannet.stats.ox.ac.uk
<20070416103

0 000f01c7804f$73088b60$00177524@mario
<01c78068$e3fecff0$6c822ecf@outlaidsavage>
0 01c78068$e3fecff0$6c822ecf@outlaidsavage
<001501c78079$d59a3600$068d1efc@xpsp215d524446>
0 001501c78079$d59a3600$068d1efc@xpsp215d524446
<001b01c78079$d65b6320$0692a95c@xpsp215d524446>
0 001b01c78079$d65b6320$0692a95c@xpsp215d524446
<001c01c78047$5954ee00$0140e35c@particullhl8bl>
0 001c01c78047$5954ee00$0140e35c@particullhl8bl
<000e01c78047$59a86530$0667f704@particullhl8bl>
0 000e01c78047$59a86530$0667f704@particullhl8bl
<1073913632.251839499.qmail@rangesender.com>
0 1073913632.251839499.qmail@rangesender.com
<000c01c78068$e136af40$7537a043@istechws2>
0 000c01c78068$e136af40$7537a043@istechws2
<001801c78079$cc6381e0$0094081c@pausab>
0 001801c78079$cc6381e0$0094081c@pausab
<001c01c78079$c4bedf70$0140ff7c@Sondre>
0 001c01c78079$c4bedf70$0140ff7c@Sondre
<001601c78069$229e2760$06fb3b2c@imadca9e9a4409>
0 001601c78069$229e2760$06fb3b2c@imadca9e9a4409
<001101c77fe1$fe47b570$0e4e9844@serverint>
0 001101c77fe1$f

<521a259m.3341794@borland.com>
0 521a259m.3341794@borland.com
<20070417015406.1292A162AC4@lists.samba.org>
0 20070417015406.1292A162AC4@lists.samba.org
<1176762965.9756@lookedleftlast.com>
0 1176762965.9756@lookedleftlast.com
<552g841w.6892634@dialupnet.com>
0 552g841w.6892634@dialupnet.com
<B3F5F728A0E8129.3707FCE0CD@bushtec.com>
0 B3F5F728A0E8129.3707FCE0CD@bushtec.com
<000701c77a61$7e99e620$439f2c36@industridata.no>
0 000701c77a61$7e99e620$439f2c36@industridata.no
<000e01c76c14$f0cc8730$001abb94@mrtsoft4a78845>
0 000e01c76c14$f0cc8730$001abb94@mrtsoft4a78845
<785301c780a6$01c780a6$506cacbd@plg.uwaterloo.ca>
0 785301c780a6$01c780a6$506cacbd@plg.uwaterloo.ca
<424781323.72142991067454@thhebat.net>
0 424781323.72142991067454@thhebat.net
<20070417020756.GA19189@samba1>
0 20070417020756.GA19189@samba1
<19917426.75527274@rabble.com>
0 19917426.75527274@rabble.com
<002201c78095$c3459c30$e75069d4@ghax>
0 002201c78095$c3459c30$e75069d4@ghax
<20070417021430.8A6AA162C4D@lists.samba.org>
0 20070

<4291692454.164kogou@dyn-htl-14300.dyn.columbia.edu>
0 4291692454.164kogou@dyn-htl-14300.dyn.columbia.edu
<000601c780cc$1cc33f70$a47e9f99@aixo>
0 000601c780cc$1cc33f70$a47e9f99@aixo
<37288365469122.DF424E141C@MOTPR7>
0 37288365469122.DF424E141C@MOTPR7
<000f01c7809a$222d46e0$06208f6c@jessica>
0 000f01c7809a$222d46e0$06208f6c@jessica
<000f01c7810f$bbfea9b0$067cdabc@F877BBDE677945A>
0 000f01c7810f$bbfea9b0$067cdabc@F877BBDE677945A
<bf3701c780dc$c42136e0$7ed5a12d@eddie-bauerfxato>
0 bf3701c780dc$c42136e0$7ed5a12d@eddie-bauerfxato
<200704092050.l39KoP0I019565@speedy.uwaterloo.ca>
0 200704092050.l39KoP0I019565@speedy.uwaterloo.ca
<01c780cc$dd74ede0$6c822ecf@dwlyfrabucm>
0 01c780cc$dd74ede0$6c822ecf@dwlyfrabucm
<c36001c780cc$dacf246b$a9e7758a@core.lv>
0 c36001c780cc$dacf246b$a9e7758a@core.lv
<811c01c780cc$d7ec8aee$e2a59084@bboy.com>
0 811c01c780cc$d7ec8aee$e2a59084@bboy.com
<001601c780cd$c4af6af0$2cc6df52@qupu>
0 001601c780cd$c4af6af0$2cc6df52@qupu
<VGv0hsS6WNiCTm7vtqGaEw@allfreebiestoyouonli

0 6ade6f6c0704170726o6ff90644x51c671ce33031767@mail.gmail.com
<01c780fd$37cbc1d0$6c822ecf@jromano>
0 01c780fd$37cbc1d0$6c822ecf@jromano
<019671291.83381123320469@thhebat.net>
0 019671291.83381123320469@thhebat.net
<404336785.86165250800331@thhebat.net>
0 404336785.86165250800331@thhebat.net
<629273762.12922261492889@thhebat.net>
0 629273762.12922261492889@thhebat.net
<20070417103529.11C0C168C1F01@cachecontrol.net>
0 20070417103529.11C0C168C1F01@cachecontrol.net
<547390.99238.qm@web58008.mail.re3.yahoo.com>
0 547390.99238.qm@web58008.mail.re3.yahoo.com
<26e801c78135$5d107d40$357f30e1@paulmakoeieem>
0 26e801c78135$5d107d40$357f30e1@paulmakoeieem
<418401c780fe$5f36605e$b697e2e2@hagmann.ch>
0 418401c780fe$5f36605e$b697e2e2@hagmann.ch
<20070417053643.13159.qmail@cpe-65-24-160-254.columbus.res.rr.com>
0 20070417053643.13159.qmail@cpe-65-24-160-254.columbus.res.rr.com
<001a01c7810e$1db95940$00b7ff2c@ty66>
0 001a01c7810e$1db95940$00b7ff2c@ty66
<F360AADB419AF5429DCFA34E9732C94ED4C9F4@nydcdx11.c

0 200704171911.l3HJBJ0I031982@speedy.uwaterloo.ca
<1-910342-1ZzmDhFZLJhoFF904mNLbFzYrr4ZL@mx15.rewardgalaxy.com>
0 1-910342-1ZzmDhFZLJhoFF904mNLbFzYrr4ZL@mx15.rewardgalaxy.com
<46251931.6040708@biostat.ku.dk>
0 46251931.6040708@biostat.ku.dk
<EDD5890AD6D21C4FAD91028DD0F010736F07D8@nydcdx11.cbs.ad.cbs.net>
0 EDD5890AD6D21C4FAD91028DD0F010736F07D8@nydcdx11.cbs.ad.cbs.net
<20070417-21607.31395.qmail@adsl-19-16-224.asm.bellsouth.net>
0 20070417-21607.31395.qmail@adsl-19-16-224.asm.bellsouth.net
<20070417194502.1913.9463@rif.myfreedombox.com>
0 20070417194502.1913.9463@rif.myfreedombox.com
<01c77af3$3aca0c30$6c822ecf@boylove>
0 01c77af3$3aca0c30$6c822ecf@boylove
<200704171209.45590.chromatic@wgz.org>
0 200704171209.45590.chromatic@wgz.org
<01c78124$a2c28010$6c822ecf@ttmh>
0 01c78124$a2c28010$6c822ecf@ttmh
<200704171358.05192.tanner@real-time.com>
0 200704171358.05192.tanner@real-time.com
<200704171358.03168.tanner@real-time.com>
0 200704171358.03168.tanner@real-time.com
<20070417191847.GA51

0 971536df0704171751u557e9a25j7cf170f01dfe94d4@mail.gmail.com
<01c78153$af4b6110$6c822ecf@james>
0 01c78153$af4b6110$6c822ecf@james
<001101b4728a$9c416230$00d47f6c@AEAD321059704F2>
0 001101b4728a$9c416230$00d47f6c@AEAD321059704F2
<01c78155$6a81e750$6c822ecf@silencersadvantages>
0 01c78155$6a81e750$6c822ecf@silencersadvantages
<01c78156$0a36c2c0$6c822ecf@mundanelyplacards>
0 01c78156$0a36c2c0$6c822ecf@mundanelyplacards
<200704092256.l39Mu2QF018522@ema8adm1.turner.com>
0 200704092256.l39Mu2QF018522@ema8adm1.turner.com
<473319.1176858558031.JavaMail.Administrator@ap6>
0 473319.1176858558031.JavaMail.Administrator@ap6
<001101c78156$d10802cf$b54b514d@cinemexicano.com>
0 001101c78156$d10802cf$b54b514d@cinemexicano.com
<21E6A3FD-33EB-49B0-A5C0-8E913F6A11C3@hanover.edu>
0 21E6A3FD-33EB-49B0-A5C0-8E913F6A11C3@hanover.edu
<CFE0.AA79.9Agisvix-003062hknlqy@mac.com>
0 CFE0.AA79.9Agisvix-003062hknlqy@mac.com
<6652E31BF992F48.78D4B19F59@buchantiquariat.com>
0 6652E31BF992F48.78D4B19F59@buchantiquaria

0 01c78191$82413120$6c822ecf@dwmaddogsm
<01c78219$285bb300$6c822ecf@fmmssupport>
0 01c78219$285bb300$6c822ecf@fmmssupport
<200704180819.l3I8JN0J007385@speedy.uwaterloo.ca>
0 200704180819.l3I8JN0J007385@speedy.uwaterloo.ca
<000b01c78192$0c4204d0$0301a8c0@25x038.nts.nnov.ru>
0 000b01c78192$0c4204d0$0301a8c0@25x038.nts.nnov.ru
<001a01c781de$2d64c4d0$000c9874@nahyun>
0 001a01c781de$2d64c4d0$000c9874@nahyun
<000d01c78192$be182450$67fedc47@Terrin>
0 000d01c78192$be182450$67fedc47@Terrin
<20070406-13937.4800.qmail@cpe-24-33-252-22.indy.res.rr.com>
0 20070406-13937.4800.qmail@cpe-24-33-252-22.indy.res.rr.com
<1176884729.2742.3.camel@localhost.localdomain>
0 1176884729.2742.3.camel@localhost.localdomain
<4625e3c8-q1sde.2lp@qualityturtle.com>
0 4625e3c8-q1sde.2lp@qualityturtle.com
<200704180824.l3I8Op7d019403@plg2.math.uwaterloo.ca>
0 200704180824.l3I8Op7d019403@plg2.math.uwaterloo.ca
<4625B5EA.9080106@comcast.net>
0 4625B5EA.9080106@comcast.net
<001501c781d6$aeaee370$001abe54@mychatbbcd70b5>
0 

0 20070418130454.GA10035@kirk.peters.homeunix.org
<01c781c3$3c158e80$6c822ecf@dwcnentrpm>
0 01c781c3$3c158e80$6c822ecf@dwcnentrpm
<001601c781ba$6127ffe0$e437933b@vyvii>
0 001601c781ba$6127ffe0$e437933b@vyvii
<846028718.35874299141424@thhebat.net>
0 846028718.35874299141424@thhebat.net
<000001c781ba$e0df9e00$0100007f@localhost>
0 000001c781ba$e0df9e00$0100007f@localhost
<200704180916.l3I9GoiD003616@matchedace.net>
0 200704180916.l3I9GoiD003616@matchedace.net
<01c77b06$3c0e2a50$6c822ecf@qpecxejxy>
0 01c77b06$3c0e2a50$6c822ecf@qpecxejxy
<2176FBE23B8DBF8.238304D7D8@auna.net>
0 2176FBE23B8DBF8.238304D7D8@auna.net
<200704181230.l3ICU53i019916@live1.bc.cbsig.net>
0 200704181230.l3ICU53i019916@live1.bc.cbsig.net
<000b01c781bc$6983ab60$1821ec88@duhv>
0 000b01c781bc$6983ab60$1821ec88@duhv
<20070418142035.mz5qw.288450528@ebs.bbc.co.uk>
0 20070418142035.mz5qw.288450528@ebs.bbc.co.uk
<BD61989A-CEE7-4D31-A23B-2F1694C18DA6@hanover.edu>
0 BD61989A-CEE7-4D31-A23B-2F1694C18DA6@hanover.edu
<40e66e0b07041

0 20070418134003.186535d2.celejar@gmail.com
<000001c781e0$8d804680$0100007f@localhost>
0 000001c781e0$8d804680$0100007f@localhost
<001001c781b7$4d79e1a0$0271fd24@erbyofcoolness>
0 001001c781b7$4d79e1a0$0271fd24@erbyofcoolness
<fad888a10704181039n5d9c2047wec057ccbc3cd8f0d@mail.gmail.com>
0 fad888a10704181039n5d9c2047wec057ccbc3cd8f0d@mail.gmail.com
<f40401c781fa$c99ada80$7d3681f0@mailmij.nl>
0 f40401c781fa$c99ada80$7d3681f0@mailmij.nl
<87FF62B7CF67642.80EE1407D3@ne.jp>
0 87FF62B7CF67642.80EE1407D3@ne.jp
<01c77b0a$171833e0$6c822ecf@purveyhastiness>
0 01c77b0a$171833e0$6c822ecf@purveyhastiness
<001801c781af$64e6dda0$05cbaf04@NOMBRE36E71PX4>
0 001801c781af$64e6dda0$05cbaf04@NOMBRE36E71PX4
<001801c781f2$985478d0$002a7f9c@hjqcw85rys8xbx>
0 001801c781f2$985478d0$002a7f9c@hjqcw85rys8xbx
<MAILSENDERNG3GKeD021272c0e6@129.97.78.23>
0 MAILSENDERNG3GKeD021272c0e6@129.97.78.23
<001a01c781c9$3303dfd0$05e305bc@pccleiton>
0 001a01c781c9$3303dfd0$05e305bc@pccleiton
<68103.63370.qm@web39710.mail.mud.yaho

0 46269D04.9000206@perl.org
<000901c7820a$128aab30$6a7dddd9@MazzaliA>
0 000901c7820a$128aab30$6a7dddd9@MazzaliA
<1176935917.13823.6.camel@localhost>
0 1176935917.13823.6.camel@localhost
<200704182243.l3IMhT0I016939@speedy.uwaterloo.ca>
0 200704182243.l3IMhT0I016939@speedy.uwaterloo.ca
<000701c78250$05c85c80$439f2c35@uq.edu.au>
0 000701c78250$05c85c80$439f2c35@uq.edu.au
<20070419004641.67e2f284.frx@firenze.linux.it>
0 20070419004641.67e2f284.frx@firenze.linux.it
<001901c7821c$7e2ad560$00647c9c@simon4f3f0194e>
0 001901c7821c$7e2ad560$00647c9c@simon4f3f0194e
<242531102.79737165527399@thhebat.net>
0 242531102.79737165527399@thhebat.net
<c540fe260704181549j3afad489j12564cdb0503f057@mail.gmail.com>
0 c540fe260704181549j3afad489j12564cdb0503f057@mail.gmail.com
<000901c7820b$e2403ba0$c14baa47@r4t1o7>
0 000901c7820b$e2403ba0$c14baa47@r4t1o7
<699485505.85949973184651@thhebat.net>
0 699485505.85949973184651@thhebat.net
<4626A0BC.2080006@fhcrc.org>
0 4626A0BC.2080006@fhcrc.org
<001101c7820c$33f98f

<01c7824c$97444e70$6c822ecf@denigrationfinking>
0 01c7824c$97444e70$6c822ecf@denigrationfinking
<01c7824c$99a8f260$6c822ecf@a_rief>
0 01c7824c$99a8f260$6c822ecf@a_rief
<200704190634.l3J6Yt0I020992@speedy.uwaterloo.ca>
0 200704190634.l3J6Yt0I020992@speedy.uwaterloo.ca
<636201c7824c$0a558e86$0d110d07@dwp.net>
0 636201c7824c$0a558e86$0d110d07@dwp.net
<200704190636.l3J6a00I021009@speedy.uwaterloo.ca>
0 200704190636.l3J6a00I021009@speedy.uwaterloo.ca
<728864747.04426490762430@thhebat.net>
0 728864747.04426490762430@thhebat.net
<F4C04CB4B565DE4.B84435CAFA@hinet.net>
0 F4C04CB4B565DE4.B84435CAFA@hinet.net
<001c01c7826f$bfc5f1f0$06f9ddcc@server>
0 001c01c7826f$bfc5f1f0$06f9ddcc@server
<001301c77b18$b79782e0$81ba92d4@vkbfw>
0 001301c77b18$b79782e0$81ba92d4@vkbfw
<733406237.19592882819220@thhebat.net>
0 733406237.19592882819220@thhebat.net
<C826A9E348564EF.CFCE081086@net.my>
0 C826A9E348564EF.CFCE081086@net.my
<01c78256$dc8ee8a0$6c822ecf@dissidentscrowed>
0 01c78256$dc8ee8a0$6c822ecf@dissidentsc

0 47fce0650704190559qa9d85adqfffc4467d554dad7@mail.gmail.com
<801e01c78227$fa8823f0$d421efaf@aagwjfpu>
0 801e01c78227$fa8823f0$d421efaf@aagwjfpu
<744834621.23441903245645@thhebat.net>
0 744834621.23441903245645@thhebat.net
<353381250.20276325063851@thhebat.net>
0 353381250.20276325063851@thhebat.net
<01c78288$24303250$6c822ecf@southerlieslikable>
0 01c78288$24303250$6c822ecf@southerlieslikable
<10077840.post@talk.nabble.com>
0 10077840.post@talk.nabble.com
<001701c7826f$4ab93160$00cd843c@terminal10>
0 001701c7826f$4ab93160$00cd843c@terminal10
<001501c7826f$4af4dad0$017e1cf4@terminal10>
0 001501c7826f$4af4dad0$017e1cf4@terminal10
<1115a2b00704092032y323ed00i2bc67c9192545572@mail.gmail.com>
0 1115a2b00704092032y323ed00i2bc67c9192545572@mail.gmail.com
<20070419134144.9CE3844073@ws5-1.us4.outblaze.com>
0 20070419134144.9CE3844073@ws5-1.us4.outblaze.com
<1176990286.22497.1.camel@zoidberg>
0 1176990286.22497.1.camel@zoidberg
<20070419131341.GA24353@gsf.de>
0 20070419131341.GA24353@gsf.de
<20

0 1-116393-4IP6HUuqqSCqI4BSMDCISSHiKUV4qSPM66Ku4@mx199.dollardbox.com
<01c782b3$3dcf4860$6c822ecf@humility'sYangtze's>
0 01c782b3$3dcf4860$6c822ecf@humility'sYangtze's
<8A680BB45D12D08.3588F31433@setel.com>
0 8A680BB45D12D08.3588F31433@setel.com
<000001c782b3$5169e600$0100007f@STLKFLEONARD2>
0 000001c782b3$5169e600$0100007f@STLKFLEONARD2
<000e01c6795c$79c825f0$06729ef4@YX009>
0 000e01c6795c$79c825f0$06729ef4@YX009
<Pine.LNX.4.62.0704191445270.25075@fractal.phys.lafayette.edu>
0 Pine.LNX.4.62.0704191445270.25075@fractal.phys.lafayette.edu
<475886403.24269146209871@thhebat.net>
0 475886403.24269146209871@thhebat.net
<1CD38A63614DDA49A88751D67F49528E0158D91B@nydcdx11.cbs.ad.cbs.net>
0 1CD38A63614DDA49A88751D67F49528E0158D91B@nydcdx11.cbs.ad.cbs.net
<000001c782b3$bae9f480$0100007f@collins>
0 000001c782b3$bae9f480$0100007f@collins
<01c77b40$ff621590$6c822ecf@advertisement>
0 01c77b40$ff621590$6c822ecf@advertisement
<01c782b3$b983fbe0$6c822ecf@excruciatinglygas>
0 01c782b3$b983fbe0$6c822ecf@

<971536df0704191653q6b78e3b6q60fac7c1dabdce52@mail.gmail.com>
0 971536df0704191653q6b78e3b6q60fac7c1dabdce52@mail.gmail.com
<001201c782bc$ec2e0550$0773cb9c@D1W6BY61>
0 001201c782bc$ec2e0550$0773cb9c@D1W6BY61
<rt-3.6.HEAD-30201-1177008475-907.42620-72-0@perl.org>
0 rt-3.6.HEAD-30201-1177008475-907.42620-72-0@perl.org
<20070420000020.70740162AE3@lists.samba.org>
0 20070420000020.70740162AE3@lists.samba.org
<59d7961d0704191659r199b318fs5f715af1dcb23094@mail.gmail.com>
0 59d7961d0704191659r199b318fs5f715af1dcb23094@mail.gmail.com
<87ED78B1E5AAAF5.F3861A7EB3@shawcable.net>
0 87ED78B1E5AAAF5.F3861A7EB3@shawcable.net
<20070619011114.14027.qmail@pool-71-164-205-206.dllstx.fios.verizon.net>
0 20070619011114.14027.qmail@pool-71-164-205-206.dllstx.fios.verizon.net
<1176179927.270691-3064-slash-slashdot-nfs-1.osdn.net@slashdot.org>
0 1176179927.270691-3064-slash-slashdot-nfs-1.osdn.net@slashdot.org
<CADE903D0064DD1F4D7098BB@[172.23.155.54]>
0 CADE903D0064DD1F4D7098BB@[172.23.155.54]
<000001c782e0$

Unnamed: 0,Feature3,Message-Id
0,0,<19943672.886214@relay.comanche.denmark.eu> M...
1,1,<>\n
2,0,<19943672.886214@relay.comanche.denmark.eu> T...
3,0,<199802161222.EAA24869@net1.aoci.com>\n
4,1,<>\n


Displaying Message-Id values that are malformed.

In [12]:
df_final.loc[df_final['Feature3'] == 1]['Message-Id']

1                                                    <>\n
4                                                    <>\n
5                                Mach10 1.1 fxpromo.com\n
10                              <199803250408.UAA03361>\n
13                                                    NaN
16                                  <31867701_67397293>\n
24                                <tryitbeforeyoubuyit>\n
29                                199803272113.OOB5051@\n
31                                                    NaN
34                                  <36424144_99983662>\n
134      <01c78010$80cbf230$6c822ecf@sunbeamHastings's>\n
136        <01c78011$3f3bc6f0$6c822ecf@month'ssuffrage>\n
161      <01c78014$b65fc760$6c822ecf@faction'sCybele's>\n
166      <000f01c78060$9da7e3f0$00796a3c@your1cfa2f9d2...
167      <001601c78060$9d677f90$00795f8c@your1cfa2f9d2...
170                <001701c7802e$ef888120$0f72a424@oem>\n
171                <001101c7802e$ef888120$0f712db4@oem>\n
179      <01c7

FEATURE4: The 'X-UIDL' header is intended to stop the recepient's mail server from downloading multiple copies of the mail once the mail is received. Normally, X-UIDL is stripped once the mail is received. Spammers' intentionally add the X-UIDL, so that mail servers download multiple copies of the mail, increasing chances of it being read. Creating a new column 'Feature4', where 1 indicates X-UIDL not empty, 0 indicates otherwise.

In [13]:
# Feature4, is the X-UIDL header not empty or na ? 1->not empty, 0->empty 
df_final['Feature4'] = 0
df_final.loc[df_final['X-UIDL'].notna(), 'Feature4'] = 1
df_final.loc[df_final['X-UIDL'].isna(), 'Feature4'] = 0
df_final[['Feature4', 'X-UIDL']].head()

Unnamed: 0,Feature4,X-UIDL
0,0,
1,0,
2,0,
3,0,
4,0,


Displaying rows where X-UIDL is not empty.

In [14]:
df_final.loc[df_final['Feature4']==1]['X-UIDL']

9      c89dd4e061ba173523703cf25c3133a2\n
11     763cf6e5123c1287a83f12d7e99c60c9\n
16                  10293287_192832.222\n
22     f2c3e4bf7654f32bfd17a6c54dc32f1d\n
24     11111111111111111111111111111111\n
26     2610431056a78aeb1b128fda426c9a5e\n
27                      123456789012376\n
28     33587715159195856343749765328732\n
29                        870483442.265\n
34                  20720340_201230.501\n
37     2610431056a78aeb1b128fda426c9a5e\n
Name: X-UIDL, dtype: object

FEATURE5: Now, we process the body to extract features that distinguish spam from ham. Spammers will typically use certain words (eg. free, limited offer, click here) to catch the attention of their recipients. Overuse of capitals and punctuation marks are also a marked characteristic of spam. Also, spammers will intentionally mis-spell words (eg. w4rning for warning), to bypass spam filters. So, a quantity like 'percent mis-spelt email' may make a good feature for detecting spam. 

We will try to leverage the fact that spammers use certain words often in their emails. First, we create a pyspark dataframe from our pandas dataframe. 

In [15]:
import findspark
findspark.init()
import pyspark
from pyspark.ml.feature import HashingTF, IDF, IDFModel, Tokenizer, RegexTokenizer, StopWordsRemover
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

In [16]:
# Now, we create a Spark, Pandas dataframe of features from df_final
# Features include TFIDF vector, vector of misspellings count per email, punctuation count per email 

In [17]:
import re
# clean a string off punctuations and additional whitespace, new line characters.
def processString(body):
    body = body.replace("'", "")
    body = re.sub(r"[^\w\s]|_", " ", body)
    body = re.sub("[\s+]", " ", body)
    return body

Right now, the email body is in it's raw form. We process it to remove punctuation, redundant white space and trim it.

In [18]:
# Feature 5 - TFIDF
body = df_final['Body']
body = body.map(processString, na_action='ignore')

Convert to pandas dataframe

In [19]:
pddf = pd.DataFrame(body, columns=['Body'])
pddf['RawFeature1'] = df_final['Feature1']
pddf['RawFeature2'] = df_final['Feature2']
pddf['RawFeature3'] = df_final['Feature3']
pddf['RawFeature4'] = df_final['Feature4']
pddf['Spam'] = df_final['Spam']
pddf.head()

Unnamed: 0,Body,RawFeature1,RawFeature2,RawFeature3,RawFeature4,Spam
0,email marketing works bulls eye gold is the...,0,0,0,0,Spam
1,this is the most exciting breakthrough ever...,0,0,1,0,Spam
2,email marketing works bulls eye gold is the...,0,0,0,0,Spam
3,free download register your web site to over 7...,0,1,0,0,Spam
4,do you love cars want your own business th...,0,0,1,0,Spam


Before creating the spark dataframe, we replace all NaNs with empty string. Our spark dataframe now contains body, individual features 1-4 and the label(spam or ham). 

In [20]:
pddf.fillna("", inplace=True)
df = spark.createDataFrame(pddf)
df.show(3)

+--------------------+-----------+-----------+-----------+-----------+----+
|                Body|RawFeature1|RawFeature2|RawFeature3|RawFeature4|Spam|
+--------------------+-----------+-----------+-----------+-----------+----+
|email marketing w...|          0|          0|          0|          0|Spam|
|   this is the mo...|          0|          0|          1|          0|Spam|
|email marketing w...|          0|          0|          0|          0|Spam|
+--------------------+-----------+-----------+-----------+-----------+----+
only showing top 3 rows



Using spark StringIndexer to index the label column. 0.0 indicates spam and 1.0 indicates ham.

In [21]:
from pyspark.ml.feature import StringIndexer
stringIndexer = StringIndexer(inputCol='Spam', outputCol='label')
df = stringIndexer.fit(df).transform(df)
df.show(3)

+--------------------+-----------+-----------+-----------+-----------+----+-----+
|                Body|RawFeature1|RawFeature2|RawFeature3|RawFeature4|Spam|label|
+--------------------+-----------+-----------+-----------+-----------+----+-----+
|email marketing w...|          0|          0|          0|          0|Spam|  0.0|
|   this is the mo...|          0|          0|          1|          0|Spam|  0.0|
|email marketing w...|          0|          0|          0|          0|Spam|  0.0|
+--------------------+-----------+-----------+-----------+-----------+----+-----+
only showing top 3 rows



Using RegexTokenizer to tokenize the Body. This breaks the body into chunks around non-word delimiters (\\W).

In [22]:
regexTokenizer = RegexTokenizer(inputCol='Body', outputCol='Body_Tokens', pattern='\\W')

In [23]:
df_tokenized = regexTokenizer.transform(df)
df_tokenized['Body_Tokens']

Column<b'Body_Tokens'>

In [24]:
df_tokenized.show(3)

+--------------------+-----------+-----------+-----------+-----------+----+-----+--------------------+
|                Body|RawFeature1|RawFeature2|RawFeature3|RawFeature4|Spam|label|         Body_Tokens|
+--------------------+-----------+-----------+-----------+-----------+----+-----+--------------------+
|email marketing w...|          0|          0|          0|          0|Spam|  0.0|[email, marketing...|
|   this is the mo...|          0|          0|          1|          0|Spam|  0.0|[this, is, the, m...|
|email marketing w...|          0|          0|          0|          0|Spam|  0.0|[email, marketing...|
+--------------------+-----------+-----------+-----------+-----------+----+-----+--------------------+
only showing top 3 rows



Using StopWordsRemover to filter out meaningless stop words.

In [25]:
stopWordsRemover = StopWordsRemover(inputCol="Body_Tokens", outputCol="Body_Tokens2")
df_tokenized = stopWordsRemover.transform(df_tokenized)
df_tokenized.show(5)

+--------------------+-----------+-----------+-----------+-----------+----+-----+--------------------+--------------------+
|                Body|RawFeature1|RawFeature2|RawFeature3|RawFeature4|Spam|label|         Body_Tokens|        Body_Tokens2|
+--------------------+-----------+-----------+-----------+-----------+----+-----+--------------------+--------------------+
|email marketing w...|          0|          0|          0|          0|Spam|  0.0|[email, marketing...|[email, marketing...|
|   this is the mo...|          0|          0|          1|          0|Spam|  0.0|[this, is, the, m...|[exciting, breakt...|
|email marketing w...|          0|          0|          0|          0|Spam|  0.0|[email, marketing...|[email, marketing...|
|free download reg...|          0|          1|          0|          0|Spam|  0.0|[free, download, ...|[free, download, ...|
|do you love cars ...|          0|          0|          1|          0|Spam|  0.0|[do, you, love, c...|[love, cars, want...|
+-------

TF-IDF in spark is divided into HashingTF and then IDF. Applying HashingTF to create term frequencies from "Body_Tokens2" column into "TermFreqs" column.

In [26]:
hashingTF = HashingTF(inputCol="Body_Tokens2", outputCol="TermFreqs", numFeatures=20)
df3 = hashingTF.transform(df_tokenized)

In [27]:
df3.columns

['Body',
 'RawFeature1',
 'RawFeature2',
 'RawFeature3',
 'RawFeature4',
 'Spam',
 'label',
 'Body_Tokens',
 'Body_Tokens2',
 'TermFreqs']

Here, we create an IDF model and fit it over "TermFreqs" column. A new column "RawFeature5" is created that contains the TF-IDF vector for every email row. 

In [28]:
idfModel = IDF(inputCol="TermFreqs", outputCol="RawFeature5").fit(df3)
df4 = idfModel.transform(df3)
df4.show(5)

+--------------------+-----------+-----------+-----------+-----------+----+-----+--------------------+--------------------+--------------------+--------------------+
|                Body|RawFeature1|RawFeature2|RawFeature3|RawFeature4|Spam|label|         Body_Tokens|        Body_Tokens2|           TermFreqs|         RawFeature5|
+--------------------+-----------+-----------+-----------+-----------+----+-----+--------------------+--------------------+--------------------+--------------------+
|email marketing w...|          0|          0|          0|          0|Spam|  0.0|[email, marketing...|[email, marketing...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
|   this is the mo...|          0|          0|          1|          0|Spam|  0.0|[this, is, the, m...|[exciting, breakt...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
|email marketing w...|          0|          0|          0|          0|Spam|  0.0|[email, marketing...|[email, marketing...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
|fre

Spark VectorAssembler is a transformer that combines a list of raw features into a single feature vector.

In [29]:
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler(inputCols=['RawFeature1', 'RawFeature2', 'RawFeature3', 'RawFeature4','RawFeature5'], outputCol='Features')

In [30]:
df5 = vectorAssembler.transform(df4)
df5.show(2)

+--------------------+-----------+-----------+-----------+-----------+----+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|                Body|RawFeature1|RawFeature2|RawFeature3|RawFeature4|Spam|label|         Body_Tokens|        Body_Tokens2|           TermFreqs|         RawFeature5|            Features|
+--------------------+-----------+-----------+-----------+-----------+----+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|email marketing w...|          0|          0|          0|          0|Spam|  0.0|[email, marketing...|[email, marketing...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|[0.0,0.0,0.0,0.0,...|
|   this is the mo...|          0|          0|          1|          0|Spam|  0.0|[this, is, the, m...|[exciting, breakt...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|[0.0,0.0,1.0,0.0,...|
+--------------------+-----------+-----------+-----------+-------

In [31]:
# Keeping only columns that we need
df6 = df5.drop('Body', 'RawFeature2', 'RawFeature2', 'RawFeature3','RawFeature4','Body_Tokens','Body_Tokens2','TermFreqs','RawFeature5')
df6.columns

['RawFeature1', 'Spam', 'label', 'Features']

In [32]:
df6.columns

['RawFeature1', 'Spam', 'label', 'Features']

Splitting the spark dataframe randomly into 75% train and 25% test dataframes.

In [33]:
df6_train, df6_test = df6.randomSplit([3.0, 1.0], 24)

Using Multinomial Naive Bayes model and training it over train dataset.

In [34]:
from pyspark.ml.classification import NaiveBayes

In [35]:
nb = NaiveBayes(smoothing=1.0, modelType='multinomial', featuresCol='Features')

In [36]:
NBModel = nb.fit(df6_train)

Testing the trained model over test dataset.

In [37]:
predictions = NBModel.transform(df6_test)

Calculating the accuracy of our model.

In [38]:
correct = predictions[predictions['label'] == predictions['prediction']]
incorrect = predictions[predictions['label'] != predictions['prediction']]

In [39]:
print('Correct predictions: ', correct.count())
print('Incorrect predictions: ', incorrect.count())

Correct predictions:  898
Incorrect predictions:  328


In [40]:
print('Accuracy in %: ', (correct.count() * 100.) / (correct.count() + incorrect.count()))

Accuracy in %:  73.2463295269168


Another feature that is WIP is 'percent mis-spelt words'. We can use the pattern.en package's suggest method to check if a word is mis-spelt or no. This could be applied over all tokenized words of an email. Divided by length of the email, it results in 'percent mis-spelt words'. 

## Feature 6 Engineering -> Percent Misspellings -> Work In Progress

In [41]:
"""
df_tokenized = df_tokenized.sample(0.005)
df_tokenized.show()"""

'\ndf_tokenized = df_tokenized.sample(0.005)\ndf_tokenized.show()'

In [42]:
"""df_tokenized.count()"""

'df_tokenized.count()'

In [43]:
""""# Feature5 -> Percent of misspellings in mail body
from pattern.en import spelling
spelling.suggest('wrng')""""

SyntaxError: EOL while scanning string literal (<ipython-input-43-28a9e9562e67>, line 3)

In [None]:
"""from functools import reduce
from pattern.en import spelling
percentSpelling = []
def getPercentMisspelled(wordList):
    print(wordList)
    f = lambda x,y: int(x)+1 if spelling.suggest(y)[0][0] != y else int(x)
    e = float(reduce(f, wordList, 0)/len(wordList))
    percentSpelling.append(e)"""

In [None]:
#r1 = df_tokenized.first()['Body2']

In [None]:
#getPercentMisspelled(r1)

In [None]:
"""df_tokenized"""

In [None]:
"""from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
#myudf = udf(getPercentMisspelled, FloatType())"""

In [None]:
#c = myudf(df_tokenized.Body2)

In [None]:
#df_tokenized = df_tokenized.withColumn('Feature5', c)