# Introduction
The spamassassin corpus contains emails classified either as `spam` or `ham`, the latter being ordinary non-spam emails. The data, as well as a description can be found here:
https://spamassassin.apache.org/old/publiccorpus/

I intend to build a classification model to identify spam from ham. To start I will get familliar with the data.

# The data

The data is split in 5 parts, description copied from https://spamassassin.apache.org/old/publiccorpus/readme.html
* `spam`: 500 spam messages, all received from non-spam-trap sources.

* `easy_ham`: 2500 non-spam messages.  These are typically quite easy to
    differentiate from spam, since they frequently do not contain any spammish
    signatures (like HTML etc).

* `hard_ham`: 250 non-spam messages which are closer in many respects to
    typical spam: use of HTML, unusual HTML markup, coloured text,
    "spammish-sounding" phrases etc.

* `easy_ham_2`: 1400 non-spam messages.  A more recent addition to the set.

* `spam_2`: 1397 spam messages.  Again, more recent.

Every email is it's own file, so I will need to combine all files. The emails also include all headers, so I will probably need to clean them up as I am mainly interested in working with the text content of the emails.

## Example email
Let's try to extract the subject and body from just one email.

In [2]:
with open("spamassassin_public_corpus/hard_ham/00002.ca96f74042d05c1a1d29ca30467cfcd5") as f:
    mail = f.read()

In [3]:
print(mail)

Return-Path: <malcolm-sweeps@mrichi.com>
Delivered-To: rod@arsecandle.org
Received: (qmail 16821 invoked by uid 505); 7 May 2002 14:37:01 -0000
Received: from malcolm-sweeps@mrichi.com by blazing.arsecandle.org
	 by uid 500 with qmail-scanner-1.10 (F-PROT: 3.12. Clear:0. Processed in 0.260914 secs); 07 May 2002 14:37:01 -0000
Delivered-To: rod-3ds@arsecandle.org
Received: (qmail 16811 invoked by uid 505); 7 May 2002 14:37:00 -0000
Received: from malcolm-sweeps@mrichi.com by blazing.arsecandle.org
	 by uid 502 with qmail-scanner-1.10 (F-PROT: 3.12. Clear:0. Processed in 0.250416 secs); 07 May 2002 14:37:00 -0000
Received: from bocelli.siteprotect.com (64.41.120.21)
  by h0090272a42db.ne.client2.attbi.com with SMTP; 7 May 2002 14:36:59 -0000
Received: from mail.mrichi.com ([208.33.95.187])
	by bocelli.siteprotect.com (8.9.3/8.9.3) with SMTP id JAA14328;
	Tue, 7 May 2002 09:37:01 -0500
From: malcolm-sweeps@mrichi.com
Message-Id: <200205071437.JAA14328@bocelli.siteprotect.com>
To: <rod-3ds

## Extract title and body

In [20]:
import pandas as pd
import re

In [5]:
# Takes an unprocessed email as inout and returns a list with each line as an element
# The idea of this function is to help distinguish headers from the email body.
def get_lines(mail): 
    # Find all occurances of multiple whitespaces or tabs and condense to a single whitespace.
    mail = re.sub(r"([ |\t]+)", " ", mail)
    # Linebreaks followed by a whitespac typically indicates a continued line. Replace these linebreaks with a single whitespace.
    mail = mail.replace("\n ", " ")
    lines = mail.split("\n")
    return lines

In [6]:
lines = get_lines(mail)

In [7]:
subject_regex = re.compile("Subject: (.*)")
subject = subject_regex.search("Subject: ")

In [8]:
def get_subject_and_body(lines):
    header_regex = re.compile("^\S+:")
    subject_regex = re.compile("Subject: (.*)")
    subject = None
    reading_headers = True
    cutoff = 0
    for i, line in enumerate(lines):
        if header_regex.match(line):
            # Don't search for a subject if we already found it
            if not subject:
                subject = subject_regex.search(line)
        elif subject:
            cutoff = i
            break
    if subject:
        subject = subject.group(1)
        # If the sunject line was empty, replace it with the tag <EMPTY>
        if not subject:
            subject = "<EMPTY>"
    body = " ".join(lines[cutoff:])
    return subject, body

In [9]:
subject, body = get_subject_and_body(lines)

In [10]:
subject

'Malcolm in the Middle Sweepstakes Prize Notification'

## All email
Now lets try extracting the subject and body of all emails.

In [11]:
import os

In [12]:
spam_dirs = [r'spamassassin_public_corpus\spam', r'spamassassin_public_corpus\spam_2']
ham_dirs = ['spamassassin_public_corpus\\easy_ham\\', 'spamassassin_public_corpus\\easy_ham_2\\', 'spamassassin_public_corpus\\hard_ham\\']

In [13]:
directory = r"C:\Users\Gustav\Data Science\Spam Classification"

In [14]:
files = {}
for mail_dir in spam_dirs + ham_dirs:
    files[mail_dir] = [f for f in os.listdir(directory+"\\"+mail_dir)]

In [15]:
with open('spamassassin_public_corpus/spam/00116.29e39a0064e2714681726ac28ff3fdef', errors='ignore') as f:
    mail = f.read()

In [16]:
mails = {}
for mail_dir, filenames in files.items():
    dir_mails = []
    for file in filenames:
        with open(mail_dir+"\\"+file, errors='ignore') as f:
            try:
                dir_mails.append(f.read())
            except:
                print(mail_dir+"\\"+file)
                throw
    mails[mail_dir.split("\\")[1]] = dir_mails
        

In [17]:
all_mails = []
for mail_list in mails.items():
    all_mails += [(mail_list[0], x) for x in mail_list[1]]

In [18]:
len(all_mails)

6051

In [21]:
rows = []
for i, mail in enumerate(all_mails):
    data_set, mail = mail
    subject, body = get_subject_and_body(get_lines(mail))
    rows.append((data_set, subject, body, mail))
df = pd.DataFrame(data=rows, columns=('dataset', 'subject', 'body', 'mail'))

Let's extract all mails where we failed to find a subject or body.

In [22]:
problematic_df = df[(df['subject'].isnull()) | (df['body'].isnull())]

In [23]:
for i, mail in problematic_df.iterrows():
    print("Index: {}".format(i))
    print("Data set {}".format(mail[0]))
    print("Subject: {}".format(mail[1]))
    print("Body: {}".format(mail[2])[:100])
    print()

Index: 500
Data set spam
Subject: None
Body: mv 00001.7848dde101aa985090474a91ec93fcf0 00001.7848dde101aa985090474a91ec93fcf0 mv 00002.d94f

Index: 1897
Data set spam_2
Subject: None
Body: mv 00001.317e78fa8ee2f54cd4890fdc09ba8176 00001.317e78fa8ee2f54cd4890fdc09ba8176 mv 00002.9438

Index: 4398
Data set easy_ham
Subject: None
Body: mv 00001.7c53336b37003a9286aba55d2945844c 00001.7c53336b37003a9286aba55d2945844c mv 00002.9c40

Index: 5676
Data set easy_ham_2
Subject: None
Body: From mail@dogma.slashnull.org Mon Jul 22 17:22:56 2002 Return-Path: <mail@dogma.slashnull.org>

Index: 5677
Data set easy_ham_2
Subject: None
Body: From mail@dogma.slashnull.org Mon Jul 22 17:23:26 2002 Return-Path: <mail@dogma.slashnull.org>

Index: 5678
Data set easy_ham_2
Subject: None
Body: From mail@dogma.slashnull.org Mon Jul 22 17:23:56 2002 Return-Path: <mail@dogma.slashnull.org>

Index: 5730
Data set easy_ham_2
Subject: None
Body: From nobody@sonic.spamtraps.taint.org Fri Aug 16 11:08:45 2002 Return-Pat

Many of the emails lacking a subject are just filled with lines like 
`mv 00001.d4365609129eef855bd5da583c90552b 00001.1a31cc283af0060967a233d26548a6ce mv 00002.5a58`
which looks suspicious to me as the long strings match the filenames in the spamassassin corpus.
All other emails except one look something like this:

`Problem with spamtrap
/home/yyyy/lib/spamtrap.sh: /home/yyyy/ftp/spamassassin/spamassassin: No such file or directory`

This also seems related to the spamassassin corpus. WIthout speculating to much about the origin of these files, I do not think they will be usefull to me. Therefore I will only keep one of the mails that I failed to find a subject for. This mail simply did not have a subject header.

In the end I throw away 10 suspcisuis emails, 1 from spam and spam_2 respectively, 1 from hard_ham and easy_ham and 6 from easy_ham_2.

In [24]:
problematic_df = problematic_df.drop(5974)

In [25]:
df = df.drop(problematic_df.index)

In [26]:
len(df)

6041

Let's do a sanity check of our Subject logic. Titles should not be very long. Let's examing the longest one.

In [27]:
df['subject length'] = df['subject'].apply(lambda x: len(x) if x else 0)

In [28]:
mail = df.loc[df['subject length'].idxmax()]
print(mail['subject'])
print()
print(mail['mail'])

Sitescooper: scoop websites onto your PalmPilot - Sitescooper automatically retrieves the stories from several news websites, trims off extraneous HTML, and converts them into iSilo or Palm DOC format for later reading on-the-move. It maintains a cache, and will avoid stories you've already read. It can handle 1-page sites, 1-page with diffing, 2- and 3-level sites as well as My-Netscape-style RSS sites. It's also very easy to add a new site to its list. Even if you don't have a PalmPilot, it's still handy for simple website-to-text conversion. Site files are included for many popular news sites including Slashdot, LWN, Freshmeat and Linux Today. http://jmason.org/software/sitescooper/

From postmaster@topsitez.us  Fri Nov 29 11:17:25 2002
Return-Path: <postmaster@topsitez.us>
Delivered-To: yyyy@localhost.spamassassin.taint.org
Received: from localhost (jalapeno [127.0.0.1])
	by jmason.org (Postfix) with ESMTP id 1BF5C16F16
	for <jm@localhost>; Fri, 29 Nov 2002 11:16:16 +0000 (GMT)
Rec

Looks like a very long subject, but it was correctly extracted. 

Let's also check if we have any mails where the buject is longer than the body.

In [29]:
df['body length'] = df['body'].apply(lambda x: len(x) if x else 0)

In [30]:
df[df['subject length'] > df['body length']]

Unnamed: 0,dataset,subject,body,mail,subject length,body length
2540,easy_ham,Re: Hanson's Sept 11 message in the National R...,Chuck Murcko wrote: >[...stuff...] Yawn. R,From fork-admin@xent.com Thu Sep 19 13:26:37 ...,51,48
3913,easy_ham,Japanese kids spend a day in school jumping in...,"URL: http://www.newsisfree.com/click/-5,83059...",From rssfeeds@jmason.org Fri Sep 27 10:40:56 ...,144,108
3992,easy_ham,"Bank robber gets the loot, makes it out of ban...","URL: http://www.newsisfree.com/click/-1,84120...",From rssfeeds@jmason.org Tue Oct 1 10:36:35 ...,121,96
4029,easy_ham,Family refuses to cancel expensive wedding jus...,"URL: http://www.newsisfree.com/click/-1,84067...",From rssfeeds@jmason.org Tue Oct 1 10:37:11 ...,119,101
4040,easy_ham,Man rappels off of city bridge trying to hitch...,"URL: http://www.newsisfree.com/click/-2,84151...",From rssfeeds@jmason.org Tue Oct 1 10:37:24 ...,115,103
4061,easy_ham,If you're going to turn on an electric pump to...,"URL: http://www.newsisfree.com/click/-1,84102...",From rssfeeds@jmason.org Tue Oct 1 10:37:59 ...,122,110
4065,easy_ham,Cops sabotaging their own in car video cameras...,"URL: http://www.newsisfree.com/click/-2,84231...",From rssfeeds@jmason.org Tue Oct 1 10:38:04 ...,108,97


Looks like these are mostly mails with a descriptive subject and an URL as body.

# What's next?
We have excluded some mails that don't look like real emails, and processed all others to extract their subject and body. All mail headers were thrown away.

Now that the data should mostly be the actual contents of the emails, I would like to do some futher processing such as tokenization and stemming. Eventually I plan to use a bag of words model to represent the data when doing spam classification. I will most likely represent the word contents of emails with one hot encoding.

I believe stemmed words will be more usefull than the original words when used in the bag of words context. Mainly because stemming will produce a more uniform dictionary. Words like `bag` and `bags` will be recognised as the same word, instead of two entierly different words. 

Let's move on to notebook `2. Text processing`.

In [32]:
df.to_csv('data.csv')