Exercise: _Build a spam classifier (a more challenging exercise):_

* _Download examples of spam and ham from [Apache SpamAssassin's public datasets](https://homl.info/spamassassin)._
* _Unzip the datasets and familiarize yourself with the data format._
* _Split the datasets into a training set and a test set._
* _Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector that indicates the presence or absence of each possible word. For example, if all emails only ever contain four words, "Hello," "how," "are," "you," then the email "Hello you Hello Hello you" would be converted into a vector [1, 0, 0, 1] (meaning [“Hello" is present, "how" is absent, "are" is absent, "you" is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word._

_You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with "URL," replace all numbers with "NUMBER," or even perform _stemming_ (i.e., trim off word endings; there are Python libraries available to do this)._

_Finally, try out several classifiers and see if you can build a great spam classifier, with both high recall and high precision._

In [36]:
import tarfile
import os
import shutil
import email
import email.policy


def load_email(filepath):
    with open(filepath, "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)


def load_spam():

	if os.path.exists('non_spam'):
		shutil.rmtree('non_spam')
	if os.path.exists('easy_ham'):
		shutil.rmtree('easy_ham')
	if os.path.exists('spam'):
		shutil.rmtree('spam')
	if os.path.exists('spam_2'):
		shutil.rmtree('spam_2')

	spam_file = tarfile.open("./spam.tar.bz2", "r:bz2")  
	spam_file.extractall()
	spam_file.close()

	# Non spam file
	non_spam_file = tarfile.open("./non_spam.tar.bz2", "r:bz2")  
	non_spam_file.extractall()
	non_spam_file.close()

	if os.path.exists(os.path.join(os.getcwd(),'easy_ham')):
		os.rename('easy_ham','non_spam')
	if os.path.exists(os.path.join(os.getcwd(),'spam_2')):
		os.rename('spam_2','spam')


	SPAM_DIR = os.path.join(os.getcwd(),'spam')
	NON_SPAM_DIR = os.path.join(os.getcwd(),'non_spam')

	spam_files = os.listdir(SPAM_DIR)
	non_spam_files = os.listdir(NON_SPAM_DIR)

	assert len(spam_files) != 0
	assert len(non_spam_files) != 0

	spam_emails = [load_email(os.path.join(SPAM_DIR,f)) for f in spam_files]
	non_spam_emails = [load_email(os.path.join(NON_SPAM_DIR,f)) \
		    for f in non_spam_files]

	return spam_emails, non_spam_emails
	
	

In [37]:
spam_emails, non_spam_emails = load_spam()

In [40]:
spam_emails[0]["Subject"]

'Dialogue et Rencontre ? Rejoins nous !'

In [45]:
from collections import Counter

count = Counter(spam_emails[0]["Subject"].split())
count.most_common(10)


[('Dialogue', 1),
 ('et', 1),
 ('Rencontre', 1),
 ('?', 1),
 ('Rejoins', 1),
 ('nous', 1),
 ('!', 1)]