# Assignment Document Classification


It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data: http://archive.ics.uci.edu/ml/datasets/Spambase

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

**Import Packages**

In [57]:
import nltk
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk import word_tokenize, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import NaiveBayesClassifier, classify
from collections import Counter
import random

**About the Corpus**

For this project I am using the spam and ham data available here : http://spamassassin.apache.org/old/publiccorpus/ 

I use the following corpora:

    To train:
        
        20021010_easy_ham.tar.bz2
        20021010_spam.tar.bz2
        
    To test:
    
        20030228_easy_ham.tar.bz2
        20030228_spam.tar.bz2
        
For the following process I have unzipped the above data.

**Load Corpora**

The following function processes the files present in the directories and returns 3 outputs PlaintextCorpusReader, sentencces in the corpus in tokenized form and the emails with spam and ham tags

In [58]:
def process_corpus(dir, tag):
    
    plain_corpus = PlaintextCorpusReader(dir, '.*', encoding='latin-1') 
    
    sen_corpus  = nltk.Text(plain_corpus.sents())
    
    email_tag_corpus = [(e, tag) for e in sen_corpus]
    
    return plain_corpus, sen_corpus, email_tag_corpus

Ham train dataset is loaded from the local directory

The above function is called to get the plaintext, sentence and email tag of ham corpus

In [59]:
#Ham
ham_dir = './20021010_easy_ham.tar/easy_ham/' 

plain_ham, sen_ham, email_tag_ham = process_corpus(ham_dir, 'ham')

Spam train dataset is loaded from the local directory

The above function is called to get the plaintext, sentence and email tag of spam corpus

In [60]:
#Spam
spam_dir = './20021010_spam.tar/spam/'

plain_spam, sen_spam, email_tag_spam = process_corpus(spam_dir, 'spam')

The below function displays the file count, word count and sentence count.

In [61]:
def display_details(corpus):

    fileid = len(corpus.fileids())
    wordcount = len(corpus.words())
    sencount = len(corpus.sents())

    print("No. Of Files = {}".format(fileid))    
    print("No. Of Words = {}".format(wordcount))
    print("No. Of Sentences = {}".format(sencount))

HAM DATA SUMMARY

In [62]:
display_details(plain_spam)

No. Of Files = 501
No. Of Words = 769207
No. Of Sentences = 12107


SPAM DATA SUMMARY

In [63]:
display_details(plain_ham)

No. Of Files = 2552
No. Of Words = 2449010
No. Of Sentences = 39959


**Combining both corpora and shuffling**

The email with the corresponding tags obtained above of both spam and ham are combined and shuffled

In [64]:
random.seed(123)

email_tag = email_tag_spam + email_tag_ham

random.shuffle(email_tag)

**Feature**

Here I use the Lemmatizer feature.

The feature is extracted into *ext_feature*

In [65]:
#WordNetLemmatizer
def feature1(s):
    w = WordNetLemmatizer()
    return [w.lemmatize(wrd.lower()) for wrd in s]

stopwrd = stopwords.words("english")

def get_features(e, f):
    return {wrd: True for wrd in f(e) if not wrd in stopwrd}

ext_feature = [(get_features(e, feature1), l) for (e, l) in email_tag]

**Naive Bayes Classifier**

The following function impliments Naive Bayes Classifier on a portion of the corpus given as argument to the function.

In [66]:
def Naive(f, p):
    l = int(len(f) * p)
    train, test = f[:l], f[l:]
    
    NB = NaiveBayesClassifier.train(train)
    return train, test, NB

I am using 3 sets of train and test sets to classify the corpus.

In [67]:
train1, test1, NB1 = Naive(ext_feature, 0.5)

train2, test2, NB2 = Naive(ext_feature, 0.7)

train3, test3, NB3 = Naive(ext_feature, 0.9)

The below function prints the accuracy of the train and test sets

In [68]:
def accuracyfn(train,test,classifier,f):
	x=nltk.classify.accuracy(classifier, train)
	print("train accuracy",x)

	y=nltk.classify.accuracy(classifier, test)
	print("Performance on the test set for feature 1:",y)
	return

In [69]:
accuracyfn(train1, test1, NB1,feature1)

train accuracy 0.9333154073675719
Performance on the test set for feature 1: 0.9119578995889832


In [70]:
accuracyfn(train2, test2, NB2,feature1)

train accuracy 0.9333260165724634
Performance on the test set for feature 1: 0.9185019206145967


In [71]:
accuracyfn(train3, test2, NB3,feature1)

train accuracy 0.9357220597964105
Performance on the test set for feature 1: 0.9330985915492958


From above it is clear that set 3 is the best set so I am using that to classify the test corpora

In [72]:
#Ham 20030228
ham2003_dir = './20030228_easy_ham.tar/easy_ham/'
plain_ham2003, sen_ham2003, email_tag_ham2003 = process_corpus(ham2003_dir, 'ham')

predict_ham = [NB3.classify(get_features(e, feature1)) for (e, l) in email_tag_ham2003]

Counter(predict_ham)

Counter({'ham': 36597, 'spam': 1979})

In [73]:
len(predict_ham)

38576

In [74]:
#Spam 20030228
spam2003_dir = './20030228_spam.tar/spam/'
plain_spam2003, sen_spam2003, email_tag_spam2003 = process_corpus(spam2003_dir, 'spam')


predict_spam = [NB3.classify(get_features(e, feature1)) for (e, l) in email_tag_spam2003]

Counter(predict_spam)

Counter({'spam': 10508, 'ham': 1518})

In [75]:
len(predict_spam)

12026

**Conclusion**

From the above predictions it is clear that almost **95%** of Ham mails were correctly identified and around **87%** Spam mails were correctly identified.

**Video**



https://youtu.be/qgtDZnZJcT0 