# Document Classification
## Mohammed Rahman

### Overview

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.


### Choosing Documents for Classification

Let's look at available texts in the guttenberg corpus.


In [None]:
import nltk
import random
random.seed(250)
import pandas as pd
pd.set_option('display.max_rows', 500)
nltk.download('gutenberg')

nltk.corpus.gutenberg.fileids()

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

We have 3 books by Jane Austen, Bible, 1 book by Blake, and so on. Each author writes using his/her own style. Can we use samples of their work to predict who wrote specific passage?


### Austen vs Blake
### Create texts

First we need to take all three of Austen's works and combine them to create one text. We will also remove punctuation and convert everything to lowercase to eliminate duplicate words. Then we can take that and create a list of text segments. Each segment will have a length of 1000 words.


In [None]:
austen = nltk.corpus.gutenberg.words('austen-emma.txt')+nltk.corpus.gutenberg.words('austen-persuasion.txt')+nltk.corpus.gutenberg.words('austen-sense.txt')
austen = [word.lower() for word in austen if word.isalpha()]
austen1=[]
for i in range(366):
    austen1.append([austen[i*1000:(i+1)*1000],'au'])
len(austen)

366454

In [None]:
len(austen1)

366

We now have a list of 432 1000-word segments of text written by Jane Austen.

I will skip the Bible since it was written by many different authors using many different styles, but let's take the next text in the guttenburg corpus, poems by Blake, and do the same thing we did with Austen.


In [None]:
blake = nltk.corpus.gutenberg.words('blake-poems.txt')
blake = [word.lower() for word in blake if word.isalpha()]
blake1=[]
for i in range(7):
    blake1.append([blake[i*990:(i+1)*990],'bl'])
len(blake)

6934

Since there are just shy of 7000 words total in the Blake text, I will make each segment 990 words in order to get 7 equal segments for Blake.

In [None]:
len(blake1)

7

Now, we have a list of seven 990-word segments of text written by William Blake.

### Create Feature Extractor

Now let's take the two original lists of words and combine them to create one longer list and find the 2000 most frequent words, which we will later use to create a feature list for our classifier.

In [None]:
ab=austen+blake
all_words = nltk.FreqDist(w.lower() for w in ab)
word_features = list(all_words)[:2000]

wlist = []
for i in range(0, 2000, 200):
    df = pd.DataFrame(word_features[i:(i+200)])
    df.columns=['200 words']
    wlist.append(df)

pd.concat(wlist, axis=1)

Unnamed: 0,200 words,200 words.1,200 words.2,200 words.3,200 words.4,200 words.5,200 words.6,200 words.7,200 words.8,200 words.9
0,the,feelings,true,clay,purpose,smiled,estate,companions,sensibility,morton
1,to,found,agreeable,benwick,assured,thoroughly,run,suit,heartily,chiefly
2,and,few,taken,temper,extraordinary,enscombe,totally,fast,cruel,selfishness
3,of,heart,state,isabella,write,desirable,shewed,pressed,relation,turns
4,a,does,conversation,curiosity,ease,seat,line,dark,buildings,animated
5,i,going,dare,delighted,also,thrown,settle,highest,existence,faults
6,her,perhaps,husband,sight,agitation,concerned,venture,altered,third,shook
7,in,believe,door,robert,distance,close,advice,convince,black,wherever
8,was,fairfax,walked,admiral,welcome,encouragement,possibility,observing,christmas,triumph
9,it,family,louisa,excellent,difficulties,opinions,piece,niece,belong,comfortably


Here, I will use the function in the Natural Language Processing with Python textbook on page 228 to create a feature generator that uses the 2000 most frequent words list and indicates whether or not each word is present in the text as a feature.

In [None]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

features = document_features(blake)
list(features.items())[:20]

[('contains(the)', True),
 ('contains(to)', True),
 ('contains(and)', True),
 ('contains(of)', True),
 ('contains(a)', True),
 ('contains(i)', True),
 ('contains(her)', True),
 ('contains(in)', True),
 ('contains(was)', True),
 ('contains(it)', True),
 ('contains(she)', True),
 ('contains(not)', True),
 ('contains(be)', True),
 ('contains(that)', True),
 ('contains(he)', True),
 ('contains(had)', True),
 ('contains(you)', True),
 ('contains(as)', True),
 ('contains(for)', True),
 ('contains(but)', True)]

### Create Test Train Dataset

Now we need to create a list of all text segments from both Austen and Blake and shuffle them to create the text corpus that we will use to train and test our classifier model.


In [None]:
documents=austen1+blake1

import random
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

373

Next splitting the dataset into test and train sections, train the classifier on the training set, and check the accuracy of the model on the test set.

In [None]:
train_set, test_set = featuresets[:100], featuresets[100:]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

0.9926739926739927


It is very easy to for NLKT to distinguish between Austen and Blake. Let's try more authors.

#### Adding Bryant

In [None]:
bryant = nltk.corpus.gutenberg.words('bryant-stories.txt')
bryant = [word.lower() for word in bryant if word.isalpha()]
bryant1=[]
for i in range(46):
    bryant1.append([bryant[i*1000:(i+1)*1000],'br'])
len(bryant)

46611

In [None]:
abb=austen+blake+bryant
all_words = nltk.FreqDist(w.lower() for w in abb)
word_features = list(all_words)[:2000]

documents=austen1+blake1+bryant1

In [None]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

419

In [None]:
train_set, test_set = featuresets[:100], featuresets[100:]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

0.9937304075235109


I think, I get even better results in distinguishing between Austen, Blake, and Braynt.

#### Adding Burgess

In [None]:
burgess = nltk.corpus.gutenberg.words('burgess-busterbrown.txt')
burgess = [word.lower() for word in burgess if word.isalpha()]
burgess1=[]
for i in range(16):
    burgess1.append([burgess[i*1000:(i+1)*1000],'bu'])
len(burgess)

16327

In [None]:
abbb=austen+blake+bryant+burgess
all_words = nltk.FreqDist(w.lower() for w in abbb)
word_features = list(all_words)[:2000]

documents=austen1+blake1+bryant1+burgess1

In [None]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

435

In [None]:
train_set, test_set = featuresets[:100], featuresets[100:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.9522388059701492


We can see that, accuracy is declined a bit, but we are still in the mid 90's.

Let's see what features are most important in training our model.


In [None]:
classifier.show_most_informative_features(25)

Most Informative Features
          contains(much) = False              bl : au     =     52.8 : 1.0
        contains(breath) = True               bu : au     =     44.0 : 1.0
        contains(chosen) = True               bu : au     =     44.0 : 1.0
           contains(eat) = True               bu : au     =     44.0 : 1.0
          contains(feet) = True               bu : au     =     44.0 : 1.0
        contains(helped) = True               bu : au     =     44.0 : 1.0
        contains(jumped) = True               bu : au     =     44.0 : 1.0
            contains(ll) = True               bu : au     =     44.0 : 1.0
      contains(terrible) = True               bu : au     =     44.0 : 1.0
          contains(tree) = True               bu : au     =     44.0 : 1.0
           contains(sun) = True               br : au     =     42.4 : 1.0
        contains(bright) = True               bl : au     =     41.1 : 1.0
           contains(cry) = True               bl : au     =     41.1 : 1.0

It appears that a text that does not contain the word 'much' is 52 times more likely to be by Blake than by Austen, while a text that contains the word "eat", "below", "chosen", "stout" or "becomes" are each 44 times more likely to be by Burgess than by Austen. Texts that contain the word 'free', 'youthful', 'soft' or "tear" are each 29 times more likely to be by Blake than by Austen.

#### Adding Carroll

In [None]:
carroll = nltk.corpus.gutenberg.words('carroll-alice.txt')
carroll = [word.lower() for word in carroll if word.isalpha()]
carroll1=[]
for i in range(27):
    carroll1.append([carroll[i*1000:(i+1)*1000],'ca'])
len(carroll)

27333

In [None]:
abbbc=austen+blake+bryant+burgess+carroll
all_words = nltk.FreqDist(w.lower() for w in abbbc)
word_features = list(all_words)[:2000]

documents=austen1+blake1+bryant1+burgess1+carroll1

In [None]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

462

In [None]:
train_set, test_set = featuresets[:100], featuresets[100:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.9475138121546961


By adding Carroll, accuracy declined significantly. This is probably because I am using a smaller percent (100/462) of the corpus for training and I am adding more complexity by adding more categories to classify into. Let's keep adding more authors and see what happens.

In [None]:
classifier.show_most_informative_features(25)

Most Informative Features
         contains(brook) = True               bu : au     =     47.8 : 1.0
          contains(crow) = True               bu : au     =     47.8 : 1.0
        contains(beside) = True               bl : au     =     41.0 : 1.0
        contains(breath) = True               bl : au     =     41.0 : 1.0
          contains(dead) = True               bl : au     =     41.0 : 1.0
        contains(forgot) = True               bl : au     =     41.0 : 1.0
           contains(had) = False              bl : au     =     41.0 : 1.0
            contains(ll) = True               ca : au     =     41.0 : 1.0
         contains(mouth) = True               bl : au     =     41.0 : 1.0
         contains(queen) = True               bl : au     =     41.0 : 1.0
         contains(river) = True               bl : au     =     41.0 : 1.0
          contains(sing) = True               bl : au     =     41.0 : 1.0
        contains(spring) = True               bl : au     =     41.0 : 1.0

Common words that indicate that a text is more likely to have been written by Blake are "youthful", "forgot", "mild", "sing", "glass", "walks", and "gently" which each indicate a text is 41 times more likely to have been written by Blake than Austen. For Burgess, indicator words are "eaten" and "black", and for Austen, "had", "by", "could" and "would".

#### Adding Chesterson

In [None]:
chesterson = nltk.corpus.gutenberg.words('chesterton-ball.txt')+nltk.corpus.gutenberg.words('chesterton-brown.txt')+nltk.corpus.gutenberg.words('chesterton-thursday.txt')
chesterson = [word.lower() for word in chesterson if word.isalpha()]
chesterson1=[]
for i in range(214):
    chesterson1.append([chesterson[i*1000:(i+1)*1000],'ch'])
len(chesterson)

214692

In [None]:
abbbcc=austen+blake+bryant+burgess+carroll+chesterson
all_words = nltk.FreqDist(w.lower() for w in abbbcc)
word_features = list(all_words)[:2000]

documents=austen1+blake1+bryant1+burgess1+carroll1+chesterson1

In [None]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

676

In [None]:
train_set, test_set = featuresets[:170], featuresets[170:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.958498023715415


The accuracy is still 88%.

In [None]:
classifier.show_most_informative_features(25)

Most Informative Features
         contains(brown) = True               bu : au     =     53.1 : 1.0
            contains(ll) = True               ca : au     =     49.3 : 1.0
        contains(bright) = True               bl : au     =     45.5 : 1.0
         contains(cloud) = True               bl : au     =     45.5 : 1.0
          contains(coat) = True               bl : au     =     45.5 : 1.0
         contains(crime) = True               bl : au     =     45.5 : 1.0
           contains(cry) = True               bl : au     =     45.5 : 1.0
     contains(dangerous) = True               bl : au     =     45.5 : 1.0
         contains(devil) = True               bl : au     =     45.5 : 1.0
         contains(drink) = True               bl : au     =     45.5 : 1.0
       contains(flowers) = True               bl : au     =     45.5 : 1.0
           contains(fly) = True               bl : au     =     45.5 : 1.0
          contains(gold) = True               bl : au     =     45.5 : 1.0

#### Adding the rest of the authors

In [None]:
edgeworth = nltk.corpus.gutenberg.words('edgeworth-parents.txt')
edgeworth = [word.lower() for word in edgeworth if word.isalpha()]
edgeworth1=[]
for i in range(170):
    edgeworth1.append([edgeworth[i*1000:(i+1)*1000],'ed'])
len(edgeworth)

170737

In [None]:
melville = nltk.corpus.gutenberg.words('melville-moby_dick.txt')
melville = [word.lower() for word in melville if word.isalpha()]
melville1=[]
for i in range(218):
    melville1.append([melville[i*1000:(i+1)*1000],'me'])
len(melville)

218361

In [None]:
shakespeare = nltk.corpus.gutenberg.words('shakespeare-caesar.txt')+nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')+nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')
shakespeare = [word.lower() for word in shakespeare if word.isalpha()]
shakespeare1=[]
for i in range(69):
    shakespeare1.append([shakespeare[i*1000:(i+1)*1000],'sh'])
len(shakespeare)

69340

In [None]:
whitman = nltk.corpus.gutenberg.words('whitman-leaves.txt')
whitman = [word.lower() for word in whitman if word.isalpha()]
whitman1=[]
for i in range(126):
    whitman1.append([whitman[i*1000:(i+1)*1000],'wh'])
len(whitman)

126276

In [None]:
abbbccemsw=austen+blake+bryant+burgess+carroll+chesterson+edgeworth+melville+shakespeare+whitman
all_words = nltk.FreqDist(w.lower() for w in abbbccemsw)
word_features = list(all_words)[:2000]

documents=austen1+blake1+bryant1+burgess1+carroll1+chesterson1+edgeworth1+melville1+shakespeare1+whitman1

In [None]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

1259

In [None]:
train_set, test_set = featuresets[:320], featuresets[320:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.9478168264110756


The accuracy is increased to 94%.

In [None]:
classifier.show_most_informative_features(40)

Most Informative Features
          contains(thou) = True               sh : au     =     61.7 : 1.0
        contains(farmer) = True               bu : au     =     58.9 : 1.0
           contains(her) = False              bu : au     =     58.9 : 1.0
        contains(beside) = True               bl : au     =     56.1 : 1.0
         contains(earth) = True               bl : au     =     56.1 : 1.0
        contains(fields) = True               bl : au     =     56.1 : 1.0
          contains(very) = False              bl : au     =     56.1 : 1.0
         contains(mouth) = True               bu : au     =     42.1 : 1.0
          contains(wide) = True               bu : au     =     42.1 : 1.0
          contains(have) = False              sh : au     =     39.3 : 1.0
         contains(river) = True               br : au     =     37.0 : 1.0
       contains(herself) = True               ca : ch     =     36.7 : 1.0
           contains(mrs) = True               au : ch     =     36.2 : 1.0