
# Week 10 - Document Classification

**Trishita Nath**

## Overview

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/ml/datasets/Spambase)

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.
This assignment is due end of day on Sunday 4/17.



In [2]:
import nltk
import random
random.seed(250)
import pandas as pd
pd.set_option('display.max_rows', 100)

nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids() # Available texts in the guttenberg corpus

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Happy\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

### Comparison of the texts for Blake and Austen

#### Creating Texts



Jane Austen has 3 books while Blake has one. I will take all three of Austen's works and combine them to create one text. Each author has their own style. I will remove punctuations and convert the text to lowercase and eliminate duplicate words. I will then create a list of text segments, each segment having 1000 words.


In [3]:
austen_text = nltk.corpus.gutenberg.words('austen-emma.txt')+nltk.corpus.gutenberg.words('austen-persuasion.txt')+nltk.corpus.gutenberg.words('austen-sense.txt')
austen_text = [word.lower() for word in austen_text if word.isalpha()]
austen_segmented=[]
for i in range(366):
    austen_segmented.append([austen_text[i*1000:(i+1)*1000],'au'])
len(austen_segmented)

366

We have a list of 432 1000-word segments of text by Jane Austen.

I will do the same for the text by Blake, but this time, I will create text segments each having 990 words

In [4]:
blake_text = nltk.corpus.gutenberg.words('blake-poems.txt')
blake_text = [word.lower() for word in blake_text if word.isalpha()]
blake_segmented=[]
for i in range(7):
    blake_segmented.append([blake_text[i*990:(i+1)*990],'bl'])
len(blake_segmented)

7



We have a list of seven 990-word segments of text by William Blake.

### Feature exctraction

I will take the two original lists of words and combine them into one list and find the 2000 most frequent words that I will use to create a feature list for my classifier.

In [5]:


austen_blake_combined = austen_text + blake_text
all_words = nltk.FreqDist(w.lower() for w in austen_blake_combined)
word_features = list(all_words)[:2000] 

word_list = []
for i in range(0, 2000, 200):
    df = pd.DataFrame(word_features[i:(i+200)])
    df.columns=['200 words']
    word_list.append(df)

pd.concat(word_list, axis=1)



Unnamed: 0,200 words,200 words.1,200 words.2,200 words.3,200 words.4,200 words.5,200 words.6,200 words.7,200 words.8,200 words.9
0,the,feelings,true,clay,purpose,smiled,estate,companions,sensibility,morton
1,to,found,agreeable,benwick,assured,thoroughly,run,suit,heartily,chiefly
2,and,few,taken,temper,extraordinary,enscombe,totally,fast,cruel,selfishness
3,of,heart,state,isabella,write,desirable,shewed,pressed,relation,turns
4,a,does,conversation,curiosity,ease,seat,line,dark,buildings,animated
...,...,...,...,...,...,...,...,...,...,...
195,till,themselves,letters,staying,suffer,eight,held,performance,distressed,endeavouring
196,something,within,spite,arrived,human,possibly,probable,furniture,explain,idle
197,dashwood,walk,sweet,suffered,hall,secure,add,earlier,filled,pursuits
198,yet,already,favour,consciousness,wallis,hearted,arise,musical,cloud,anger


In [6]:
# Feature generator function that indicates whether or not each word is present in the text as a feature
def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [7]:
# Testing on Austen's list
features = document_features(austen_text)
list(features.items())[:10]



[('contains(the)', True),
 ('contains(to)', True),
 ('contains(and)', True),
 ('contains(of)', True),
 ('contains(a)', True),
 ('contains(i)', True),
 ('contains(her)', True),
 ('contains(in)', True),
 ('contains(was)', True),
 ('contains(it)', True)]

### Train and Test Dataset

I will create a list of all text segments from both Austen and Blake then       shuffle them to create the text corpus that I will use to train and test the classifier model.

In [8]:
documents=austen_segmented + blake_segmented

random.shuffle(documents)
feature_set = [(document_features(d), c) for (d,c) in documents]
len(feature_set)



373

#### Splitting the dataset into test and train datasets


In [9]:

train_dataset, test_dataset = feature_set[:100], feature_set[100:]
classifier = nltk.NaiveBayesClassifier.train(train_dataset)

# Accuracy
print(nltk.classify.accuracy(classifier, test_dataset)) 

0.9926739926739927


In [10]:
classifier.show_most_informative_features(10)

Most Informative Features
        contains(bright) = True               bl : au     =     55.0 : 1.0
         contains(green) = True               bl : au     =     55.0 : 1.0
          contains(thou) = True               bl : au     =     55.0 : 1.0
         contains(angel) = True               bl : au     =     33.0 : 1.0
          contains(song) = True               bl : au     =     33.0 : 1.0
         contains(which) = False              bl : au     =     33.0 : 1.0
           contains(had) = False              bl : au     =     33.0 : 1.0
           contains(her) = False              bl : au     =     33.0 : 1.0
        contains(infant) = True               bl : au     =     33.0 : 1.0
          contains(bore) = True               bl : au     =     23.6 : 1.0


I will add two more authors to see the effect it has on the accuracy of the model

#### Bugress Dataset

In [11]:
burgess_text = nltk.corpus.gutenberg.words('burgess-busterbrown.txt')
burgess_text = [word.lower() for word in burgess_text if word.isalpha()]
burgess_segmented=[]
for i in range(16):
    burgess_segmented.append([burgess_text[i*1000:(i+1)*1000],'bu'])
len(burgess_text)

16327

In [12]:
austen_blake_burgess_combined= austen_text + blake_text + burgess_text
all_words = nltk.FreqDist(w.lower() for w in austen_blake_burgess_combined)
word_features = list(all_words)[:2000] 

documents=austen_segmented + blake_segmented + burgess_segmented

random.shuffle(documents)
feature_sets = [(document_features(d), c) for (d,c) in documents]
len(feature_sets)

389

In [13]:
train_dataset, test_dataset = feature_sets[:100], feature_sets[100:]
classifier = nltk.NaiveBayesClassifier.train(train_dataset)
print(nltk.classify.accuracy(classifier, test_dataset))

0.9965397923875432


In [14]:
# Top 10 Most important features of the training our model

classifier.show_most_informative_features(10)



Most Informative Features
         contains(angel) = True               bl : au     =     50.0 : 1.0
          contains(grey) = True               bl : au     =     50.0 : 1.0
           contains(sun) = True               bl : au     =     50.0 : 1.0
          contains(thou) = True               bl : au     =     50.0 : 1.0
         contains(brown) = True               bu : au     =     45.0 : 1.0
        contains(farmer) = True               bu : au     =     45.0 : 1.0
         contains(brook) = True               bu : au     =     39.0 : 1.0
             contains(m) = True               bu : au     =     33.0 : 1.0
        contains(abroad) = True               bl : au     =     30.0 : 1.0
        contains(breath) = True               bl : au     =     30.0 : 1.0


#### Chesterson Dataset

In [15]:
chesterson_text = nltk.corpus.gutenberg.words('chesterton-ball.txt')+nltk.corpus.gutenberg.words('chesterton-brown.txt')+nltk.corpus.gutenberg.words('chesterton-thursday.txt')
chesterson_text = [word.lower() for word in chesterson_text if word.isalpha()]
chesterson_segmented=[]
for i in range(214):
    chesterson_segmented.append([chesterson_text[i*1000:(i+1)*1000],'ch'])
len(chesterson_text)



214692

In [16]:
austen_blake_burgess_chesterson_combined = austen_text + blake_text + burgess_text + chesterson_text
all_words = nltk.FreqDist(w.lower() for w in austen_blake_burgess_chesterson_combined)
word_features = list(all_words)[:2000] 

documents=austen_segmented + blake_segmented + burgess_segmented + chesterson_segmented

random.shuffle(documents)
feature_sets = [(document_features(d), c) for (d,c) in documents]
len(feature_sets)

603

In [17]:
# Accuracy
print(nltk.classify.accuracy(classifier, test_dataset))

0.9965397923875432


In [18]:
classifier.show_most_informative_features(10)

Most Informative Features
         contains(angel) = True               bl : au     =     50.0 : 1.0
          contains(grey) = True               bl : au     =     50.0 : 1.0
           contains(sun) = True               bl : au     =     50.0 : 1.0
          contains(thou) = True               bl : au     =     50.0 : 1.0
         contains(brown) = True               bu : au     =     45.0 : 1.0
        contains(farmer) = True               bu : au     =     45.0 : 1.0
         contains(brook) = True               bu : au     =     39.0 : 1.0
             contains(m) = True               bu : au     =     33.0 : 1.0
        contains(abroad) = True               bl : au     =     30.0 : 1.0
        contains(breath) = True               bl : au     =     30.0 : 1.0
