<a href="https://colab.research.google.com/github/nhwhite212/DealingwithDataSpring2021/blob/master/7-TextMining_NLP/D-Document_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Classifying Text: Supervised Classification

For this lesson, we will focus on how to build our first automatic classification algorithms. Since the topic is huge, we will be simply scratching the surface, to get something working. For those interested in learning more, taking the Data Mining course next semester is the natural sequence.

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are:

* Deciding whether an email is spam or not.
* Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports," "technology," and "politics."
* Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.

A classifier is called supervised if it is built based on **training data** containing the correct label for each input. 

<img src="http://www.nltk.org/images/supervised-classification.png" width="50%">

### Document Classification

A common classification task is to classify documents into categories. Let's use for this the Movie Reviews corpus from NLTK:

In [4]:
import nltk
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [5]:
from nltk.corpus import movie_reviews

categories = movie_reviews.categories()
categories

['neg', 'pos']

Now let's generate the list of files, each with its corresponding category.

In [6]:
labeled_files = []
for category in categories:
    labeled_files += [(fileid, category) for fileid in movie_reviews.fileids(category)]

In [7]:
labeled_files[:5]

[('neg/cv000_29416.txt', 'neg'),
 ('neg/cv001_19502.txt', 'neg'),
 ('neg/cv002_17424.txt', 'neg'),
 ('neg/cv003_12683.txt', 'neg'),
 ('neg/cv004_12641.txt', 'neg')]

In [8]:
labeled_files[-5:]

[('pos/cv995_21821.txt', 'pos'),
 ('pos/cv996_11592.txt', 'pos'),
 ('pos/cv997_5046.txt', 'pos'),
 ('pos/cv998_14111.txt', 'pos'),
 ('pos/cv999_13106.txt', 'pos')]

In [9]:
len(labeled_files)

2000

In [10]:
len([l for l in labeled_files if l[1]=='pos'])

1000

In [11]:
len([l for l in labeled_files if l[1]=='neg'])

1000

### Featurizing a Document

Now let's create the features. We will create **one feature per word**, with a **binary value**, indicating whether the document contains the word or not. To limit the number of features that the classifier needs to process, we begin by constructing a list of the 2000 most frequent words in the overall corpus

In [14]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import random

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [15]:
# Extract all the words from all the reviews
all_words = []
for (fileid, category) in labeled_files:
    # Get the words of the document
    all_words.extend(movie_reviews.words(fileid))

# Keep only words that are not stopwords
stopwords = nltk.corpus.stopwords.words('english')
filtered = [w.lower() for w in all_words if w.isalpha() and w not in stopwords]
    
# Compute the word frequency after removing stopwords 
fdist  = nltk.FreqDist(filtered)
features = set([w for (w,f) in fdist.most_common(2000)])

In [16]:
def document_features(fileid):
    # Get the words of the document
    document_words = set(movie_reviews.words(fileid))
    document_features = {}
    for word in features:
        # Create a boolean feature that is True when the document contains the word
        if word in document_words:
            document_features[word] = True
        else:
            document_features[word] = False
    return document_features

Let's see how long it takes to featurize a single document

In [17]:
%timeit  document_features("pos/cv995_21821.txt")

1000 loops, best of 5: 1.18 ms per loop


And to visualize how a "featurized" document looks like:

In [18]:
import pandas as pd
testdf = pd.DataFrame(
    [document_features("pos/cv995_21821.txt"),
     document_features("neg/cv003_12683.txt")]
)
testdf

Unnamed: 0,range,woody,example,job,roll,travolta,nights,prince,bar,particularly,totally,jane,stars,theme,brothers,comedy,wars,image,realistic,effective,suspense,visuals,armageddon,al,amy,existence,stock,big,compelling,length,interest,study,lawrence,seem,brain,sign,capable,harry,provided,complete,...,sorry,bob,filmmaking,morning,behind,feature,intense,flat,twist,etc,themes,members,shame,radio,people,based,solid,pace,cut,find,casting,hit,inspired,rose,waste,dude,exist,technical,miss,pull,dollars,door,constantly,special,right,film,written,plenty,leave,shoot
0,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,True,True,True,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,True,...,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False


In [19]:
labeled_documents = [(document_features(fileid), category) for (fileid, category) in labeled_files]
random.shuffle(labeled_documents)
train_set, test_set = labeled_documents[100:], labeled_documents[:100]

In [20]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [21]:
print(nltk.classify.accuracy(classifier, test_set))

0.75


In [22]:
classifier.show_most_informative_features(50)

Most Informative Features
             outstanding = True              pos : neg    =     13.2 : 1.0
                   mulan = True              pos : neg    =      8.5 : 1.0
                  seagal = True              neg : pos    =      8.1 : 1.0
             wonderfully = True              pos : neg    =      7.3 : 1.0
              ridiculous = True              neg : pos    =      6.3 : 1.0
                   damon = True              pos : neg    =      6.2 : 1.0
                   flynt = True              pos : neg    =      5.8 : 1.0
                     era = True              pos : neg    =      5.6 : 1.0
                  wasted = True              neg : pos    =      5.6 : 1.0
                    lame = True              neg : pos    =      5.2 : 1.0
                   awful = True              neg : pos    =      5.2 : 1.0
                   waste = True              neg : pos    =      4.9 : 1.0
                  poorly = True              neg : pos    =      4.9 : 1.0

### Exercise

Try to come up with ways to improve the classifier that we discussed above. 

In [23]:
# Extract all the words from all the reviews
all_words = []
for (fileid, category) in labeled_files:
    # Get the words of the document
    all_words.extend(movie_reviews.words(fileid))

# Keep only words that are not stopwords
stopwords = nltk.corpus.stopwords.words('english')
# Extend with a few keywords that are named entities
stopwords.extend(["mulan", "seagal", "damon", "ripley", "jedi", "hanks"])
filtered = [w.lower() for w in all_words if w.isalpha() and w not in stopwords]
    
# Compute the word frequency after removing stopwords 
fdist  = nltk.FreqDist(filtered)
features = set([w for (w,f) in fdist.most_common(2000)])

In [24]:
def document_features(fileid):
    # Get the words of the document
    document_words = set(movie_reviews.words(fileid))
    document_features = {}
    for word in features:
        # Create a boolean feature that is True when the document contains the word
        if word in document_words:
            document_features[word] = True
        else:
            document_features[word] = False
    return document_features

In [25]:
labeled_documents = [(document_features(fileid), category) for (fileid, category) in labeled_files]
random.shuffle(labeled_documents)
train_set, test_set = labeled_documents[100:], labeled_documents[:100]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Accuracy:", nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(50)

Accuracy: 0.81
Most Informative Features
             outstanding = True              pos : neg    =     11.2 : 1.0
             wonderfully = True              pos : neg    =      7.4 : 1.0
                  poorly = True              neg : pos    =      6.1 : 1.0
                   flynt = True              pos : neg    =      5.7 : 1.0
                    lame = True              neg : pos    =      5.5 : 1.0
                   awful = True              neg : pos    =      5.1 : 1.0
                   waste = True              neg : pos    =      5.0 : 1.0
              ridiculous = True              neg : pos    =      5.0 : 1.0
                  wasted = True              neg : pos    =      4.9 : 1.0
                   worst = True              neg : pos    =      4.4 : 1.0
                     era = True              pos : neg    =      4.3 : 1.0
                   bland = True              neg : pos    =      4.2 : 1.0
               laughable = True              neg : pos    =

### You should be able to apply similar logic to many different types of text...