# Learning to Classify Text

# HW4: Classification 
+ **Groups**: This one has a lot of exposition. Depending on how long you have to meet with you group this week, you might decide to do parts of it independently and compare results rather than doing it all together. It's up to you! 

Modified by Mia Jacobsen

## Part 1

In [None]:
%pip install nltk
%pip install pprint 
%pip install re

In [None]:
import nltk, re, pprint, random # nltk, regular expression, pretty print, random module
from nltk import word_tokenize


Detecting patterns is a central part of Natural Language Processing. Words ending in -ed tend to be past tense verbs. Frequent use of will is indicative of news text. These observable patterns — word structure and word frequency — happen to correlate with particular aspects of meaning, such as tense and topic. But how did we know where to start looking, which aspects of form to associate with which aspects of meaning?

The goal of this chapter is to answer the following questions:

1. How can we identify particular features of language data that are salient for classifying it?
2. How can we construct models of language that can be used to perform language processing tasks automatically?
3. What can we learn about language from these models?

Along the way we will study some important machine learning techniques, including decision trees, naive Bayes' classifiers, and maximum entropy classifiers. We will gloss over the mathematical and statistical underpinnings of these techniques, focusing instead on how and when to use them. Before looking at these methods, we first need to appreciate the broad scope of this topic.

## 6.1 Supervised Classification

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are:

- Deciding whether an email is spam or not.
- Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports," "technology," and "politics."
- Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.

The basic classification task has a number of interesting variants. For example, in multi-class classification, each instance may be assigned multiple labels; in open-class classification, the set of labels is not defined in advance; and in sequence classification, a list of inputs are jointly classified.

A classifier is called supervised if it is built based on training corpora containing the correct label for each input. The framework used by supervised classification is shown in 6.1.

Figure 6.1: Supervised Classification. (a) During training, a feature extractor is used to convert each input value to a feature set. These feature sets, which capture the basic information about each input that should be used to classify it, are discussed in the next section. Pairs of feature sets and labels are fed into the machine learning algorithm to generate a model. (b) During prediction, the same feature extractor is used to convert unseen inputs to feature sets. These feature sets are then fed into the model, which generates predicted labels.

In the rest of this section, we will look at how classifiers can be employed to solve a wide variety of tasks. Our discussion is not intended to be comprehensive, but to give a representative sample of tasks that can be performed with the help of text classifiers.

**<font color=red>Do NOT just blindly run all this code!!! Try your best to explore and understand what's going on. Also: don't shy away from isolating objects in a new cell to probe them and see what they are. You're building lots of novel objects, and without poking them you won't understand what's going on.</font>**

### Gendered name Identification

Stereotypical male and female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. Let's build a classifier to model these differences more precisely.

The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we'll start by just looking at the final letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name:

In [None]:
def gender_features(word):
    return {'last_letter': word[-1]}
# Our only feature is the last letter of the name

In [None]:
gender_features('Shrek')

In [None]:
gender_features('Bob')

The returned dictionary, known as a feature set, maps from features' names to their values. Feature names are case-sensitive strings that typically provide a short human-readable description of the feature. Feature values are values with simple types, such as booleans, numbers, and strings.

### Note

Most classification methods require that features be encoded using simple value types, such as booleans, numbers, and strings. But note that just because a feature has a simple type, does not necessarily mean that the feature's value is simple to express or compute; indeed, it is even possible to use very complex and informative values, such as the output of a second supervised classifier, as features.

Now that we've defined a feature extractor, we need to prepare a list of examples and corresponding class labels.

In [None]:
nltk.download('names') #download a corpus containing a bunch of names


In [None]:
names = ([(name, 'male') for name in nltk.corpus.names.words('male.txt')] +
         [(name, 'female') for name in nltk.corpus.names.words('female.txt')])
random.shuffle(names)
len(names) # 7944 names

<font color=red>Q: What might be the purpose of shuffling the dataset?<font>

Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a training set and a test set. The training set is used to train a new "naive Bayes" classifier.

Remember it can be a good idea to have a look at the different variables and stuff we create, it will make it easier to understand what's going on! Play around with it :) 

In [None]:
featuresets = [(gender_features(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500] # test set is first 500 records
# train set is everything after the first 500 records, actually 7544 names. We train on far
# more data than we test
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
# a chunk for you to play around with 

We will learn more about the naive Bayes classifier later. For now, let's just test it out on some names that did not appear in its training data:

In [None]:
classifier.classify(gender_features('Neo'))

In [None]:
classifier.classify(gender_features('Trinity'))

In [None]:
classifier.classify(gender_features('Bob'))

In [None]:
classifier.classify(gender_features('Deandrea'))

<font color=red> Now try your own names. Does it get it right? </font>

In [None]:
classifier.classify(gender_features('YOUR NAME'))

Observe that these character names from The Matrix are correctly classified. Although this science fiction movie is set in 2199, it still conforms with our expectations about names and genders. We can systematically evaluate the classifier on a much larger quantity of unseen data:

In [None]:
nltk.classify.accuracy(classifier, test_set)
# I assume we get the accuracy by calculating what percent of the test set we predicted accurately
# accuracy = (total # predicted right)/(total in test set) ? 

<font color=red>Q: What do you think the purpose of having a separate testing dataset is? <font>

Finally, we can examine the classifier to determine which features it found most effective for distinguishing the names' genders:

In [None]:
classifier.show_most_informative_features(5)

This listing shows that the names in the training set that end in "a" are female 34 times more often than they are male, but names that end in "k" are male 44 times more often than they are female. These ratios are known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.

<font color=red> Your Turn: Modify the gender_features() function to provide the classifier with features encoding the length of the name, its first letter, and any other features that seem like they might be informative. Retrain the classifier with these new features, and test its accuracy.

In [None]:
def gender_features2(word):
    return {}

In [None]:
gender_features('Bob')

In [None]:
gender_features('Deandrea')

When working with large corpora, constructing a single list that contains the features of every instance can use up a large amount of memory. In these cases, use the function nltk.classify.apply_features, which returns an object that acts like a list but does not store all the feature sets in memory: </font>

In [None]:
from nltk.classify import apply_features

random.shuffle(names)

train_set = apply_features(gender_features2, names[500:])
test_set = apply_features(gender_features2, names[:500])

classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
classifier.classify(gender_features2('Neo'))

In [None]:
classifier.classify(gender_features2('Trinity'))

<font color=red> Q: Is this new classifer better or worse than our previous classifier? Find the accruacy and the most informative features and compare to the old classifier. You may want to try more than just 5 features

### Choosing the Right Features

Selecting relevant features and deciding how to encode them for a learning method can have an enormous impact on the learning method's ability to extract a good model. Much of the interesting work in building a classifier is deciding what features might be relevant, and how we can represent them. Although it's often possible to get decent performance by using a fairly simple and obvious set of features, there are usually significant gains to be had by using carefully constructed features based on a thorough understanding of the task at hand.

Typically, feature extractors are built through a process of trial-and-error, guided by intuitions about what information is relevant to the problem. It's common to start with a "kitchen sink" approach, including all the features that you can think of, and then checking to see which features actually are helpful. We take this approach for name gender features.

In [None]:
def gender_features3(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features[f"count{letter}"] = name.lower().count(letter)
        features[f"has{letter}"] = (letter in name.lower())
    return features


In [None]:
gender_features3('John')

In [None]:
len(gender_features3('John')) # 54 features 

The featuresets returned by this feature extractor contain a large number of specific features. There are usually limits to the number of features that you should use with a given learning algorithm — if you provide too many features, then the algorithm will have a higher chance of relying on idiosyncrasies of your training data that don't generalize well to new examples.

This problem is known as overfitting, and can be especially problematic when working with small training sets such as the Names Corpus. For example, if we train a naive Bayes classifier using the above feature extractor, it will overfit our relatively small training set, resulting in an accuracy that is lower than the accuracy of a classifier which only pays attention to the final letter of each name:

In [None]:
from nltk import apply_features

random.shuffle(names)

train_set = apply_features(gender_features3, names[500:])
test_set = apply_features(gender_features3, names[:500])

classifier = nltk.NaiveBayesClassifier.train(train_set)

nltk.classify.accuracy(classifier, test_set)

Once an initial set of features has been chosen, a very productive method for refining the feature set is error analysis. To do this we create a development set which is separate fromt the test and train set. 

In [None]:

test_names = names[:500] # test set is first 500 names

devtest_names = names[500:1500] # development set is the next 1000 names 

train_names = names[1500:] # train set is all the other names


The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the system. For reasons discussed below, it is important that we employ a separate dev-test set for error analysis, rather than just using the test set. 


Having divided the corpus into appropriate datasets, we train a model using the training set, and then run it on the devtest set.

In [None]:
# For this part we go back our most simple gender features

random.shuffle(names)

train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, devtest_names)

classifier = nltk.NaiveBayesClassifier.train(train_set)

nltk.classify.accuracy(classifier, devtest_set)
# We assess the accuracy on the devtest set

Using the dev-test set, we can generate a list of the errors that the classifier makes when predicting name genders:

In [None]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

len(errors)

We can then examine individual error cases where the model predicted the wrong label, and try to determine what additional pieces of information would allow it to make the right decision (or which existing pieces of information are tricking it into making the wrong decision). The feature set can then be adjusted accordingly. The names classifier that we have built generates about 200 errors on the dev-test corpus:

In [None]:
for (tag, guess, name) in sorted(errors):
    print (tag, guess, name)

Looking through this list of errors makes it clear that some suffixes that are more than one letter can be indicative of name genders. For example, names ending in yn appear to be predominantly female, despite the fact that names ending in n tend to be male; and names ending in ch are usually male, even though names that end in h tend to be female. We therefore adjust our feature extractor to include features for two-letter suffixes:

In [None]:
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

Rebuilding the classifier with the new feature extractor, we see that the performance on the dev-test dataset improves by almost 3 percentage points (from 76% to 78%):

In [None]:
train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, devtest_names)

classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, devtest_set)

This error analysis procedure can then be repeated, checking for patterns in the errors that are made by the newly improved classifier. Each time the error analysis procedure is repeated, we should select a different dev-test/training split, to ensure that the classifier does not start to reflect idiosyncrasies in the dev-test set.

But once we've used the dev-test set to help us develop the model, we can no longer trust that it will give us an accurate idea of how well the model would perform on new data. It is therefore important to keep the test set separate, and unused, until our model development is complete. At that point, we can use the test set to evaluate well our model will perform on new input values.

In [None]:
# use this chunk to futher develop your classifier. 
# You can e.g., do some more error analysis 


In [None]:
# end by testing your classifier on the test set
nltk.classify.accuracy(classifier, test_set)

<font color=red> How does your new classifier do compared to the simpler ones? Did you end up overfitting on the development set? <font>

## Part 2

### Document Classification


We've previsouly seen several examples of corpora where documents have been labeled with categories. Using these corpora, we can build classifiers that will automatically tag new documents with appropriate category labels. First, we construct a list of documents, labeled with the appropriate categories. For this example, we've chosen the Movie Reviews Corpus, which categorizes each review as positive or negative.

In [None]:
nltk.download('movie_reviews')

documents = [(list(nltk.corpus.movie_reviews.words(fileid)), category)
             for category in nltk.corpus.movie_reviews.categories()
             for fileid in nltk.corpus.movie_reviews.fileids(category)]

random.shuffle(documents)

<font color=red> Consider the nested list comprehension for building up documents. </font>

        documents = [(list(nltk.corpus.movie_reviews.words(fileid)), category)
             for category in nltk.corpus.movie_reviews.categories()
             for fileid in nltk.corpus.movie_reviews.fileids(category)]
             
<font color=red> Explain in your own words what this code is doing. Try running the object documents[0] below to get started. What is the output? </font>

In [None]:
documents[0]

### Exercise: Exploring the code
<font color=red>Start out by exploring the movie reviews corpus and familiarizing yourself with the following: figuring out how big the corpus is, how many reviews there are, and how many of them are positive/negative would be a bare minimum. You can use these corpus methods: 
    
    .fileids()', .words(), .raw(). 
    
This particular corpus comes with categories, so you should also try: 

    .categories(). 
    
 You can list file IDs based on categories: 
         
    nltk.corpus.movie_reviews.fileids('pos')


In [None]:
# Your code here 

In [None]:
# Your answer here 

Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it should pay attention to. For document topic identification, we can define a feature for each word, indicating whether the document contains that word. To limit the number of features that the classifier needs to process, we begin by constructing a list of the 2000 most frequent words in the overall corpus. We can then define a feature extractor that simply checks whether each of these words is present in a given document.

In [None]:
# first generate the frequency distribution for all words
all_words = nltk.FreqDist(w.lower() for w in nltk.corpus.movie_reviews.words())

# select the top 2000 most frequent words
word_features = [w for (w,f) in all_words.most_common(2000)] 

def document_features(document): 
    document_words = set(document) # do you remember what set() does?
    features = {}
    for word in word_features:
        features[f'contains({word})'] = (word in document_words)
    return features

In [None]:
document_features(nltk.corpus.movie_reviews.words('pos/cv957_8737.txt'))

#### Note

The reason that we compute the set of all words in a document, rather than just checking if word in document, is that checking whether a word occurs in a set is much faster than checking whether it occurs in a list.

Now that we've defined our feature extractor, we can use it to train a classifier to label new movie reviews. To check how reliable the resulting classifier is, we compute its accuracy on the test set. And once again, we can use show_most_informative_features() to find out which features the classifier found to be most informative.

In [None]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(classifier, test_set)

In [None]:
classifier.show_most_informative_features(5)

<font color=red> Q: Do any of the most informative features surprise you? Try with 10 and 20 features as well <font>

### Exercise
Consider the fact that - because the documents are randomly shuffled in the first cell (under section heading Document Classification), 

    from nltk.corpus import movie_reviews
    documents = [(list(movie_reviews.words(fileid)), category)
                 for category in movie_reviews.categories()
                 for fileid in movie_reviews.fileids(category)]
    random.shuffle(documents)

the features may change if you run all the code again starting with that cell. Try it a few times, and see which features remain "most informative" and which ones change. 


In [None]:
# What did you notice? What were some of the most informative feature sets? 

### Exercise

<font color=red> Below I've written a fake movie review. First we'll classify this version, then a shorter one, and finally you're going to replace it with your own fake movie review. </font>


In [None]:
myreview = "Mr. Matt Damon was outstanding, fantastic, excellent, wonderfully subtle, superb, terrific, and memorable in his portrayal of Mulan"

In [None]:
myreview_toks = nltk.word_tokenize(myreview.lower())  # lowercase, and then tokenize
myreview_toks

In [None]:
myreview_feats = document_features(myreview_toks)     # generate word feature dictionary
classifier.classify(myreview_feats)    # classify

In [None]:
classifier.prob_classify(myreview_feats).prob('pos')  # probability of 'pos' label


In [None]:
classifier.prob_classify(myreview_feats).prob('neg')  # probability of 'neg' label


<font color=red> Now go back and change myreview to the following shorter review and see how that changes your classifier results. 

        myreview = "Mr. Matt Damon was outstanding, fantastic."   

Are you surprised by the result (hint: you should be)?</font>

Explanation: This surprising result comes from the fact that under this particular classifier model, all reviews, long or short, get represented by exactly the same set of 2,000 presence/absence word features. Even though this short review has 8 word tokens, there are (at least) 1,992 other features also simultaneously voting for 'pos' and 'neg' labels. In this case, these "absent" word features voted heavily towards 'neg' (e.g., enjoyed was not found, therefore up the 'neg' prediction); the presence of Matt, Damon, outstanding, fantastic -- all strong features towards 'pos' -- didn't have enough collective sway.


<font color=red> Now write your own fake review in 
    
        myreview = "  "
        
Knowing what you do about how the classifier works, try to deceive it! Can you write a negative review that will be incorrecting predicted to be 'pos' by the classifier? 
</font>


In [None]:
# Tell me about your fake review(s) and what you found here. 