<h1><center>6. Learning to Classify Text</center></h1>
 
Detecting patterns is a central part of Natural Language Processing. Words ending in -ed tend to be past tense verbs. Frequent use of will is indicative of news text. These observable patterns — word structure and word frequency — happen to correlate with particular aspects of meaning, such as tense and topic.

This chapter aims to teach 

* How to identify particular features of language data that are salient for classifying it,
* How to construct models of language that can be used to perform language processing tasks automatically,
* What can be learned about language from these models.

**To be able to run the codes below, download nltk.**

In [1]:
import nltk
import random #shuffle
from nltk.corpus import names
from nltk.corpus import brown
from nltk.corpus import movie_reviews


nltk.download('punkt') #for word_tokenize
nltk.download('averaged_perceptron_tagger')#for pos tagger
nltk.download('tagsets') #for pos_tag help
nltk.download('universal_tagset') #universal tags for pos

####Corpora###
nltk.download('brown') #brown
nltk.download('nps_chat') #nps chat 
nltk.download('conll2000') #conll 
nltk.download('treebank') #penn 
nltk.download('sinica_treebank') #sinica treebank
nltk.download('indian') #indian corpus
nltk.download('mac_morpho') #mac morpho
nltk.download('rte') #recognizing text entailment
nltk.download('names') #names
nltk.download('movie_reviews') #movie reviews


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Unzipping corpora/nps_chat.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package sinica_treebank to /root/nltk_data...
[n

True

<h2>1. Supervised Classification</h2>

**Classification** is the task of choosing the correct **class label** for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are:

* Deciding whether an email is spam or not.
* Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports," "technology," and "politics."
* Deciding whether a given occurrence of the word *bank* is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.

The basic classification task has a number of interesting variants. For example,

* In multi-class classification, each instance may be assigned multiple labels;
* In open-class classification, the set of labels is not defined in advance;
* In sequence classification, a list of inputs are jointly classified

A classifier is called **supervised** if it is built based on training corpora containing the correct label for each input. The framework used by supervised classification is shown:

![](http://www.nltk.org/images/supervised-classification.png)

In supervised classification, input data are already correctly labeled. For instance, a classification algorithm will learn to identify animals after being trained on a dataset of images that are properly labeled with the species of the animal and some identifying characteristics.

<h3>Gender Identification</h3>

In chapter 2 we saw that male and female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. Let's build a classifier to model these differences more precisely.

The first step in creating a classifier is deciding what **features** of the input are relevant, and how to **encode** those features. For this example, we'll start by just looking at the final letter of a given name. The following **feature extracto**r function builds a dictionary containing relevant information about a given name:



In [4]:
def gender_features(word):
    return {'last_letter': word[-1]}
gender_features('Shrek')

{'last_letter': 'k'}

The returned dictionary, known as a **feature set**, maps from feature names to their values. Feature names are case-sensitive strings that typically provide a short human-readable description of the feature, as in the example '`last_letter`'. Feature values are values with simple types, such as booleans, numbers, and strings.

Now that we've defined a feature extractor, we need to prepare a list of examples and corresponding class labels.

In [0]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)

Next, we use the feature extractor to process the `names` data, and divide the resulting list of feature sets into a **training set** and a **test set**. The training set is used to train a new "naive Bayes" classifier. We will learn more about the naive Bayes classifier later in the chapter.

In [6]:
#feature set consist of the tuples of the last characters and genders
featuresets = [(gender_features(n), gender) 
               for (n, gender) in labeled_names]
#training and test sets
train_set, test_set = featuresets[500:], featuresets[:500]
#train classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)
#some test samples
print("Neo:", classifier.classify(gender_features('Neo')))
print("Trinity:", classifier.classify(gender_features('Trinity')))

Neo: male
Trinity: female


We can systematically evaluate the classifier on a much larger quantity of unseen data, and we can examine the classifier to determine which features it found most effective for distinguishing the names' genders:

In [7]:
print("Accuracy:", nltk.classify.accuracy(classifier, test_set))
print(classifier.show_most_informative_features(5))

Accuracy: 0.754
Most Informative Features
             last_letter = 'k'              male : female =     44.0 : 1.0
             last_letter = 'a'            female : male   =     34.5 : 1.0
             last_letter = 'f'              male : female =     26.7 : 1.0
             last_letter = 'p'              male : female =     12.6 : 1.0
             last_letter = 'v'              male : female =     11.2 : 1.0
None


**Your Turn:** Modify the `gender_features() ` function to provide the classifier with features encoding the length of the name, its first letter, and any other features that seem like they might be informative. Retrain the classifier with these new features, and test its accuracy.

In [0]:
# Try here

When working with large corpora, constructing a single list that contains the features of every instance can use up a large amount of memory. In these cases, use the function `nltk.classify.apply_features`, which returns an object that acts like a list but does not store all the feature sets in memory:

In [0]:
from nltk.classify import apply_features
train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])

NameError: ignored

<h3>Choosing The Right Features</h3>

Selecting relevant features and deciding how to encode them for a learning method can have an enormous impact on the learning method's ability to extract a good model. Much of the interesting work in building a classifier is deciding what features might be relevant, and how we can represent them.

Typically, feature extractors are built through a process of trial-and-error, guided by intuitions about what information is relevant to the problem. It's common to start with a "kitchen sink" approach, including all the features that you can think of, and then checking to see which features actually are helpful. We take this approach for name gender features:

In [0]:
def gender_features2(name):
    features = {} #object to be filled with features
    features["first_letter"] = name[0].lower() #first letter
    features["last_letter"] = name[-1].lower() #last letter
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        #all letters existance and frequency
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features
print("All features (swipe to see all):")
print(gender_features2('Atalay'))

All features (swipe to see all):
{'first_letter': 'a', 'last_letter': 'y', 'count(a)': 3, 'has(a)': True, 'count(b)': 0, 'has(b)': False, 'count(c)': 0, 'has(c)': False, 'count(d)': 0, 'has(d)': False, 'count(e)': 0, 'has(e)': False, 'count(f)': 0, 'has(f)': False, 'count(g)': 0, 'has(g)': False, 'count(h)': 0, 'has(h)': False, 'count(i)': 0, 'has(i)': False, 'count(j)': 0, 'has(j)': False, 'count(k)': 0, 'has(k)': False, 'count(l)': 1, 'has(l)': True, 'count(m)': 0, 'has(m)': False, 'count(n)': 0, 'has(n)': False, 'count(o)': 0, 'has(o)': False, 'count(p)': 0, 'has(p)': False, 'count(q)': 0, 'has(q)': False, 'count(r)': 0, 'has(r)': False, 'count(s)': 0, 'has(s)': False, 'count(t)': 1, 'has(t)': True, 'count(u)': 0, 'has(u)': False, 'count(v)': 0, 'has(v)': False, 'count(w)': 0, 'has(w)': False, 'count(x)': 0, 'has(x)': False, 'count(y)': 1, 'has(y)': True, 'count(z)': 0, 'has(z)': False}


However, there are usually limits to the number of features that you should use with a given learning algorithm — if you provide too many features, then the algorithm will have a higher chance of relying on idiosyncrasies of your training data that don't generalize well to new examples. This problem is known as **overfitting**, and can be especially problematic when working with small training sets.

For example, using gender_features2() feature extractor will overfit the relatively small training set if we train a naive Bayes classifier.


In [0]:
featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Accuracy:", nltk.classify.accuracy(classifier, test_set))

Accuracy: 0.776


Once an initial set of features has been chosen, a very productive method for refining the feature set is **error analysis**. First, we select a **development set**, containing the corpus data for creating the model. This development set is then subdivided into the **training set** and the **dev-test** set.

In [0]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the system. The division of the corpus data into different subsets is shown:

![](http://www.nltk.org/images/corpus-org.png)

Having divided the corpus into appropriate datasets, we train a model using the training set [1], and then run it on the dev-test set [2].

In [0]:
train_set = [(gender_features(n), gender) 
             for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) 
               for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) 
            for (n, gender) in test_names]
# train a model using the training set
classifier = nltk.NaiveBayesClassifier.train(train_set) #[1]
print("Accuracy:", nltk.classify.accuracy(classifier, devtest_set))  #[2]

**Finding Errors**

Using the dev-test set, we can generate a list of the errors that the classifier makes when predicting name genders:

In [0]:
errors = [] #to be filled with errors
for (name, tag) in devtest_names:
    #guesses of the classifiers
    guess = classifier.classify(gender_features(name))
    #if the guess is not correct
    if guess != tag:
        #push values to the errors list
        errors.append( (tag, guess, name) )

Using the dev-test set, we can generate a list of the errors that the classifier makes when predicting name genders. Only the first 20 errors are presented to give an example.

In [0]:
#tabulate the errors
print("Error count:", len(errors))
print("Errors (only first 15):")
for (tag, guess, name) in sorted(errors[:15]):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

If you analyze all the errors, names ending in *yn* appear to be predominantly female, despite the fact that names ending in *n* tend to be male; and names ending in *ch* are usually male, even though names that end in *h* tend to be female.

We therefore adjust our feature extractor to include features for two-letter suffixes and rebuild the classifier. The new feature extractor improved the accuracy of the classifier:

In [0]:
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:],
            'first_letter': word[0]}

train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print("Accuracy:", nltk.classify.accuracy(classifier, devtest_set))

This error analysis procedure can then be repeated, checking for patterns in the errors that are made by the newly improved classifier. Each time the error analysis procedure is repeated, we should select a different dev-test/training split, to ensure that the classifier does not start to reflect problematically in the dev-test set.

But once we've used the dev-test set to help us develop the model, we can no longer trust that it will give us an accurate idea of how well the model would perform on new data. It is therefore important to **keep the test set separate, and unused, until our model development is complete**. At that point, we can use the test set to evaluate how well our model will perform on new input values.

<h3>Document Classification</h3>

In the first part, we saw several examples of corpora where documents have been labeled with categories. Using these corpora, we can build classifiers that will automatically tag new documents with appropriate category labels. For this example, we've chosen the Movie Reviews Corpus, which categorizes each review as positive or negative.

In [0]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it should pay attention to. For document topic identification, we can define a feature for each word, indicating whether the document contains that word. To limit the number of features that the classifier needs to process, we begin by constructing a list of the 2000 most frequent words in the overall corpus. We can then define a feature extractor that simply checks whether each of these words is present in a given document.

In [0]:
#finding all words in corpus
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
#most frequent 2000 words
word_features = list(all_words)[:2000]
#feature extractor
def document_features(document):
    '''making a set of the document speeds up searching for a word 
    compared to a list'''
    document_words = set(document)
    features = {} #to be filled with features
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

print(document_features(movie_reviews.words('pos/cv957_8737.txt')))

{'contains(plot)': True, 'contains(:)': True, 'contains(two)': True, 'contains(teen)': False, 'contains(couples)': False, 'contains(go)': False, 'contains(to)': True, 'contains(a)': True, 'contains(church)': False, 'contains(party)': False, 'contains(,)': True, 'contains(drink)': False, 'contains(and)': True, 'contains(then)': True, 'contains(drive)': False, 'contains(.)': True, 'contains(they)': True, 'contains(get)': True, 'contains(into)': True, 'contains(an)': True, 'contains(accident)': False, 'contains(one)': True, 'contains(of)': True, 'contains(the)': True, 'contains(guys)': False, 'contains(dies)': False, 'contains(but)': True, 'contains(his)': True, 'contains(girlfriend)': True, 'contains(continues)': False, 'contains(see)': False, 'contains(him)': True, 'contains(in)': True, 'contains(her)': False, 'contains(life)': False, 'contains(has)': True, 'contains(nightmares)': False, 'contains(what)': True, "contains(')": True, 'contains(s)': True, 'contains(deal)': False, 'contains

Now that we've defined our feature extractor, we can use it to train a classifier to label new movie reviews. To check how reliable the resulting classifier is, we compute its accuracy on the test set. And once again, we can use `show_most_informative_features()` to find out which features the classifier found to be most informative.

In [0]:
#feature sets
featuresets = [(document_features(d), c) for (d,c) in documents]
#training and test sets
train_set, test_set = featuresets[100:], featuresets[:100]
#train classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

print("Accuracy:", nltk.classify.accuracy(classifier, test_set))

print(classifier.show_most_informative_features(5))

<h3>Part-of-Speech Tagging</h3>

In chapter 5, we built a regular expression tagger that chooses a part-of-speech tag for a word by looking at the internal make-up of the word. However, this regular expression tagger had to be hand-crafted. Instead, we can train a classifier to work out which suffixes are most informative. Let's begin by finding out what the most common suffixes are:

In [0]:
suffix_fdist = nltk.FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1
    
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
print(common_suffixes)

Next, we'll define a feature extractor function which checks a given word for these suffixes:



In [0]:
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

In this case, the classifier will make its decisions based only on information about which of the common suffixes (if any) a given word has.

Now that we've defined our feature extractor, we can use it to train a new "decision tree" classifier (to be discussed in part 4):


**Warning: The code below runs too slow, it may not produce an output even in 10 minutes.**

In [0]:
#pre-tagged words
tagged_words = brown.tagged_words(categories='news')
#feature set
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
#dividing data as 90% training sets and 10% test set
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
#train classifier
classifier = nltk.DecisionTreeClassifier.train(train_set)
print("Accuracy:", nltk.classify.accuracy(classifier, test_set))

KeyboardInterrupt: ignored

<h3>Exploiting Context</h3>

Our feature extraction function can be augmented with new features such as the length of the word, the number of syllables it contains, or its prefix. However, as long as the feature extractor just looks at the target word, we have no way to add features that depend on the *context* that the word appears in. But contextual features often provide powerful clues about the correct tag — for example, when tagging the word "fly," knowing that the previous word is "a" will allow us to determine that it is functioning as a noun, not a verb.

In order to do that, we must revise the pattern that we used to define our feature extractor. Instead of just passing in the word to be tagged, we will pass in a complete (untagged) sentence, along with the index of the target word.

In [0]:
def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features
sentence = brown.sents()[0]
print("Word:", sentence[8])
print("Sentence:", ' '.join(sentence))
print("Features:",pos_features(brown.sents()[0], 8))

Word: investigation
Sentence: The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
Features: {'suffix(1)': 'n', 'suffix(2)': 'on', 'suffix(3)': 'ion', 'prev-word': 'an'}


Let's train our classifier with our new approach:

In [0]:
#pre-tagged sentences
tagged_sents = brown.tagged_sents(categories='news')
featuresets = [] #to be filled
#untag sentences and send to feature extractor
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )
#dividing data as 90% training sets and 10% test set
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
#train classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

print("Accuracy:",nltk.classify.accuracy(classifier, test_set))

Accuracy: 0.7891596220785678


It is clear that exploiting contextual features improves the performance of our part-of-speech tagger. For example, the classifier learns that a word is likely to be a noun if it comes immediately after the word "large". However, it is unable to learn the generalization that a word is probably a noun if it follows an adjective, because it doesn't have access to the previous word's part-of-speech tag. In general, simple classifiers always treat each input as independent from all other inputs. In many contexts, this makes perfect sense. For example, decisions about whether names tend to be male or female can be made on a case-by-case basis. However, there are often cases, such as part-of-speech tagging, where we are interested in solving classification problems that are closely related to one another.

<h3>Sequence Classification</h3>

In order to capture the dependencies between related classification tasks, we can use **joint classifier** models, which choose an appropriate labeling for a collection of related inputs. In the case of part-of-speech tagging, a variety of different **sequence classifier** models can be used to jointly choose part-of-speech tags for all the words in a given sentence.

One sequence classification strategy, known as consecutive classification or greedy sequence classification, is to find the most likely class label for the first input, then to use that answer to help find the best label for the next input. The process can then be repeated until all of the inputs have been labeled.

In [0]:
# more comments needed

#feature extractor
def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-tag"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        #previous tags are added to feature extractor
        features["prev-tag"] = history[i-1]
    return features

class ConsecutivePosTagger(nltk.TaggerI):

    def __init__(self, train_sents):
        train_set = [] #to be filled
        for tagged_sent in train_sents:
            #untag sentences
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = [] #to be filled with previous words' tags
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                #push the previous words' tags to history
                history.append(tag)
        #train classifier
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = [] #to be filled with previous words' tags
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

tagged_sents = brown.tagged_sents(categories='news')
#dividing data as 90% training sets and 10% test set
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]
tagger = ConsecutivePosTagger(train_sents)
print("Accuracy:", tagger.evaluate(test_sents))

Accuracy: 0.7980528511821975


<h3>Other Methods for Sequence Classification</h3>

One shortcoming of this approach is that we commit to every decision that we make. For example, if we decide to label a word as a noun, but later find evidence that it should have been a verb, there's no way to go back and fix our mistake. One solution to this problem is to adopt a transformational strategy instead. Transformational joint classifiers work by creating an initial assignment of labels for the inputs, and then iteratively refining that assignment in an attempt to repair inconsistencies between related inputs. The Brill tagger, is a good example of this strategy.

<h2>2. Further Examples of Supervised Classification</h2>

<h3>Sentence Segmentation</h3>

Sentence segmentation can be viewed as a classification task for punctuation: whenever we encounter a symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it terminates the preceding sentence.

The first step is to obtain some data that has already been segmented into sentences and convert it into a form that is suitable for extracting features:

In [0]:
sents = nltk.corpus.treebank_raw.sents()
#list of tokens from individual sentences
tokens = []
#set of indexes of sentence boundaries
boundaries = set()
offset = 0
for sent in sents:
    tokens.extend(sent)
    offset += len(sent)
    boundaries.add(offset-1)

Here, `tokens` is a merged list of tokens from the individual sentences, and `boundaries` is a set containing the indexes of all sentence-boundary tokens. Next, we need to specify the features of the data that will be used in order to decide whether punctuation indicates a sentence-boundary:

In [0]:
#feature extractor
def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

Based on this feature extractor, we can create a list of labeled featuresets by selecting all the punctuation tokens, and tagging whether they are boundary tokens or not

In [0]:
featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!']

Using these featuresets, we can train and evaluate a punctuation classifier:



In [0]:
#dividing data as 90% training sets and 10% test set
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.936026936026936

To use this classifier to perform sentence segmentation, we simply check each punctuation mark to see whether it's labeled as a boundary; and divide the list of words at the boundary marks. The following piece of code shows how this can be done.

***
##Comments needed
***

In [0]:
def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and \
        classifier.classify(punct_features(words, i)) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents

<h3> Identifying Dialogue Act Types</h3>

When processing dialogue, it can be useful to think of utterances as a type of *action* performed by the speaker. This interpretation is most straightforward for performative statements such as "I forgive you" or "I bet you can't climb that hill." But greetings, questions, answers, assertions, and clarifications can all be thought of as types of speech-based actions. Recognizing the **dialogue acts** underlying the utterances in a dialogue can be an important first step in understanding the conversation.

The NPS Chat Corpus consists of over 10,000 posts from instant messaging sessions. These posts have all been labeled with one of 15 dialogue act types, such as "Statement," "Emotion," "ynQuestion", and "Continuer." We can therefore use this data to build a classifier that can identify the dialogue act types for new instant messaging posts. 

The first step is to extract the basic messaging data. We will call  xml_posts() to get a data structure representing the XML annotation for each post and next, we'll define a simple feature extractor that checks what words the post contains:



In [0]:
posts = nltk.corpus.nps_chat.xml_posts()[:10000]

def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

Finally, we construct the training and testing data by applying the feature extractor to each post (using ` post.get('class')` to get a post's dialogue act type), and create a new classifier:

In [0]:
#feature sets
featuresets = [(dialogue_act_features(post.text), post.get('class'))
               for post in posts]
#dividing data as 90% training sets and 10% test set
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
#train classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.668


<h3>Recognizing Textual Entailment</h3>

Recognizing textual entailment (RTE) is the task of determining whether a given piece of text T entails another text called the "hypothesis" (as already discussed in chapter 1 part 5). Here are a couple of examples of text/hypothesis pairs:

Challenge 3, Pair 34 (True)

>**T**: Parviz Davudi was representing Iran at a meeting of the Shanghai Co-operation Organisation (SCO), the fledgling association that binds Russia, China and four former Soviet republics of central Asia together to fight terrorism.

>**H**: China is a member of SCO.

Challenge 3, Pair 81 (False)

>**T**: According to NC Articles of Organization, the members of LLC company are H. Nelson Beavers, III, H. Chester Beavers and Jennie Beavers Stewart.

>**H**: Jennie Beavers Stewart is a share-holder of Carolina Analytical Laboratory.

We can treat RTE as a classification task, in which we try to predict the True/False label for each pair. Although it seems likely that successful approaches to this task will involve a combination of parsing, semantics and real world knowledge, many early attempts at RTE achieved reasonably good results with shallow analysis, based on similarity between the text and hypothesis at the word level.

In our RTE feature detector (2.2), we let words (i.e., word types) serve as proxies for information, and our features count

* the degree of word overlap
* the degree to which there are words in the hypothesis but not in the text (captured by the method `hyp_extra()`)

Not all words are equally important — Named Entity mentions such as the names of people, organizations and places are likely to be more significant, which motivates us to extract distinct information for words and nes (Named Entities). In addition, some high frequency function words are filtered out as "stopwords".def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtractor(rtepair)
    features = {}
    features['word_overlap'] = len(extractor.overlap('word'))
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    features['ne_overlap'] = len(extractor.overlap('ne'))
    features['ne_hyp_extra'] = len(extractor.hyp_extra('ne'))
    return features

In [0]:
#feature extractor
def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtractor(rtepair)
    features = {} #to be filled
    #word overlap
    features['word_overlap'] = len(extractor.overlap('word'))
    #word extra
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    #named entities overlap
    features['ne_overlap'] = len(extractor.overlap('ne'))
    #named entities extra
    features['ne_hyp_extra'] = len(extractor.hyp_extra('ne'))
    return features



To illustrate the content of these features, we examine some attributes of the text/hypothesis Pair 34 shown earlier:

In [0]:
rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33]
extractor = nltk.RTEFeatureExtractor(rtepair)
print("Text words:")
print(extractor.text_words)
print("Hypothesis words:")
print(extractor.hyp_words)
print("Overlapping named entities:")
print(extractor.overlap('ne'))
print("Hypothesis extra words:")
print(extractor.hyp_extra('word'))

Text words:
{'Shanghai', 'was', 'Co', 'fight', 'Asia', 'fledgling', 'China', 'binds', 'Davudi', 'together', 'Organisation', 'Russia', 'representing', 'former', 'central', 'Iran', 'association', 'at', 'Soviet', 'republics', 'terrorism.', 'meeting', 'Parviz', 'operation', 'four', 'SCO', 'that'}
Hypothesis words:
{'SCO.', 'member', 'China'}
Overlapping named entities:
{'China'}
Hypothesis extra words:
{'member'}


<h2>3. Evaluation</h2>

*   shows how accurate a model is capturing pattern in data
*   guides us for further improvements to model
*   how reliable current model is for language predictions





<h3>Test Set</h3>

Generally, evaluation techniques calculate score by generating labels for test set using model, and compare it with correct labels. 

Test set has same format as training set. 

**Be careful** to **use a test set that differs from training** corpus. Reusing the training set as the test set, then a model that simply **memorized its input**,
without learning how to generalize to new examples. High evaluation score will be generated but it will be **misleading**. 

<h4>Size of Test Set</h4>

* For classification tasks that have
a **small number of well-balanced labels and a diverse test set**, evaluation
can be performed with as few as **100 evaluation instances.** 

* If a classification task
has a **large number of labels or very infrequent labels**, then the size of the test
set should be chosen to **ensure that the least frequent label occurs at least 50 times**.

* Additionally, if the test set contains **closely related instances**, the **size of the test set should be increased** to
ensure that this lack of diversity does not skew the evaluation results. 

* When **large amounts of annotated data** are available, it is common to use **10% of the overall data** for evaluation.
<br></br>

Do not forget to consider degree of similarity between test set & development set. Higher similarity could indicate that model does not generalize well.
<br></br>
We could use sentences for POS tagging using data from single genre as folllows, 

In [0]:
import random
from nltk.corpus import brown
tagged_sents = list(brown.tagged_sents(categories='news'))
random.shuffle(tagged_sents)
size = int(len(tagged_sents) * 0.1)
train_set_pos, test_set_pos = tagged_sents[size:], tagged_sents[:size]

This will generate test set very similar to training set. Both sets are taken from news genre, it may fail to generalize to other classes. What’s worse, because of the call to
`random.shuffle()`, the test set contains sentences that are taken from the same documents that were used for training. If a pattern is repeated in the same document *-it will be most likely-* it will affect both development and test set. 

Ensure that they are taken from different documents as follows,

In [0]:
file_ids = brown.fileids(categories='news')
size = int(len(file_ids) * 0.1)
train_set_example = brown.tagged_sents(file_ids[size:])
test_set_example = brown.tagged_sents(file_ids[:size])

#to be more certain of sets coming from different distributions
train_set_example = brown.tagged_sents(categories='news')
#select from different documents
#classifier built on these sets will generalize well beyond data it was trained.
test_set_example = brown.tagged_sents(categories='fiction')

<h3>Accuracy</h3>

Accuracy is one of the most simplest and common metrics used to evaluate a classifer. Accuracy measures **what percentage of inputs were correctly classified**. 

The function `nltk.classify.accuracy()` will calculate the accuracy of a classifier model on a given test set:

In [0]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy: %4.2f' % nltk.classify.accuracy(classifier, test_set))

Accuracy: 0.67


Do not forget to consider frequencies of the individual class labels in the test set. 

<h3>Precision and Recall</h3>

Accuracy may be decieving when search tasks are done. Accurately labeling irrelevant data will not mean much, what we aim to increase is finding relevant data. A new metric must be employed to measure model performance w.r.t search task. 

To simplify problem, first we can divide our items in four categories:

* **True Positives** are relevant items labeled correctly.
* **True Negatives** are irrelevant items labeled incorrectly.
* **False Positives (Type I errors)**  are irrelevant items labeled relevant.
* **False Negatives (Type II errors)** are revelant items labeled irrelevant.

Given these four numbers, we can define the following metrics:

*   **Precision** indicates how many items identified as relevant was actually relevant
<center>$TP/(TP+FP)$</center>

*  **Recall** indicates how many of the relevant items we have identified
<center>$TP/(TP+FN)$</center>

* **F-Measure (or F-Score),** combines precision and recall into a single score by using harmonic mean
<center>$(2*P*R)/(P+R)$</center>


<h3>Confusion Matrices</h3>

A confusion matrix is a table where each $cell [i,j]$ indicates how often label $j$ was predicted when the correct label was $i$. 

Diagonal entries indicate labels predicted correctly. Off diagonal entries indicate errors. 



In [0]:
from nltk.corpus import brown
import matplotlib.pyplot as plt

brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)

def tag_list(tagged_sents):
  return [tag for sent in tagged_sents for (word, tag) in sent]
def apply_tagger(tagger, corpus):
  return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus]

gold = tag_list(brown.tagged_sents(categories='editorial'))
test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial')))

cm = nltk.ConfusionMatrix(gold, test)

print(cm)

           |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

<h3>Cross Validation</h3>

Having a big train set and creating a satisfactionary test set may turn to a duality if there is a small set of annotated data available. 

One solution to this problem is to perform multiple evaluations on different test sets,
then to combine the scores from those evaluations, a technique known as crossvalidation. In particular, we subdivide the original corpus into N subsets called
folds. For each of these folds, we train a model using all of the data except the data in
that fold, and then test that model on the fold. 

<h2>4. Decision Trees</h2>

One solution to this problem is to perform multiple evaluations on different test sets,
then to combine the scores from those evaluations, a technique known as crossvalidation. In particular, we subdivide the original corpus into N subsets called
folds. For each of these folds, we train a model using all of the data except the data in
that fold, and then test that model on the fold. 

<h3>Entropy and Information Gain</h3>


Information gain measures how much more organized the input values become when we divide them up
using a given feature. To measure how disorganized the original set of input values are,
we calculate entropy of their labels.

In particular, entropy
is defined as the sum of the probability of each label times the log probability of that
same label:

<center>$H = \sum\nolimits_{l\in labels}P(l)*log_2P(l)$</center>

Calculating the entropy of a list of labels:

In [0]:
import math
def entropy(labels):
  freqdist = nltk.FreqDist(labels)
  probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)]
  return -sum([p * math.log(p,2) for p in probs])
print("Entropy of group m-m-m-m:",entropy(['male', 'male', 'male', 'male']))
print("Entropy of group m-f-m-m:",entropy(['male', 'female', 'male', 'male']))
print("Entropy of group f-m-f-m:",entropy(['female', 'male', 'female', 'male']))
print("Entropy of group f-f-m-f:",entropy(['female', 'female', 'male', 'female']))
print("Entropy of group f-f-f-f:",entropy(['female', 'female', 'female', 'female']))


Entropy of group m-m-m-m: -0.0
Entropy of group m-f-m-m: 0.8112781244591328
Entropy of group f-m-f-m: 1.0
Entropy of group f-f-m-f: 0.8112781244591328
Entropy of group f-f-f-f: -0.0


We see that entropy increases with respect to diversity. 

**This notebook has shown some code implementations of chapter 6 with small introduction to classification, however it is highly advised to read chapter 6 from NLTK book. Many concepts from chapter 6 are skipped here. Click here to read:**
[Chapter 6](https://www.nltk.org/book/ch06.html)

<h2>Exercises</h2>

###Question 1

Select one of the classification tasks described in this chapter, such as name gender detection, document classification, part-of-speech tagging, or dialog act classification. Using the same training and test data, and the same feature extractor, build three classifiers for the task: a decision tree, a naive Bayes classifier, and a Maximum Entropy classifier. Compare the performance of the three classifiers on your selected task. How do you think that your results might be different if you used a different feature extractor?



In [0]:
#Try here

###Question 2

The Senseval 2 Corpus contains data intended to train word-sense disambiguation classifiers. It contains data for four words: hard, interest, line, and serve. Choose one of these four words, and load the corresponding data:
  	

```
from nltk.corpus import senseval
instances = senseval.instances('hard.pos')
size = int(len(instances) * 0.1)
train_set, test_set = instances[size:], instances[:size]
```



Using this dataset, build a classifier that predicts the correct sense tag for a given instance. See the corpus HOWTO at http://nltk.org/howto for information on using the instance objects returned by the Senseval 2 Corpus.

In [0]:
#Try here

##Solutions

###Question 1

In [0]:
# gender classification selected

# all the functions and training sets are retrived from part 1 
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:],
            'first_letter': word[0]}

labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)

featuresets = [(gender_features(n), gender) 
               for (n, gender) in labeled_names]

train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]

#training naive Bayes classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Naive Bayes accuracy:", nltk.classify.accuracy(classifier, devtest_set))

#training Decision Tree classifier
classifier = nltk.DecisionTreeClassifier.train(train_set)
print("Decision Tree accuracy:", nltk.classify.accuracy(classifier, devtest_set))

#training Maximum Entropy classifier
print("(Training Maximum Entropy classifier takes 1-2 minutes)")
classifier = nltk.MaxentClassifier.train(train_set)
print("Maximum Entropy accuracy:", nltk.classify.accuracy(classifier, devtest_set))

###Question 2

In [0]:
from nltk.corpus import senseval

instances = senseval.instances('interest.pos')
size = int(len(instances) * 0.1)
train_set, test_set = instances[size:], instances[:size]

# Hint: Here is how to look at the individual contexts
# Printing this table is learned from the link given in the question
print("This is how 'interest' used in context:")
for inst in train_set[:5]:
    p = inst.position
    left = ' '.join(w for (w,t) in inst.context[p-2:p])
    word = ' '.join(w for (w,t) in inst.context[p:p+1])
    right = ' '.join(w for (w,t) in inst.context[p+1:p+3])
    senses = ' '.join(inst.senses)
    print ('%20s |%10s | %-15s -> %s' % (left, word, right, senses))


def features(instance):
    features = dict()
    p = instance.position
       ## previous word and tag
    if p: ## > 0
        features['prev_word'] = instance.context[p-1][0]
        features['prev_tag'] = instance.context[p-1][1]
       ## use START if it is the first word
    else: 
        features['prev_word'] = (p, 'START')
        features['prev_tag'] = (p, 'START')
       ## following word and tag       
        features['following_word'] = instance.context[p+1][0]
        features['following_tag'] = instance.context[p+1][1]
    return features


featureset =[(features(i), i.senses[0]) for i in 
             instances if len(i.senses)==1]

### shuffle them randomly
random.shuffle(featureset)
print("\nFeature set:")
print (featureset[:2])

### try on a small sample
train, dev, test = featureset[500:], featureset[:250], featureset[250:500]
classifier = nltk.NaiveBayesClassifier.train(train)
print("\nAccuracy:")
print ("Accuracy on Dev:", nltk.classify.accuracy(classifier, dev))
print ("Accuracy on Test:", nltk.classify.accuracy(classifier, train))


This is how 'interest' used in context:
because municipal-bond |  interest | is exempt       -> interest_6
       at prevailing |  interest | rates .         -> interest_6
            bet that |  interest | rates will      -> interest_6
           losses if |  interest | rates rise      -> interest_6
                     |  interest | rates do        -> interest_6

Feature set:
[({'prev_word': 'falling', 'prev_tag': 'VBG'}, 'interest_6'), ({'prev_word': 'buying', 'prev_tag': 'NN'}, 'interest_1')]

Accuracy:
Accuracy on Dev: 0.776
Accuracy on Test: 0.8265524625267666
