# Using Taggers with NLTK

Kevin Nolasco

Cabrini University

MCIS565 - Natural Language Processing

04/10/2022


## Prompt
- Create a regular expression tagger and various unigram and n-gram taggers, incorporating backoff, and train them on part of the Brown corpus.
- Create three different combinations of the taggers. Test the accuracy of each combined tagger. Which combination works best?
- Try varying the size of the training corpus. How does it affect your results?

## Regexp Tagger, Various Unigram and N-gram Taggers and Backoff

*Create a regular expression tagger and various unigram and n-gram taggers, incorporating backoff, and train them on part of the Brown corpus.*

Below we will implement the as for part 1 of the prompt.

In [38]:
import nltk
from nltk.corpus import brown

### Get Corpus

In [39]:
brown_tagged_sents = brown.tagged_sents(categories = 'news')
brown_sents = brown.sents(categories = 'news')

In [40]:
# split to train and test
def split_sents(tagged_sents, sents, train_size = 0.6):
    """
    return train_tagged_sents, test_tagged_sents, train_sents, test_sents
    """
    train_n = int(len(tagged_sents)*train_size)
    return tagged_sents[:train_n], tagged_sents[train_n:], sents[:train_n], sents[train_n:]

In [41]:
train_tagged_sents, test_tagged_sents, train_sents, test_sents = split_sents(brown_tagged_sents, brown_sents)

### Regex Tagger

In [42]:
# make patterns for regex tagging
patterns = [
    (r'.*ing$', 'VBG'),                # gerunds
    (r'.*ed$', 'VBD'),                 # simple past
    (r'.*es$', 'VBZ'),                 # 3rd singular present
    (r'.*ould$', 'MD'),                # modals
    (r'.*\'s$', 'NN$'),                # possessive nouns
    (r'.*s$', 'NNS'),                  # plural nouns
    (r'^-?[0-9]+(\.[0-9]+)?$', 'CD'),  # cardinal numbers
    (r'.*', 'NN')                      # nouns (default)
]

regex_tagger = nltk.RegexpTagger(patterns)
regex_tagger.tag(train_sents[0])
# .evualuate() is depricated, use accuracy instead
print('Accuracy on test set: {:.2%}'.format(regex_tagger.accuracy(train_tagged_sents)))

Accuracy on test set: 20.62%


### Unigram Taggers

In [43]:
# quick train and test on tagged sents
unigram_tagger = nltk.UnigramTagger(train_tagged_sents)
print('Accuracy on test set: {:.2%}'.format(unigram_tagger.accuracy(test_tagged_sents)))

Accuracy on test set: 78.14%


## Bigram Taggers

In [44]:
bigram_tagger = nltk.BigramTagger(train_tagged_sents)
print('Accuracy on test set: {:.2%}'.format(bigram_tagger.accuracy(test_tagged_sents)))

Accuracy on test set: 8.37%


## Trigram Taggers

In [45]:
trigram_tagger = nltk.TrigramTagger(train_tagged_sents)
print('Accuracy on test set: {:.2%}'.format(trigram_tagger.accuracy(test_tagged_sents)))

Accuracy on test set: 5.35%


## Simple Backoff

In [47]:
with_backoff = nltk.UnigramTagger(train_tagged_sents, backoff = nltk.DefaultTagger('NN'))
print('Accuracy on test set: {:.2%}'.format(with_backoff.accuracy(test_tagged_sents)))

Accuracy on test set: 81.43%


## Different Combinations of Taggers

*Create three different combinations of the taggers. Test the accuracy of each combined tagger. Which combination works best?*

**Combo 1** will be unigram tagger with regexp as backoff.

In [48]:
# combo 1 
regex_backoff = nltk.UnigramTagger(train_tagged_sents, backoff = nltk.RegexpTagger(patterns))
print('Accuracy on test set: {:.2%}'.format(regex_backoff.accuracy(test_tagged_sents)))

Accuracy on test set: 83.75%


**Combo 2** will be unigram tagger with regexp as backoff and with default tagger as backoff for regexp.

In [49]:
# combo 2
combo2 = nltk.UnigramTagger(train_tagged_sents, backoff = nltk.RegexpTagger(patterns, backoff = nltk.DefaultTagger('NN')))
print('Accuracy on test set: {:.2%}'.format(combo2.accuracy(test_tagged_sents)))

Accuracy on test set: 83.75%


**Combo 3** will be bigram tagger with unigram as backoff and with regexp tagger as backoff for unigram.

In [50]:
# combo 3
combo3 = nltk.BigramTagger(train_tagged_sents, backoff = nltk.UnigramTagger(train_tagged_sents, backoff = nltk.RegexpTagger(patterns)))
print('Accuracy on test set: {:.2%}'.format(combo3.accuracy(test_tagged_sents)))

Accuracy on test set: 84.50%


**Final Combo** will be trigram tagger with combo 3 as the backoff.

In [51]:
# combo 4
combo4 = nltk.TrigramTagger(train_tagged_sents, backoff = nltk.BigramTagger(train_tagged_sents, backoff = nltk.UnigramTagger(train_tagged_sents, backoff = nltk.RegexpTagger(patterns))))
print('Accuracy on test set: {:.2%}'.format(combo4.accuracy(test_tagged_sents)))

Accuracy on test set: 84.44%


Looks like Combo 3 was the most accurate tagger. This goes along with what the textbook was saying about "[As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off in information retrieval).](https://www.nltk.org/book/ch05.html#:~:text=As%20n%20gets,in%20information%20retrieval)". This implies that if we continue stacking the backoffs like above, the performance will not improve.

## Varying Size 

*Try varying the size of the training corpus. How does it affect your results?*

In [58]:
import numpy as np
def train_size_increase():
    for train_size in np.arange(0.65,1, 0.05):
        train_tagged_sents, test_tagged_sents, _, _ = split_sents(brown_tagged_sents, brown_sents, train_size = train_size)
        tagger = nltk.BigramTagger(train_tagged_sents, backoff = nltk.UnigramTagger(train_tagged_sents, backoff = nltk.RegexpTagger(patterns)))
        print('=============================================================================\n')
        print('For train size = {:.0%}'.format(train_size))
        print('Accuracy on test set: {:.2%}'.format(tagger.accuracy(test_tagged_sents)))
        print('\n=============================================================================\n')

In [59]:
train_size_increase()


For train size = 65%
Accuracy on test set: 84.81%



For train size = 70%
Accuracy on test set: 85.51%



For train size = 75%
Accuracy on test set: 86.05%



For train size = 80%
Accuracy on test set: 85.81%



For train size = 85%
Accuracy on test set: 86.56%



For train size = 90%
Accuracy on test set: 86.50%



For train size = 95%
Accuracy on test set: 88.47%




Looks like the best accuracy on test set comes from training it with 95% of the data. This makes sense because we know that for every project, lots of data is important.

# Conclusion

Above we can see the implementation of different taggers on the Brown corpus. We can see that N-Gram taggers are a powerful tool and that for this situation, a Bigram Tagger was most effective. We can see that the training size impacts the accuracy score, so presenting the taggers with lots of good data is important.