# Assignment 4

John Flanigan

October 7, 2019

In [1]:
import nltk

## Exercise 5.18

Generate some statistics for tagged data to answer the following questions:

a. What proportion of word types are always assigned the same part-of-speech tag?

In [2]:
tag_dict = {}

for word, tag in nltk.corpus.brown.tagged_words(tagset='universal'):
    if word in tag_dict:
        tag_dict[word].add(tag)
    else:
        tag_dict[word] = set(tag)

same_part_of_speech_count = len([word for word, tags in tag_dict.items() if len(tags) == 1])
same_part_of_speech_count / len(tag_dict)

0.012451611752323528

b. How many words are ambiguous, in the sense that they appear with at least two tags?

In [3]:
ambiguous_words = []
for word, tags in tag_dict.items():
    if len(tags) > 1:
        ambiguous_words.append(word)
        
len(ambiguous_words) / len(tag_dict)

0.9875483882476764

c. What percentage of word tokens in the Brown Corpus involve these ambiguous words?

In [4]:
count = 0
for word in nltk.corpus.brown.words():
    if word in ambiguous_words:
        count += 1
count / len(nltk.corpus.brown.words())

0.9350675857222578

## Exercise 5.19

The evaluate() method works out how accurately the tagger performs on this text. For example, if the supplied tagged text was [('the', 'DT'), ('dog', 'NN')] and the tagger produced the output [('the', 'NN'), ('dog', 'NN')], then the score would be 0.5. Let's try to figure out how the evaluation method works:

a. A tagger t takes a list of words as input, and produces a list of tagged words as output. However, t.evaluate() is given correctly tagged text as its only parameter. What must it do with this input before performing the tagging?

__Answer: It must first retrieve the words from the correctly tagged text and then re-tag these words.__

b. Once the tagger has created newly tagged text, how might the evaluate() method go about comparing it with the original tagged text and computing the accuracy score?

__Answer: Evaluate would likely count the number of correct tags (true positives), the number of tags that were incorrect (false positives), and the number of tags that it missed (false negatives). It would then use these three metrics to calculate an accuracy score.__

c. Now examine the source code to see how the method is implemented. Inspect nltk.tag.api.__file__ to discover the location of the source code, and open this file using an editor (be sure to use the api.py file and not the compiled api.pyc binary file).

In [5]:
nltk.tag.api.__file__

'//anaconda3/lib/python3.7/site-packages/nltk/tag/api.py'

In [6]:
with open('//anaconda3/lib/python3.7/site-packages/nltk/tag/api.py', 'r') as f:
    print(f.read())

# Natural Language Toolkit: Tagger Interface
#
# Copyright (C) 2001-2019 NLTK Project
# Author: Edward Loper <edloper@gmail.com>
#         Steven Bird <stevenbird1@gmail.com> (minor additions)
# URL: <http://nltk.org/>
# For license information, see LICENSE.TXT

"""
Interface for tagging each token in a sentence with supplementary
information, such as its part of speech.
"""
from abc import ABCMeta, abstractmethod
from itertools import chain

from six import add_metaclass

from nltk.internals import overridden
from nltk.metrics import accuracy
from nltk.tag.util import untag


@add_metaclass(ABCMeta)
class TaggerI(object):
    """
    A processing interface for assigning a tag to each token in a list.
    Tags are case sensitive strings that identify some property of each
    token, such as its part of speech or its sense.

    Some taggers require specific types for their tokens.  This is
    generally indicated by the use of a sub-interface to ``TaggerI``.
    For example, featureset

## Exercise 5.20

Write code to search the Brown Corpus for particular words and phrases according to tags, to answer the following questions:

a. Produce an alphabetically sorted list of the distinct words tagged as MD.

In [7]:
md_words = set([word.lower() for word, tag in nltk.corpus.brown.tagged_words() if tag == 'MD'])
sorted(md_words)

["c'n",
 'can',
 'colde',
 'could',
 'dare',
 'kin',
 'maht',
 'mai',
 'may',
 'maye',
 'mayst',
 'might',
 'must',
 'need',
 'ought',
 'shall',
 'should',
 'shuld',
 'shulde',
 'wil',
 'will',
 'wilt',
 'wod',
 'wold',
 'wolde',
 'would']

b. Identify words that can be plural nouns or third person singular verbs (e.g. deals, flies).

In [8]:
plural_nouns = set([word.lower() for word, tag in nltk.corpus.brown.tagged_words() if tag == 'NNS'])
third_person_singular_verbs = set([word.lower() for word, tag in nltk.corpus.brown.tagged_words() if tag == 'VBZ'])

combined = [word for word in plural_nouns if word in third_person_singular_verbs]
sorted(combined)

['accounts',
 'acts',
 'addresses',
 'advances',
 'affects',
 'aids',
 'aims',
 'amounts',
 'answers',
 'appeals',
 'approaches',
 'arches',
 'assaults',
 'associates',
 'attacks',
 'attempts',
 'attributes',
 'backs',
 'bangs',
 'banks',
 'bargains',
 'bars',
 'bases',
 'bats',
 'beats',
 'bellows',
 'belts',
 'bends',
 'benefits',
 'bites',
 'blankets',
 'blots',
 'blows',
 'blueprints',
 'boards',
 'bodies',
 'boils',
 'borders',
 'bores',
 'bottles',
 'bows',
 'breaks',
 'bridges',
 'bristles',
 'bubbles',
 'bugs',
 'bulletins',
 'bullies',
 'burns',
 'butts',
 'calls',
 'caps',
 'captures',
 'cares',
 'casts',
 'catches',
 'causes',
 'censors',
 'centers',
 'challenges',
 'champions',
 'changes',
 'charges',
 'checks',
 'claims',
 'claps',
 'clicks',
 'clouds',
 'clucks',
 'clutches',
 'colors',
 'commands',
 'comments',
 'compounds',
 'compresses',
 'concentrates',
 'concerns',
 'conducts',
 'conflicts',
 'contacts',
 'contracts',
 'contrasts',
 'controls',
 'coordinates',
 'cost

c. Identify three-word prepositional phrases of the form IN + DET + NN (eg. in the lab).

In [9]:
tagged_words = nltk.corpus.brown.tagged_words()
phrases = []

for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(tagged_words):
    if t1 == 'IN' and t2 == 'DET' and t3 == 'NN':  
        phrases.append([w1, w2, w3])
        
phrases

[]

__Note: I have double checked my work and all of the parts appear to be working as expected. I'm unsure why I'm not getting any results.__

d. What is the ratio of masculine to feminine pronouns?

In [10]:
[word.lower() for word, tag in nltk.corpus.brown.tagged_words(tagset='universal') if tag == 'PRON']

['it',
 'it',
 'them',
 'them',
 'it',
 'it',
 'it',
 'it',
 'they',
 'we',
 'it',
 'that',
 'it',
 'that',
 'they',
 'it',
 'he',
 'who',
 'he',
 'it',
 'who',
 'it',
 'it',
 'he',
 'who',
 'it',
 'they',
 'it',
 'itself',
 'it',
 'who',
 'who',
 'it',
 'he',
 'him',
 'it',
 'it',
 'i',
 'who',
 'i',
 'i',
 'we',
 'it',
 'himself',
 'he',
 'he',
 'he',
 'himself',
 'he',
 'it',
 'they',
 'it',
 'it',
 'he',
 'it',
 'he',
 'it',
 'he',
 'it',
 'you',
 'you',
 'he',
 'you',
 'it',
 'he',
 'it',
 'he',
 'it',
 'them',
 'he',
 'he',
 'i',
 'them',
 'it',
 'it',
 'it',
 'who',
 'it',
 'who',
 'he',
 'it',
 'who',
 'it',
 'who',
 'they',
 'who',
 'who',
 'he',
 'he',
 'we',
 'he',
 'he',
 'they',
 'they',
 'it',
 'themselves',
 'you',
 'he',
 'whom',
 'he',
 'i',
 'who',
 'he',
 'he',
 'he',
 'them',
 'it',
 'he',
 'who',
 'who',
 'he',
 'them',
 'who',
 'it',
 'it',
 'who',
 'who',
 'he',
 'who',
 'who',
 'he',
 'it',
 'they',
 'he',
 'they',
 'he',
 'he',
 'he',
 'he',
 'he',
 'he',
 'he'

__Note: Because I could not find a way to automatically categorize pronouns, I printed out all pronouns from the brown corpus and have attempted to manually categorize masculine and feminine pronouns.__

In [11]:
masculine_pronouns = ['he', 'him', 'himself', 'himselfe', 'his', 'hiself', 'hym', 'hymselfe']
feminine_pronouns = ['her', 'hers', 'herself', 'she']

In [12]:
masculine_count = len([word for word, tag in nltk.corpus.brown.tagged_words(tagset='universal') if word in masculine_pronouns])
feminine_count = len([word for word, tag in nltk.corpus.brown.tagged_words(tagset='universal') if word in feminine_pronouns])
masculine_count / feminine_count

3.258492462311558