Credit for organizing this Jupyter Notebook goes to Brian Reese (University of Minnesota Professor)

Credit for the last few code chunks at the bottom of the Jupyter Notebook go to me

In [47]:
import nltk
from nltk.corpus import brown
from pprint import pprint

nltk.download('brown')
# String-formatting function:
def pairs2str(sent):
    strings = []
    for (word, tag) in sent:
        if tag == None:
            strings.append(word + "/" + "UNK")
        else:
            strings.append(word + "/" + tag)
    return " ".join(strings)

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


# NLTK's tagged corpora

Many of the corpora included with the NLTK library can be accessed with part of speech information for word tokens.  For example, the methods below (note that I have imported the `brown` corpus from NLTK's `corpus` module) access the Brown corpus as a list of words and a list of sentences, respectively.  A sentence is represented as a list of individual tokens.  Each token is represented as a **tuple** consisting of the **word form** and its **part of speech label**.  

- `brown.tagged_words()`
- `brown.tagged_sents()`

The code block below saves the part of speech tagged words of the Brown corpus to the list `words` and prints the first five tokens. The pair `('The', 'AT')`, for example, indicates that first instance of the word `'The'` is assigned the part of speech label `AT`.  

In [48]:
words = brown.tagged_words()
pprint(words[:5])

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL')]


The next code block saves each sentence of the Brown corpus to the list `sents` and prints the first item in the list, itself a list of word-POS tuples.  

In [49]:
sents = brown.tagged_sents()
pprint(sents[0])

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL'),
 ('said', 'VBD'),
 ('Friday', 'NR'),
 ('an', 'AT'),
 ('investigation', 'NN'),
 ('of', 'IN'),
 ("Atlanta's", 'NP$'),
 ('recent', 'JJ'),
 ('primary', 'NN'),
 ('election', 'NN'),
 ('produced', 'VBD'),
 ('``', '``'),
 ('no', 'AT'),
 ('evidence', 'NN'),
 ("''", "''"),
 ('that', 'CS'),
 ('any', 'DTI'),
 ('irregularities', 'NNS'),
 ('took', 'VBD'),
 ('place', 'NN'),
 ('.', '.')]


It is common to represent a POS tagged word as a string `word/POS` and to represent a sentence as a sequence of such strings.  The function `pairs2str`, defined above, takes a list of word-POS tuples and prints them as a string in this format.

In [50]:
print(pairs2str(sents[5]))

It/PPS recommended/VBD that/CS Fulton/NP legislators/NNS act/VB ``/`` to/TO have/HV these/DTS laws/NNS studied/VBN and/CC revised/VBN to/IN the/AT end/NN of/IN modernizing/VBG and/CC improving/VBG them/PPO ''/'' ./.


# Help

NLTK provides a convenient `help` function with information on the two most commonly used tagsets, the Brown tagset and UPenn treebank tagset.  Note that if you do not provide an argument to the method, it returns a full list of corresponding tagset  Note that the argument to the method is not just a string, but rather a **regular expression**

In [None]:
nltk.help.brown_tagset('RB')

RB: adverb
    only often generally also nevertheless upon together back newly no
    likely meanwhile near then heavily there apparently yet outright fully
    aside consistently specifically formally ever just ...


In [None]:
nltk.help.upenn_tagset('JJ*')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
JJR: adjective, comparative
    bleaker braver breezier briefer brighter brisker broader bumper busier
    calmer cheaper choosier cleaner clearer closer colder commoner costlier
    cozier creamier crunchier cuter ...
JJS: adjective, superlative
    calmest cheapest choicest classiest cleanest clearest closest commonest
    corniest costliest crassest creepiest crudest cutest darkest deadliest
    dearest deepest densest dinkiest ...


How hard is automatic part of speech tagging?  A recurring theme of the course is that one of the main challenges of NLP is dealing effectively with the ambiguity inherent to natural languages.  This is true for part-of-speech tagging, as well.  Many word forms occur with more than one part of speech label in a given corpus. For example, the word "bank" can be either a noun or a verb.

We can use a part of speech tagged corpora to get a sense of the extent of this ambiguity by answering a few questions; for example:

1. How many word types are unambiguous in the sense that they occur with only one part of speech tag in the corpus?
2. How many word types are ambiguous, that is, occur with two or more distinct part of speech tags in the corpus?
3. How many of the individual tokens in the text are instances of (un)ambiguous word types?
4. How ambiguous are ambiguous word types? That is, for each integer $n$, how many word types appear with $n$ distinct part of speech labels? What is the highest $n$?

Before answering these questions, lets clean up the corpus by normalizing the text (we perform **case folding** here) and simplifying the tags a little.  

In [51]:
words = brown.tagged_words()
words = [(word.lower(), tag.split('-')[0]) for word, tag in words]

What kind of data structure would be useful in answering the questions above?  

For each vocabulary item that occurs in the corpus, we need to know how many distinct part of speech labels it occurs with, so some sort of dictionary-like structure seems appropriate.  For example, we could build a dictionary that maps word forms to lists of the POS labels they occur with.  However, in order to answer the question about ambiguous *tokens*, we also need to know how often a word occurs with each of its associated part of speech labels.  Is there a single data structure we could use to answer all of the questions above?  

Yes, a conditional frequency distribution in which the conditions are the word forms that occur in the corpus!  Recall that a CFD associates an ordinary frequency distribution with each condition.  For our purposes, the objects associated with a word form will be a frequency distribution over the part of speech labels that occur with that form.  If we feed NLTK's `.ConditionalFreqDist()` class a list of tuples, it will return a conditional frequency distribution where the first element of each tuple is the condition and the second element is the observation for that condition.  Thus the data structure we need is obtained with the following block of code.

In [52]:
cfd = nltk.ConditionalFreqDist(words)

The following features of the object `cfd` should be useful in answering the questions below:

 - For each word form `w`, `cfd[w]` returns a frequency distribution over POS labels.  Try typing `cfd['back']` in the code block below.  By inspecting the resulting object you can see that the word form `back` occurs with four distinct parts of speech in our corpus: adverb (`RB`), singular common noun (`NN`), adjective (`JJ`), and base verb (`VB`).
 - The number of "bins" associated with a frequency distribution tells us how many POS labels a word form occurs with. The `.B()` method returns the number of bins associated with a frequency distribution.  Type `cfd['back'].B()` into the work space below.
 - The `.N()` method returns the total number of observations for a given condition.  Type `cfd['back'].N()` into the work space below.  This tells us how many times `back` occurred in the corpus.
 - `len(cfd)` tells you how many conditions are associated with the conditional frequency distribution, i.e. the number of unique word form types there were in the corpus.  Try it out in the space below.
 - `cfd.N()` returns the total number of word tokens in the corpus.  Again, try it out.
 

In [53]:
# Workspace code block:
cfd['back']

FreqDist({'JJ': 29, 'NN': 178, 'RB': 734, 'VB': 25})

In [54]:
# Workspace code block:
cfd['back'].B()

4

In [55]:
# Workspace code block:
cfd['back'].N()

966

In [56]:
# Workspace code block:
len(cfd)

49815

In [57]:
# Workspace code block:
cfd.N()

1161192

This is all you need to answer the questions above.  Fill in the code blocks below with the necessary code to answer each question.

## How many (un)ambiguous word types are there in the corpus?

How many of the distinct **word types** in the corpus are associated with a single part-of-speech label?  How many of them are associated with two or more part-of-speech labels?

In [58]:
# single part-of-spech label
distinct = 0

for cond in cfd:
  if cfd[cond].B() <= 1:
    distinct += 1

print("There are", distinct, "word types with a single part-of-speech label")

There are 44196 word types with a single part-of-speech label


In [59]:
# two or more part-of-spech label
distinct = 0

for cond in cfd:
  if cfd[cond].B() > 1:
    distinct += 1

print("There are", distinct, "word types with two or more part-of-speech label")

There are 5619 word types with two or more part-of-speech label


## How many of the word *tokens* in the corpus are instances of ambiguous word types?

In [60]:
ambiguous = 0
unambiguous = 0

for cond in cfd:
  if cfd[cond].B() > 1:
    ambiguous += cfd[cond].N()
  else:
    unambiguous += cfd[cond].N()

print("There are", ambiguous, "ambiguous word tokens.\nThere are", unambiguous, "unambiguous word tokens.")

There are 719031 ambiguous word tokens.
There are 442161 unambiguous word tokens.


## How frequently do we see a word form with 1, 2, 3, ..., part of speech labels?
Hint: You may want to build a frequency distribution over integers $n$, where we are counting words that appear with $n$ distinct part of speech labels.  For example, `fd[3]` would return the number of word forms that appear with 3 distinct POS labels.  I have initiated a `FreqDist` object for you in the block below.  To increment the count for a given observation `obs`, use `fd[obs] += 1`.  

Use the `.tabulate()` method of frequency distributions to display the results.

In [61]:
# Write your code here
fd = nltk.FreqDist()

count = -1
i = 1

while count is not 0:
  count = 0
  for cond in cfd:
    if cfd[cond].B() is i:
      count += 1
  if count is not 0:
    fd[i] += count
    i += 1

fd.tabulate()


    1     2     3     4     5     6     7 
44196  5066   434    90    23     4     2 
