## Part 1. Part-of-speech tagging

In this lab, we will start with part-of-speech (POS) tagging and chunking, which can be called _shallow parsing_ techniques. You will see some ways of analyzing the contents of corpora using these methods.

First of all, you need to import the NLTK library and some of its resources. Run the code by pressing Ctrl-Enter inside the cell, as usual:

In [None]:
import sys
!{sys.executable} -m pip install nltk
import nltk
nltk.download(['punkt', 'averaged_perceptron_tagger', 'tagsets', 'gutenberg', 'brown'])

### 1.1 Tokenization

In order to use a POS tagger, you need to work with tokenized text. That is, the input to the POS tagger must be a list of strings, where every string is one, separate word token. NLTK has a built-in word tokenizer, which can be used as follows:

In [None]:
sentence = "In order to use a POS tagger, you need to work with tokenized text."
tokenized = nltk.word_tokenize(sentence)
print(tokenized)

As you can see when you run the code above, punctuation is separated out as their own tokens. Now it is your turn to **modify the sentence** and test the tokenization on some more challenging data:
* What happens to apostrophes and hyphens inside words, in words such as *part-of-speech* or *don't*?
* What happens to other types of punctuation, such as dot-dot-dot or double quotes etc?
* What happens to abbreviations and titles in front of names, such as *Mr.* or *Prof.*, that end in a dot?
* Can you find examples, in which the tokenizer fails to tokenize correctly?

### 1.2 POS tagging

Next, apply the POS tagger that is available in NLTK on your tokenized text:

In [None]:
print(nltk.pos_tag(tokenized))

You should see a list of tuples, that is, pairs of words and their predicted POS tag. Did the tagger tag your sentence correctly?

If you prefer to see the tagging in a more readable form, you can use the following command: 

In [None]:
print(list(nltk.tuple2str(tup) for tup in nltk.pos_tag(tokenized)))

Or you can make it even cleaner, if you produce one single string as output:

In [None]:
print(" ".join(nltk.tuple2str(tup) for tup in nltk.pos_tag(tokenized)))

### 1.3 What do the POS tags mean?

In order to know whether the POS tagging is correct, you need to understand what the tags mean in the first place. You can ask NLTK for help on the meaning of specific tags, which is convenient:

In [None]:
nltk.help.upenn_tagset("JJ")

You can use regular expressions to find information about more tags at the same time:

In [None]:
nltk.help.upenn_tagset("NN.*")

Now, if you want information on all possible POS tags, find out what regular expression you should use.

### 1.4 Disambiguation of POS tags in context

Next, take a moment to study how well the NLTK POS tagger manages to tag words with ambiguous parts of speech, in sentences, such as _"They **refuse** to **permit** us to obtain the **refuse permit**."_ or _"**Can** I have a **can** of milk, please?"_

All the necessary code is collected in the cell below, so that you don't have to click through many cells each time you test a new sentence:

In [None]:
sentence = "Enter your sentence here!"
tokenized = nltk.word_tokenize(sentence)
pos_tagged = nltk.pos_tag(tokenized)
print(" ".join(nltk.tuple2str(tup) for tup in pos_tagged))

### 1.5 Tokenized and tagged corpora in NLTK

NLTK contains some corpora that are already pre-tokenized, such as the _Gutenberg corpus_. NLTK also contains some corpora that are both pre-tokenized and pre-tagged, such as the _Brown corpus_. When the POS tags are already available, it means that (in principle) they are verified by linguists to be correct.

Let us first look at the Gutenberg corpus. It contains the following texts:

In [None]:
print(nltk.corpus.gutenberg.fileids())

To obtain some specific tokenized text from the Gutenberg corpus, we do like this. Here below, we only print the 99 first words of the text:

In [None]:
tokenized = nltk.corpus.gutenberg.words('austen-emma.txt')
print(tokenized[0:99])

How do you produce POS tags for the 99 first words of Jane Austen's Emma?

In [None]:
pos_tagged = # what goes here?

print(" ".join(nltk.tuple2str(tup) for tup in pos_tagged))

What do you think about the quality of the POS tagging?

In contrast to the Gutenberg corpus, the Brown corpus comes as both tokenized and tagged. The code snippet below demonstrates how to obtain word tokens as well as correctly POS tagged word tokens for the 100 first words of the Brown corpus:

In [None]:
tokenized_100 = nltk.corpus.brown.words()[0:100]
print("TOKENIZED:", list(tokenized_100))
print()

pos_tagged_100 = nltk.corpus.brown.tagged_words()[0:100]
print("POS TAGGED:", " ".join(nltk.tuple2str(tup) for tup in pos_tagged_100))

As you can see, the tags in the Brown corpus sometimes end in `-TL`, which indicates that the word occurs in a title. There are other suffixes as well: `-HL` means headline and `-NC` means citation. Also in some other respects the tag set differs from the one used by the NLTK POS tagger. An explanation of the POS tags in the pre-tagged Brown corpus can be found in [Wikipedia](https://en.wikipedia.org/wiki/Brown_Corpus).

If you now run the NLTK POS tagger on the 100 first words of the tokenized Brown corpus, how does that tagging compare to the correct one?

In [None]:
pos_tagged_100_v2 = # Add your command here
print("POS TAGGED 2:", " ".join(nltk.tuple2str(tup) for tup in pos_tagged_100_v2))

Now you are done with Part 1 and can continue to Part 2.