In [None]:
import nltk
from nltk import word_tokenize, sent_tokenize

Let's load the text of Emma by Jane Austin.

In [None]:
file = open('emma.txt', 'r')
emma = file.read()

In [None]:
emma

### Text tokenization

We see that the whole text is a single list element. It also still contains the new line symbols.

We can split the text into lines (using `.splitlines()`) or split into words on an empty space (default) or a character (using `.split()`).

In [None]:
emma.splitlines()

In [None]:
emma.split()

After split, punctuation marks are still connected to the words.

Let's see what happens now when we tokenize the text.

In [None]:
words = word_tokenize(emma)

In [None]:
words

Punctuation marks are now separate from the words, but it did for example tokenize 'Mr.' correctly.

In [None]:
sentences = sent_tokenize(emma)

In [None]:
sentences[10]

Sentence tokenizing did not remove new line characters so we need to do that by hand.

In [None]:
emma = emma.replace('\n', ' ')

In [None]:
sentences = sent_tokenize(emma)

In [None]:
sentences[10]

### Tagging

We can use in-built NLTK functionality to tag parts of speech in a sentence.

In [None]:
nltk.pos_tag(word_tokenize(sentences[10]))

The default tagset from the Penn Treebank project has a large number of tags (30+). This is how you can see what each tag means.

In [None]:
nltk.help.upenn_tagset('VBD')

In [None]:
nltk.help.upenn_tagset('VBN')

In [None]:
nltk.help.upenn_tagset('WDT')

For simplicity, let's change the tagset to universal which has only 10 main tags.

In [None]:
nltk.pos_tag(word_tokenize(sentences[10]), tagset = 'universal')

Interestingly, it does pretty well on seemingly ambiguous sentences

In [None]:
text = word_tokenize("They refuse to permit us to obtain the refuse permit")

In [None]:
nltk.pos_tag(text, tagset = 'universal')

Brown Corpus we mentioned earlier is also tagged and we can use it to explore interesting things about language.

In [None]:
from nltk.corpus import brown

In [None]:
brown.tagged_words(tagset = 'universal')

We can look at for example what words appear in the Brown Corpus editorial category after 'never'.

In [None]:
brown_text = brown.words(categories = 'editorial')
sorted(set(b for (a, b) in nltk.bigrams(brown_text) if a == 'never'))

And now we can also see what part of speech words appearing after 'never' usually are.

In [None]:
brown_tags = brown.tagged_words(categories = 'editorial', tagset = 'universal')

In [None]:
tags = [b[1] for (a, b) in nltk.bigrams(brown_tags) if a[0] == 'never']

In [None]:
fd = nltk.FreqDist(tags)
fd.tabulate()