# Exercise 1: Working with text

**Version from the interactive session**

The purpose of this notebook is to introduce you to some basic techniques for handling text data.

## Dataset

The data for this lab is taken from [WikiText](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/), a collection of more than 100&nbsp;million tokens extracted from the set of &lsquo;Good&rsquo; and &lsquo;Featured&rsquo; articles on the English Wikipedia. More specifically, we will be using the training portion of the WikiText-2 dataset, which contains approximately 2&nbsp;million tokens.

In [None]:
with open('wiki.train.tokens', encoding='utf8') as fp:
    wikitext = fp.readlines()

Show a line from the file:

In [None]:
wikitext[3]

As we can see, each line consists of a sequence of space-separated tokens. (In real applications, we would typically have to tokenise the raw text ourselves.) Here is a helper function that extracts these tokens from a given line:

In [None]:
def tokens(lines):
    for line in lines:
        for token in line.rstrip().split():
            yield token
        yield '<eos>'

## Encode words into integers

One common task when processing text with neural networks is to encode words and other strings into contiguous ranges of integers so that we can map them to the components of a vector. The standard pattern for doing this looks as follows:

In [None]:
stoi = {}
for token in tokens(wikitext):
    if token not in stoi:
        stoi[token] = len(stoi)

What index did the word *movie* get?

In [None]:
stoi['movie']

Sometimes we also need to invert the string-to-integer mapping:

In [None]:
itos = {i: s for s, i in stoi.items()}

Decoding should be the inverse of encoding:

In [None]:
assert itos[stoi['movie']] == 'movie'

## Plot word frequencies

We want to see whether the word frequencies in the WikiText data follow the expected Zipfian pattern. We start by collecting the frequencies:

In [None]:
from collections import Counter

counter = Counter(tokens(wikitext))

#counter = Counter(t.lower() for t in tokens(wikitext) if t.isalpha())    # for Problem 1

How often does the word *movie* occur in the data?

In [None]:
print(counter['movie'])

Print the 10 most common words.

In [None]:
counter.most_common(10)

Plot the word frequencies of the 100 most common words:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

labels, values = zip(*counter.most_common(100))
plt.loglog(range(len(labels)), values)
plt.xticks()
plt.show()

In [None]:
list(zip(*counter.most_common(10)))

**Here are some problems for you to work on:**

* Normalise your data by lowercasing and excluding non-alphabetical tokens. How does that affect the results?
* Compute the size of the vocabulary and the total number of tokens. Compare your values to the official statistics on the WikiText website.
* Check whether Heaps&rsquo; law seems to hold for the WikiText data by plotting the vocabulary size in steps of 1,000 tokens.

In [None]:
print('Vocabulary size:', len(stoi))

print('Number of tokens:', sum(counter.values()))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

values = []
tmp = set()
for i, token in enumerate(tokens(wikitext)):
    tmp.add(token)
    if i % 1000 == 0:
        values.append(len(tmp))

plt.loglog(range(len(values)), values)
plt.xticks()
plt.show()

## Extracting linguistic features

Many times when building neural network systems for text, we need to preprocess the data in some way (examples: tokenisation, stop word removal) and extract linguistic features from the text, such as part-of-speech tags or lemmas. For this we can use libraries such as [spaCy](https://spacy.io):

In [None]:
import spacy

SpaCy comes with different languages models. Here we load the English language model. (There are actually several English language model; to save some time, we load the smallest one.)

In [None]:
nlp = spacy.load('en_core_web_sm')

Define a short text:

In [None]:
text = u'Apple Corp. buys Alphabet Inc. for $1 billion'

Process the text using the default NLP pipeline:

In [None]:
doc = nlp(text)

Show the named entities (names of people, places, organisations, etc.):

In [None]:
from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

Show the dependency parse:

In [None]:
from spacy import displacy

displacy.render(doc, style='dep', options={'distance': 110}, jupyter=True)

**Here are some problems for you to work on:**

Read the [Linguistic Features](https://spacy.io/usage/linguistic-features) section from the spaCy documentation and solve the following problems on the WikiText data.

* What is the percentage of stop words in the data? (Stop words are frequent words that we may wish to ignore.)
* Compile a small dictionary with all verbs in the data. Represent verbs by their lemmas.
* Extract the names of all people mentioned in the data. Do you notice any obvious errors made by the named entity recogniser?
* Extract all two-word compounds in the data. How many compounds do you find?

### Sample solutions to the problems

In [None]:
# Process the WikiText data and store the processed "documents" in a list

docs = list(nlp.pipe(wikitext))

In [None]:
n = 0
k = 0
for doc in docs:
    for token in doc:
        n += 1
        k += token.is_stop
print('Percentage of stop words: {:.2f}'.format(k / n))

In [None]:
verbs = set()
for doc in docs:
    for token in doc:
        if token.text.islower() and token.is_alpha and token.pos_ == 'VERB':
            verbs.add(token.lemma_)
print('Size of the vocabulary:', len(sorted(verbs)))

In [None]:
ents = set()
for doc in docs:
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            ents.add(ent.text)
print(sorted(ents))

In [None]:
compounds = set()
for doc in docs:
    for token in doc:
        if token.dep_ == 'compound':
            if token.head.i < token.i:
                compounds.add(token.head.text + ' ' + token.text)
            else:
                compounds.add(token.text + ' ' + token.head.text)
print(sorted(compounds))

That&rsquo;s all folks!