# 5. Lemmatization and POS tagging

In this class, we will start working with *language processing pipelines* which enrich our text with useful linguistic annotations. The library that we're going to use is **spaCy** (https://spacy.io/).

Let's start by reading the text of "Anna Karenina". Instead of storing it as a single huge string, we will read it as a list of paragraphs, each paragraph being a string. The paragaraphs are separated in the file by blank lines which appear as `\n\n` in the string representation.

In [1]:
from collections import defaultdict
from operator import itemgetter

In [2]:
def load_paragraphs(filename):
    '''Reads a text divided into paragraphs.'''
    pars = []
    with open(filename) as fp:
        text = fp.read()
        pars = [p.replace('\n', ' ') for p in text.split('\n\n') if p.strip()]
    return pars

In [3]:
text = load_paragraphs('anna_karenina.txt')

Now let's load a spaCy model. The model is typically stored in a variable called `nlp`.

In [4]:
import spacy
nlp = spacy.load('en_core_web_sm')

In order to process a chunk of text, we simply call `nlp` as a function on it:

In [5]:
docs = [nlp(par) for par in text]  # Run the NLP pipeline paragraph by paragraph

In [6]:
print(docs[100][0].lemma_, docs[100][0].lemma)
# Prints the first word of a 100th paragraph
# .lemma_ prints the word itself, .lemma prints its number

see 11925638236994514241


Now `docs` is a list of `Doc` ("document") objects, each document containing a list of tokens. A token contains many useful annotations in addition to its string form. Today we are going to look into the following annotations:
* `orth_` - the original string form of the word,
* `norm_` - the normalized string form (e.g. lowercased)
* `lemma_` - the "dictionary form", e.g. `"did"` is lemmatized to `"do"`
* `pos_` - part-of-speech tag (e.g. `VERB`),
* `morph` - the details of the word's morphological form.

Here's how we can see those annotations for a short snippet of the book's 27-th paragraph:

In [7]:
for tok in docs[26]:
    print(tok.orth_, tok.norm_, tok.lemma_, tok.pos_, tok.morph, sep='\t')

“	"	"	PUNCT	PunctSide=Ini|PunctType=Quot
What	what	what	PRON	
’s	's	’	VERB	Number=Sing|Person=Three|Tense=Pres|VerbForm=Fin
this	this	this	DET	Number=Sing|PronType=Dem
?	?	?	PUNCT	PunctType=Peri
this	this	this	DET	Number=Sing|PronType=Dem
?	?	?	PUNCT	PunctType=Peri
”	"	"	PUNCT	PunctSide=Fin|PunctType=Quot
she	she	she	PRON	Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs
asked	asked	ask	VERB	Tense=Past|VerbForm=Fin
,	,	,	PUNCT	PunctType=Comm
pointing	pointing	point	VERB	Aspect=Prog|Tense=Pres|VerbForm=Part
to	to	to	ADP	
the	the	the	DET	Definite=Def|PronType=Art
letter	letter	letter	NOUN	Number=Sing
.	.	.	PUNCT	PunctType=Peri


If any of the labels sound cryptic, there's a very useful function called `spacy.explain()`:

In [8]:
spacy.explain('ADP')

'adposition'

# In-class exercises

## Ex 1

Count the number of occurrences of each POS-tag in a dictionary called `tag_freqs`.

In [9]:
tag_freqs = {}
for paragraph in docs:
    for tok in paragraph:
        if tok.pos_ not in tag_freqs:
            tag_freqs[tok.pos_] = 1
        else:
            tag_freqs[tok.pos_] += 1
print(tag_freqs)

{'NOUN': 56203, 'DET': 34508, 'PROPN': 14015, 'ADP': 43417, 'PUNCT': 72260, 'AUX': 20999, 'PRON': 51952, 'ADV': 23033, 'CCONJ': 16941, 'SPACE': 292, 'VERB': 56767, 'NUM': 2215, 'X': 116, 'ADJ': 21505, 'SCONJ': 6317, 'PART': 12034, 'INTJ': 2129, 'SYM': 9}


In [10]:
# Does the same but with lambda function
tag_freqs = defaultdict(lambda: 0)
for d in docs:
    for tok in d:
        tag_freqs[tok.pos_] += 1
print(tag_freqs)

defaultdict(<function <lambda> at 0x7fce0c2f2040>, {'NOUN': 56203, 'DET': 34508, 'PROPN': 14015, 'ADP': 43417, 'PUNCT': 72260, 'AUX': 20999, 'PRON': 51952, 'ADV': 23033, 'CCONJ': 16941, 'SPACE': 292, 'VERB': 56767, 'NUM': 2215, 'X': 116, 'ADJ': 21505, 'SCONJ': 6317, 'PART': 12034, 'INTJ': 2129, 'SYM': 9})


In [11]:
tag_freqs['VERB']

56767

Once we've done this, we can  list the different parts of speech tags in the order of decreasing frequency:

In [12]:
sorted(tag_freqs.items(), reverse=True, key=itemgetter(1))

# Itemgetter sorts the second item = we're sorting by frequency

[('PUNCT', 72260),
 ('VERB', 56767),
 ('NOUN', 56203),
 ('PRON', 51952),
 ('ADP', 43417),
 ('DET', 34508),
 ('ADV', 23033),
 ('ADJ', 21505),
 ('AUX', 20999),
 ('CCONJ', 16941),
 ('PROPN', 14015),
 ('PART', 12034),
 ('SCONJ', 6317),
 ('NUM', 2215),
 ('INTJ', 2129),
 ('SPACE', 292),
 ('X', 116),
 ('SYM', 9)]

## Ex 2

Find the most frequently occurring nouns, verbs and adjectives (lemmas). For this, you will need to build a dictionary `lemma_by_tag`, in which each entry is a list of lemmas for this tag, ordered by frequency.

In [13]:
# Vnořený slovník, nejprve vytvoříme první prázdný slovník, pak druhý, zanořený slovník, který obsahuje pro
# všechny neznámé 0, pak se postupně přidává 1

lemmas_by_tag = defaultdict(lambda: defaultdict(lambda: 0))

for d in docs:
    for tok in d:
        lemmas_by_tag[tok.pos_][tok.lemma_] += 1
        

In [14]:
for tags in lemmas_by_tag:
    lemmas_by_tag[tags] = sorted(lemmas_by_tag[tags].items(), reverse=True, key=itemgetter(1))

In [15]:
lemmas_by_tag['NOUN'][:10]

[('man', 763),
 ('time', 651),
 ('hand', 603),
 ('eye', 568),
 ('face', 556),
 ('_', 470),
 ('life', 467),
 ('room', 455),
 ('day', 433),
 ('wife', 425)]

In [16]:
lemmas_by_tag['VERB'][:10]

[('be', 3762),
 ('say', 3437),
 ('go', 1908),
 ('’', 1415),
 ('see', 1368),
 ('have', 1332),
 ('come', 1311),
 ('know', 1259),
 ('do', 1002),
 ('look', 973)]

In [17]:
lemmas_by_tag['ADJ'][:10]

[('good', 465),
 ('little', 456),
 ('same', 448),
 ('old', 421),
 ('other', 379),
 ('own', 348),
 ('new', 316),
 ('more', 287),
 ('great', 258),
 ('first', 255)]

## Ex 3

Find the 10 word forms most ambiguous wrt. POS tag, i.e. having the most possible tags.

## Ex 4

Test the Zipf's law (plot rank against frequency on a log-log scale) for word forms and lemmas.

# Homework

## Ex 6 (2p.)

In English, the lemma of a word is often identical to the word form. But how often? Write the function `lemma_equals_word(docs, tag)` that measures the proportion of tokens tagged with `tag`, for which the lemma is identical to the normalized word.

In [18]:
def lemma_equals_word(docs, tag):
    count = 0
    count_same = 0

    for d in docs:
        for tok in d:
            if tok.pos_ == tag:
                count += 1
                if tok.norm_ == tok.lemma_:
                    count_same += 1
    
    print(count_same/count)

In [19]:
lemma_equals_word(docs, 'ADJ')

0.9610788188793304


In [20]:
lemma_equals_word(docs, 'NOUN')

0.8085155596676334


In [21]:
lemma_equals_word(docs, 'VERB')

0.299117444994451


## Ex 7 (3p.)

Write a function `find_examples(docs, word, tags)`, which finds instances of `word` being tagged with different `tags`. Use it to show example sentences of the different taggings.

In [59]:
def find_examples(docs, word, tags):
    results = defaultdict(list)
    
    for d in docs:
        for tok in d:
            if str(tok) == word:
                results[tok.pos_].append(tok.sent)

    return results

In [60]:
ex = find_examples(docs, 'sleep', ['NOUN', 'VERB'])

In [62]:
ex['NOUN'][0].sent

He turned over his stout, well-cared-for person on the springy sofa, as though he would sink into a long sleep again; he vigorously embraced the pillow on the other side and buried his face in it; but all at once he jumped up, sat up on the sofa, and opened his eyes.

In [63]:
ex['VERB'][0].sent

But after she had gone to bed, for a long while she could not sleep.

In [25]:
ex['NOUN'][0].sent #dont touch

He turned over his stout, well-cared-for person on the springy sofa, as though he would sink into a long sleep again; he vigorously embraced the pillow on the other side and buried his face in it; but all at once he jumped up, sat up on the sofa, and opened his eyes.

In [27]:
ex['VERB'][0].sent #dont touch

But after she had gone to bed, for a long while she could not sleep.