# Sequence Labeling with Markov Models
- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

*Recommended Reading*:
- Dan Jurafsky and James H. Martin. [__Speech and Language Processing__ (SLP)](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft)
- Steven Bird, Ewan Klein, and Edward Loper. [__Natural Language Processing with Python__ (NLTK)](https://www.nltk.org/book/)

*Notebook Covers Material of*:
- [SLP](https://web.stanford.edu/~jurafsky/slp3/8.pdf) Chapter 8: Part-of-Speech Tagging (HMMs)
- [NLTK](https://www.nltk.org/book/ch05.html) 
    - Chapter 5: Part of Speech Tagging 
    - Chapter 7: Extracting Information from Text

__Requirements__

- spaCy
- [NLTK](https://www.nltk.org/)
- [`conll.py`](https://github.com/esrel/LUS/) (in `src` folder)

## Sequence Labeling (Tagging)
[Classification](https://en.wikipedia.org/wiki/Statistical_classification) is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

[Sequence Labeling](https://en.wikipedia.org/wiki/Sequence_labeling) is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. It is a sub-class of [structured (output) learning](https://en.wikipedia.org/wiki/Structured_prediction), since we are predicting a *sequence* object rather than a discrete or real value predicted in classification problems.

- Can be treated as a set of independent classification tasks, one per member of the sequence;
- Performance is generally improved by making the optimal label for a given element dependent on the choices of nearby elements;

Due to the complexity of the model and the interrelations of predicted variables the process of prediction using a trained model and of training itself is often computationally infeasible and [approximate inference](https://en.wikipedia.org/wiki/Approximate_inference) and learning methods are used. 

[Markov Chain](https://en.wikipedia.org/wiki/Markov_chain) is a stochastic model used to describe sequences. It is the simplest [Markov Model](https://en.wikipedia.org/wiki/Markov_model). In order to make inference tractable, a process that generated the sequence is assumed to have [Markov Property](https://en.wikipedia.org/wiki/Markov_property), i.e. future states depend only on the current state, not on the events that occurred before it. (An [ngram](https://en.wikipedia.org/wiki/N-gram) [language model](https://en.wikipedia.org/wiki/Language_model) is a $(n-1)$-order Markov Model.) 

In Statical Language Modeling, we are modeling *observed sequences* represented as Markov Chains. Since the states of the process are *observable*, we only need to compute __transition probabilities__. 

In Sequence Labeling, we assume that *observed sequences* (__sentences__) have been generated by a Markov Process with *unobservable* (i.e. hidden) states (__labels__), i.e. [Hidden Markov Model](https://en.wikipedia.org/wiki/Hidden_Markov_model) (__HMM__). 
Since the states of the process are hidden and the output is observable, each state has a probability distribution over the possible output tokens, i.e. __emission probabilities__. 

Using these two probability distributions (__transition__ and __emission__), in sequence labeling, we are *inferring* the sequence of state transitions, given a sequence of observations.

### Natural Language Processing (NLP) Tasks

Below are some examples of NLP tasks that Sequence Labeling is applied to as one of the methods.

The scenario when members of a sequence are mapped to higher order units (i.e. grouped together `[['a'],['b','c']]`) and assigned a category) is known as __shallow parsing__.

- [Part-of-Speech Tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging)
- [Shallow Parsing](https://en.wikipedia.org/wiki/Shallow_parsing) (Chunking)
    - [Phrase Chunking](https://en.wikipedia.org/wiki/Phrase_chunking)
    - [Named-Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) 
    - [Semantic Role Labeling](https://en.wikipedia.org/wiki/Semantic_role_labeling)
    - Dependency [Parsing](https://en.wikipedia.org/wiki/Parsing) 
    - Discourse Parsing
    - (Natural/Spoken) __Language Understanding__: Concept Tagging/Entity Extraction

### The General Setting for Sequence Labeling

- Create __training__ and __testing__ sets by tagging a certain amount of text by hand
    - i.e. map each word in corpus to a tag
- Train tagging model to extract generalizations from the annotated __training__ set
- Evaluate the trained tagging model on the annotated __testing__ set
- Use the trained tagging model too annotate new texts

### Markov Model Tagging
Tagging is one of the tasks [Hidden Markov Models](https://en.wikipedia.org/wiki/Hidden_Markov_model) are used for.

Given s word sequence $w_{1}^{n}$ the goal is to find the most probable tag sequence $t_{1}^{n}$. 

$$t_{1}^{n} = \arg\max\limits_{t_{1}^{n}} p(t_{1}^{n} | w_{1}^{n})$$

We assume that a tag sequence has generated the given sequence of words. 

Using __Bayes's Rule__ 

$$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$

Consequently, we compute:

$$t_{1}^{n} = \arg\max\limits_{t_{1}^{n}}\frac{p(w_{1}^{n} | t_{1}^{n}) p(t_{1}^{n})}{p(w_{1}^{n})}$$

Probability of a word sequence is the same for all tags, thus:

$$t_{1}^{n} = \arg\max\limits_{t_{1}^{n}} p(w_{1}^{n} | t_{1}^{n})p(t_{1}^{n})$$


#### Parameter Learning
The parameter learning task in HMMs is to find, given an output sequence or a set of such sequences, the best set of *state transition* and *emission probabilities*. The task is usually to derive the [*maximum likelihood estimate*](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) of the parameters of the HMM given the set of output sequences. 

##### Simplifying Assumptions

- Probability of a word only depends on its own tag, not tags of other words in a sentence, thus:

$$p(w_{1}^{n}|t_{1}^{n}) \approx p(w_1|t_1)p(w_2|t_2) ... p(w_n|t_n)$$

- Probability of a tag depends on previous N tags; i.e. Markov assumption (ngram), thus:

$$p(t_{1}^{n}) \approx \prod_{i=1}^{n}{p(t_i | t_{i-n+1}^{i-1})}$$

- The (first-order) Markov assumption (bigram):

$$p(t_{1}^{n}) \approx p(t_1|t_0) p(t_2|t_1) ... p(t_n|t_{n-1})$$
- or:
$$p(t_{1}^{n}) \approx \prod_{i=1}^{n}{p(t_i | t_{i-1})}$$

##### Estimating Transition Probabilities from Data

- *Transition Probabilities* $p(t_i | t_{i-n+1}^{i-1})$ is an ngram model, and it is estimated using the same recipe we use for ngram language modeling; but using tag ngrams instead of word-ngrams. 
- It is assumed that the set of states is *finite* and known (i.e. there is no unknown (or OOV) state).
- The same principles of *smoothing* apply for ngrams of state transitions


*Calculating Probability from Frequencies*

Probabilities of ngrams can be computed *normalizing* frequency counts (*Maximum Likelihood Estimation*): dividing the frequency of an ngram sequence by the frequency of its prefix (*relative frequency*).

N-gram   | Equation                      
:--------|:------------------------------
Unigram  | $$p(t_i) = \frac{c(t_i)}{T}$$ 
Bigram   | $$p(t_i|t_{i-1}) = \frac{c(t_{i-1},t_i)}{c(t_{i-1})}$$ 
Ngram    | $$p(t_i|t_{i-N+1}^{i-1}) = \frac{c(t_{i-N+1}^{i-1}, t_i)}{c(t_{i-N+1}^{i-1})}$$ 

where:
- $T$ is the total number of tags in a corpus
- $c(x)$ is the count of occurrences of $x$ in a corpus ($x$ could be unigram, bigram, etc.)

##### Estimating Emission Probabilities from Data
Similar to *transition probabilities*, *emission probabilities* can be estimated from annotated data counting relative frequencies of observations. Since we assume that probability of a word depends only on its tag, the equation is the following.

$$p(w_i|t_i) = \frac{c(t_i,w_i)}{c(t_i)}$$

*Unknown Words* & *Unknown Word Models* 

Emission probabilities are subject to data sparseness; thus require handling unknown words. 
Consequently, we need to estimate probabilities for $p($ `<unk>` $|t_i)$. 

- We can assume that all tags ($t_i$) have equal probability of emitting `<unk>`; and estimate it as $\frac{1}{V}$, where $V$ is the size of tag vocabulary.
    - i.e. use Additive Smoothing
- We can estimate them from data replacing OOV with `<unk>` and computing the probabilities
- We can build __Unknown Word Model__ (like in Part-of-Speech Tagging), for instance using:
    - word shape (capitalization)
    - word class (word, punctuation, number)
    - part-of-speech tags (generalize)
    - word suffixes (last characters): e.g. suffixes of lengths (1 to 5) (e.g. [Samuelsson (1993)](https://www.aclweb.org/anthology/W93-0420.pdf))


#### Decoding
$$t_{1}^{n} \approx \arg\max\limits_{t_{1}^{n}} \prod^n_{i=1} p(w_i|t_i)p(t_i|t_{i-N+1}^{i-1})$$

| __Model__ | __Equation__                                                                                 |
|:----------|:--------------------------------------------------------------------------------------------:|
| *unigram* | $$t_{1}^{n} = \arg\max\limits_{t_{1}^{n}} \prod^n_{i=1} p(w_i|t_i)p(t_i)$$                   |
| *bigram*  | $$t_{1}^{n} = \arg\max\limits_{t_{1}^{n}} \prod^n_{i=1} p(w_i|t_i)p(t_i|t_{i-1})$$           |
| *trigram* | $$t_{1}^{n} = \arg\max\limits_{t_{1}^{n}} \prod^n_{i=1} p(w_i|t_i)p(t_i|t_{i-2}, t_{i-1})$$  |

where:
- $p(w_i|t_i)$ -- *emission probability*, i.e. of seeing current word given the current tag
- $p(t_i|t_{i-n+1}^{i-1})$ -- *transition probability*, i.e. of seeing the current tag given the tags we just saw 

##### Viterbi Algorithm
The decoding algorithm for HMMs is the [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm) -- an instance of dynamic programming. Bigram version of the algorithm is not difficult to implement (see pseudo-code in [SLP 8.4.5](https://web.stanford.edu/~jurafsky/slp3/8.pdf)); trigram, however, is more complex, and practical taggers incorporate other advanced features. 

There are numerous implementation available.

### Maximum Likelihood Estimation (__MLE__)

Let's compare *emission probability* to *bigram probability* estimation computation:
- Maximum Likelihood Estimation (__MLE__) from frequency counts
- suffer from data sparseness:
    - smoothing (__+1S__ - add-one smoothing, for simplicity)
    - out-of-vocabulary (__OOV__, `<unk>`) word uniform probability estimation

|         | __bigram *p*__ | __emission *p*__ |
|:--------|:-----------------------|:-------------------------|
| __MLE__ | $$p(w_i | w_{i-1}) = \frac{c(w_{i-1}, w_i)}{c(w_{i-1})}$$ | $$p(w_i|t_i) = \frac{c(t_i,w_i)}{c(t_i)}$$
| __+1S__ | $$p(w_i | w_{i-1}) = \frac{c(w_{i-1},w_i)+1}{c(w_{i-1})+V}$$ | $$p(w_i|t_i)=\frac{c(t_i,w_i)+1}{c(t_i)+V}$$
| __OOV__ | $$\frac{1}{V}$$ | $$\frac{1}{V}$$ 

In practice this means that we can estimate emission probabilities as ngram probabilities, i.e. using the same functions for counting and smoothing, treating $c(t_i,w_i)$ as $c(w_{i-1},w_i)$, i.e. as `[t_i, w_i]` ngram.


## Part-of-Speech Tagging

Part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context.

Tag Sets vary from corpus to corpus.

### Universal Part of Speech Tags

Universal POS-Tag Set represents a simplified and unified set of part-of-speech tags, that was proposed for the standardization across corpora and languages. 
The number of defined tags varies from 12 ([Petrov et al/Google/NLTK](https://github.com/slavpetrov/universal-pos-tags)) to 17 ([Universal Dependencies/spaCy](https://universaldependencies.org/u/pos/index.html), in *Italics*).



| Tag  | Meaning | English Examples |
|:-----|:--------|:-----------------|
| __Open Class__ |||
| NOUN | noun (common and proper) | year, home, costs, time, Africa
| VERB | verb (all tenses and modes) | is, say, told, given, playing, would
| ADJ  | adjective           | new, good, high, special, big, local
| ADV  | adverb              | really, already, still, early, now
| *PROPN* | proper noun (split from NOUN) | Africa
| *INTJ*  | interjection (split from X) | oh, ouch
| __Closed Class__ |||
| DET  | determiner, article | the, a, some, most, every, no, which
| PRON | pronoun             | he, their, her, its, my, I, us
| ADP  | adposition	(prepositions and postpositions) | on, of, at, with, by, into, under
| NUM  | numeral             | twenty-four, fourth, 1991, 14:24
| PRT (*PART*) | particles or other function words | at, on, out, over per, that, up, with
| CONJ | conjunction         | and, or, but, if, while, although
| *AUX* | auxiliary (split from VERB) | have, is, should
| *CCONJ*  | coordinating conjunction (splits CONJ) | or, and
| *SCONJ*  | subordinating conjunction (splits CONJ) | if, while
| __Other__ |||
| .    | punctuation marks   | . , ; !
| X    | other               | foreign words, typos, abbreviations: ersatz, esprit, dunno, gr8, univeristy
| *SYM* | symbols (split from X) | $, :) 




### Part-of-Speech Tagging with NLTK

In [9]:
import nltk
nltk.download('universal_tagset')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/mdestro/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mdestro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/mdestro/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [25]:
import nltk

text = "Oh. I have seen a man with a telescope in Antarctica."

# tokenization
tokens = nltk.word_tokenize(text)
print('\nTokens:', tokens)

# POS-tagging (with WSJ Tags)
print('\nPOS (WSJ):', nltk.pos_tag(tokens))

# POS-tagging with Universal Tags
print('\nPOS (universal):', nltk.pos_tag(tokens, tagset='universal'))



Tokens: ['Oh', '.', 'I', 'have', 'seen', 'a', 'man', 'with', 'a', 'telescope', 'in', 'Antarctica', '.']

POS (WSJ): [('Oh', 'UH'), ('.', '.'), ('I', 'PRP'), ('have', 'VBP'), ('seen', 'VBN'), ('a', 'DT'), ('man', 'NN'), ('with', 'IN'), ('a', 'DT'), ('telescope', 'NN'), ('in', 'IN'), ('Antarctica', 'NNP'), ('.', '.')]

POS (universal): [('Oh', 'X'), ('.', '.'), ('I', 'PRON'), ('have', 'VERB'), ('seen', 'VERB'), ('a', 'DET'), ('man', 'NOUN'), ('with', 'ADP'), ('a', 'DET'), ('telescope', 'NOUN'), ('in', 'ADP'), ('Antarctica', 'NOUN'), ('.', '.')]


### Part-of-Speech Tagging with spaCy

In [26]:
import spacy

nlp = spacy.load('en')

# let's print spaCy pipeline
print([key for key, model in nlp.pipeline])


['tagger', 'parser', 'ner']


In [27]:
doc = nlp(text)

# tokens
print([t.text for t in doc])

# Fine grained POS-tags (not universal)
print([t.tag_ for t in doc])

# Coarse POS-tags (from Universal POS Tag set)
print([t.pos_ for t in doc])

['Oh', '.', 'I', 'have', 'seen', 'a', 'man', 'with', 'a', 'telescope', 'in', 'Antarctica', '.']
['UH', '.', 'PRP', 'VBP', 'VBN', 'DT', 'NN', 'IN', 'DT', 'NN', 'IN', 'NNP', '.']
['INTJ', 'PUNCT', 'PRON', 'AUX', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT']


### Training POS-Tagger with NLTK

- Manually POS-tagged corpus
- Sequence Labeling (Tagging) Algorithm

#### Corpora for POS-Tagging
NLTK provides several corpora, most of them are POS-tagged. We will use WSJ with universal tag set (automatically converted using intetnal mapping).

In [13]:
from nltk.corpus import treebank

# WSJ POS-Tags
print(treebank.tagged_sents()[:1])

# Universal POS-Tags
print(treebank.tagged_sents(tagset='universal')[:1])

[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]]
[[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')]]


#### NLTK Taggers (Some have also NER)

NLTK provides several tagging algorithms, including 

- rule-based taggers
    - Regular Expression Tagger: assigns tags to tokens by comparing their word strings to a series of regular expressions.

- [Pre-Trained Taggers](http://www.nltk.org/api/nltk.tag.html)
    - HunPoS
    - Senna
    - Stanford Tagger
    
- trainable taggers
    - `Brill Tagger`: Brill's transformational rule-based tagger assigns an initial tag sequence to a text; and then appies an ordered list of transformational rules to correct the tags of individual tokens. Learns rules from corpus.
    - [Greedy Averaged Perceptron](https://explosion.ai/blog/part-of-speech-pos-tagger-in-python)
    - [TnT](http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf)
    - Hidden Markov Models
    - Conditional Random Fields (*later, another lab session*)
    - Sequential:
        - Affix Tagger: A tagger that chooses a token's tag based on a leading or trailing substring of its word string.
        - Ngram Tagger: A tagger that chooses a token's tag based on its word string and on the preceding _n_ word's tags.
            - Unigram Tagger
            - Bigram Tagger
            - Trigram Tagger

        - Classifier-based POS Tagger: A sequential tagger that uses a classifier to choose the tag for each token in a sentence.
    


#### Testing a POS Tagger

In [14]:
# Prepare Training & Test Splits as 80%/20%
import math

total_size = len(treebank.tagged_sents())
train_indx = math.ceil(total_size * 0.8)
trn_data = treebank.tagged_sents(tagset='universal')[:train_indx]
tst_data = treebank.tagged_sents(tagset='universal')[train_indx:]
 
print("Total: {}; Train: {}; Test: {}".format(total_size, len(trn_data), len(tst_data)))


Total: 3914; Train: 3132; Test: 782


In [15]:
# rule-based tagging
from nltk.tag import RegexpTagger

# rule from NLTK adapted to Universal Tag Set & extended
rules = [
    (r'^-?[0-9]+(.[0-9]+)?$', 'NUM'),   # cardinal numbers
    (r'(The|the|A|a|An|an)$', 'DET'),   # articles
    (r'.*able$', 'ADJ'),                # adjectives
    (r'.*ness$', 'NOUN'),               # nouns formed from adjectives
    (r'.*ly$', 'ADV'),                  # adverbs
    (r'.*s$', 'NOUN'),                  # plural nouns
    (r'.*ing$', 'VERB'),                # gerunds
    (r'.*ed$', 'VERB'),                 # past tense verbs
    (r'[\.,!\?:;\'"]', '.'),            # punctuation (extension) 
    (r'.*', 'NOUN')                     # nouns (default)
]

re_tagger = RegexpTagger(rules)

# tagging sentences in test set
for s in treebank.sents()[:train_indx]:
    print("INPUT: {}".format(s))
    print("TAG  : {}".format(re_tagger.tag(s)))
    break
    
# evaluation
accuracy = re_tagger.evaluate(tst_data)

print("Accuracy: {:6.4f}".format(accuracy))

INPUT: ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
TAG  : [('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'NOUN'), (',', '.'), ('will', 'NOUN'), ('join', 'NOUN'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'NOUN'), ('a', 'DET'), ('nonexecutive', 'NOUN'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')]
Accuracy: 0.5360


##### Exercise (Optional)

- Extend rule-set of RegexpTagger to handle close-class words (similar to punctuation & DET):

    - prepositions (ADP)
    - particles (PRT)
    - pronouns (PRON)
    - conjunctions (CONJ)

- Evaluate 

#### Training HMM POS Tagger

In [16]:
# training hmm on treebank
import nltk.tag.hmm as hmm

hmm_model = hmm.HiddenMarkovModelTrainer()
hmm_tagger = hmm_model.train(trn_data)

# tagging sentences in test set
for s in treebank.sents()[:train_indx]:
    print("INPUT: {}".format(s))
    print("TAG  : {}".format(hmm_tagger.tag(s)))
    print("PATH : {}".format(hmm_tagger.best_path(s)))
    break
    
# evaluation
accuracy = hmm_tagger.evaluate(tst_data)

print("Accuracy: {:6.4f}".format(accuracy))

INPUT: ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
TAG  : [('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')]
PATH : ['NOUN', 'NOUN', '.', 'NUM', 'NOUN', 'ADJ', '.', 'VERB', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', 'NOUN', 'NUM', '.']
Accuracy: 0.5135


#### Exercise: Tagging with NLTK
Experiment with different taggers provided in NLTK (e.g. NgramTagger)
- Explore and experiment with different tagger parameters
    - some of them have *cut-off*
- For each report evaluation accuracy

## Shallow Parsing

As we have already mentioned, [Shallow Parsing](https://en.wikipedia.org/wiki/Shallow_parsing) is a kind of Sequence Labeling. The main difference from Sequence Labeling task, such as Part-of-Speech tagging, where there is an output label (tag) per token; Shallow Parsing additionally performs __chunking__ -- segmentation of input sequence into constituents. Chunking is required to identify categories (or types) of *multi-word expressions*.
In other words, we want to be able to capture information that expressions like `"New York"` that consist of 2 tokens, constitute a single unit.

What this means in practice is that Shallow Parsing performs *jointly* (or not) 2 tasks:
- __Segmentation__ of input into constituents (__spans__)
- __Classification__ (Categorization, Labeling) of these constituents into predefined set of labels (__types__)


### Revisiting Joint Probability Factorization
In [*generative approach*](https://en.wikipedia.org/wiki/Generative_model) to Sequence Labeling we are modeling [joint probability distribution](https://en.wikipedia.org/wiki/Joint_probability_distribution).

$$p(w_{1}^{n},t_{1}^{n}) = p(w_1, w_2, ..., w_n, t_1, t_2, ..., t_n)$$ 

To make the inference tractable, we factor the joint distribution using Chain Rule and apply [conditional independence assumption](https://en.wikipedia.org/wiki/Independence_(probability_theory)).

$$P(A,B) = P(B|A)P(A) = P(A|B)P(B)$$

It is common to mistakenly assume that $P(A|B) = P(B|A)$, known as [Confusion of the Inverse](https://en.wikipedia.org/wiki/Confusion_of_the_inverse).

The relation between $P(A|B)$ and $P(B|A)$ is given by the Bayes Rule:

$$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$

Consequently: 

$$p(t_{1}^{n}|w_{1}^{n}) = \frac{p(w_{1}^{n},t_{1}^{n})}{p(w_{1}^{n})} = \frac{p(w_{1}^{n} | t_{1}^{n}) p(t_{1}^{n})}{p(w_{1}^{n})}$$

If events $A$ and $B$ are conditionally independents, we have: 

$$P(A,B) = P(A)P(B) \rightarrow P(A) = P(A|B); P(B) = P(B|A)$$

Applying, Markov assumption to $p(t_{1}^{n})$ and conditional independence assumption to $p(w_{1}^{n} | t_{1}^{n})$ we end-up with our ngram sequence labeling model.


$$p(t_{1}^{n}|w_{1}^{n}) \approx \prod^n_{i=1} p(w_i|t_i)p(t_i|t_{i-N+1}^{i-1})$$

If we would not apply conditional independence assumption to $p(w_{1}^{n}|t_{1}^{n})$, we would be modeling $p(w_{1}^{n},t_{1}^{n})$ __jointly__.

Applying just Markov assumption, i.e. modeling it as an ngram (Markov Chain), we will be solving the following equation:

$$p(w_{1}^n,t_{1}^{n}) \approx \prod_{i=1}^{n}{p(w_{i},t_{i}|w_{i-N+1}^{i-1},t_{i-N+1}^{i-1})}$$

Because:

$$p(t_{1}^{n}|w_{1}^{n}) = \frac{p(w_{1}^{n},t_{1}^{n})}{p(w_{1}^{n})} = \frac{p(w_{1}^{n} | t_{1}^{n}) p(t_{1}^{n})}{p(w_{1}^{n})}$$

$$t_{1}^{n} = \arg\max \limits_{t_{1}^{n}} p(t_{1}^{n}|w_{1}^{n}) = \arg\max \limits_{t_{1}^{n}} p(w_{1}^{n},t_{1}^{n}) = \arg\max \limits_{t_{1}^{n}} p(w_{1}^{n} | t_{1}^{n}) p(t_{1}^{n})$$

Most probable sequence can be obtained either way.

$$t_{1}^{n} \approx \arg\max\limits_{t_{1}^{n}} \prod^n_{i=1} p(w_i|t_i)p(t_i|t_{i-N+1}^{i-1})$$ 
$$t_{1}^{n} \approx \arg\max\limits_{t_{1}^{n}} \prod_{i=1}^{n}{p(w_{i},t_{i}|w_{i-N+1}^{i-1},t_{i-N+1}^{i-1})}$$

Factorization and conditional independence assumptions reduce computational complexity and requirements; thus, the *amount of observations* needed to estimate model probabilities.

Both models are applied to Shallow Parsing and Sequence Labeling in general:
e.g. Hidden Markov Model Tagger and Stochastic Conceptual Language Models for Spoken Language Understanding in [Raymond & Riccardi (2007)](https://disi.unitn.it/~riccardi/papers2/IS07-GenerDiscrSLU.pdf).

### Joint Segmentation and Classification
In Shallow Parsing, the segmentation and label information is generally modeled *jointly*. 
In practice, it means that our output labels ($t_i$) can be decomposed into ($c_i,s_i$), where $c_i$ is classification label for token $i$, and $s_i$ segmentation label for token $i$.

Consequently, in shallow parsing, we are modeling:

$$p(w_{1}^{n},t_{1}^{n}) \rightarrow p(w_{1}^{n},c_{1}^{n},s_{1}^{n}) \approx \prod^n_{i=1} p(w_i|c_i,s_i)p(c_i,s_i|c_{i-N+1}^{i-1},s_{i-N+1}^{i-1})$$

The joint modeling implies that we do not make conditional independence assumption between segmentation and classification labels. If we make an assumption that probability of a words depends on segmentation and classification labels independently, while both labels depend on their previous N labels, we can factorize the equation as:

$$p(w_{1}^{n},c_{1}^{n},s_{1}^{n}) \approx \prod^n_{i=1} p(w_i|c_i)p(c_i|c_{i-N+1}^{i-1})p(w_i|s_i)p(s_i|s_{i-N+1}^{i-1})$$

The *events* could be modeled independently as well: i.e. we can predict either classification labels only, or segmentation labels only.

*Segmentation*:
$$p(w_{1}^{n},s_{1}^{n}) \approx \prod^n_{i=1} p(w_i|s_i)p(s_i|s_{i-N+1}^{i-1})$$
*Classification*
$$p(w_{1}^{n},c_{1}^{n}) \approx \prod^n_{i=1} p(w_i|c_i)p(c_i|c_{i-N+1}^{i-1})$$

#### Joint Modeling for Features
In Shallow Parsing we jointly model *output label*.
The principles of joint modeling could be applied to introduce additional features for *input tokens* as well. 
For instance, we could model jointly words and part-of-speech tags ($x_i$) for shallow parsing as:

$$p(w_{1}^{n},x_{1}^{n},c_{1}^{n},s_{1}^{n}) \approx \prod^n_{i=1} p(w_i,x_i|c_i,s_i)p(c_i,s_i|c_{i-N+1}^{i-1},s_{i-N+1}^{i-1})$$

or predict them jointly with segmentation and classification labels as:

$$p(w_{1}^{n},x_{1}^{n},c_{1}^{n},s_{1}^{n}) \approx \prod^n_{i=1} p(w_i|c_i,s_i,x_i)p(c_i,s_i,x_i|c_{i-N+1}^{i-1},s_{i-N+1}^{i-1},x_{i-N+1}^{i-1})$$

In the first case our input is *word-pos* pairs, we don't make independence assumptions, consequently they are treated as a single unit (i.e. you need to generate *pos* per word some other way for tagging). Same applies to *segmentation-classification* (or *segmentation-classification-pos*) output labels.

- In joint modeling our observations for tokens and ngrams are more sparse: *word-pos* pair usually appears in data less than *word* and *pos* separately (same applies for their ngrams). 
- In joint modeling of output labels, we will have to estimate more of them, thus will have less observations for each.


#### [Bayesian Categorization](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

Assuming conditional independence between *word* and *pos* leads to Bayesian Categorization.

$$p(C_k|x) = \frac{p(x|C_k)p(C_k)}{p(x)} = p(C_k) \prod^n_{i=1} p(x_i|C_k)$$

$$p(t_{i}|w_{i},x_{i}) \approx p(t_{i})p(w_i|t_i)p(x_i|t_i)$$

## Encoding Segmentation Information: CoNLL Corpus Format

Corpus in CoNLL format consists of series of sentences, separated by blank lines. Each sentence is encoded using a table (or "grid") of values, where each line corresponds to a single word, and each column corresponds to an annotation type. 

The set of columns used by CoNLL-style files can vary from corpus to corpus.

```
who    O
plays  O
luke   B-character.name
on     O
star   B-movie.name
wars   I-movie.name
new    I-movie.name
hope   I-movie.name
```

### [IOB Scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging))

- The notation scheme is used to label *multi-word* spans in token-per-line format.
    - *star wars new hope* is a *movie.name* concept that has 4 tokens
- Both, prefix and suffix notations are commons: 
    - prefix: __B-movie.name__
    - suffix: __movie.name-B__

- Meaning of Prefixes
    - __B__ for (__B__)eginning of span
    - __I__ for (__I__)nside of span
    - __O__ for (__O__)tside of span (no prefix or suffix, just `O`)

#### Alternative Schemes:
- No prefix or suffix (useful when there are no *multi-word* concepts)
```
        who    O
        plays  O
        luke   character.name
        on     O
        star   movie.name
        wars   movie.name
        new    movie.name
        hope   movie.name
```
- __IOB/IOB2/BIO__

- __IOBE__
    - IOB + 
    - __E__ for (__E__)nd of span (or __L__ for (__L__)ast)
```
        who    O
        plays  O
        luke   B-character.name
        on     O
        star   B-movie.name
        wars   I-movie.name
        new    I-movie.name
        hope   E-movie.name
```
    
- __BILOU/BIOES__
    - IOB + 
    - __L__ for (__L__)ast word of span
    - __U__ for (__U__)nit word (or __S__ for (__S__)ingleton)
```
        who    O
        plays  O
        luke   U-character.name
        on     O
        star   B-movie.name
        wars   I-movie.name
        new    I-movie.name
        hope   L-movie.name
```

#### Choice of Scheme
- It is possible to convert IOB, IOBE, & BILOU formats to each other
- Each prefix is applied to every concept label, consequently we increase the number of transitions whose probabilities we need to estimate; 
    - increasing data sparseness, as for each label we will have less observations
- The choice of scheme depends on the amount of available data:
    - __IOB__ for least amount
    - __BILOU__ for the most amount 

#### Terminology
There is no strict naming convention regarding schemes (see alternatives) or how each constituent is termed. 
Below is the terminology used in this notebook. 

```
    who    O
    plays  O
    luke   B-character.name
    on     O
    star   B-movie.name
    wars   I-movie.name
    new    I-movie.name
    hope   I-movie.name
```

##### Interpretation
Segmentation and Labeling data formats encode the following information:
- in string (sentence) `"who plays luke on star wars new hope"`
- there are 2 __entities__ (a.k.a. chunks, concepts or slots, depending on NLP task and perspective), that have __types__ (labels)
    - `character.name`
    - `movie.name`
    
- entity of __type__ `movie.name`: 
    - has __span__:
        - as tokens from `0` for *CoNLL*: `[5:7]`
    - has __value__: `"star wars new hope"`
        - string *covered by* (*on included*) in __span__
 
*CoNLL* format encodes __tokenization__ informations. In other words, how string `"star wars new hope"` is split into tokens. Since most Sequence Labeling algorithms operate on token level, internally the strings are split into tokens, applying *IOB*-like schemes.

## Named Entity Recognition with NLTK
[NLTK](https://www.nltk.org/api/nltk.tag.html) provides implementations of popular sequence labeling algorithms for Part-of-Speech Tagging (including [HMM](https://www.nltk.org/api/nltk.tag.html#module-nltk.tag.hmm)), that can be used for Sequence Labeling in general. 

- Loading & working with CoNLL format corpora in NLTK
- Tagger training & testing (running)

To have a custom tagger that labels input text with our __custom label set__, we need to __train__ it on a corpus annotated with this __custom label set__.

Addtionally, NLTK provides [Chunking](http://www.nltk.org/api/nltk.chunk.html). 

### NLTK Pre-trained NE Chunker

NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function `nltk.ne_chunk()`. If we set the parameter `binary=True`, then named entities are just tagged as `NE`; otherwise, the classifier adds category labels such as `PERSON`, `ORGANIZATION`, and `GPE`.

In [17]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/mdestro/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /Users/mdestro/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [18]:
for s in treebank.tagged_sents():
    print(s)
    print(nltk.ne_chunk(s))
    print(nltk.ne_chunk(s, binary=True))
    break

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
(S
  (PERSON Pierre/NNP)
  (ORGANIZATION Vinken/NNP)
  ,/,
  61/CD
  years/NNS
  old/JJ
  ,/,
  will/MD
  join/VB
  the/DT
  board/NN
  as/IN
  a/DT
  nonexecutive/JJ
  director/NN
  Nov./NNP
  29/CD
  ./.)
(S
  (NE Pierre/NNP Vinken/NNP)
  ,/,
  61/CD
  years/NNS
  old/JJ
  ,/,
  will/MD
  join/VB
  the/DT
  board/NN
  as/IN
  a/DT
  nonexecutive/JJ
  director/NN
  Nov./NNP
  29/CD
  ./.)


### Training NLTK Taggers

In [19]:

nltk.download('conll2002')

[nltk_data] Downloading package conll2002 to
[nltk_data]     /Users/mdestro/nltk_data...
[nltk_data]   Unzipping corpora/conll2002.zip.


True

In [20]:
from nltk.corpus import conll2002

print(len(conll2002.tagged_sents()))
print(conll2002._chunk_types)
print(conll2002.sents('esp.train')[0])
print(conll2002.tagged_sents('esp.train')[0])
print(conll2002.chunked_sents('esp.train')[0])
print(conll2002.iob_sents('esp.train')[0])

35651
('LOC', 'PER', 'ORG', 'MISC')
['Melbourne', '(', 'Australia', ')', ',', '25', 'may', '(', 'EFE', ')', '.']
[('Melbourne', 'NP'), ('(', 'Fpa'), ('Australia', 'NP'), (')', 'Fpt'), (',', 'Fc'), ('25', 'Z'), ('may', 'NC'), ('(', 'Fpa'), ('EFE', 'NC'), (')', 'Fpt'), ('.', 'Fp')]
(S
  (LOC Melbourne/NP)
  (/Fpa
  (LOC Australia/NP)
  )/Fpt
  ,/Fc
  25/Z
  may/NC
  (/Fpa
  (ORG EFE/NC)
  )/Fpt
  ./Fp)
[('Melbourne', 'NP', 'B-LOC'), ('(', 'Fpa', 'O'), ('Australia', 'NP', 'B-LOC'), (')', 'Fpt', 'O'), (',', 'Fc', 'O'), ('25', 'Z', 'O'), ('may', 'NC', 'O'), ('(', 'Fpa', 'O'), ('EFE', 'NC', 'B-ORG'), (')', 'Fpt', 'O'), ('.', 'Fp', 'O')]


In [21]:
# training hmm on training data: exactly as above
import nltk.tag.hmm as hmm

hmm_model = hmm.HiddenMarkovModelTrainer()

print(conll2002.iob_sents('esp.train')[0])

# let's get only word and iob-tag
trn_sents = [[(text, iob) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.train')]
print(trn_sents[0])

tst_sents = [[(text, iob) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testa')]

hmm_ner = hmm_model.train(trn_sents)
    
# evaluation
accuracy = hmm_ner.evaluate(tst_sents)

print("Accuracy: {:6.4f}".format(accuracy))

[('Melbourne', 'NP', 'B-LOC'), ('(', 'Fpa', 'O'), ('Australia', 'NP', 'B-LOC'), (')', 'Fpt', 'O'), (',', 'Fc', 'O'), ('25', 'Z', 'O'), ('may', 'NC', 'O'), ('(', 'Fpa', 'O'), ('EFE', 'NC', 'B-ORG'), (')', 'Fpt', 'O'), ('.', 'Fp', 'O')]
[('Melbourne', 'B-LOC'), ('(', 'O'), ('Australia', 'B-LOC'), (')', 'O'), (',', 'O'), ('25', 'O'), ('may', 'O'), ('(', 'O'), ('EFE', 'B-ORG'), (')', 'O'), ('.', 'O')]
Accuracy: 0.3760


### NLTK Chunk Tagger Note
HMM uses only words as input, NLTK also povides trainable MaxEnt Chunker Tagger, which unfortunatelly requires `megam` file. Unfortunatelly, it is very convoluted to install. (http://www.umiacs.umd.edu/~hal/megam/index.html)

### Exercise

#### Segmentation 
Train a tagger to perform *segmentation* of input sentences into constituents
- Strip concept information from output labels (i.e. keep only IOB-prefix)
- Train tagger to predict segmentation labels
- Evaluate segmentation performance

#### CoNLL Eval: Exercise
CoNLL Community developed a perl script to evaluate *segmentation* and *labeling* performance jointly using IOB information. Such evaluation provides more accurate assessment of the shallow parsing performance, in comparison to token-level metrics (e.g. NLTK accuracy).

- import `evaluate` function from `conll.py` (example shown)
- evaluate tagger predictions
- compare performances to token-level accuracies

In [114]:
# to import conll
import os
import sys
sys.path.insert(0, os.path.abspath('../src/'))

from conll import evaluate
# for nice tables
import pandas as pd

# getting references (note that it is testb this time)
refs = [[(text, iob) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testb')]
print(refs[0])
# getting hypotheses
hyps = [hmm_ner.tag(s) for s in conll2002.sents('esp.testb')]
print(hyps[0])

results = evaluate(refs, hyps)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

[('La', 'B-LOC'), ('Coruña', 'I-LOC'), (',', 'O'), ('23', 'O'), ('may', 'O'), ('(', 'O'), ('EFECOM', 'B-ORG'), (')', 'O'), ('.', 'O')]
[('La', 'B-LOC'), ('Coruña', 'I-LOC'), (',', 'O'), ('23', 'O'), ('may', 'O'), ('(', 'O'), ('EFECOM', 'B-ORG'), (')', 'O'), ('.', 'O')]
{'PER': {'p': 0.6912751677852349, 'r': 0.2802721088435374, 'f': 0.398838334946757, 's': 735}, 'LOC': {'p': 0.0297753899928798, 'r': 0.8487084870848709, 'f': 0.0575323619535989, 's': 1084}, 'ORG': {'p': 0.7948350071736011, 'r': 0.39571428571428574, 'f': 0.5283738674296614, 's': 1400}, 'MISC': {'p': 0.5702479338842975, 'r': 0.20294117647058824, 'f': 0.299349240780911, 's': 340}, 'total': {'p': 0.05463234834759793, 'r': 0.4914301770160157, 'f': 0.09833300536924071, 's': 3559}}


Unnamed: 0,p,r,f,s
PER,0.691,0.28,0.399,735
LOC,0.03,0.849,0.058,1084
ORG,0.795,0.396,0.528,1400
MISC,0.57,0.203,0.299,340
total,0.055,0.491,0.098,3559


### Named Entity Recognition with spaCy

In [23]:
import spacy
nlp = spacy.load('en')
txt = 'Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.'
doc = nlp(txt)

print([ent.text for ent in doc.ents])
print([(t.ent_type_, t.ent_iob_) for t in doc])

['Pierre Vinken', '61 years old', 'Nov. 29']
[('PERSON', 'B'), ('PERSON', 'I'), ('', 'O'), ('DATE', 'B'), ('DATE', 'I'), ('DATE', 'I'), ('', 'O'), ('', 'O'), ('', 'O'), ('', 'O'), ('', 'O'), ('', 'O'), ('', 'O'), ('', 'O'), ('', 'O'), ('DATE', 'B'), ('DATE', 'I'), ('', 'O')]


### Exercise:
- Evaluate spaCy NER model using CoNLL evaluation script on CoNLL 2002 Test B data
- Data is Spanish

## Assignment

Assigment is in the intersection of Named Entity Recognition and Dependency Parsing.

0. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
        - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
    - report CoNLL chunk-level performance (per class and total);
        - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together). 

2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.