# Part 3: Weakly supervised part-of-speech tagging

In this part, we will work on a different type of task, which is called **sequence labeling**. Instead of having one label for an entire text, in sequence labeling, we assign a label to each token in the text.
Specifically we chose **Part-of-speech (POS) tagging**, where the goal is to assign a POS tag that indicates a grammatical type, to a word based on its definition and context.


<img src="../img/pos_tagging.png" width="800" style="display: block; margin: 0 auto" />



In order to perform weakly supervised POS tagging, we will employ the [skweak toolkit](https://github.com/NorskRegnesentral/skweak).
We will create labeling functions to assign POS tags based on _syntactic analysis_ and _grammatical rules_.


In [1]:
# Imports
%load_ext autoreload
%autoreload 2

import re
import os

import pandas as pd

import nltk
import spacy

from textblob import TextBlob
from textblob.taggers import PatternTagger

import skweak

from scripts.skweak_ner_eval import evaluate
from scripts.utils import load_data_split, get_frequent_words, tag_all, penntreebank2universal, compute_recall, compute_num_conflicts

pd.set_option('display.max_rows', 500)

In [2]:
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
nltk.download('stopwords')
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md
!python -m textblob.download_corpora

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/andst/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/andst/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/andst/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting en-core-web-md==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.6.0/en_core_web_md-3.6.0-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[nltk_data] Downloading package brown to /Users/andst/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] 

## POS tags

For this tutorial, we will use the following subset of the [universal POS tags](https://universaldependencies.org/u/pos/index.html):
1. **DET**: determiner, which is a word that modifies nouns or noun phrases and expresses the reference of the noun phrase in context.
2. **NUM**: numeral. It is a word that expresses a number and a relation to the number, such as quantity, sequence, frequency or fraction.
3. **PROPN**: proper noun is a noun that is the name of a specific individual, place, or object.
4. **ADJ**: adjective, which is a word that typically modifies nouns and specifies their properties or attributes.
5. **NOUN**: noun, which is a part of speech typically denoting a person, place, thing, animal or idea.

In [3]:
all_labels = ["DET", "NUM", "PROPN", "ADJ", "NOUN"]

## Load data

We will use the [English corpus](https://universaldependencies.org/treebanks/en_ewt/index.html) from Universal Dependencies, a framework that contains consistent grammatical annotations across many different languages.
The texts in the corpus come from five types of web media: weblogs, newsgroups, emails, reviews, and Yahoo! answers and consist of 254,825 words and 16,621 sentences.

Skweak operates on spaCy ``doc`` objects, so the dataset is loaded in this format.

In [4]:
# Load training data
train_docs = load_data_split("train", all_labels)

In [5]:
for doc in train_docs[:3]:
    skweak.utils.display_entities(doc)

## 3.1 Labeling functions

In the first step, we find the 200 most frequent words in our training corpus and use a lexicon to label these words. In the second step, we mannually annotate the 50 most frequent words.
Finally, for each POS tag we will create the following labeling functions: 

*   DET --> Lexicon with determiners.
*   NUM --> If the token is a number or a word indicating a number from 1 to 10.
*   PROPN --> A word that is capitalized.
*   ADJ --> List of prefixes and suffixes. Syntactic rules that check: 1. if the previous word is a form of "be" and 2. if the previous word is a determiner or numeral.
*   NOUN --> List of prefixes and suffixes. Syntactic rule checking if the previous word is a determiner, numeral or adjective.

#### Lexicon LF

In [6]:
# Get the 200 most frequent words in the training set
frequent_words = get_frequent_words(train_docs, 200)
print(frequent_words[:5])

["'s", "n't", 'would', 'one', 'like']


In [7]:
# Load the lexicon
with open("noun_vb_adj_list.txt") as f:
    lines = f.readlines()

# Create a dictionary with the words and their pos tags
lexicon = {}
for l in lines:
    values = l.replace("\n", "").split("\t")
    lexicon[values[0]] = values[1]

In [8]:
print("There are", len(lexicon), "words in the lexicon.")
print(list(lexicon.items())[:5])

There are 3387 words in the lexicon.
[('people', 'NOUN'), ('history', 'NOUN'), ('way', 'NOUN'), ('art', 'NOUN'), ('world', 'NOUN')]


In [9]:
# How many of the frequent words we found exist in the lexicon
len((list(set(frequent_words) & set(list(lexicon.keys())))))

121

In [10]:
# Lexicon LF
def frequent_word_detector(doc):
    for token in doc:
        # If the frequent word exists in the lexicon use its assigned pos tag
        if token.text.lower() in frequent_words and token.text.lower() in list(lexicon.keys()):
            yield token.i, token.i + 1, lexicon[token.text.lower()]


lexicon_lf = skweak.heuristics.FunctionAnnotator("frequent_words", frequent_word_detector)


#### Manual annotation LF

In [11]:
# Manual annotation
top50_words = get_frequent_words(train_docs, 50)
print(top50_words)

["'s", "n't", 'would', 'one', 'like', 'time', 'get', 'know', 'also', 'us', 'good', 'could', 'new', 'go', 'please', '$', 'people', 'may', 'back', 'said', 'even', 'work', 'bush', 'well', 'want', 'great', 'way', 'see', 'best', 'place', 'take', "'m", 'going', 'service', 'need', 'thanks', 'make', 'many', 'year', 'number', 'day', 'two', 'think', 'much', 'food', 'let', 'first', 'call', '2', 'help']


In [12]:
# Annotate the words that their POS tag exists in our chosen tag subset
manual_tags = {
    "one": "NUM",
    "like": "VERB",
    "time": "NOUN",
    "get": "VERB",
    "know": "VERB",
    "good": "ADJ",
    "could": "VERB",
    "new": "ADJ",
    "go": "VERB",
    "please": "VERB",
    "people": "NOUN",
    "said": "VERB",
    "work": "VERB",
    "bush": "NOUN",
    "want": "VERB",
    "great": "ADJ",
    "way": "NOUN",
    "see": "VERB",
    "best": "ADJ",
    "place": "NOUN",
    "take": "VERB",
    "going": "VERB",
    "service": "NOUN",
    "need": "VERB",
    "make": "VERB",
    "year": "NOUN",
    "number": "NOUN",
    "day": "NOUN",
    "two": "NUM",
    "think": "VERB",
    "food": "NOUN",
    "let": "VERB",
    "first": "ADJ",
    "call": "VERB",
    "2": "NUM",
    "help": "VERB"
}

In [13]:
# Manual POS tags LF
def manual_pos_tagger(doc):
    for token in doc:
        if token.text.lower() in manual_tags:
            yield token.i, token.i + 1, manual_tags[token.text.lower()]


manual_pos_lf = skweak.heuristics.FunctionAnnotator("manual_pos", manual_pos_tagger)


#### DET LF

In [14]:
# Use a lexicon of determiners
tries = skweak.gazetteers.extract_json_data("det.json")
det_lf = skweak.gazetteers.GazetteerAnnotator("determiners", tries, case_sensitive=False)


Extracting data from det.json
Populating trie for class DET (number: 47)


#### NUM LF

In [15]:
# Use a regular expression pattern to look for digits
def num_detector(doc):
    for token in doc:
        if re.search("\d+", token.text):
            yield token.i, token.i + 1, "NUM"

# Check if the token is the word of a number from 1 to 10
def num_word_detector(doc):
    for token in doc:
        if token.text.lower() in ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"]:
            yield token.i, token.i + 1, "NUM"

num_lf1 = skweak.heuristics.FunctionAnnotator("numerals1", num_detector)
num_lf2 = skweak.heuristics.FunctionAnnotator("numerals2", num_word_detector)


#### PROPN LF

In [16]:
# Check if the fist letter of a word or the whole word is capitalized
def propn_detector(doc):
    for token in doc:
        if token.i == 0:
            # For the first word of a sentence, check if all letters are capitalized
            if token.text.isupper():
                yield token.i, token.i + 1, "PROPN"
        else:
            if token.text.isupper() or token.text[0].isupper():
                yield token.i, token.i + 1, "PROPN"


propn_lf = skweak.heuristics.FunctionAnnotator("proper_nouns", propn_detector)


#### ADJ LFs

In [17]:
# Look for common suffixes and prefixes
def adj_detector_suffixes(doc):
    suffixes = ("able", "al", "ful", "ic", "ive", "less", "ous", "y", "ish", "ible", "ent", "est")
    for token in doc:
        if len(token.text) > 3 and token.text.endswith(suffixes):
            yield token.i, token.i + 1, "ADJ"


# Look for common prefixes
def adj_detector_prefixes(doc):
    prefixes = ("un", "im", "in", "ir", "il", "non", "dis")
    for token in doc:
        if len(token.text) > 3 and token.text.lower().startswith(prefixes):
            yield token.i, token.i + 1, "ADJ"


# If the previous word is a form of "be" and the current word does not end with "ing" and was not labeled as DET, then it's an adjective
def adj_detector_synt1(doc):
    weak_labels = ["O"] * len(doc)
    for span in doc.spans["determiners"]:
        weak_labels[span.start] = span.label_

    for token in doc[1:]:
        if not token.is_punct:
            prev = doc[token.i - 1].text.lower()
            if prev in ["be", "been", "being", "am", "is", "are", "was", "were"] and (
                    not token.text.endswith("ing")) and weak_labels[token.i] == "O":
                yield token.i, token.i + 1, "ADJ"


# If the previous word is labeld as DET or NUM, then the current word is an adjective
def adj_detector_synt2(doc):
    weak_labels = ["O"] * len(doc)

    for span in doc.spans["determiners"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["numerals1"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["numerals2"]:
        weak_labels[span.start] = span.label_

    for token in doc[1:]:
        if not token.is_punct:
            if weak_labels[token.i - 1] != "O":
                yield token.i, token.i + 1, "ADJ"


adj_lf1 = skweak.heuristics.FunctionAnnotator("adjectives1", adj_detector_suffixes)
adj_lf2 = skweak.heuristics.FunctionAnnotator("adjectives2", adj_detector_prefixes)
adj_lf3 = skweak.heuristics.FunctionAnnotator("adjectives3", adj_detector_synt1)
adj_lf4 = skweak.heuristics.FunctionAnnotator("adjectives4", adj_detector_synt2)


#### NOUN LF

Let's create a labeling function that looks for common noun suffixes. Can you think of some?

In [18]:
# ***********************************
def noun_detector_suffixes(doc):
    suffixes = ("ment", "tion", "sion", "xion", "ant", "ent", "ee", "er", "or",
                "ism", "ist", "ness", "ship", "ity", "ance", "ence",
                "ar", "or", "y", "acy", "age")
    for token in doc:
        if len(token.text) > 3 and token.text.lower().endswith(suffixes):
            yield token.i, token.i + 1, "NOUN"

# ***********************************

In [19]:
# Look for common prefixes
def noun_detector_prefixes(doc):
    prefixes = (
        "anti", "auto", "bi", "co", "counter", "dis", "ex", "hyper", "in", "inter", "kilo", "mal", "mega", "mis",
        "mini", "mono", "neo", "out", "poly", "pseudo", "re", "semi", "sub", "super", "sur", "tele", "tri", "ultra",
        "under", "vice")
    for token in doc:
        if len(token.text) > 3 and token.text.lower().startswith(prefixes):
            yield token.i, token.i + 1, "NOUN"


# # If the previous word is labeld as DET, NUM or ADJ, then the current word is an noun
def noun_detector_synt(doc):
    weak_labels = ["O"] * len(doc)

    for span in doc.spans["determiners"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["numerals1"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["numerals2"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["adjectives1"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["adjectives2"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["adjectives3"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["adjectives4"]:
        weak_labels[span.start] = span.label_

    for token in doc[1:]:
        if not token.is_punct:
            if weak_labels[token.i - 1] != "O":
                yield token.i, token.i + 1, "NOUN"


noun_lf1 = skweak.heuristics.FunctionAnnotator("nouns1", noun_detector_suffixes)
noun_lf2 = skweak.heuristics.FunctionAnnotator("nouns2", noun_detector_prefixes)
noun_lf3 = skweak.heuristics.FunctionAnnotator("nouns3", noun_detector_synt)


## Apply LFs

In [20]:
# Put all LFs in a list
lfs = [
    lexicon_lf, manual_pos_lf, det_lf, 
    num_lf1, num_lf2, propn_lf,
    adj_lf1, adj_lf2, adj_lf3, adj_lf4,
    noun_lf1, noun_lf2, noun_lf3
]

train_docs = tag_all(train_docs, lfs)

In [21]:
# Print some of the assigned weak labels
for doc in train_docs[:3]:
    skweak.utils.display_entities(doc, ["determiners", "nouns1"])

In [22]:
# Train HMM
hmm = skweak.aggregation.HMM("hmm", all_labels)
hmm = hmm.fit(train_docs)

Starting iteration 1
Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents
Starting iteration 2


         1     -518827.3307             +nan


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents
Starting iteration 3


         2     -490363.7763      +28463.5544


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents
Starting iteration 4


         3     -477294.1340      +13069.6423


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents
Starting iteration 5


         4     -470431.7984       +6862.3357


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents
Starting iteration 6


         5     -466066.8149       +4364.9835


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents
Starting iteration 7


         6     -462378.4838       +3688.3311


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents
Starting iteration 8


         7     -457974.1574       +4404.3264


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents


         8     -451644.7337       +6329.4238


In [23]:
# Majority voting
mv = skweak.aggregation.MajorityVoter("mv", all_labels)

In [24]:
# Apply LFs, HMM and MV to the test docs
test_docs = load_data_split("test", all_labels)
test_docs = tag_all(test_docs, lfs + [mv, hmm])

## Evaluate

#### Which POS tags are easier to detect?

* We see that POS tags like determiners and numerals are easier to detect and we can achieve a good F1 score with just one or two simple LFs.


In [25]:
df = evaluate(test_docs, all_labels, [
    "determiners", "numerals1", "numerals2", "proper_nouns"
])

In [26]:
df.loc[["DET", "NUM", "PROPN"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tok_precision,tok_recall,tok_f1
label,proportion,model,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DET,18.4 %,determiners,0.678,0.998,0.808
DET,18.4 %,numerals1,0.0,0.0,0.0
DET,18.4 %,numerals2,0.0,0.0,0.0
DET,18.4 %,proper_nouns,0.0,0.0,0.0
NUM,5.2 %,determiners,0.0,0.0,0.0
NUM,5.2 %,numerals1,0.813,0.802,0.808
NUM,5.2 %,numerals2,0.786,0.144,0.244
NUM,5.2 %,proper_nouns,0.0,0.0,0.0
PROPN,20.1 %,determiners,0.0,0.0,0.0
PROPN,20.1 %,numerals1,0.0,0.0,0.0


* Other POS tags like adjectives and nouns, which rely more on the context are harder to detect and require more complicated rules.

In [27]:
df = evaluate(test_docs, all_labels, [
    "adjectives1", "adjectives2", "adjectives3", "adjectives4",
    "nouns1", "nouns2", "nouns3"
])

In [28]:
df.loc[["ADJ", "NOUN"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tok_precision,tok_recall,tok_f1
label,proportion,model,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ADJ,16.4 %,adjectives1,0.336,0.339,0.338
ADJ,16.4 %,adjectives2,0.254,0.049,0.082
ADJ,16.4 %,adjectives3,0.297,0.091,0.14
ADJ,16.4 %,adjectives4,0.176,0.319,0.226
ADJ,16.4 %,nouns1,0.0,0.0,0.0
ADJ,16.4 %,nouns2,0.0,0.0,0.0
ADJ,16.4 %,nouns3,0.0,0.0,0.0
NOUN,40.0 %,adjectives1,0.0,0.0,0.0
NOUN,40.0 %,adjectives2,0.0,0.0,0.0
NOUN,40.0 %,adjectives3,0.0,0.0,0.0


#### Which type of LF works the best?

* For adjectives the LF that uses suffixes works the best, while the syntactic rules are less accurate. On the contrary, for nouns the LF that is based on syntactic analysis has the best results. For both POS tags, the LFs that use prefixes do not yield good results.

#### Which aggregator works best?

* Despite its simplicity, majority voting outperforms HMM on almost all of the POS tags and overall achieves a higher macro F1 score.

In [29]:
df = evaluate(test_docs, all_labels, ["mv", "hmm"])

In [30]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tok_precision,tok_recall,tok_f1
label,proportion,model,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ADJ,16.4 %,hmm,0.189,0.335,0.242
ADJ,16.4 %,mv,0.425,0.455,0.44
DET,18.4 %,hmm,0.688,0.935,0.792
DET,18.4 %,mv,0.71,0.835,0.768
NOUN,40.0 %,hmm,0.333,0.195,0.246
NOUN,40.0 %,mv,0.387,0.44,0.412
NUM,5.2 %,hmm,0.622,0.157,0.25
NUM,5.2 %,mv,0.866,0.724,0.788
PROPN,20.1 %,hmm,0.61,0.258,0.362
PROPN,20.1 %,mv,0.579,0.462,0.514


## Using Libraries as Labelling functions


In this part, we use popular NLP libraries to create labeling functions. They include Spacy, NLTK, Textblob.
We use the Majority Voter and HMM as aggregation functions
Optionally, you can train your own model on the data.

Learning goals:
- Understand how to use external libraries as labeling functions
- Understand the Spacy object and how to use it for annotation

First, read and understand the two functions below.

In [41]:

# Sometimes data formats (here POS tags) differ. We load the data and convert it to the format we need. 
# Surely, there is some loss of information
def nltk_tagger(doc):
    for token in doc:
        if not token.is_punct:
            # Tag token with nltk
            nltk_pos = nltk.pos_tag([token.text])[0][1]
            # Map nltk pos tags to ours
            if nltk_pos == "DT":
                yield token.i, token.i + 1, "DET"
            elif nltk_pos == "CD":
                yield token.i, token.i + 1, "NUM"
            elif nltk_pos == "NNP" or nltk_pos == "NNPS":
                yield token.i, token.i + 1, "PROPN"
            elif nltk_pos == "JJ" or nltk_pos == "JJR" or nltk_pos == "JJS":
                yield token.i, token.i + 1, "ADJ"
            elif nltk_pos == "NN" or nltk_pos == "NNS":
                yield token.i, token.i + 1, "NOUN"
            elif nltk_pos == "VB" or nltk_pos == "VBD" or nltk_pos == "VBG" or nltk_pos == "VBN" or nltk_pos == "VBP" or nltk_pos == "VBZ":
                yield token.i, token.i + 1, "VERB"


# We cn also use the Textblob library to get POS tags
# Under the hood, it uses the Pattern library. Once again, a transformation of the tag-labels is needed
def textblob_tagger(doc):
    for token in doc:
        if not token.is_punct:
            textblob_pos = TextBlob(token.text, pos_tagger=PatternTagger()).tags
            if len(textblob_pos) > 0:
                yield token.i, token.i + 1, penntreebank2universal(textblob_pos[0][1])


## Write the Spacy Labeling Functions

Use the two english Spacy models "en_core_web_sm", "en_core_web_md" to create labeling functions.
The challenge is that they use different tokens, i.e. the atomic units of a sentence. Our simple tokenization just splits the words by whitespace.
Your task it to design an algorithm that maps the tokens of the simple tokenization to the tokens of the Spacy tokenization, and use the token available there to create labeling functions.

Hints:
1) Access token i by `token=doc[i]` or obtain its poition by `i=token.i`
2) Access the Spacy POS token (its ground truth) by `pos=token.pos_`

In [42]:
eng_nlp_sm = spacy.load("en_core_web_sm")
eng_nlp_md = spacy.load("en_core_web_md")

# ***********************************

def eng_spacy_tagger_sm(doc):
    other_doc = eng_nlp_sm(doc.text)
    i = 0
    for token in doc:
        labelled = False
        for other_token in other_doc:
            if other_doc[other_token.i:].text not in doc[token.i:].text:
                continue
            if token.text in other_token.text and not labelled:
                labelled = True
                yield token.i, token.i + 1, other_token.pos_.split("-")[-1]


def eng_spacy_tagger_md(doc):
    other_doc = eng_nlp_md(doc.text)

    for token in doc:
        labelled = False
        for other_token in other_doc:
            if other_doc[other_token.i:].text not in doc[token.i:].text:
                continue
            if token.text in other_token.text and not labelled:
                labelled = True
                yield token.i, token.i + 1, other_token.pos_.split("-")[-1]

# ***********************************

In [43]:

nltk_lf = skweak.heuristics.FunctionAnnotator("nltk", nltk_tagger)
textblob_lf = skweak.heuristics.FunctionAnnotator("textblob", textblob_tagger)
eng_spacy_sm_lf = skweak.heuristics.FunctionAnnotator("eng_spacy_sm", eng_spacy_tagger_sm)
eng_spacy_md_lf = skweak.heuristics.FunctionAnnotator("eng_spacy_md", eng_spacy_tagger_md)

### Load Data and apply Labeling functions

Before and after applying the labeling functions, and the aggregation functions, we compute the recall and number of conflicts. For the sake of time, we use this time only a subset of the data.

In [44]:

# load training and test data
lfs = [nltk_lf, eng_spacy_sm_lf, textblob_lf, eng_spacy_md_lf]
all_labels = ["DET", "NUM", "PROPN", "NOUN", "ADJ"]

# small amount of data for the sake of time
train_docs = load_data_split("train", all_labels, 3000)

# tag the training documents
train_docs = tag_all(train_docs, lfs)



In [45]:
recall = compute_recall(train_docs)
num_conflicts = compute_num_conflicts(train_docs)
print("Train recall", recall)
print("Train conflicts", num_conflicts)

Train recall 1.0
Train conflicts 0.4051


We observe that the recall is very high. This is because the used libraries are quite well.
Further, we observe that in 40.5% of the tokens there is a conflict.

In [46]:
# train the HMM
hmm = skweak.aggregation.HMM("hmm", all_labels)
hmm=hmm.fit(train_docs)

Starting iteration 1
Number of processed documents: 1000
Number of processed documents: 2000
Finished E-step with 3000 documents
Starting iteration 2


         1     -105487.6510             +nan


Number of processed documents: 1000
Number of processed documents: 2000
Finished E-step with 3000 documents
Starting iteration 3


         2     -100521.0518       +4966.5992


Number of processed documents: 1000
Number of processed documents: 2000
Finished E-step with 3000 documents
Starting iteration 4


         3      -99870.4155        +650.6362


Number of processed documents: 1000
Number of processed documents: 2000
Finished E-step with 3000 documents
Starting iteration 5


         4      -99584.3244        +286.0912


Number of processed documents: 1000
Number of processed documents: 2000
Finished E-step with 3000 documents


         5      -99350.4815        +233.8429


Now we compare how majority vote and HMM change the number of conflicts.
Remember, that it's important to set Majority vote before HMM, otherwise Majority Vote takes the HMM predictions into account

In [47]:
mv = skweak.aggregation.MajorityVoter("mv", all_labels)
train_docs = tag_all(train_docs, [mv, hmm])

num_conflicts = compute_num_conflicts(train_docs)
print("Conflicts with MV on train set: ", num_conflicts)

Conflicts with MV on train set:  0.4051


We observe that the number of token conflicts does not change. The reason is that both methods can not choose a class different from the labeling functions.

## Evaluation

Look at the Precision, Recall and F1-Score of the different aggregation functions. What do you observe?

In [48]:
# tag the test documents
# Once again, it's important to set Majority vote before HMM, otherwise Majority Vote takes the HMM predictions into account
test_docs = load_data_split("test", all_labels, 1000)
test_docs = tag_all(test_docs, lfs + [mv, hmm])

num_conflicts = compute_num_conflicts(test_docs)
print("Conflicts on test set", num_conflicts)

Conflicts on test set 0.4215


In [49]:
df = evaluate(test_docs, all_labels, [ "mv", "hmm"])

In [50]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tok_precision,tok_recall,tok_f1
label,proportion,model,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ADJ,12.8 %,hmm,0.912,0.866,0.888
ADJ,12.8 %,mv,0.893,0.852,0.872
DET,17.3 %,hmm,0.95,0.996,0.972
DET,17.3 %,mv,0.9,0.999,0.946
NOUN,37.4 %,hmm,0.936,0.859,0.896
NOUN,37.4 %,mv,0.671,0.879,0.762
NUM,7.3 %,hmm,0.964,0.927,0.946
NUM,7.3 %,mv,0.969,0.848,0.904
PROPN,25.2 %,hmm,0.806,0.95,0.872
PROPN,25.2 %,mv,0.808,0.947,0.872


Contrary, to the first part, we observe that the HMM performs better than majority vote.