# Part 3: Weakly supervised part-of-speech tagging

In this part, we will work on a different type of tasks, which is sequence labeling. Instead of having one label for a entire text, now, in sequence labeling we assign a label to each token in a text.
Specifically we chose Part-of-speech (POS) tagging, which concerns the task of assigning a POS tag that indicates a grammatical type, to a word based on its definition and context.

We will create labeling functions to assign POS tags based on syntactic analysis and grammatical rules.


In [1]:
# Imports
%load_ext autoreload
%autoreload 2

import re
import os
import sys
import nltk
import spacy
import joblib
import skweak
import numpy as np
import pandas as pd

from spacy.tokens import Span
from spacy.tokens import DocBin
from collections import Counter
from spacy.training import Corpus
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from utils.skweak_ner_eval_utils import evaluate
# from skweak import heuristics, gazetteers, aggregation, utils

sys.path.append('../')

In [2]:
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/vasiliki/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vasiliki/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
!python --version

Python 3.8.17


In [None]:
# Weakly Supervised Named Entity Tagging with Learnable Logical Rules
# https://universaldependencies.org/format.html
# https://aclanthology.org/2021.acl-long.352.pdf

# Get data from https://github.com/explosion/projects/tree/v3/benchmarks/ud_benchmark
# by using assets command, or downloading https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz

# for each file run with vars.ud_treebank being the treebank you want to use, e.g. ("UD_English-EWT")
# python scripts/copy_files.py train conllu assets/ud-treebanks-v2.5/${vars.ud_treebank}/ corpus/${vars.ud_treebank}/train/
# python -m spacy convert corpus/${vars.ud_treebank}/train/ corpus/${vars.ud_treebank}/ --converter conllu -n 1 -T -C



## Load data

To add:
1. What is the dataset?
2. What kind of texts does it contain?
3. How many samples does it have?
4. Train/test?
5. Where did we get it from?

In [4]:
# conll u -> skweak -> wrench

# Dataset folder
part3_path = "part_3_pos_tags"


# Path to the dataset file
data_path = os.path.join(part3_path, "corpus", "UD_English-EWT")

# Create a blank spacy pipeline
nlp = spacy.blank("xx")
reader = Corpus(os.path.join(data_path, "train.spacy"))
train_data = list(reader(nlp))

In [6]:
# Get the doc objects
docs = [doc.reference.copy() for doc in train_data]
print("There are", len(docs), "documents in the training set")

There are 12543 documents in the training set


## Part-of-speech (POS) tagging

The goal is to assign a POS tag to each token.

For this tutorial, we will use the following subset of the [universal POS tags](https://universaldependencies.org/u/pos/index.html):
1. **DET**: determiner, which is a word that modifies nouns or noun phrases and expresses the reference of the noun phrase in context.
2. **NUM**: numeral. It is a word that expresses a number and a relation to the number, such as quantity, sequence, frequency or fraction.
3. **PROPN**: proper noun is a noun that is the name of a specific individual, place, or object.
4. **ADJ**: adjective, which is a word that typically modifies nouns and specifies their properties or attributes.
5. **NOUN**: noun, which is a part of speech typically denoting a person, place, thing, animal or idea.
6. **VERB**: verb. Verbs typically signal events and actions, can constitute a minimal predicate in a clause, and govern the number and types of other constituents which may occur in the clause.

In [7]:
all_labels = ["DET", "NUM", "PROPN", "ADJ", "VERB", "NOUN"]

In [8]:
# Set the gold labels in the subset we chose
for doc in docs:
    print([s.text for s in doc.sents])
    ents = []
    tok_pos = []
    for tok in doc:
        if tok.pos_ in all_labels:
            # print(tok.pos_)
            tok_pos.append(tok.pos_)
            ents.append(Span(doc, tok.i, tok.i + 1, tok.pos_))
        else:
            tok_pos.append("O")
    doc.set_ents(ents)
    print(tok_pos)

['Al-Zaman : American forces killed Shaikh Abdullah al-Ani, the preacher at the mosque in the town of Qaim, near the Syrian border.']
['PROPN', 'O', 'PROPN', 'O', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'O', 'PROPN', 'O', 'DET', 'NOUN', 'O', 'DET', 'NOUN', 'O', 'DET', 'NOUN', 'O', 'PROPN', 'O', 'O', 'DET', 'ADJ', 'NOUN', 'O']
['[This killing of a respected cleric will be causing us trouble for years to come.]']
['O', 'DET', 'NOUN', 'O', 'DET', 'ADJ', 'NOUN', 'O', 'O', 'VERB', 'O', 'NOUN', 'O', 'NOUN', 'O', 'VERB', 'O', 'O']
['DPA: Iraqi authorities announced that they had busted up 3 terrorist cells operating in Baghdad.']
['PROPN', 'O', 'ADJ', 'NOUN', 'VERB', 'O', 'O', 'O', 'VERB', 'O', 'NUM', 'ADJ', 'NOUN', 'VERB', 'O', 'PROPN', 'O']
['Two of them were being run by 2 officials of the Ministry of the Interior!']
['NUM', 'O', 'O', 'O', 'O', 'VERB', 'O', 'NUM', 'NOUN', 'O', 'DET', 'PROPN', 'O', 'DET', 'PROPN', 'O']
['The MoI in Iraq is equivalent to the US FBI, so this would b

In [23]:
# Create a subset to test our LFs on a smaller dataset
subset_docs = [doc.copy() for doc in docs[0:500]]

In [None]:
## 3.1 Labeling functions

In the first step we find the 200 most frequent words in our corpus and use a lexicon to label these words. In the second step, we annotate the 20 most frequent words.
Finally, for each POS tag we will create the following labeling functions: 

*   DET --> Lexicon with determiners.
*   NUM --> If the token is a number.
*   PROPN --> A word that is capitalized.
*   ADJ --> Suffixes: “able”, “al”, “ful”, “ic”, “ive”, “less”, “ous”, ”y”, “ish”, “ible”, "est".
*   NOUN --> 1. Suffixes: "ment", "tion", "sion", "xion", "ant", "ent", "ee", "er", "or", "ism", "ist", "ness", "ship", "ity", "ance", "ence", "ar", "or", "y", "acy", "age" , 2. Linguistic rule: if the previous word is a DET, a NUM or an ADJ, then the current one is a NOUN.
*   VERB --> 1. Suffixes: "ing", "ate", "en", "ed", "ify", "ise", "ize", 2. Linguistic rule: if the previous word is a NOUN, then the current one is a VERB, 3. Previous word is a form of "be".

In [24]:
# Get all the words in the dataset
words = [token.text.lower() for doc in docs for token in doc if not token.is_punct]

# Remove stopwords
words = [w for w in words if w not in stopwords.words('english')]

# Find the most frequent words
word_freq = Counter(words)
common_words = [w[0] for w in word_freq.most_common(200)]

In [25]:
common_words[:5]

["'s", "n't", 'would', 'one', 'like']

In [26]:
# Load the lexicon
with open("noun_vb_adj_list.txt") as f:
    lines = f.readlines()

In [27]:
# Create a dictionary with the words and their pos tags
lexicon = {}
for l in lines:
    values = l.replace("\n", "").split("\t")
    lexicon[values[0]] = values[1]

In [28]:
len(lexicon)

3387

In [29]:
list(lexicon.items())[:5]

[('people', 'NOUN'),
 ('history', 'NOUN'),
 ('way', 'NOUN'),
 ('art', 'NOUN'),
 ('world', 'NOUN')]

In [30]:
# How many common words match
len((list(set(common_words) & set(list(lexicon.keys())))))

121

In [31]:
# Lexicon LF
def common_word_detector(doc):
    for token in doc:
        if token.text.lower() in common_words and token.text.lower() in list(lexicon.keys()):
            yield token.i, token.i+1, lexicon[token.text.lower()]

word_lf = skweak.heuristics.FunctionAnnotator("common_words", common_word_detector)

# for doc in docs:
#     doc = word_lf(doc)
#     skweak.utils.display_entities(doc, "common_words")


In [18]:
# NLTK LF

# def nltk_tagger(doc):
#     for token in doc:
#         if not token.is_punct:
#             # Tag token with nltk
#             nltk_pos = nltk.pos_tag([token.text])[0][1]
#             # Map nltk pos tags to ours
#             if nltk_pos == "DT":
#                 yield token.i, token.i+1, "DET"
#             elif nltk_pos == "CD":
#                 yield token.i, token.i+1, "NUM"
#             elif nltk_pos == "NNP" or nltk_pos == "NNPS":
#                 yield token.i, token.i+1,"PROPN"
#             elif nltk_pos == "JJ" or nltk_pos == "JJR" or nltk_pos == "JJS":
#                 yield token.i, token.i+1, "ADJ"
#             elif nltk_pos == "NN" or nltk_pos == "NNS":
#                 yield token.i, token.i+1, "NOUN"
#             elif nltk_pos == "VB" or nltk_pos == "VBD" or nltk_pos == "VBG" or nltk_pos == "VBN" or nltk_pos == "VBP" or nltk_pos == "VBZ":
#                 yield token.i, token.i+1, "VERB"
            

# nltk_lf = skweak.heuristics.FunctionAnnotator("nltk_tags", nltk_tagger)


# for doc in docs:
#     doc = nltk_lf(doc)
#     skweak.utils.display_entities(doc, "nltk_tags")


In [33]:
# Manual annotation
top50_words = [w[0] for w in word_freq.most_common(50)]
print(top50_words)

["'s", "n't", 'would', 'one', 'like', 'time', 'get', 'know', 'also', 'us', 'good', 'could', 'new', 'go', 'please', '$', 'people', 'may', 'back', 'said', 'even', 'work', 'bush', 'well', 'want', 'great', 'way', 'see', 'best', 'place', 'take', "'m", 'going', 'service', 'need', 'thanks', 'make', 'many', 'year', 'number', 'day', 'two', 'think', 'much', 'food', 'let', 'first', 'call', '2', 'help']


In [34]:
manual_tags = {
    "one": "NUM",
    "like": "VERB",
    "time": "NOUN",
    "get": "VERB",
    "know": "VERB",
    "good": "ADJ",
    "could": "VERB",
    "new": "ADJ",
    "go": "VERB",
    "please": "VERB",
    "people": "NOUN",
    "said": "VERB",
    "work": "VERB",
    "bush": "NOUN",
    "want": "VERB",
    "great": "ADJ",
    "way": "NOUN",
    "see": "VERB",
    "best": "ADJ",
    "place": "NOUN",
    "take": "VERB",
    "going": "VERB",
    "service": "NOUN",
    "need": "VERB",
    "make": "VERB",
    "year": "NOUN",
    "number": "NOUN",
    "day": "NOUN",
    "two": "NUM",
    "think": "VERB",
    "food": "NOUN",
    "let": "VERB",
    "first": "ADJ",
    "call": "VERB",
    "2": "NUM",
    "help": "VERB"
}

In [35]:
# Manual POS tags LF
def manual_pos_tagger(doc):
    for token in doc:
        if token.text.lower() in manual_tags:
            yield token.i, token.i+1, manual_tags[token.text.lower()]

manual_pos_lf = skweak.heuristics.FunctionAnnotator("manual_pos", manual_pos_tagger)

# for doc in docs:
#     doc = manual_pos_lf(doc)
#     skweak.utils.display_entities(doc, "manual_pos")

In [36]:
# DET LF
tries = skweak.gazetteers.extract_json_data("det.json")
det_lf = skweak.gazetteers.GazetteerAnnotator("determiners", tries, case_sensitive=False)

# for doc in docs:
#     doc = det_lf(doc)
#     skweak.utils.display_entities(doc, "determiners")

Extracting data from det.json
Populating trie for class DET (number: 47)


In [23]:
# # Or DET LF without json
# det_list = ["the", "a", "an", "this", "that", "these", "those", "my", "your", "his", "her", "its", "our", "their", "a few", "few",
#             "fewer", "fewest", "a little", "little", "much", "many", "more", "a lot of", "most", "some", "any", "enough", "all",
#             "both", "half", "either",  "neither", "no", "each", "every", "other", "another", "several", "such", "what", "rather",
#             "quite", "least", "less", "which", "whose"]

# tries = skweak.gazetteers.Trie(det_list)
# det_lf = skweak.gazetteers.GazetteerAnnotator("determiners", {"DET":tries}, case_sensitive=False)

# for doc in docs:
#     det_lf(doc)
#     skweak.utils.display_entities(doc, "determiners")

In [37]:
# NUM LF

def num_detector(doc):
    for token in doc:
        if re.search("\d+", token.text):
            yield token.i, token.i+1, "NUM"

num_lf = skweak.heuristics.FunctionAnnotator("numerals", num_detector)


# for doc in docs:
#     doc = num_lf(doc)
#     skweak.utils.display_entities(doc, "numerals")

In [38]:
# PROPN LF

def propn_detector(doc):
    for token in doc:
        if token.i == 0:
            if token.text.isupper():
                yield token.i, token.i+1, "PROPN"
        else:
            if token.text.isupper() or token.text[0].isupper():
                yield token.i, token.i+1, "PROPN"

propn_lf = skweak.heuristics.FunctionAnnotator("proper_nouns", propn_detector)

# for doc in docs:
#     doc = propn_lf(doc)
#     skweak.utils.display_entities(doc, "proper_nouns")

In [39]:
# ADJ LF

def adj_detector_suffixes(doc):
    suffixes = ("able", "al", "ful", "ic", "ive", "less", "ous", "y", "ish", "ible", "ent", "est")
    for token in doc:
        if len(token.text)>3 and token.text.strip(".").endswith(suffixes):
            yield token.i, token.i+1, "ADJ"

def adj_detector_prefixes(doc):
    prefixes = ("un", "im", "in", "ir", "il", "non", "dis")
    for token in doc:
        if len(token.text)>3 and token.text.lower().strip(".").startswith(prefixes):
            yield token.i, token.i+1, "ADJ"

def adj_detector(doc):
    weak_labels = ["O"]*len(doc)
    for span in doc.spans["determiners"]:
        weak_labels[span.start] = span.label_
    
    for token in doc[1:]:
        if not token.is_punct:
            prev = doc[token.i-1].text.lower()
            if (prev in ["be", "been", "being"] and (not token.text.endswith("ing") and weak_labels[token.i] == "O")) or (prev in ["am", "is", "are", "was", "were"] and (not token.text.endswith("ing") and weak_labels[token.i] == "O")):
                yield token.i, token.i+1, "ADJ"


def adj_detector_ling(doc):
    weak_labels = ["O"]*len(doc)

    for span in doc.spans["determiners"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["numerals"]:
        weak_labels[span.start] = span.label_

    for token in doc[1:]:
        if not token.is_punct:
            if weak_labels[token.i-1] != "O":
                yield token.i, token.i+1, "ADJ"
    
                    
adj_lf1 = skweak.heuristics.FunctionAnnotator("adjectives1", adj_detector_suffixes)
adj_lf2 = skweak.heuristics.FunctionAnnotator("adjectives2", adj_detector_prefixes)
adj_lf3 = skweak.heuristics.FunctionAnnotator("adjectives3", adj_detector)
adj_lf4 = skweak.heuristics.FunctionAnnotator("adjectives4", adj_detector_ling)

# for doc in docs:
#     doc = adj_lf4(adj_lf3(adj_lf2(adj_lf1(doc))))
#     skweak.utils.display_entities(doc, ["adjectives1", "adjectives2", "adjectives3", "adjectives4"])


In [40]:
# NOUN LF

def noun_detector_suffixes(doc):
    suffixes = ("ment", "tion", "sion", "xion", "ant", "ent", "ee", "er", "or", 
                "ism", "ist", "ness", "ship", "ity", "ance", "ence", 
                "ar", "or", "y", "acy", "age")
    for token in doc:
        if len(token.text)>3 and token.text.lower().strip(".").endswith(suffixes):
            yield token.i, token.i+1, "NOUN"

def noun_detector_prefixes(doc):
    prefixes = ("anti", "auto", "bi", "co", "counter", "dis", "ex", "hyper", "in", "inter", "kilo", "mal", "mega", "mis",
               "mini", "mono", "neo", "out", "poly", "pseudo", "re", "semi", "sub", "super", "sur", "tele", "tri", "ultra", "under", "vice")
    for token in doc:
        if len(token.text)>3 and token.text.lower().strip(".").startswith(prefixes):
            yield token.i, token.i+1, "NOUN"

def noun_detector_ling(doc):
    weak_labels = ["O"]*len(doc)

    for span in doc.spans["determiners"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["numerals"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["adjectives1"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["adjectives2"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["adjectives3"]:
        weak_labels[span.start] = span.label_
    
    for token in doc[1:]:
        if not token.is_punct:
            if weak_labels[token.i-1] != "O":
                yield token.i, token.i+1, "NOUN"
        
noun_lf1 = skweak.heuristics.FunctionAnnotator("nouns1", noun_detector_suffixes)
noun_lf2 = skweak.heuristics.FunctionAnnotator("nouns2", noun_detector_prefixes)
noun_lf3 = skweak.heuristics.FunctionAnnotator("nouns3", noun_detector_ling)

# for doc in docs:
#     doc = noun_lf3(noun_lf2(noun_lf1(doc)))
#     skweak.utils.display_entities(doc, ["nouns1", "nouns2", "nouns3"])


In [41]:
# VERB LF

def verb_detector_suffixes(doc):
    suffixes = ("ing", "ate", "en", "ed", "ify", "ise", "ize")
    for token in doc:
        if len(token.text)>2 and token.text.lower().strip(".").endswith(suffixes):
            yield token.i, token.i+1, "VERB"

def verb_detector_prefixes(doc):
    prefixes = ("re", "dis", "over", "un", "mis", "out", "be", "co", "de", "fore", "inter", "pre", "sub", "trans", "under")
    for token in doc:
        if len(token.text)>2 and token.text.lower().strip(".").startswith(prefixes):
            yield token.i, token.i+1, "VERB"

def verb_detector_ling(doc):
    weak_labels = ["O"]*len(doc)

    for span in doc.spans["nouns1"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["nouns2"]:
        weak_labels[span.start] = span.label_

    for span in doc.spans["nouns3"]:
        weak_labels[span.start] = span.label_
    
    for token in doc[1:]:
        if not token.is_punct:
            if weak_labels[token.i-1] != "O":
                yield token.i, token.i+1, "VERB"

verb_lf1 = skweak.heuristics.FunctionAnnotator("verbs1", verb_detector_suffixes)
verb_lf2 = skweak.heuristics.FunctionAnnotator("verbs2", verb_detector_prefixes)
verb_lf3 = skweak.heuristics.FunctionAnnotator("verbs3", verb_detector_ling)


# for doc in docs:
#     doc = verb_lf3(verb_lf2(verb_lf1(doc)))
#     skweak.utils.display_entities(doc, ["verbs3"]) # , "verbs2", "verbs3"])


In [42]:
import pandas
pandas.set_option('display.max_rows', 500)

In [44]:
# Put all LFs in a list
lfs = [word_lf, manual_pos_lf, det_lf, propn_lf, num_lf, adj_lf1, adj_lf2, adj_lf3, adj_lf4, 
       noun_lf1, noun_lf2, noun_lf3, verb_lf1, verb_lf2, verb_lf3]

# nlp = spacy.blank("xx")

# hmm = aggregation.HMM("hmm", ["DET"])
# print(doc.spans)
#evaluate(docs, all_labels, ["proper_nouns"])

In [45]:
# Apply LFs to the docs
for doc in docs:
    for lf in lfs:
        doc = lf(doc)

In [46]:
# Use HMM
hmm = skweak.aggregation.HMM("hmm", all_labels)
hmm.fit(docs)

for doc in docs:
    doc = hmm(doc)

Starting iteration 1
Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents
Starting iteration 2


         1     -722920.7405             +nan


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents
Starting iteration 3


         2     -687030.1765      +35890.5640


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents
Starting iteration 4


         3     -674776.8510      +12253.3255


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents
Starting iteration 5


         4     -665489.1607       +9287.6903


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Number of processed documents: 10000
Number of processed documents: 11000
Number of processed documents: 12000
Finished E-step with 12543 documents


         5     -652511.9939      +12977.1668


In [47]:
# Use majority voting
mv = skweak.aggregation.MajorityVoter("mv", all_labels)

for doc in docs:
    doc = mv(doc)

In [48]:
# Evaluate
evaluate(docs, all_labels, ["common_words", "manual_pos", "determiners", "proper_nouns", "numerals", 
                            "adjectives1", "adjectives2", "adjectives3", "adjectives4", 
                            "nouns1", "nouns2", "nouns3", "verbs1", "verbs2", "verbs3", "hmm", "mv"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
label,proportion,model,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ADJ,12.1 %,adjectives1,0.315,0.352,0.332,,,,0.315,0.352,0.332
ADJ,12.1 %,adjectives2,0.211,0.058,0.09,,,,0.211,0.058,0.09
ADJ,12.1 %,adjectives3,0.252,0.092,0.134,,,,0.252,0.092,0.134
ADJ,12.1 %,adjectives4,0.177,0.367,0.238,,,,0.177,0.367,0.238
ADJ,12.1 %,common_words,0.572,0.165,0.256,,,,0.572,0.165,0.256
ADJ,12.1 %,determiners,0.0,0.0,0.0,,,,0.0,0.0,0.0
ADJ,12.1 %,hmm,0.295,0.211,0.246,,,,0.295,0.211,0.246
ADJ,12.1 %,manual_pos,0.845,0.078,0.142,,,,0.845,0.078,0.142
ADJ,12.1 %,mv,0.463,0.312,0.372,,,,0.463,0.312,0.372
ADJ,12.1 %,nouns1,0.0,0.0,0.0,,,,0.0,0.0,0.0


### Run LFs for the subset we created

#### Compute common words for the lexicon LF

In [49]:
# Get all the words in the dataset
words = [token.text.lower() for doc in subset_docs for token in doc if not token.is_punct]

# Remove stopwords
words = [w for w in words if w not in stopwords.words('english')]

# Find the most frequent words
word_freq = Counter(words)
common_words = [w[0] for w in word_freq.most_common(200)]

In [50]:
common_words[:5]

["'s", 'bush', 'al', 'india', 'would']

#### Annotate the most common 50 words

In [51]:
# Manual annotation
top50_words = [w[0] for w in word_freq.most_common(50)]
print(top50_words)

["'s", 'bush', 'al', 'india', 'would', 'iraq', 'us', 'iraqi', "n't", 'one', 'many', 'even', 'indian', 'said', 'new', 'war', 'musharraf', 'peace', 'years', 'country', 'military', 'israel', 'two', 'also', 'national', 'time', 'chernobyl', 'pakistan', 'government', 'kashmir', 'sri', 'elections', 'know', 'qaeda', 'may', 'president', 'power', 'last', 'another', 'lanka', 'posada', 'back', 'could', 'state', 'general', 'made', 'much', 'party', 'united', 'people']


In [52]:
manual_tags = {
    "bush": "NOUN",
    "al": "PROPN",
    "india": "PROPN",
    "iraq": "PROPN",
    "iraqi": "ADJ",
    "indian": "ADJ",
    "said": "VERB",
    "new": "ADJ",
    "war": "NOUN",
    "musharraf": "PROPN",
    "peace": "NOUN",
    "years": "NOUN",
    "country": "NOUN",
    "military": "NOUN",
    "israel": "PROPN",
    "two": "NUM",
    "national": "ADJ",
    "time": "NOUN",
    "chernobyl": "PROPN",
    "pakistan": "PROPN",
    "government": "NOUN",
    "kashmir": "PROPN",
    "sri": "PROPN",
    "elections": "NOUN",
    "know": "VERB",
    "qaeda": "PROPN",
    "president": "NOUN",
    "power": "NOUN",
    "last": "NOUN",
    "another": "ADJ",
    "lanka": "PROPN",
    "posada": "PROPN",
    "could": "VERB",
    "general": "ADJ",
    "made": "VERB",
    "party": "NOUN",
    "united": "VERB",
    "people": "NOUN",
}

In [53]:
# Apply LFs to the subset docs
for doc in subset_docs:
    for lf in lfs:
        doc = lf(doc)

In [54]:
# Use HMM
hmm = skweak.aggregation.HMM("hmm", all_labels)
hmm.fit(subset_docs)

for doc in subset_docs:
    doc = hmm(doc)

Starting iteration 1
Finished E-step with 500 documents
Starting iteration 2


         1      -45625.6316             +nan


Finished E-step with 500 documents
Starting iteration 3


         2      -43027.2096       +2598.4220


Finished E-step with 500 documents
Starting iteration 4


         3      -41728.6433       +1298.5664


Finished E-step with 500 documents
Starting iteration 5


         4      -40424.3924       +1304.2509


Finished E-step with 500 documents


         5      -39530.1871        +894.2052


In [55]:
# Use majority voting
mv = skweak.aggregation.MajorityVoter("mv", all_labels)

for doc in subset_docs:
    doc = mv(doc)

In [56]:
# Evaluate
evaluate(subset_docs, all_labels, ["common_words", "manual_pos", "determiners", "proper_nouns", "numerals", 
                            "adjectives1", "adjectives2", "adjectives3", "adjectives4", 
                            "nouns1", "nouns2", "nouns3", "verbs1", "verbs2", "verbs3", "hmm", "mv"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
label,proportion,model,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ADJ,12.6 %,adjectives1,0.33,0.371,0.35,,,,0.33,0.371,0.35
ADJ,12.6 %,adjectives2,0.231,0.104,0.144,,,,0.231,0.104,0.144
ADJ,12.6 %,adjectives3,0.194,0.054,0.084,,,,0.194,0.054,0.084
ADJ,12.6 %,adjectives4,0.219,0.421,0.288,,,,0.219,0.421,0.288
ADJ,12.6 %,common_words,0.492,0.112,0.182,,,,0.492,0.112,0.182
ADJ,12.6 %,determiners,0.0,0.0,0.0,,,,0.0,0.0,0.0
ADJ,12.6 %,hmm,0.222,0.397,0.284,,,,0.222,0.397,0.284
ADJ,12.6 %,manual_pos,0.524,0.067,0.118,,,,0.524,0.067,0.118
ADJ,12.6 %,mv,0.508,0.244,0.33,,,,0.508,0.244,0.33
ADJ,12.6 %,nouns1,0.0,0.0,0.0,,,,0.0,0.0,0.0
