# COLX 581 Lab 4: Active learning for morphological segmentation

In this lab, we will investigate different active learning strategies for the task of supervised morphological segmentation for English. We won't actually annotate any data in this assignment. Instead we will simulate active learning in the following way: We will start with a small annotated training set of 50 segmented words. We will then explore different criteria for incorporating additional training examples from a large annotated training set.

In the first assignment, we will implement a segmentation system using the `pycrfsuite` toolkit. We'll will also train and evaluate a tagger on a small set of 50 training examples.

In the second assignment, we'll explore the effect of adding 50 additional randomly sampled training examples to the training set. This can be seen as a baseline for our active learning experiment.

In the third assignment, you will you will implement two active learning strategies belonging to the query-by-uncertainty family.

In the fourth assignment, you will compare the active learning strategy query-by-uncertainty to query-by-committee.

Let's start by reading three segementation datasets: 

* A small training set of 50 segmented word forms,
* A larger training set containing 1350 segmented word forms, and
* A development set which we will use for evaluation

We use a function `read_data` from a module `data_handling` which is provided with the starter code. You are welcome to look at the module!

In [20]:
########################
#### dadta_handling ####
########################

BEGIN = "BEGIN"
INSIDE = "INSIDE"
END = "END" 
SINGLE = "SINGLE"
BOUNDARY="<BD>"

def read_data(file):
    data = []
    for line in file:
        line = line.strip()
        if line == "":
            continue
        input, output = line.split("\t")
        data.append((input,output.split(" ")))
    return data

def get_bies_notation(data):
    result = []
    for _, ex in data:
        ex = [[[c,INSIDE] for c in morph] for morph in ex]
        for morph in ex:
            if len(morph) == 1:
                morph[0][1] = SINGLE
            else:
                morph[0][1] = BEGIN
                morph[-1][1] = END
        result.append([c for morph in ex for c in morph])
    return result

def unbies(data):
    result = []
    for ex in data:
        segmented = ['']
        for char, tag in ex:
            segmented[-1] += char
            if tag == "SINGLE" or tag == "END":
                segmented.append('')
        segmented = [seg for seg in segmented if seg != '']
        result.append([''.join(segmented), segmented])
    return result

def char2features(example, i):
    char = example[i][0]
    char_minus_1 = example[i-1][0] if i > 0 else BOUNDARY
    char_plus_1 = example[i+1][0] if i + 1 < len(example) else BOUNDARY 
    char_minus_2 = example[i-2][0] if i - 1 > 0 else BOUNDARY
    char_plus_2 = example[i+2][0] if i + 2 < len(example) else BOUNDARY 
    tag = example[i][1]
    distance_from_start = i
    distance_to_end = len(example) - i
    word_prefix = "".join([c for c,tag in example[:i+1]])
    
    features=["CHAR=%s" % char,
              "CHAR-1=%s" % char_minus_1,
              "CHAR-2=%s" % char_minus_2,
              "CHAR+1=%s" % char_plus_1,
              "CHAR+2=%s" % char_plus_2,
              "STR--=%s" % (char_minus_2 + char_minus_1 + char),
              "STR++=%s" % (char + char_plus_1 + char_plus_2),
              "STR-+=%s" % (char_minus_1 + char + char_plus_1), 
              "DIST_FROM_START=%u" % distance_from_start]
 
    return features

def data2features(data):
    """ Extract features for a data set in BIES format. """
    return [[char2features(example,i) for i in range(len(example))] for example in data]

def data2labels(data):
    """ Extract the tags from a data set in BIES format. """
    return [[tok[1] for tok in example] for example in data]

import numpy as np 

def evaluate(sys_segmented_data,gold_segmented_data):
    def boundaries(segmented):
        token_lengths = [len(tok) for tok in segmented]
        return set(np.cumsum(token_lengths)[:-1])

    tp = 0.0
    fp = 0.0
    fn = 0.0
    for sys_ex, (_, gold_ex) in zip(sys_segmented_data,
                                    gold_segmented_data):
        sys_bound = boundaries(sys_ex)
        gold_bound = boundaries(gold_ex)
        tp += len(sys_bound.intersection(gold_bound))
        fp += len(sys_bound.difference(gold_bound))
        fn += len(gold_bound.difference(sys_bound))
    recall = 100 * tp / (tp + fn)
    precision = 100 * tp / (tp + fp)
    fscore = 2 * recall * precision / (recall + precision)

    return precision, recall, fscore

In [21]:
from os import path
import data_handling

small_traindata = data_handling.read_data(open(path.join("data", "small_traindata")))
devdata = data_handling.read_data(open(path.join("data", "devdata")))
all_traindata = data_handling.read_data(open(path.join("data", "all_traindata")))

print(small_traindata[0])

("acupuncture's", ['acupuncture', "'s"])


Each segmented example is a pair `(word_form, segments)`, where `word_form` is a string and `segments` a list of morphemes as above. 

## Assignment 1: Implementing a segmentation system using `pycrfsuite`

### Assignment 1.1: Training a tagger

rubric={"accuracy":5}

Your first task is to implement a training function `train_tagger` which trains a `pycrfsuite` tagger. Your function should take three parameters:

* `traindata`, a list of training examples `(word_form, segments)`,
* `epochs`, the number of epochs for the training algorithm, and
* `fn` a file name where the trained tagger is stored

Read the following [tutorial](https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb) to see how a `pycrfsuite` tagger is trained. You can set the regularization parameter `c1` to `1.0` and `c2` to `1e-3`. You can set `feature.possible_transitions` to `False`. 

You can't use `traindata` directly to the `pycrfsuite.Trainer` class. You will first need to convert it into BIES (Begin-Inside-End-Singleton) format and then perform feature extraction. This can be accomplished using three functions which are provded to you in a `data_handling.py` which is provided with this notebook: 

* The first function `data_handling.get_bies_notation` takes a dataset as input and returns the same datset in BIES format. 
* The second function `data_handling.data2features` takes a dataset in BIES format and extracts features for each example (using a few simple feature functions).
* The last function `data_handling.data2labels(bies)` takes a dataset in BIES format and returns the BIES labels for each example. 

Here is a small demonstration of using these functions:

In [31]:
print("First training example:\t",small_traindata[0])
bies = data_handling.get_bies_notation([small_traindata[0]])
# list of list
print("get_bies_notation:\t",bies)
print("data2features of bies:\t",data_handling.data2features(bies))
print("data2labels of bies:\t",data_handling.data2labels(bies))
print("unbies of bies:\t\t", data_handling.unbies(bies))

First training example:	 ("acupuncture's", ['acupuncture', "'s"])
get_bies_notation:	 [[['a', 'BEGIN'], ['c', 'INSIDE'], ['u', 'INSIDE'], ['p', 'INSIDE'], ['u', 'INSIDE'], ['n', 'INSIDE'], ['c', 'INSIDE'], ['t', 'INSIDE'], ['u', 'INSIDE'], ['r', 'INSIDE'], ['e', 'END'], ["'", 'BEGIN'], ['s', 'END']]]
data2features of bies:	 [[['CHAR=a', 'CHAR-1=<BD>', 'CHAR-2=<BD>', 'CHAR+1=c', 'CHAR+2=u', 'STR--=<BD><BD>a', 'STR++=acu', 'STR-+=<BD>ac', 'DIST_FROM_START=0'], ['CHAR=c', 'CHAR-1=a', 'CHAR-2=<BD>', 'CHAR+1=u', 'CHAR+2=p', 'STR--=<BD>ac', 'STR++=cup', 'STR-+=acu', 'DIST_FROM_START=1'], ['CHAR=u', 'CHAR-1=c', 'CHAR-2=a', 'CHAR+1=p', 'CHAR+2=u', 'STR--=acu', 'STR++=upu', 'STR-+=cup', 'DIST_FROM_START=2'], ['CHAR=p', 'CHAR-1=u', 'CHAR-2=c', 'CHAR+1=u', 'CHAR+2=n', 'STR--=cup', 'STR++=pun', 'STR-+=upu', 'DIST_FROM_START=3'], ['CHAR=u', 'CHAR-1=p', 'CHAR-2=u', 'CHAR+1=n', 'CHAR+2=c', 'STR--=upu', 'STR++=unc', 'STR-+=pun', 'DIST_FROM_START=4'], ['CHAR=n', 'CHAR-1=u', 'CHAR-2=p', 'CHAR+1=c', 'CHA

In [32]:
import pycrfsuite
import data_handling

def train_tagger(traindata,epochs,fn):
    # # your code here
    # https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb    
    # 1. `data_handling.get_bies_notation`  --> bies
    # 2. `data_handling.data2features`      --> X_train
    # 3. `data_handling.data2labels(bies)`  --> y_train


    # from `train model`: https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb
    # trainer = pycrfsuite.Trainer(verbose=False)

    # for xseq, yseq in zip(X_train, y_train):
    #     trainer.append(xseq, yseq)

    # trainer.set_params({
    #     'c1': 1.0,   # coefficient for L1 penalty
    #     'c2': 1e-3,  # coefficient for L2 penalty
    #     'max_iterations': 50,  # stop earlier

    #     # include transitions that are possible, but not observed
    #     'feature.possible_transitions': True
    # })


    # use `trainer.train(file_name)` for training

    # your code here

Let's now train a segmentation model on the small training set of 50 examples:

In [4]:
train_tagger(small_traindata,20,"small_segmentation.model") 

### Assignment 1.2: Segmenting data

rubric={"accuracy":5}

You will now implement a function `segment` which segments a dataset. The function takes two arguments:

* `data` a dataset of examples in the same format which is returned by `data_handling.read_data`. The existing segmentations in `data` are ignored.
* `fn` the file name of a stored segmentation model (trained using `train_tagger`)

The `segment` function should return a list of pairs `(probability,segmentation)`, where `segmentation` is a list of morphemes like `["dog","s"]` and `probability` is the probability which the tagger gives to this segmentation. For example:

```[(0.156, ["dog","s"]), (0.223, ["cat","s"]), ...]```

You should first load the segmentation model from `fn`. Check the [tutorial](https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb) to see how this is done. You should then convert `data` into BIES format (using `data_handling.get_bies_notation`) and extract features for each example (using `data_handling.data2features`). This gives you a list `data_features`.

You should then iterate through the examples in `data_fatures` and segment each of them using the function [`pycrfsuite.Tagger.tag`](https://python-crfsuite.readthedocs.io/en/latest/pycrfsuite.html#pycrfsuite.Tagger.tag) which is a member function of the your segmentation model. This gives you the most likely BIES tags for each input example, for example: `["BEGIN","INSIDE","END","SINGLE"]`. 

You can use the function `pycrfsuite.Tagger.probability` to get the probability of the segmentation.

You then need to transform each example into a list of pairs, for example:

```[("d","BEGIN"),("o","INSIDE"),("g","END"),("s","SINGLE")]```

You can then feed these to the the function `data_handling.unbies` which will return segmented word forms: `("dogs",["dog","s"])`.

Combine each of the segmentations with the appropriate probability and you're done.

In [5]:
def segment(data,fn):
    # your code here
    tagged_data = []
    probs = []    

    # from `make predictions`: https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb
    # tagger = pycrfsuite.Tagger()
    # tagger.open(fn)

    # your data go through `get_bies_notation`  --> bies

    # iterate `bies` by enumerating `data2features(bies)` which gives i, example
    #   `tagger.tag`            --> append to `tagged_data` BUT...
    #       [['s', 'BEGIN'], ['a', 'INSIDE'], ['l', 'INSIDE'], ['e', 'END'], ['s', 'SINGLE']]     <-- bies[i]
    #       ['BEGIN', 'INSIDE', 'INSIDE', 'END', 'SINGLE']                                        <-- tags
    

    #    [('s', 'BEGIN'), ('a', 'INSIDE'), ('l', 'INSIDE'), ('e', 'END'), ('s', 'SINGLE')]         --> append to `tagged_data`
    #   `tagger.probability`                                                                       --> append to `probs`

    # Then, return following:
    # [ 
    #   (   0.156,                  <--- probs
    #       ["dog","s"] ),          <--- by `unbies`` of `tagged_data`
    #   (   0.223, ["cat","s"]), 
    #   ... 
    # ]
    return ...
    # your code here

When we want to evaluate a segmentation system, we don't need the probability for each segmentation. The following function simply throws away all the probabilites from the return value of `segment`:

In [6]:
def get_segmentations(data,fn):
    tokenized_data = [segmented for _,segmented in segment(data,fn)]
    return tokenized_data

We can now evaluate the initial model trained on 50 examples using the function `data_handling.evaluate`. The f-score should be 56%:

In [7]:
tokenized_dev = get_segmentations(devdata,"small_segmentation.model")

print("Results:")
print("Development set precision: %.2f, recall: %.2f, f-score: %.2f" % data_handling.evaluate(tokenized_dev,devdata))

Results:
Development set precision: 73.05, recall: 45.58, f-score: 56.13


### Assignment 2: Annotating additional examples randomly

rubric={"accuracy":5}

You will now investigate the average improvement which can be gained by adding a number of additional training examples to `small_traindata`. This will be a baseline system where we randomly sample additional training examples from the large `all_traindata` dataset.

Start by implementing the function `augment_data_randomly`. It takes three arguments:

* `small_data` a small dataset of training example where we will add new examples,
* `big_data` a large dataset where we will sample examples which are added to `small_data`, and
* `n` the number of examples which are added to `small_data`.

The function should return `augmented_data` which contains all examples in `small_data` together with the randomly sampled examples from `big_data`.

You should sample `n` examples from `big_data` **without replacement**. This means that you should not add the same training example twice. You can use functions from the Python module [`random`](https://docs.python.org/3/library/random.html). 

In [8]:
import random

def augment_data_randomly(small_data, big_data, n):
    # your code here
    augmented_data = small_data + # use `random.sample` for `big_data` with `n`
    return augmented_data
    # your code here

To compute the average improvement which can be accomplished by adding random examples to `small_traindata`, you should train and evaluate `TRIALS` (= 20) segmentation systems which are trained on `small_traindata` augmented by `ADDED_EXAMPLES` (= 50) randomly sampled training examples from `all_traindata`. You can train each segmentation system for 20 epochs.

Evaluate each of the taggers on `devdata` and store its f-score in the list `f-scores`. You can evaluate the segmentation performance using the function `data_handling.evaluate`. See the previous assignment for an example on how to call `data_handling.evaluate`.

In [9]:
TRIALS=20
ADDED_EXAMPLES=50

fscores = []

for i in range(TRIALS):
    # # your score here
    # 1. `augment_data_randomly`
    # 2. `train_tagger`
    # 3. `get_segmentations` with `devdata`
    # 4. `evaluate`
    # 5. `append` fscore to `fscores`
    
    # your score here

You can now compute the mean and confidence interval (at the 99% level) for the the performance of the segmentaiton systems. Your mean should be around 67%.

In [10]:
import numpy as np
import scipy.stats

def mean_and_confidence_interval(data, confidence):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

mean, lower_boundary, upper_boundary = mean_and_confidence_interval(fscores,0.99)
print("Mean: %.2f Confidence interval at 99%% level: [%.2f, %.2f]" % (mean, lower_boundary, upper_boundary))

Mean: 65.02 Confidence interval at 99% level: [62.99, 67.06]


### Assignment 3: Query-by-uncertainty

You will now implement active learning. You'll investigate three strategies for choosing additional training examples from the large trainingset `all_traindata`. In each case, we choose an example to add the `small_traindata` using our existing segmentation system, then retrain our segmentation system, choose another example, retrain again, and so on. 

### Assignment 3.1: Least confidence strategy

rubric={"accuracy":5}

Our first active learning strategy will segment all words in `all_traindata` and find the word which results in the the least confident segmentation. This is measured by the segmentation probability $p(y|x)$, where $y$ is the segmentation and `x` is the input word. The function has two modes. If the parameter `length_normalize` is true, then it will return the example which maximizes the normalized probability $p(y|x)^{1/N}$, where $N$ is the length of the word form $x$. Otherwise, the function returns $p(y|x)$

You should start by implementing the function `least_confident_example`. It finds the word form in `data` for which our segmentation system is maximally unsure about the segmentation. The function takes four arguments:

* `data` a dataset of examples in the format returned by `data_handling.read_data`,
* `skip_words` a set of words which should be filtered out when finding the output word of the function (there are words which either belong to `small_trainset` or have already been added),
* `fn` the file name for a segmentation model,
* `length_normalize` a boolean which tells us whether to normalize probabilities by length or not.

You should start by calling `segment` on the input dataset `data`. Use the segmentation system stored in `fn`. This will give you the segmentation returned by our segmentation system and its probability for each example in `data`. If `length_normalize` ir `True`, you should normalize those probabilities by appying the transformation $p \mapsto p^{1/N}$, where $N$ is the length od the input form.  

Return a pair `(probability, example)`, where `example` is the example in `data` which gets the lowest segmentation probability, for example `("dogs",["dog","s"])` and `probability` is the segmentation probability (possibly normalized by length if). Note that the segmentation `["dog","s"]` here is **the gold standard segmentation in `data`**. Not the segmentation given by our model!

**Note that you should make sure that you don't return a word occurring in `skip_words`**.

In [34]:
def least_confident_example(
        data,
        skip_words,                 # list of words
        fn,
        length_normalize):
    
    # # your code here
    # 1. get  `confidences` using `segment`
    confidences = ...
    # 2. create a tuple of `(prob, ex)`
    # 3. if normalize, then normalize ...
    # 4. `sort()` confidences 
    # 5. then, return the first one (the least confident example)

    return confidences[0]
    # your code here

We'll now create a new training set `active_traindata` which is initialized to the examples in `small_traindata`. You will then alternate between training a segmentation model stored in a file `active_segmentation.model` and adding training examples to `active_traindata`. Each time you will call `least_confident_example` to find the next training example to add to `active_traindata`. 

You should make sure that you never add the same word form twice to `active_traindata` by setting the `skip_words` parameter for `least_confident_example`. 

First you will run `least_confident_example` with `length_normalize = False`. This should give you an improvement over training on `small_traindata` alone but you will probably not beat `augment_data_randomly`:

In [12]:
from copy import deepcopy

ADDED_EXAMPLES=50

print("Query-by-uncertainty using least_confident_example without length normalization:")
active_traindata = deepcopy(small_traindata) 

# print(active_traindata)

for i in range(ADDED_EXAMPLES):
    # # your code here

    # 1. `train_tagger` using `active_traindata`
    # 2. `least_confident_example`
    # 3. `append` example to `active_traindata`
    # your code here


train_tagger(active_traindata,20,"active_segmentation.model")    
tokenized_dev = get_segmentations(devdata,"active_segmentation.model")
print("Results:")
print("Development set precision: %.2f, recall: %.2f, f-score: %.2f" % data_handling.evaluate(tokenized_dev,devdata))

Query-by-uncertainty using least_confident_example without length normalization:
Results:
Development set precision: 88.71, recall: 48.67, f-score: 62.86


You will then run `least_confident_example` with `length_normalize = True`. This should give you a clear improvement over `augment_data_randomly`:

In [13]:
print("Query-by-uncertainty using least_confident_example with length normalization:")
active_traindata = deepcopy(small_traindata)

for i in range(ADDED_EXAMPLES):
    # your code here
    
    # SAME as before except for `length_normalize` = True

    # your code here

train_tagger(active_traindata,20,"active_segmentation.model")    
tokenized_dev = get_segmentations(devdata,"active_segmentation.model")
print("Results for supervised segmentation:")
print("Development set precision: %.2f, recall: %.2f, f-score: %.2f" % data_handling.evaluate(tokenized_dev,devdata))

Query-by-uncertainty using least_confident_example with length normalization:
Results for supervised segmentation:
Development set precision: 90.85, recall: 57.08, f-score: 70.11


### Assignment 3.2: Token entropy (optional)

rubric={"accuracy":5}

Our second active learning strategy will segment all words in `all_traindata` and find the word which results in the the segmentation with the highest *token entropy*. This is computed using the marginal probabilities $p(y_i = tag|x)$ of the BIES tag $tag$ at position $i$ in the input example $x$. The token entropy is computed as

$$\sum_{i = 0}^n \sum_{tag \in \{B,I,E,S\}} p(y_i = tag | x) \cdot -\log p(y_i=tag | x)$$

You should implement a function `entropy_maximizing_example` which takes three arguments:

* `data` a dataset of examples in the format returned by `data_handling.read_data`,
* `skip_words` a set of words which should be filtered our when finding the output word of the function (there are words which either belong to `small_trainset` or have already been added), and
* `fn` the file name for a segmentation model.

You should start by extracting features for the examples in `data` using `data_handling.get_bies_notation` and `data_handling.data2features`. 

The function [`pycrfsuite.Tagger.marginal`](https://python-crfsuite.readthedocs.io/en/latest/pycrfsuite.html#pycrfsuite.Tagger.marginal) gives you the marginal probability $p(y_i=tag | x)$ at position `i` in an input example $x$. You will need to call either `pycrfsuite.Tagger.tag` or `pycrfsuite.Tagger.set` before `marginal` in order to initialize the tagger to the input example `x`. 

You should sum $p(y_i = tag | x) \cdot -\log p(y_i=tag | x)$ for each tag and position in the example and normalize by dividing with the length of the example $x$. Looping over each example in `data` will give you a list `marginals` which contains pairs `(entropy, example)` where `example` is an example like `("dogs",["dog","s"])` and `entropy` is the token entropy for this example. You should return the pair where the token entropy is maximal. Note that the segmentation `["dog","s"]` here is **the gold standard segmentation in `data`**. Not the segmentation given by our model!   

**Note that you should make sure that you don't return a word occurring in `skip_words`**.

In [14]:
import numpy as np

from data_handling import BEGIN, INSIDE, END, SINGLE

TAGS = [BEGIN, INSIDE, END, SINGLE]

def entropy_maximizing_example(data,skip_words,fn):
    # your code here
    

   return entropies[0]

You will then run `entropy_maximizing_example` with `length_normalize = True`. You should get a slight improvement over `augment_data_randomly`: 

In [15]:
active_traindata = deepcopy(small_traindata)

print("Query-by-uncertainty using entropy_mazimizing_example with length normalization:")
for i in range(ADDED_EXAMPLES):
    # your code here
    ...
    # your code here

train_tagger(active_traindata,20,"active_segmentation.model")    
tokenized_dev = get_segmentations(devdata,"active_segmentation.model")
print("Results for supervised segmentation:")
print("Development set precision: %.2f, recall: %.2f, f-score: %.2f" % data_handling.evaluate(tokenized_dev,devdata))

# Query-by-uncertainty using entropy_mazimizing_example with length normalization:
# Results for supervised segmentation:
# Development set precision: 81.51, recall: 42.92, f-score: 56.23


Query-by-uncertainty using entropy_mazimizing_example with length normalization:
Results for supervised segmentation:
Development set precision: 89.21, recall: 54.87, f-score: 67.95


### Assignment 4: Query-by-committee

rubric={"accuracy":7}

Our final active learning strategy is a version of the query-by-committee strategy. Here we train a number of models on subsets of our original training set. We will then add examples to the training set based on how strongly the models agree on the segmentation for the examples. We want to add examples where there is maximal disagreement between the models beacues these are likely to be examples which our training data doesn't sufficiently cover.

You should start by implementing a function `train_committee` which trains `COMMITTEE_SIZE` models (defined below). Each model is trained on a 70% subset of the input dataset `traindata`. Every model is trained for `epoch` epochs and is stored in a file `fn_prefix.n` where `fn_prefix` is given as argument to `train_committee` and `n` is an index number between `0` and `COMMITTEE_SIZE`. You should sample examples from traindata **without replacement** which means that the same training example should not occur twice in the sampled set. You can use functions from the Python module `random` to sample training examples.

In [16]:
import random

random.seed(0)

COMMITTEE_SIZE=20

def train_committee(traindata,epochs,fn_prefix):
    # your code here
    for i in range(COMMITTEE_SIZE):
        # 1. get 70% of sample data from `traindata`
        # 2. `train_tagger` and save it as model_i
        
    # your code here

You should now implement a function `most_controversial_example`. It takes three arguments:

* `data` a dataset of examples in the format returned by `data_handling.read_data`,
* `skip_words` a set of words which should be filtered our when finding the output word of the function (there are words which either belong to `small_trainset` or have already been added), and
* `fn_prefix` the file name prefix for the segmentation models belonging to a committee.

The function will segment the dataset `data` using each of the segmentation models belonging to a committee. It will then return the example in `data` which results in the largest disagreement among the models in the committee. To measure disagreement, we will use *vote entropy* which is defined in the following way: Let $V(tag|i)$ be the number of models which predict BIES tag $tag$ at position $i$ in the input example, then the vote entropy is given by

$$ -\frac{1}{T}\cdot \sum_{i=0}^N \sum_{tag \in \{B,I,E,S\}} \frac{V(tag,i) + \alpha}{C + 4 \cdot \alpha} \cdot -\log \frac{V(tag,i) + \alpha}{C + 4\cdot \alpha},$$

where $C$ is the number of models which belong to the committee and $\alpha$ is a small smoothing constant which we set to `0.1`.

You should start by segmenting `data` using each of the segmentation models in the committee stored in the model files `fn_prefix.0`,  ..., `fn_prefix.COMMITTEE_SIZE-1`. This will give you a list `committee_segmented` with `COMMITTEE_SIZE` elements.

You should then use this list to compute the vote entropy for each example in `data`. Looping over each example in `data` will give you a list `vote_entropies` which contains pairs `(entropy, example)` where `example` is an example like `("dogs",["dog","s"])` and `entropy` is the vote entropy for this example. You should return the pair where the vote entropy is maximal. Note that the segmentation `["dog","s"]` here is **the gold standard segmentation in `data`**. Not the segmentation given by our model!   

**Note that you should make sure that you don't return a word occurring in `skip_words`**.

In [17]:
from copy import deepcopy
from data_handling import BEGIN, INSIDE, END, SINGLE
from math import log

TAGS = [BEGIN, INSIDE, END, SINGLE]

def most_controversial_example(data,skip_words,fn_prefix):
    # your code here
    committee_segmented = []
    for i in range(COMMITTEE_SIZE):
        # 1. `get_segmentations` through models
        segmentations =  ...                            # --> ['neg', 'ative']
        # 2. get `segmented_data` with `data`           
        segmented_data = ...                            # --> ('negative', ['neg', 'ative'])
        # 3. `get_bies_notation` of `segmented_data`
        bies_segmentations = ...
        # 4. `append` it to `committee_segmented` 
        ...

        # therefore, `committee_segmented` is a list of the list of segmentations

    vote_entropies = []
    for i, ex in enumerate(data):
        if not ex[0] in skip_words:
            # 1. initialize tag_counts with 0.1
            tag_counts = ...        
            # for a word `n e g a t i v e`
            # tag_counts should be as follows:
            # [{'BEGIN': 0.1, 'INSIDE': 0.1, 'END': 0.1, 'SINGLE': 0.1},    <- n
            #  {'BEGIN': 0.1, 'INSIDE': 0.1, 'END': 0.1, 'SINGLE': 0.1},    <- e
            #  {'BEGIN': 0.1, 'INSIDE': 0.1, 'END': 0.1, 'SINGLE': 0.1},    <- g
            #  {'BEGIN': 0.1, 'INSIDE': 0.1, 'END': 0.1, 'SINGLE': 0.1},    <- a
            #  {'BEGIN': 0.1, 'INSIDE': 0.1, 'END': 0.1, 'SINGLE': 0.1},    <- t
            #  {'BEGIN': 0.1, 'INSIDE': 0.1, 'END': 0.1, 'SINGLE': 0.1},    <- i
            #  {'BEGIN': 0.1, 'INSIDE': 0.1, 'END': 0.1, 'SINGLE': 0.1},    <- v
            #  {'BEGIN': 0.1, 'INSIDE': 0.1, 'END': 0.1, 'SINGLE': 0.1}]    <- e

            # 2. then update (+1) `tag_counts` by iterating  `COMMITTEE_SIZE`
            for j in range(COMMITTEE_SIZE):
                for k in range(len(ex[0])):
                    tag_counts[k][...] += 1
            
            # 3. calculate probs (entropy)
            probs = ...
            # [(
            #   0.44075178913300456,            <- probs
            #   ('negative', ['negat', 'ive'])  <- ex   
            # )]                                                <-- vote_entropies 
            vote_entropies.append(...)

    vote_entropies.sort(reverse=True)
    return vote_entropies[0]
    # your code here

You will then run `most_controversial_example` with `length_normalize = True`. You should get a clear improvement over `augment_data_randomly`: 

In [18]:
active_traindata = deepcopy(small_traindata)

print("Query-by-committee:")
for i in range(ADDED_EXAMPLES):
    # your code here
    # 1. train_committee
    # 2. get `most_controversial_example` 
    # 3. append example 

    # your code here

train_tagger(active_traindata,20,"active_segmentation.model")    
tokenized_dev = get_segmentations(devdata,"active_segmentation.model")
print("Results for supervised segmentation:")
print("Development set precision: %.2f, recall: %.2f, f-score: %.2f" % data_handling.evaluate(tokenized_dev,devdata))

Query-by-committee:
Results for supervised segmentation:
Development set precision: 90.07, recall: 60.18, f-score: 72.15
