# Colx 525 Lab Assignment 2: Morphological Segmentation (Cheat sheet)

## Assignment objectives

In this assignment, you will develop a supervised morphological segmentation system using the `pycrfsuite` tagger toolkit. You will:

1. Read in training, development and test data.
1. Convert datasets into BIES format.
1. Train a baseline BPE segmentation system using the Python toolkit `bpe`.
1. Implement an evaluation function for morphologial segmentation.
1. Implement a feature extraction function for your supervised segmentation system.
1. Train your segmentation system and apply it to test data.  

All parts of this assignment are mandatory.

## Getting started

You will need to install the Python modules `bpe`, `numpy` and `pycrfsuite`. The easiest way to do this is using `pip`:

```
pip install python-crfsuite bpe numpy
```

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:

* Submit the assignment by filling in this jupyter notebook with your answers embedded
* Be sure to follow the general lab instructions

### Assignment 1: Reading in data

Start by running the following code. It will create a directory `.data` and download English segmentation data into the directory. 

In [1]:
import os, urllib.request, tarfile

URL = "http://mpsilfve.github.io/assets/segmentation_data.tgz"
DATADIR = ".data"
FN = "segmentation_data"
TGZ = os.path.join(DATADIR,"%s.tgz" % FN)

try:
    print("Creating",DATADIR)
    os.mkdir(DATADIR)
except FileExistsError:
    pass

print("Downloading %s.tgz into .data" % FN)
urllib.request.urlretrieve(URL,TGZ)

print("Extracting %s" % TGZ)
tf = tarfile.open(TGZ)
tf.extractall(DATADIR)

Creating .data
Downloading segmentation_data.tgz into .data
Extracting .data/segmentation_data.tgz


#### Assignment 1.1: Reading data files into lists

rubric={accuracy:5}

In the sub-directory `segmentation_data` of the directory `.data`, there are three files: `traindata`, `devdata` and `testdata`. These files consists of two tab-separated columns:

**1.**
```
vouchsafed      vouch saf ed
negative        negat ive
annotations     an not at ion s
torpedos        torpedo s
coxswain        coxswain
lofted          loft ed
...
```

The first column contains a word form and the second column contains the same word form but segmented into morphemes. Please read these files into three lists: `traindata`, `devdata` and `testdata`. Each list element should be a pair like:

```
["vouchsafed",["vouch","saf","ed"]]
```

**2.** **traindata | devdata | testdat**:
```
[   ['vouchsafed', ['vouch', 'saf', 'ed']], 
    ['negative', ['negat', 'ive']], 
    ['annotations', ['an', 'not', 'at', 'ion', 's']], 
    ['torpedos', ['torpedo', 's']], 
    ['coxswain', ['coxswain']], 
    ['lofted', ['loft', 'ed']], ... ]
```

In [2]:
traindata = []
devdata = []
testdata = []

# your code here

# your code here

# A few assertions to make sure that your code is working properly.
assert(traindata[0] == ["vouchsafed",["vouch", "saf", "ed"]])
assert(devdata[0] == ["sales",["sale", "s"]])
assert(testdata[0] == ["bunkhouses",["bunk","house","s"]])

Finally, we'll define `trainwords`, `devwords` and `testwords` which simply contain the unsegmented word forms in the data sets: 

**3.** **trainwords | devwords | testwords**:
```
['vouchsafed', 'negative', 'annotations', 'torpedos', 'coxswain', 'lofted', ... ]
```

In [25]:
trainwords = [wf for wf, _ in traindata]
devwords = [wf for wf, _ in devdata]
testwords = [wf for wf, _ in testdata]

print("trainwords\t", trainwords[:2])
print("devwords\t", devwords[:2])
print("testwords\t", testwords[:2])

trainwords	 ['vouchsafed', 'negative']
devwords	 ['sales', 'oppressor']
testwords	 ['bunkhouses', 'beheld']


#### Assignment 1.2: BIES notation

rubric={accuracy:10}

We will use a sequential tagger for morphological segmentation which means that we need to convert our segmented word forms into so called BIES (Begin-Inside-End-Singleton) format which looks like this (for the word form "bunkhouses" segmented as "bunk" "house" "s"):

```
BEGIN   INSIDE   INSIDE   END   BEGIN   INSIDE   INSIDE   INSIDE   END   SINGLE

b       u        n        k     h       o        u        s        e     s
```

The first character of each morpheme receives a `BEGIN` tag, the last one an `END` tag and the remaining characters receive an `INSIDE` tag. As a special case, morphemes consisting of a single character (like the plural "s" ending above), receive the tag `SINGLE`. 

It is your task to convert `traindata`, `devdata` and `testdata` into BIES format. You should store the result in the lists `trainbies`, `devbies` and `testbies`. Each example in these lists should have the following format:

```
[["b","BEGIN"],
 ["u","INSIDE"],
 ["n","INSIDE"],
 ["k","END"],
 ["h","BEGIN"],
 ["o","INSIDE"],
 ["u","INSIDE"]
 ["s","INSIDE"],
 ["e","END"],
 ["s","SINGLE"]]
```

### IO/BIO/IOE/BIOES

|       | IO | BIO | IOE | BIOES |
|-------|-------|-------|-------|-------|
| U.N.  | I-NP  | B-NP  | I-NP  | B-NP  |
| official  |    I-NP  | I-NP  | I-NP  | I-NP  | 
| Ekeus  | I-NP  | I-NP  | E-NP  | E-NP  | 
| heads  |  O  |  O  |  O  |  O  | 
| for  |  O  |  O  |  O  |  O  | 
| Baghdad  |  I-NP  | B-NP  |  E-NP  |  S-NP  | 
| .  |  O  |  O  |  O  |  O  | 

**4.** **trainbies | devbies | testbies**:
```
[[  ['v', 'BEGIN'], ['o', 'INSIDE'], ['u', 'INSIDE'], ['c', 'INSIDE'], ['h', 'END'], 
    ['s', 'BEGIN'], ['a', 'INSIDE'], ['f', 'END'], 
    ['e', 'BEGIN'], ['d', 'END']    ], 
 [  ['n', 'BEGIN'], ['e', 'INSIDE'], ['g', 'INSIDE'], ['a', 'INSIDE'], ['t', 'END'], 
    ['i', 'BEGIN'], ['v', 'INSIDE'], ['e', 'END']   ], 
 [  ['a', 'BEGIN'], ['n', 'END'], 
    ['n', 'BEGIN'], ['o', 'INSIDE'], ['t', 'END'], 
    ['a', 'BEGIN'], ['t', 'END'], 
    ['i', 'BEGIN'], ['o', 'INSIDE'], ['n', 'END'], 
    ['s', 'SINGLE'] ], 
    ...
]
```

In [4]:
BEGIN = "BEGIN"
INSIDE = "INSIDE"
END = "END" 
SINGLE = "SINGLE"

# your code here
def get_bies_notation(data):
    result = []

    return result
# your code here

trainbies = get_bies_notation(traindata)
devbies = get_bies_notation(devdata)
testbies = get_bies_notation(testdata)

# A few assertions to make sure that your code is working properly.
assert(trainbies[0] == [['v', 'BEGIN'], ['o', 'INSIDE'], ['u', 'INSIDE'], ['c', 'INSIDE'], ['h', 'END'], ['s', 'BEGIN'], ['a', 'INSIDE'], ['f', 'END'], ['e', 'BEGIN'], ['d', 'END']])
assert(devbies[0] == [['s', 'BEGIN'], ['a', 'INSIDE'], ['l', 'INSIDE'], ['e', 'END'], ['s', 'SINGLE']])
assert(testbies[0] == [['b', 'BEGIN'], ['u', 'INSIDE'], ['n', 'INSIDE'], ['k', 'END'], ['h', 'BEGIN'], ['o', 'INSIDE'], ['u', 'INSIDE'], ['s', 'INSIDE'], ['e', 'END'], ['s', 'SINGLE']])

#### Assignment 1.3: From BIES notation to morphemes

rubric={accuracy:5}

For evaluation purposes, we will need to transform BIES notation produced by our segmentation model back into segmented word forms, i.e. take the following as input:

```
[['s', 'BEGIN'], ['a', 'INSIDE'], ['l', 'INSIDE'], ['e', 'END'], ['s', 'SINGLE']]
```

And generate the following as output:

```
["sale", "s"]
```

Implement a function `unbies(data)` that takes a list of examples in BIES format as input and returns a list of pairs in the format:

```
["sales", ["sale","s"]]
```

The first element in the pair is the unsegmented word form and the second one is the segmented word form.

---

`unbies()` converts data from **trainbies** to **traindata**:
```
print(unbies([[["d","BEGIN"],["o","INSIDE"],["g","END"],["s","SINGLE"]]]))    <-- trainbies
[["dogs", ["dog","s"]]]                                                       <-- traindata
```

In [5]:
# your code here
def unbies(data):
    result = []

    return result
#your code here

# An assertion to make sure that your code is working properly.
assert(unbies([[["d","BEGIN"],["o","INSIDE"],["g","END"],["s","SINGLE"]]]) == [["dogs", ["dog","s"]]])

### Assignment 2: BPE baseline

You are now going to use the Python library `bpe` to segment the data using byte-pair encoding. This will serve as a baseline for our supervised morpheme segmentation system.

Start by installing `bpe` using `pip` (follow instructions [here](https://github.com/soaxelbrooke/python-bpe)). 

Study the example for using a `bpe.Encoder` model [here](https://github.com/soaxelbrooke/python-bpe). The `__init__` function for `Encoder` takes four arguments: 

1. the size of the BPE vocabulary, 
1. `pct_bpe`, a real valued parameter, which controls how `Encoder` handles frequent tokens (`pct_bpe==1` will apply BPE to all tokens in the training data and `pct_bpe==0` won't segment anything)   
1. `ngram_max` which controls the maximal frequency of BPE tokens and
1. `ngram_min` which contains the minimum frequency of BPE tokens. 

You should always set `bpe_pct` to `1` in order to apply BPE to all tokens. You should set `ngram_max` to a large number like `100000` and `ngram_min` to `1`. These settings will ensure that `bpe.Encoder` implements the basic BPE algorithm. 

#### Assignment 2.1: Training a BPE model

rubric={accuracy:5}

You should initialize a `bpe.Encoder` model `encoder` with vocabulary size 64,000. You should then read training data for BPE from the file `en-ud-train.txt` in the sub-directory `segmentation_data` in the `.data` directory. Split the file into lines and call the `encoder.fit` giving the dataset as parameter.   

https://github.com/soaxelbrooke/python-bpe

```

from bpe import Encoder

# Generated with http://pythonpsum.com
test_corpus = '''
    Object raspberrypi functools dict kwargs. Gevent raspberrypi functools. Dunder raspberrypi decorator dict didn't lambda zip import pyramid, she lambda iterate?
    Kwargs raspberrypi diversity unit object gevent. Import fall integration decorator unit django yield functools twisted. Dunder integration decorator he she future. Python raspberrypi community pypy. Kwargs integration beautiful test reduce gil python closure. Gevent he integration generator fall test kwargs raise didn't visor he itertools...
    Reduce integration coroutine bdfl he python. Cython didn't integration while beautiful list python didn't nit!
    Object fall diversity 2to3 dunder script. Python fall for: integration exception dict kwargs dunder pycon. Import raspberrypi beautiful test import six web. Future integration mercurial self script web. Return raspberrypi community test she stable.
    Django raspberrypi mercurial unit import yield raspberrypi visual rocksdahouse. Dunder raspberrypi mercurial list reduce class test scipy helmet zip?
'''

encoder = Encoder(200, pct_bpe=0.88)  # params chosen for demonstration purposes
encoder.fit(test_corpus.split('\n'))
```

`Encoder`:
```
Encoder(
    vocab_size=8192,
    pct_bpe=0.2,
    word_tokenizer=None,
    silent=True,
    ngram_min=2,
    ngram_max=2,
    required_tokens=None,
    strict=False,
    EOW='__eow',
    SOW='__sow',
    UNK='__unk',
    PAD='__pad',
)
```

In [8]:
import os
from bpe.encoder import Encoder

VOCAB_SIZE=64000
PCTBPE=1
NGRAM_MAX=1000000
NGRAM_MIN=1

# your code here

# your code here

#### Assignment 2.2: Segmenting development and test data using BPE

rubric={accuracy:5}

You should now segment the development and test data using your BPE model `encoder`. Perform segmentation on `devwords` and `testwords` which only contain unsegmented word forms. You can use `encoder.tokenize`. Store the result in two lists `bpe_tokenized_dev` and `bpe_tokenized_test`.

`bpe.Encoder` uses a few special characters to indicate start and end of words (among other things). The special characters always start with two underscores. For example, `"__sow"`. Filter out all special characters from `bpe_segmented_dev` and `bpe_segmented_test` (you can assume that it is safe to filter out tokens starting with a double underscore `__`).  

Examples in `bpe_segmented_dev` and `bpe_segmented_test` should have the following format:

```
['bunk', 'houses']
```

(note that this example is actually incorrectly segmented)

**`encoder.tokenize`**:
```
bpe_tokenized_dev:
[   ['__sow', 'sales', '__eow'], ['__sow', 'oppress', 'or', '__eow'], 
    ['__sow', 'wi', 'pes', '__eow'], ['__sow', 'bash', 'fully', '__eow'], 
    ['__sow', 'feli', 'cit', 'ous', '__eow']    ]
```

**filter out tokens starting with a double underscore `__`**:
```
bpe_segmented_dev:
[   ['sales'], ['oppress', 'or'], ['wi', 'pes'], ['bash', 'fully'], 
    ['feli', 'cit', 'ous']  ]
```

In [27]:
#your code here

#your code here

print('bpe_tokenized_dev\t', bpe_tokenized_dev[:2])
print('bpe_segmented_dev\t', bpe_segmented_dev[:2])

bpe_tokenized_dev	 [['__sow', 'sales', '__eow'], ['__sow', 'oppress', 'or', '__eow']]
bpe_segmented_dev	 [['sales'], ['oppress', 'or']]


### Assignment 3: Evaluation

rubric={accuracy:10}

We will evaluate segmentation algorithms using precision, recall and f-score on segment boundaries. As an example, let's say we evaluate against the following gold standard:

```
[["hot", "dog","s"],["king","s"]]
```

and our segmentation system returned the following segmentation:

```
[["hot","dog","s"], ["k","ing","s"]]
```

The gold standard segment boundaries are:

```
[[3,6],[4]]
```

and our system gives the following segment boundaries:

```
[[3,6],[1,4]]
```

Now three of our segment boundaries are actually found in the gold standard (`3` and `6` for the first example and `4` for the second one). This gives us a precision of $P = 3/4 = 75\%$ (three of the four morpheme boundaries were found are in the gold standard) and a recall of $R = 3/3 = 100\%$ (we found two of the three token boundaries given by the gold standard). Finally, we get the f-score as $2 \cdot P \cdot R /(P+R) \approx 87.5\%$. 

Implement a function `evaluate` which takes two arguments:

1. `sys_segmented_data` which is a list of segmented examples like `["k","ing","s"]` produced by a segmentation system.
1. `gold_segmented_data` which is a list of gold standard segmentation examples like `["kings",["k","ing","s"]]`.

Your function should return the precision, recall and fscore for segment boundaries in `sys_segmented_data`.

**Note!** You should get fscore above 45% for the BPE segmented test set. 

If you want, you can now tune the vocabulary size of `bpe.Encoder` (which we set to 64k) so that it you get maximal performance on the development set. 

$P = \frac{\textrm{retrieved ones}~\cap~\textrm{relevant ones}}{\textrm{retrieved ones}}$

$R  = \frac{\textrm{retrieved ones}~\cap~\textrm{relevant ones}}{\textrm{relevant ones}}$

```
GOLD:
[["hot", "dog","s"],["king","s"]] => [[3,6],[4]]

SYSTEM:
[["hot","dog","s"], ["k","ing","s"]] => [[3,6],[1,4]]



```

- $\textrm{retrieved ones}~\cap~\textrm{relevant ones}$:  `tp` = **3** where `tp` = GOLD $\cap$ SYSTEM
- $\textrm{retrieved ones}$ (SYSTEM): `tp + fp` = 3 + 1 =  **4** where `fp` = SYSTEM - GOLD
- $\textrm{relevant ones}$ (GOLD): `tp + fn` = 3 + 0 = **3** where  `fn` = GOLD - SYSTEM

In [15]:
import numpy as np

def evaluate(sys_segmented_data,gold_segmented_data):
    # your code here

    # your code here

    return precision, recall, fscore
    

precision, recall, fscore = evaluate(bpe_segmented_test,testdata)
print("Results for BPE segmentation:")
print("Test set precision: %.2f, recall: %.2f, f-score: %.2f" % (precision, recall, fscore))

Results for BPE segmentation:
Test set precision: 46.19, recall: 48.02, f-score: 47.08


### Assignment 4: Supervised morphological segmentation

You will now use the Python toolkit `pycrfsuite` to train a supervised morphological segmentation system. You can install `pycrfsuite` using pip (see [Installation](https://python-crfsuite.readthedocs.io/en/latest/)).

To get a better understanding of the `pycrfsuite` toolkit, you can browse through the following [tutorial](https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb) for building a NER system using `pycrfsuite`. 

Famous CRF implementations:
- **crf++**  https://taku910.github.io/crfpp/
- **wapiti** https://wapiti.limsi.fr
- **crfsuite** https://www.chokkan.org/software/crfsuite/} $\Rightarrow$ `pycrfsuite`

In [14]:
import pycrfsuite

#### Assignment 4.1: Feature engineering

rubric={accuracy:10}

Our segmentation system is feature based structured CRF tagger and your task now is to implement a feature extraction function called `char2features`. It takes two arguments:

1. an example in BIES format (for example, `[["d","BEGIN"],["o","INSIDE"],["g","END"],["s","SINGLE"]]`) and
1. a position in the examples (for example, 2).

As a very simple example consider:

```
def char2features(example, i):
    char_at_i = example[i][0]
    features = ["CHARACTER_AT_i=%s" % char_at_i]
    return features
```

Given the arguments `example = [["d","BEGIN"],["o","INSIDE"],["g","END"],["s","SINGLE"]]` and `i = 2`, this function will return `["CHARACTER_AT_i=g"]`. Although this is a possible feature function, it is probably too simplistic to ensure good segmentation accuracy. You should expand it to include additional features like:

1. Surrounding characters around position i.
1. Substrings around position i of varying lengths.
1. Distance to the end of the example and beginning of the example
1. Anything else you can come to think of. 

You are allowed to use external datasets for feature engineering. If you do, please make sure to hand in your datasets together with the assignment.

**NOTE!** `char2features` is not allowed to look at the labels (for example, `"INSIDE"`). You may only extract features from the characters in `example`. 

```
oppress or
    *

o BEGIN
p INSIDE
p INSIDE
r INSIDE    
e INSIDE *  <-- current position
s INSIDE 
s END 
o BEGIN
r END


```

**Feature engineering**:
- current character at i: $w_i$ = *e*
- surrounding characters around position i: $w_{i-1}$ = *r*, $w_{i-2}$ = *p*, $w_{i+1}$ = *s*, $w_{i+2}$ = *s*
- substrings around position i of varying lengths: $w_{i-2,i-1,i}$ = *pre*, $w_{i,i+1,i+2}$ = *ess*, $w_{i-1,i,i+1}$ = *res*
- distance to the end of the example and beginning of the example: $d = 4$ 
- anything else you can come to think of ....


```
[   [
        ...     ], 
    [   ['CHAR=o', ...], 
        ['CHAR=p', ...], 
        ['CHAR=p', ...], 
        ['CHAR=r', ...], 
        ['CHAR=e', 'CHAR-1=r', 'CHAR-2=p', 'CHAR+1=s', 'CHAR+2=s', 'STR--=pre', 'STR++=ess',
            'STR-+=res', 'DIST_FROM_START=4'], 
        ['CHAR=s', ...], 
        ...     ],
    ...
]
```

In [17]:
BOUNDARY="<BD>"

def char2features(example, i):
    # your code here

    # your code here
    return features

def data2features(data):
    """ Extract features for a data set in BIES format. """
    return [[char2features(example,i) for i in range(len(example))] for example in data]

def data2labels(data):
    """ Extract the tags from a data set in BIES format. """
    return [[tok[1] for tok in example] for example in data]

# Initialize the training, development and test sets for pycrfsuite.
X_train = data2features(trainbies)
y_train = data2labels(trainbies)

X_dev = data2features(devbies)
y_dev = data2labels(devbies)

X_test = data2features(testbies)
y_test = data2labels(testbies)

#### Assignment 4.2: Training the segmentation system

rubric={accuracy:5}

Find out how to train a `pycrfsuite` model in the [tutorial](https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb) (for now, just copy the hyper-parameter values from the tutorial). You should train your model using `X_train` and `y_train` which we just created. Save your model in the file `segmentation.model`.

**Training the segmentation system**  See https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb


`trainer.train(’segmentation.model’)`

In [18]:
trainer = pycrfsuite.Trainer(verbose=True)

# your code here


trainer.train('segmentation.model')
# your code here

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 11041
Seconds required: 0.033

L-BFGS optimization
c1: 1.000000
c2: 0.001000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 11501.826562
Feature norm: 1.000000
Error norm: 3365.016869
Active features: 5599
Line search trials: 1
Line search step: 0.000187
Seconds required for this iteration: 0.008

***** Iteration #2 *****
Loss: 7643.037593
Feature norm: 3.135730
Error norm: 2641.393094
Active features: 5285
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.005

***** Iteration #3 *****
Loss: 5919.984383
Feature norm: 4.274837
Error norm: 1174.072773
Active features: 4904
Line search trials: 1
Line search step: 1.000000
Seconds required for this ite

#### Assignment 4.3: Fine-tuning and segmentation

rubric={accuracy:5}

You can now segment the development data `devbies` using the function `segment` below. You should get f-score above 84% on the development data. Please tune the L1 and L2 penalty for `pycrfsuite.Trainer` using segmentation f-score on the development data as criterion. Remember that you will need to retrain `tagger` after adjusting the hyperparameter values.

When you are done with tuning, you can segment the test set.

**To prevent overfitting**: 
- L1 regularization technique is called Lasso Regression:  Least Absolute Shrinkage and Selection Operator (Lasso) adds “absolute value of magnitude” of coefficient as penalty term to the loss function
- L2 is called Ridge Regression, which adds “squared magnitude” of coefficient as penalty term 


`pycrfsuite.Trainer`:
```
    'c1': float,  # coefficient for L1 penalty
    'c2': float,  # coefficient for L2 penalty
```

In [19]:
def segment(data,tagger):
    tagged_data = []
    for i,example in enumerate(data2features(data)):
        tags = tagger.tag(example)
        tagged_example = [(char, tag) for (char, _), tag in zip(data[i],tags)]
        tagged_data.append(tagged_example)
    return [segmented for _,segmented in unbies(tagged_data)]
    
tagger = pycrfsuite.Tagger()
tagger.open('segmentation.model')

supervised_tokenized_dev = segment(devbies,tagger)
print("Results for supervised segmentation:")
print("Development set precision: %.2f, recall: %.2f, f-score: %.2f" % evaluate(supervised_tokenized_dev[:10],devdata[:10]))
supervised_tokenized_dev[:10]

# When you're done tuning the hyper-parameters, uncomment the lines below to segment the test data.

supervised_tokenized_test = segment(testbies,tagger)
print("Results for supervised segmentation:")
print("Test set precision: %.2f, recall: %.2f, f-score: %.2f" % evaluate(supervised_tokenized_test,testdata))

Results for supervised segmentation:
Development set precision: 81.82, recall: 90.00, f-score: 85.71
Results for supervised segmentation:
Test set precision: 87.02, recall: 79.74, f-score: 83.22


**Note that these f-scores are examplary**