# Colx 581 Lab 1: Crosslingual Transfer for Danish NER (Cheat sheet)

In this lab, we will build a named entity recognizer (NER) for Danish using a small dataset of named entity (NE) annotated text for Danish and a larger dataset of NE annotated text for English. We will use the [spaCy](https://v2.spacy.io/) toolkit which you should first install. You can use either SpaCy version 2 or 3 for this assignment. Please see practical work 1 on Canvas for how to install this version.

In the first assignment, you need to build a function which **reads data** and another function which **converts the data into spaCy format**. You'll also create tests to check that your data handling works properly. Finally, you will create an **evaluation function** for NER.

In the second assignment, you'll **train a spaCy NER model** on a small dataset of NER annotations and evaluate your NER model on Danish development data.

In the third assignment, you'll first **train a NER model on English** NE annotated data and aligned bilingual word embeddings for Danish and English. You will then **fine-tune your model on the Danish NE** train set and finally evaluate you recognizer on the Danish test set.

## Assignment 1. Data handling

### Assignment 1.1 Reading data from disk

rubric={accuracy:5}

The directory in `Lab1/data` contains Danish training, development and test data for NER in a simple two column format (where a tab separates the columns). All tokens are marked with BIO tags and the data contains four entity types: `LOC` (locations), `MISC` (miscellaneous), `ORG` (organizations) and `PER` (person names):

```
Berlingske      B-ORG
Tidendes        O
afslag          O
kom             O
først           O
seks            O
uger            O
senere          O
og              O
lignede         O
til             O
forveksling     O
afslaget        O
fra             O
Jyllands-Posten B-ORG
.               O
```

Sentences are separated by blank lines.

Define a function `read_data` which takes a file as input and returns a list of sentences. Each sentence is a list of pairs `(token, bio_tag)`, for example `("Jyllands-Posten", "B-ORG")`.  

**You should also write 3 tests** in addition to the existing test below for `read_data`. You can imitate the test below `read_data`. These tests use the class [`io.StringIO`](https://docs.python.org/3/library/io.html#io.StringIO) which provides a file-like interface to strings.

You should test that `read_data` does something sensible when given an empty file and a file with more than one sentence. You should also check that your implementation can handle a blank line at the end of the file correctly (we don't want empty sentences in the returned list). 

In [10]:
import os.path as path

def read_data(file):
    data = [[]]
    # your code here

    # your code here
    return data

import io

# A test to make sure that your read_data function can handle a file 
# which consists of a single sentence. 
test_string = "Berlingske\tB-ORG"
assert(read_data(io.StringIO(test_string)) == [[("Berlingske","B-ORG")]])

# You should write three additional tests here:
# your code here

# your code here

You should now read the Danish training, development and test data:

In [2]:
danish_train = read_data(open(path.join("data","danish-train.conll")))
danish_dev = read_data(open(path.join("data","danish-dev.conll")))
danish_test = read_data(open(path.join("data","danish-test.conll")))

print(danish_train[0])

[('På', 'O'), ('fredag', 'O'), ('har', 'O'), ('SID', 'B-ORG'), ('inviteret', 'O'), ('til', 'O'), ('reception', 'O'), ('i', 'O'), ('SID-huset', 'O'), ('i', 'O'), ('anledning', 'O'), ('af', 'O'), ('at', 'O'), ('formanden', 'O'), ('Kjeld', 'B-PER'), ('Christensen', 'I-PER'), ('går', 'O'), ('ind', 'O'), ('i', 'O'), ('de', 'O'), ('glade', 'O'), ('tressere', 'O'), ('.', 'O')]


### Assignment 1.2 Converting data into spaCy format

rubric={accuracy:10}

The NER models in spaCy take training data in a specific format which differs from the BIO annotation in our files:

```
('På fredag har SID inviteret til reception i SID-huset i anledning af at formanden Kjeld Christensen går ind i de glade tressere .', {'entities': [[14, 17, 'ORG'], [88, 99, 'PER']]})
```

Each example is a pair where the first member is a sentence string like `'På fredag har SID inviteret til reception i SID-huset i anledning af at formanden Kjeld Christensen går ind i de glade tressere .'`. Note that the tokens in the sentence including punctuation are separated by spaces. The second member is a dictionary which specifies all named entities in a list `{'entities': [[14, 17, 'ORG'], [82, 99, 'PER']]}`.

Each entity like `[14, 17, 'ORG']` gives the start index of the entity (`14` in this case), its end (`17`) and the type of the entity `ORG`. The organization here is 

```
print('På fredag har SID inviteret til reception i SID-huset i anledning af at formanden Kjeld Christensen går ind i de glade tressere .'[14:17])

SID
```
and the person is:
```
print('På fredag har SID inviteret til reception i SID-huset i anledning af at formanden Kjeld Christensen går ind i de glade tressere .'[88:99])

Kjeld Christensen
```

It is your task to define a function `get_spacy_ner_data` which takes a dataset in BIO-format and returns it in spaCy format.

**You should also write 5 tests** in addition to the existing test below for `get_spacy_ner_data`. You can model your tests according to example that is given. Your tests should cover at least the following cases: 

1. A sentence with a single entity consisting of one token.
1. A sentence with multiple entites 
1. A sentence with an entity containing several tokens.
1. a sentence with two entities next to each other like *Anne showed Sue Mengqiu Huang 's new painting* (note the space after *Huang*), where *Anne*, *Sue* and *Mengqiu Huang* are all separate person names. 

In [8]:
print('På fredag har SID inviteret til reception i SID-huset \
      i anledning af at formanden Kjeld Christensen går ind i de glade tressere .'[14:17], "\tORG")

print('På fredag har SID inviteret til reception i SID-huset i \
      anledning af at formanden Kjeld Christensen går ind i de glade tressere .'[88:99], "\tPER")

SID 	ORG
Kjeld Chris 	PER


In [3]:
def get_entity(tag):
    return tag[2:]

def get_spacy_ner_data(data):
    result = []
    # your code here

    # your code here
    return result

# A test to make sure that your read_data function can handle a single 
# example with no entities
test_data = [[("The","O"),
              ("dog","O"),
              ("slept","O"),
              (".","O")]]
assert(get_spacy_ner_data(test_data) == [('The dog slept .',{'entities':[]})])

# You should write five additional tests here:
# your code here


# your code here

You should now create the Danish SpaCy data:

In [4]:
danish_spacy_train = get_spacy_ner_data(danish_train)
danish_spacy_dev = get_spacy_ner_data(danish_dev)
danish_spacy_test = get_spacy_ner_data(danish_test)

print(danish_spacy_train[0])

('På fredag har SID inviteret til reception i SID-huset i anledning af at formanden Kjeld Christensen går ind i de glade tressere .', {'entities': [[14, 17, 'ORG'], [82, 99, 'PER']]})


### Assignment 1.3: Evaluation for NER

rubric={"accuracy":5}

You'll now impement a function `evaluate` for computing the precision, recall and fscore of a NER model. The function takes two arguments: dataset with NE annotations from a NER model and another dataset with gold standard NE annotations. Both datasets are given in the return format of `get_spacy_ner_data`, i.e. each example in the dataset is is a pair like:

```
('På fredag har SID inviteret til reception i SID-huset i anledning af at formanden Kjeld Christensen går ind i de glade tressere .', {'entities': [[14, 17, 'ORG'], [82, 99, 'PER']]})
```

You should compute precision, recall and fscore over the entire dataset (i.e. you should **not** compute these separately for each example and then average them). Your function should return percentages, for example `67.5`.

Before you start implementing the function, make sure that you understand the computation of precision, recall and fscore for the test case under the function definition.

In [5]:
def evaluate(sys_spacy_data,gold_spacy_data):
    precision, recall, fscore = 0, 0, 0

    # your code here

    # your code here
    
    return precision, recall, fscore

# This is a partial check that your evaluation works as it should. Please implement more tests
# if you want to ensure yourself that the function works correctly.
sys_data = [("word1 word2 word3 word4",{"entities":[(0,5,"PER"),(12,17,"LOC")]}),
            ("word1 word2 word3 word4",{"entities":[(6,11,"ORG")]})]

gold_data = [("word1 word2 word3 word4",{"entities":[(0,6,"PER"),(12,17,"LOC")]}),
             ("word1 word2 word3 word4",{"entities":[]})]

precision, recall, fscore = evaluate(sys_data,gold_data)
assert(precision == 1.0/3 * 100)
assert(recall    == 1.0/2 * 100)
assert(fscore    == 2*1.0/3*1.0/2 / (1.0/3 + 1.0/2) * 100)

## Assignment 2: Training a NER model on the Danish data

Study the following [tutorial](https://spacy.io/usage/training#training-data) on training a NER model in spaCy. You will now initialize a NER model and train it on the Danish training data in spaCy format. 

### Assignment 2.1: Initializing the NER model

rubric={accuracy:5}

The first step in training the NER model is to initialize it. Implement a function `init_model`. It takes two arguments: `spacy_train_data` a NE annotated dataset in spaCy format (as given by `get_spacy_ner_data`) and `language` a language code (either `da` or `en`). 

Check the [example code for the `main` function](https://github.com/explosion/spaCy/blob/v2.x/examples/training/train_ner.py) in the spaCy NER tutorial for more details on initializing the model. **For SpaCy version 2**, these instructions should work out of the box. **For SpaCy version 3**, you will need to make a small change:

* Instead of calling `create_pipe()` and `get_pipe()` (SpaCy 2), simply call `add_pipe()` (SpaCy 3), which creates and returns the NER component.

You should first create a blank `spacy.Language` text processing pipeline called `model` using `spacy.blank` and then add create and add a `"ner"` submodel to the model. You can use `nlp.create_pipe` and `nlp.add_pipe` here. Finally, you should add each entity type (for example, `ORG` and `PER`) to the ner model using `add_label`. Please implement this is a way which ensures that all entity types mentioned in `spacy_train_data` will be included in the model (even if there are more types than the ones included in our Danish NER data).

In [6]:
import spacy 
from spacy.util import minibatch, compounding
from random import shuffle, seed
import numpy as np
import torch

def init_model(spacy_train_data, language):
    model = spacy.blank(language)

    seed(0)
    np.random.seed(0)
    spacy.util.fix_random_seed(0)
    torch.manual_seed(0)
    
    # your code here
    # For SpaCy v2:
    # model.create_pipe("ner")
    # ner = model.get_pipe("ner")    


    # `add_pipe` for "ner", https://spacy.io/api/language#add_pipe
    # model.add_pipe("ner", last=True)
    # ner = model.get_pipe("ner")    

    # iterate training_data:
    # {"entities":
    #   [   (0,5,"PER"),                <--- `add_label(PER)`
    #       (12,17,"LOC") ]}            <--- `add_label(LOC)`


    # your code here
    # Make sure we're only training the NER component of the pipeline
    pipe_exceptions = ["ner"]
    other_pipes = [pipe for pipe in model.pipe_names if pipe not in pipe_exceptions]

    # Start training so that we can use the model to annotate data

    # For SpaCy v2:
    # model.disable_pipes(*other_pipes)
    # optimizer = model.begin_training()
    
    model.select_pipes(disable=other_pipes)
    optimizer = model.initialize()
    
    return model, optimizer

danish_untrained_model, _ = init_model(danish_spacy_train,"da")

### Assignment 2.2: Annotating the development set

rubric={"accuracy":5}

You should now implement a function `annotate` for annotating a dataset using your NER model. The function takes a dataset in spaCy format and a NER model as input and returns the annotated dataset in the same format.

You can apply a model to an input sentence in the following way:

```
annotated = danish_model("Sue saw a dog .")
```

`annotated` is a [spacy.tokens.Doc](https://v2.spacy.io/api/doc) object and `annotated.ents` is a tuple of entities each of which is a [`spacy.tokens.span.Span`](https://v2.spacy.io/api/span) object. (**HINT:** `Span.start_char`, `Span.end_char` and `Span.label_` might prove useful). For a dataset containing the example above, `annotate` function might return the list:

```
[("Sue saw a dog .", {"entities":[(0,3,"PER")]})]
```

In [7]:
def annotate(spacy_data, model):
    result = []
    
    # your code here

    # your code here
    
    return result

# We're using an empty model which means that all annotations should be empty.
# Note! Passing this assertion does not guarantee that your function works properly.
danish_spacy_dev_sys = annotate(danish_spacy_dev, danish_untrained_model)
for sentence, annotations in danish_spacy_dev_sys:
    assert(annotations == {"entities":[]})

### Assignment 2.3 Training the NER model

rubric={"accuracy":5}

You should now implement the training function for a spaCy NER model. Check the [example code for the `main` function](https://github.com/explosion/spaCy/blob/v2.x/examples/training/train_ner.py) in the spaCy NER tutorial. **For SpaCy version 2**, these instructions should work out of the box. **For SpaCy version 3**, you will need to implement a small change:

* Instead of giving a list of sentences and a list of annotations as parameters to `model.update()`, you should first compile each sentence and annotation into an Example object `Example.from_dict(sentence, annotation)`. You can then form a list of examples and give the list as input to `model.update()`. 

The skeleton code for `train` already initializes the NER model and an optimizer. It is your task to implement the training procedure for each epoch. At the start of the epoch, you should shuffle the training data `spacy_train_data`. Then you should split the data into batches using [`spacy.util.minibatch`](https://v2.spacy.io/api/top-level#util.minibatch). You can either use batches of fixed size (5) or use `spacy.util.compounding` here to generate batches of varying size. You should then call `model.update` on each batch following the example in the [main function in the tutorial](https://github.com/explosion/spaCy/blob/v2.x/examples/training/train_ner.py) using dropout `0.1`. You should also pass `losses` as a parameter to `model.update`.

In [8]:
from copy import deepcopy
from spacy.training import Example

def train(spacy_train_data, spacy_dev_data, epochs,language):
    # Initialize model and get optimizer
    model, optimizer = init_model(spacy_train_data,language)
    
    # Make sure we don't permute the original training data.
    spacy_train_data = deepcopy(spacy_train_data)
    
    for itn in range(epochs):
        losses = {}
        
        # your code here
        # `shuffle` your training data:

        # see https://stackoverflow.com/questions/56595714/training-ner-model-with-spacy-only-uses-one-core
        # see https://stackoverflow.com/questions/66675261/how-can-i-work-with-example-for-nlp-update-problem-with-spacy3-0

        # batch up the examples using spaCy's minibatch
        # batches = list(minibatch(spacy_train_data, size=5))
        # for each batch in numerate(batches):
        # ...
        #   example = []

        # ...

        # ```
        # model.update(
        #   example, losses, drop
        # )
        # ```

        # your code here
           
        # Evaluate model
        print("Loss for epoch %u: %.4f" % (itn+1, losses["ner"]))
        spacy_dev_sys = annotate(spacy_dev_data, model)
        p, r, f = evaluate(spacy_dev_sys,spacy_dev_data)
        print("  PRECISION: %.2f%%, RECALL: %.2f%%, F-SCORE: %.2f%%" % (p,r,f))
    return model

Train a NER model on the Danish training data for 20 epochs. Take note of the best f-score on the development data during training. You should achieve f-score above 45%.

In [9]:
danish_model = train(danish_spacy_train,danish_spacy_dev,20,"da")
print()
print("Evaluating model on development set:")

danish_spacy_dev_sys = annotate(danish_spacy_dev, danish_model)

p, r, f = evaluate(danish_spacy_dev_sys,danish_spacy_dev)
print("  PRECISION: %.2f%%, RECALL: %.2f%%, F-SCORE: %.2f%%" % (p,r,f))

Loss for epoch 1: 832.5588
  PRECISION: 46.15%, RECALL: 5.19%, F-SCORE: 9.33%
Loss for epoch 2: 174.1499
  PRECISION: 43.46%, RECALL: 32.56%, F-SCORE: 37.23%
Loss for epoch 3: 103.0850
  PRECISION: 49.48%, RECALL: 41.50%, F-SCORE: 45.14%
Loss for epoch 4: 79.4326
  PRECISION: 43.01%, RECALL: 35.45%, F-SCORE: 38.86%
Loss for epoch 5: 58.6696
  PRECISION: 52.43%, RECALL: 43.52%, F-SCORE: 47.56%
Loss for epoch 6: 34.0745
  PRECISION: 46.85%, RECALL: 44.96%, F-SCORE: 45.88%
Loss for epoch 7: 26.1795
  PRECISION: 44.55%, RECALL: 40.06%, F-SCORE: 42.19%
Loss for epoch 8: 23.6894
  PRECISION: 46.45%, RECALL: 45.24%, F-SCORE: 45.84%
Loss for epoch 9: 19.0971
  PRECISION: 43.14%, RECALL: 44.38%, F-SCORE: 43.75%
Loss for epoch 10: 16.9120
  PRECISION: 49.01%, RECALL: 35.73%, F-SCORE: 41.33%
Loss for epoch 11: 13.8621
  PRECISION: 46.63%, RECALL: 43.80%, F-SCORE: 45.17%
Loss for epoch 12: 7.7922
  PRECISION: 47.48%, RECALL: 43.52%, F-SCORE: 45.41%
Loss for epoch 13: 4.8169
  PRECISION: 46.85%, RE

### Assignment 2.4 Pocket learning  (optional)

rubric={accuracy:3}

During training, the performance of the model on the development set may fluctuate. If you just train for a fixed number of epochs like above, there is no guarantee that your final model will be your best one. You should re-write your training algorithm to ["pocket"](https://en.wikipedia.org/wiki/Perceptron#Variants) the best model so far. 

Your new `train` function should evaluate f-score on the development set after each epoch and return the model which attained the highest development accuracy. One way to accomplish this might be to write the model to disk whenever a new high score is attained and then read the best model from disk and return it.  

In [10]:
def train(spacy_train_data, spacy_dev_data, epochs,language):
    # Initialize model and get optimizer
    model, optimizer = init_model(spacy_train_data,language)
    
    # Make sure we don't permute the original training data.
    spacy_train_data = deepcopy(spacy_train_data)
    best_f = 0
    best_model = None
    
    for itn in range(epochs):
        losses = {}
        
        # your code here
        # training should be same as before (see 2.3)
        # your code here
           
        # Evaluate model
        print("Loss for epoch %u: %.4f" % (itn+1, losses["ner"]))
        spacy_dev_sys = annotate(spacy_dev_data, model)
        p, r, f = evaluate(spacy_dev_sys,spacy_dev_data)
        print("  PRECISION: %.2f%%, RECALL: %.2f%%, F-SCORE: %.2f%%" % (p,r,f))
        if f > best_f:
            best_f = f
            best_model = deepcopy(model)
    return best_model

Please re-train the Danish model for 20 epochs. This should improve your f-score.

In [11]:
danish_model = train(danish_spacy_train,danish_spacy_dev,20,"da")
print()
print("Evaluating model on development set:")
danish_spacy_dev_sys = annotate(danish_spacy_dev, danish_model)
p, r, f = evaluate(danish_spacy_dev_sys,danish_spacy_dev)
print("  PRECISION: %.2f%%, RECALL: %.2f%%, F-SCORE: %.2f%%" % (p,r,f))


# PRECISION: 51.63%, RECALL: 45.53%, F-SCORE: 48.39%

### Assignment 2.5 Adding pretrained bilingual embeddings

rubric={"accuracy":3}

We'll now initialize the model using pretrained bilingual Danish and English word embeddings. You should implement a new version of `init_model` which is identical to your previous implementation except that you load a pretrained embedding. You can do this using the member function `from_disk` of `model.vocab`. Check the [spaCy documentation](https://v2.spacy.io/api/vocab#from_disk) for further information. Load the embedding from `data/vocab`.

**For SpaCy version 3**, it is not sufficient to simply call `from_disk()`, you will additionally need to give the argument `config={"model":{"tok2vec":{"pretrained_vectors":True}}}` when initializing the NER component of the model using `add_pipe()`. Otherwise, SpaCy will load the embeddings but won't actually use them (which results in lower F-score). 

In [18]:
from spacy.vocab import Vocab

def init_model(spacy_train_data, language):
    model = spacy.blank(language)#config={"paths":{"vectors":"data/vocab"}})

    seed(0)
    np.random.seed(0)
    spacy.util.fix_random_seed(0)
    torch.manual_seed(0)
    
    # your code here

    # `add_pipe` for "ner", https://spacy.io/api/language#add_pipe
    # model.add_pipe("ner", config={"model":{"tok2vec":{"pretrained_vectors":True}}}, last=True)
    # ...
    # your code here

    # Make sure we're only training the NER component of the pipeline
    pipe_exceptions = ["ner"]
    other_pipes = [pipe for pipe in model.pipe_names if pipe not in pipe_exceptions]

    # Start training so that we can use the model to annotate data
    model.disable_pipes(*other_pipes)
    optimizer = model.begin_training()

    return model, optimizer

Train a new model with pretrained embeddings on the Danish NER data. Take note of the best f-score on the development data during training. You should achieve f-score around 55%.

Spacy may print a warning about renaming the embedding vectors. This is not a cause for concern.

In [13]:
danish_model = train(danish_spacy_train,danish_spacy_dev,20,"da")
print()
print("Evaluating model on development set:")
danish_spacy_dev_sys = annotate(danish_spacy_dev, danish_model)
p, r, f = evaluate(danish_spacy_dev_sys,danish_spacy_dev)
print("  PRECISION: %.2f%%, RECALL: %.2f%%, F-SCORE: %.2f%%" % (p,r,f))

Loss for epoch 1: 838.1308
  PRECISION: 45.92%, RECALL: 12.97%, F-SCORE: 20.22%
Loss for epoch 2: 155.7281
  PRECISION: 50.16%, RECALL: 44.38%, F-SCORE: 47.09%
Loss for epoch 3: 93.2945
  PRECISION: 52.98%, RECALL: 46.11%, F-SCORE: 49.31%
Loss for epoch 4: 113.7654
  PRECISION: 50.26%, RECALL: 56.20%, F-SCORE: 53.06%
Loss for epoch 5: 45.5698
  PRECISION: 54.99%, RECALL: 58.79%, F-SCORE: 56.82%
Loss for epoch 6: 35.9901
  PRECISION: 58.14%, RECALL: 50.43%, F-SCORE: 54.01%
Loss for epoch 7: 26.9974
  PRECISION: 53.01%, RECALL: 53.31%, F-SCORE: 53.16%
Loss for epoch 8: 16.4708
  PRECISION: 56.62%, RECALL: 53.03%, F-SCORE: 54.76%
Loss for epoch 9: 15.3483
  PRECISION: 53.13%, RECALL: 51.30%, F-SCORE: 52.20%
Loss for epoch 10: 11.9133
  PRECISION: 55.46%, RECALL: 57.06%, F-SCORE: 56.25%
Loss for epoch 11: 13.5871
  PRECISION: 55.45%, RECALL: 51.30%, F-SCORE: 53.29%
Loss for epoch 12: 8.1043
  PRECISION: 57.72%, RECALL: 53.89%, F-SCORE: 55.74%
Loss for epoch 13: 13.7993
  PRECISION: 60.66%,

## Assignment 3: Training an English NER model and fine-tuning on Danish data

Run the following code to train an English NER system on the [CoNLL 2003 dataset](https://www.aclweb.org/anthology/W03-0419.pdf). This can take a while because the CoNLL dataset is far bigger than our Danish training data. Thats why we only train for 5 epochs.

Note that F-score will be very high here because we're evaluating on English data. 

In [14]:
english_train = read_data(open(path.join("data","english-train.conll")))
english_dev = read_data(open(path.join("data","english-dev.conll")))

english_spacy_train = get_spacy_ner_data(english_train)
english_spacy_dev = get_spacy_ner_data(english_dev)

english_model = train(english_spacy_train,english_spacy_dev,5,"en")

Loss for epoch 1: 11191.4381
  PRECISION: 87.00%, RECALL: 86.03%, F-SCORE: 86.51%
Loss for epoch 2: 5555.7599
  PRECISION: 87.34%, RECALL: 87.65%, F-SCORE: 87.49%
Loss for epoch 3: 4278.0694
  PRECISION: 88.54%, RECALL: 87.80%, F-SCORE: 88.17%
Loss for epoch 4: 3421.0851
  PRECISION: 89.35%, RECALL: 88.49%, F-SCORE: 88.92%
Loss for epoch 5: 3072.2541
  PRECISION: 88.40%, RECALL: 88.46%, F-SCORE: 88.43%


### Assignment 3.1: Fine-tuning the model on Danish data

rubric={accuracy:3}

You should now fine-tune the English NER model on the Danish training data. The function `retrain` below is very similar to the `train` function which you implemented earlier but `retrain` does not initialize the NER model. Instead it takes the model as parameter and continues training it. 

You only need to copy yor code from `train` and then you are ready to fine-tune the model. After fine-tuning, the model f-score should be around 60%. 

In [15]:
from copy import deepcopy

def retrain(spacy_train_data, spacy_dev_data, epochs,model,pretrained_fn):
    # Make sure we don't permute the original training data.
    spacy_train_data = deepcopy(spacy_train_data)
    
    model = deepcopy(model)
    
    for itn in range(epochs):
        losses = {}
        
        # your code here

        # your code here
        
        
        # Evaluate model
        print("Loss for epoch %u: %.4f" % (itn+1, losses["ner"]))
        spacy_dev_sys = annotate(spacy_dev_data, model)
        p, r, f = evaluate(spacy_dev_sys,spacy_dev_data)
        print("  PRECISION: %.2f%%, RECALL: %.2f%%, F-SCORE: %.2f%%" % (p,r,f))
    return model

transfer_model = retrain(danish_spacy_train, danish_spacy_dev, 20,english_model)

Loss for epoch 1: 199.5989
  PRECISION: 56.16%, RECALL: 53.89%, F-SCORE: 55.00%
Loss for epoch 2: 83.7119
  PRECISION: 59.48%, RECALL: 59.65%, F-SCORE: 59.57%
Loss for epoch 3: 25.3469
  PRECISION: 64.74%, RECALL: 58.21%, F-SCORE: 61.31%
Loss for epoch 4: 7.9118
  PRECISION: 63.20%, RECALL: 61.38%, F-SCORE: 62.28%
Loss for epoch 5: 3.0076
  PRECISION: 64.48%, RECALL: 62.25%, F-SCORE: 63.34%
Loss for epoch 6: 0.6200
  PRECISION: 62.29%, RECALL: 62.82%, F-SCORE: 62.55%
Loss for epoch 7: 1.3127
  PRECISION: 62.72%, RECALL: 61.10%, F-SCORE: 61.90%
Loss for epoch 8: 0.0002
  PRECISION: 62.80%, RECALL: 60.81%, F-SCORE: 61.79%
Loss for epoch 9: 0.0001
  PRECISION: 63.10%, RECALL: 61.10%, F-SCORE: 62.08%
Loss for epoch 10: 0.0003
  PRECISION: 62.24%, RECALL: 60.81%, F-SCORE: 61.52%
Loss for epoch 11: 0.0000
  PRECISION: 62.43%, RECALL: 60.81%, F-SCORE: 61.61%
Loss for epoch 12: 0.0000
  PRECISION: 62.83%, RECALL: 61.38%, F-SCORE: 62.10%
Loss for epoch 13: 0.0010
  PRECISION: 63.13%, RECALL: 61

After tuning, apply your the basic NER model and the transfer model on the Danish test data. Your f-score should be around 55% for the basic model and around 60% for the transfer model.

In [16]:
print("Evaluating basic Danish model on test set:")
danish_spacy_test_sys_basic = annotate(danish_spacy_test, danish_model)
p, r, f = evaluate(danish_spacy_test_sys_basic,danish_spacy_test)
print("  PRECISION: %.2f%%, RECALL: %.2f%%, F-SCORE: %.2f%%" % (p,r,f))
print()

print("Evaluating basic transfer model on test set:")
danish_spacy_test_sys_transfer = annotate(danish_spacy_test, transfer_model)
p, r, f = evaluate(danish_spacy_test_sys_transfer,danish_spacy_test)
print("  PRECISION: %.2f%%, RECALL: %.2f%%, F-SCORE: %.2f%%" % (p,r,f))

Evaluating basic Danish model on test set:
  PRECISION: 60.23%, RECALL: 52.82%, F-SCORE: 56.28%

Evaluating basic transfer model on test set:
  PRECISION: 62.83%, RECALL: 61.54%, F-SCORE: 62.18%


### Assignment 3.2 Analyzing the results (optional)

rubric={"reasoning":3}

Investigate `danish_spacy_test_sys_basic` and `danish_spacy_test_sys_transfer` to figure out why the transfer model delivers better results then the basic NER model. You can approach the question in different ways. You might look at things like:

* Does the transfer model get better at identifying names which are similar in the Danish and English NER data?
* Does the transfer model identify more purely Danish names which are not found by the basic model? 

YOUR ANSWER HERE