
---

# WSD by fine-tuning a transformer-based pre-trained model

**Copy this notebook (File>Save a copy in Drive)**

**Add your name in notebook's name**

**Deadlines**
- send me a shared link by email with subject "ML3 finetuning for WSD + last NAME + NAME", before **Dec 27**
- don't forget to give me **edit rights**
- the execution traces should be visible
- **Strong advice**:
  - do the TODO1 and TODO2 by next lab session (Dec 8)
  - freeze the FlauBERT's parameters in your preliminary experiments AND check you did it right
  - don't forget to enable the use of a gpu on colab:
  - Runtime > Change runtime type > Hardware accelerator => select "T4 GPU", which is enough
  - Exécution > Modifier le type d'exécution > Accélérateur matériel > T4 GPU



We will use the French FrameNet "[ASFALDA](http://asfalda.linguist.univ-paris-diderot.fr/frameIndex.xml)" dataset to experiment the Word Sense Disambiguation task (WSD).

In this dataset, some words have been manually associated with a semantic frame:
- these words are called the **"targets"**
- find the correct frame for a given word token corresponds to a word sense disambiguation task **(WSD)**
- note though that a single frame pertains to several lexical units (e.g. FR_Commerce_buy => acheter.v, achat.n, acquérir.v, etc...)
- for this lab session, sentences containing several targets have been duplicated: each line corresponds to a (sentence, target) pair.

FrameNet data also contains annotations for the semantic roles of semantic arguments (Buyer, Seller, Goods ...), which will be ignored for this lab session.

So, the objective of this lab is to build a classifier:
- input = a (sentence, target) pair
- output = a probability distribution over the various "senses" (namely frames)
  - in basic version, we do not impose that a given target be only associated with its possible senses (namely those that were seen in the training data for this target lemma).

A central trait of our classifier will be to use the contextual representation of the target, as output by a transformer-based pre-trained language model.

Note that *BERT*-like models provide vectors for tokens, each token being potentially a subword.
**In your base version, you will use the FlauBERT vector of the FIRST token of the target word.**

Example: for the target *comprenions* in:

*Nous comprenions bien le cours*

tokenized as :

'\<s>', 'Nous\</w>', 'compren', 'ions\</w>', 'bien\</w>', 'le\</w>', 'cours\</w>', '.\</w>, '\</s>'

you will use the last hidden vector of "compren".

The base classifier will be a neural network comprising
- the pre-trained language model
- which provides the hidden vector of the 1st token of the target word
- plus a simple linear layer + softmax into the set of frames seen in the training set.

We have a single classifier for all lemmas. In the base version, we put no constraint on which frames can be associated with a given target-lemma.



In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
#from tqdm import tqdm
from tqdm.notebook import tqdm # for progress bars in notebooks
from random import shuffle
import os


## Naming conventions

- sentences are already segmented into words (with a rule-based tokenizer)
- but are not segmented into subwords yet
- we use "word" or "w" for the tokens obtained after pre-segmentation
- and "token" for units obtained after *BERT*-like tokenization (BPE ou WordPiece etc...)

- in variable names, we distinguish
 - integer identifiers for symbols
   (for the token vocabulary, the frame vocabulary ...)
 - versus the rank of a unit (either word or token) within a sequence
- tid => token identifier
- wrk => rank of a word in a the pre-tokenized sequence
- trk => rank of a token in a bert*-tokenized sequence
- tg => "target", so
 - tg_wrk = rank of the target word
 - tg_trk = rank of the first token of the target


## Get a variable for the device (CPU or GPU)



In [None]:
# in order to use a GPU
# modify notebook settings:
# Runtime > Change runtime type > Hardware accelerator => select "T4 GPU", which is enough

# if a GPU is available, we will use it
if torch.cuda.is_available():
    # object torch.device
    DEVICE = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    device_id = torch.cuda.current_device()
    gpu_properties = torch.cuda.get_device_properties(device_id)
    print("We will use GPU %d (%s) of compute capability %d.%d with "
          "%.2fGb total memory.\n" %
          (device_id,
          gpu_properties.name,
          gpu_properties.major,
          gpu_properties.minor,
          gpu_properties.total_memory / 1e9))

else:
    print('No GPU available, using the CPU instead.')
    DEVICE = torch.device("cpu")



## "ASFALDA" dataset

A French FrameNet, comprising about 16000 annotated targets, into about 100 distinct frames, along with their semantic role annotations.


### Fetching the data

In [None]:
if not os.path.exists('./asfalda_data_for_wsd/'):
  # shell commands can be run using !
  !pip install wget
  import wget

  # The URL for the dataset zip file.
  url = 'http://www.linguist.univ-paris-diderot.fr/~mcandito/divers/asfalda_data_for_wsd.tgz'


  if not os.path.exists('./asfalda_data_for_wsd.tgz'):
    print('Downloading dataset')
    wget.download(url, './asfalda_data_for_wsd.tgz')
    !tar zxf asfalda_data_for_wsd.tgz

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9682 sha256=7ebe89a3ae97d8dce0fb0aa081b5eeec94dd6f1acb1e093c5b17db81643c1bb8
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
Downloading dataset


### Data loading method

In [None]:
def load_asfalda_data(gold_data_file, split_info_file):
    """
        Inputs: - asfalda gold data file
                - file indicating the corpus type for each sentence id

        Returns 4 dictionaries (whose keys are corpus types (train/dev/test))
        - sentences = list of sentences, each sent is a list of words
        - list of rank of target word in each sentence
        - list of target lemmas
        - gold labels

        Example:
        sentences['train'] = [[ ]]
         # the targets are the 3rd and first words
        tg_wrks['train'] = [2, 0]
        tg_lemmas['train'] = ['comprendre', 'comprendre']
        labels['train'] = ['frame1', 'frame2']

    """
    # load the usual split into train / dev / test
    s = open(split_info_file)
    lines = [ l[:-1].split('\t') for l in s.readlines() ]
    split_info_dic = { line[0]:line[1] for line in lines }

    # dev / train / test sentences
    sentences = {'dev':[], 'train':[], 'test':[]}
    # the word ranks (wrk) for the target words
    tg_wrks = {'dev':[], 'train':[], 'test':[]}
    # target lemmas
    tg_lemmas = {'dev':[], 'train':[], 'test':[]}
    # the labels of targets (= frames)
    labels = {'dev':[], 'train':[], 'test':[]}

    max_sent_len = {'dev':0, 'train':0, 'test':0}
    max_tg_wrk = {'dev':0, 'train':0, 'test':0}

    stream = open(gold_data_file)
    for line in stream.readlines():
        if line.startswith('#'):
            continue
        line = line.strip()
        (sentid, tg_wrk, frame_name, tg_lemma, tg_pos, rest) = line.split('\t',5)
        # role annotation is ignored
        # sentences are pre-segmented into space-separated words
        # => we split on space, and will use the is_split_into_words=True mode of the FlauBERT tokenizer
        sentence = rest.split("\t")[-1].split(' ')
        part = split_info_dic[sentid]
        tg_wrk = int(tg_wrk)

        l = len(sentence)
        sentences[part].append(sentence)
        labels[part].append(frame_name)
        tg_wrks[part].append(tg_wrk)
        tg_lemmas[part].append(tg_lemma)
        if max_sent_len[part] < l:
            max_sent_len[part] = l
        if max_tg_wrk[part] < tg_wrk:
            max_tg_wrk[part] = tg_wrk
    print("Max sentence length:", max_sent_len)
    print("Max target rank (in words):", max_tg_wrk)

    return sentences, tg_wrks, tg_lemmas, labels


### Data loading and defining ids for labels

In [None]:
gold_data_file = './asfalda_data_for_wsd/sequoiaftb.asfalda_1_3.gold.uniq.nofullant.txt'

# usual split train / dev / test for this corpus
split_info_file = './asfalda_data_for_wsd/sequoiaftb_split_info'

sentences, tg_wrks, tg_lemmas, label_strs = load_asfalda_data(gold_data_file,
                                                              split_info_file)

for p in sentences.keys():
    avgl = sum([len(s) for s in sentences[p]])/len(sentences[p])
    print("%s : %d sentences, average lentgh=%3.2f"
          %(p, len(sentences[p]), avgl))

# creating label ids for frames seen in training set
i2label = list(set(label_strs['train']))
# id for unknown frame (for dev and test)
i2label.append('*UNK*')

label2i = {x:i for i,x in enumerate(i2label)}
# id of special frame "Other_sense"
i_OTHER_SENSE = label2i['Other_sense']

# sequence of gold labels
# for each sub-corpus (key = dev/train/test)
labels = {}
for p in label_strs.keys():
    labels[p] = [label2i[x] if x in label2i else label2i['*UNK*'] for x in label_strs[p]]




Longueur max des phrases: {'dev': 115, 'train': 271, 'test': 140}
Rang max du target (en mots): {'dev': 96, 'train': 267, 'test': 115}
SPLIT 18657 items into 16792 and 1865
SPLIT 18657 items into 16792 and 1865
SPLIT 18657 items into 16792 and 1865
SPLIT 18657 items into 16792 and 1865
dev : 2688 sentences, average lentgh=38.03
train : 16792 sentences, average lentgh=38.99
test : 3447 sentences, average lentgh=38.45
val : 1865 sentences, average lentgh=38.97


### TODO1 : MFS Baseline ("most frequent sense")

In WSD, a very strong baseline is to always assign the most frequent sense of a word, independently of its context.

Note this is a supervised baseline, since we need a sense-annotated corpus to compute the most frequent sense of each word.

- Compute the most frequent sense of each **target-lemma**
  (using counts found in **train**)

- and compute the MFS baseline, namely the accuracy obtained when choosing the most frequent sense of each target
  - MFS in train
  - MFS in dev (always using frequencies in train to get the most frequent senses)
    - **NB**: in case of unknown target lemma, fall back on the most frequent frame in full training data

- Study the items in dev that are unknown in train:
  - unknown target lemmas
  - unknown frame / target-lemma associations
  - unknown frames
  



## Data encoding

### FlauBERT tokenization

We use the FlauBERT model, using the Huggingface "transformers" module.

In [None]:
try:
  import transformers
except ImportError:
  !pip install transformers

from transformers import AutoModel, AutoTokenizer, AutoConfig

# flaubert's tokenization uses as first step a tokenization into words by moses
try:
  import sacremoses
except ImportError:
  !pip install sacremoses
  import sacremoses


Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 8.6MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 30.0MB/s 
[?25hCollecting tokenizers==0.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/4c/34/b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee/tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 40.7MB/s 
[?25hCollecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)

In [None]:
# We choose the FlauBERT model

# we load tokenizer and config for now
flaubert_tokenizer = AutoTokenizer.from_pretrained("flaubert/flaubert_base_cased")
flaubert_config = AutoConfig.from_pretrained("flaubert/flaubert_base_cased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1496.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1561415.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=895731.0, style=ProgressStyle(descripti…




### TODO2: Encoding method

The objective is to apply FlauBERT's tokenization, **keeping track of the position of the tokens of the targets**.

Follow the instructions below to fill in the encode method.

In [None]:

class WSDEncoder:
    def __init__(self, tokenizer, config):
        self.tokenizer = tokenizer
        self.config = config # to get indices of special tokens


    def encode(self, sentences, tg_wrks, max_length=100, verbose=False, is_split_into_words=True):
        # - tokenization into subwords ("tokens"):
        # - substitution by token ids
        # - truncation and padding
        # - addition of special tokens
        # - track of the rank of the first token of target
      """
      Input:
        - sentences : list of sentences
           -- if is_split_into_words:
              sentences are already split into words
              (hence sentences = list of word strings [[w1, w2, w3], [w1, w2]...])
           -- otherwise, sentences are to split on spaces to get words

        - tg_wrks : list of the ranks of target words
          (one rank per sentence, starting at 0 in a sentence)
        - max_length : maximum length in number of tokens

      Returns:
        - tid_seqs : the sentences padded/truncated so that each contains max_length token ids
        - first_trk_of_targets : for each sentence,
                                 the rank in corresponding tid_seq
                                 of the first token of the target word

      Example with is_split_into_words=True: a batch with one sent
      sentences = [ ['Conséquemment', ',', 'nous', 'comprendrions', '.'] ]
      tg_wrks = [3]

      if the sentence is tokenized into
        '<s>', 'Con', 'séqu', 'emment</w>', ',</w>', 'nous</w>', 'compr', 'end', 'rions</w>', '.</w>' ....
      the first token rank of the target "comprendrions" is 6 ('compr')

      """
      # TODO HERE : encoding method

      # Indications:
      # 1. apply flaubert tokenization *word per word*, and build
      #    tid_seqs first without padding / truncation nor special tokens,
      #    and keep track of token rank of first token of target word
      # 2. then truncate or pad, and add special symbols
      # (write several methods for easier reading)

      #  return tid_seqs, first_trk_of_targets

#### Encoding test

In [None]:
encoder = WSDEncoder(flaubert_tokenizer, flaubert_config)

# test encoder
test_sents = ["Conséquemment , nous comprendrions .",
              "Le code comprend des erreurs .",
            "J' essaie de comprendre les transformers .",
            "Il n' a pas bien compris le code !"]

# we split into words (cf. asfalda dataset sentences are already split)
test_sents = [ x.split(' ') for x in test_sents ]
# target words are the occurrences of "comprendre"
test_tg_wrks = [3, 2, 3, 5]
max_length=10

# TODO: uncomment to test your encode method
#tid_seqs, first_trk_of_targets = encoder.encode(test_sents, test_tg_wrks, max_length=10, verbose=True, is_split_into_words=True)

#for tid_seq, ft in zip(tid_seqs, first_trk_of_targets):
#    readable = flaubert_tokenizer.convert_ids_to_tokens(tid_seq)
#    print("Len = %d target token rank = %d tid_seq = %s (%s)" % (len(tid_seq), ft, str(tid_seq), str(readable)))


### TODO3: WSDData class: full encoding and batch production

In [None]:

class WSDData:
    def __init__(self, corpus_type, sentences, tg_wrks, tg_lemmas, labels, encoder, max_length=100):
        """
        Inputs:
        - corpus type string (train/dev/test)
        - list of sentences (each sentence = list of word strings)
        - list of target word ranks : one per sentence
        - list of gold label id
        - encoder = instance of WSDEncoder

        - max_length = size of encoded sequences, in nb of bert tokens
                      (padded / truncated via encoder.encode)

        Encodes all the data using the relevant identifiers
        """

        self.corpus_type = corpus_type # train / dev / test / val
        self.size = len(sentences)
        self.encoder = encoder

        self.labels = labels       # gold label ids
        self.sentences = sentences # list of list of word strings
        self.tg_lemmas = tg_lemmas

        tid_seqs, tg_trks = encoder.encode(sentences, tg_wrks, max_length)

        self.tid_seqs = tid_seqs  # sequences of token ids
        self.tg_trks = tg_trks    # target token ranks


    def shuffle(self):
      """
      Rearranges all the data in a new random order
      (sentences, tg_lemmas, tg_trks, tid_seqs, labels)

      NB: ** original order might be lost **
      """
      # TODO

    # production of a batch
    def make_batches(self, batch_size, device, shuffle_data=False):
        """
        Returns an iterator over 3 torch tensors
        - batch of token id sequences
        - corresponding batch of target token ranks
        - corresponding batch of labels for these targets
        """
        # TODO
        # (Tip: use "yield" function to return an iterator)

        # **NB** : the torch tensors can be directly sent to the right device
        #          using .to(device)
        # for ...
        #    ...
        #    yield(b_tid_seqs, b_tg_trks, b_labels)


In [None]:
# Encoding of the three sets train/dev/test
MAX_LENGTH = 100
wsd_data = {}
# key = part of the split corpus (train/test/dev)
for p in sentences.keys():
    print("Encoding part %s ..." % p)
    wsd_data[p] = WSDData(p, sentences[p], tg_wrks[p], tg_lemmas[p], labels[p],
                          encoder, max_length=MAX_LENGTH)
    # we check that encoding provides the right lengths
    for i, s in enumerate(wsd_data[p].tid_seqs):
        if len(s) != MAX_LENGTH:
            print("Size bug:", i, s)


## WSDClassifier class: the network for WSD

Base architecture =
- the FlauBERT model
- plus linear layer + softmax

### TODO4: Digression: matrix operation

Matrix operation to fetch the bert hidden vector of the first
token of the targets.

Input is
1. x = a tensor for a batch of (truncated/padded) sentences
   containing the bert vectors for all tokens of each sentence

2. r = a tensor for the token ranks of the first token of the targets
  
=> we want to keep only the bert vectors of these first tokens

TODO: write down the shapes of tensors x and r and of the output tensor


In [None]:
x = torch.tensor([[[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [9, 10, 11, 12]],
                  [[13, 14, 15, 16],
                   [17, 18, 19, 20],
                   [21, 22, 23, 24]]])
print(x.shape)
# if token ranks for the two sentences of the batch are (1,2)
r = torch.tensor([1, 2])
# => we want to get the [5, 6, 7, 8] and [21, 22, 23, 24] vectors

# write down the matrix operation
# see this source : https://discuss.pytorch.org/t/how-to-select-specific-vector-in-3d-tensor-beautifully/37724
# o = ...


### TODO5: The network : architecture, forward propagation, evaluation

In [None]:
flaubert_model = AutoModel.from_pretrained("flaubert/flaubert_base_cased", return_dict=True)


In [None]:
class WSDClassifier(nn.Module):

    def __init__(self, num_labels, device, bert_model, bert_config, freeze_bert = True):
        super(WSDClassifier, self).__init__()

        self.device = device

        # the full *BERT*-like model
        # the .to(device) triggers the copy towards the relevant device
        # (possibly a GPU)
        self.bert_layer = bert_model.to(device)
        # config will allow to get the hidden vectors' size
        self.bert_config = bert_config

        # TODO HERE : rest of the network

        # TODO: implement option to either freeze or fine-tune the BERT model


    def forward(self, b_tid_seq, b_tg_trk):
        """
        Inputs: (all are tensors, on the relevant device)
            - a batch of sentences = a batch of token id sequences
              (as output in 'input_ids' member of tokenizer output)
            - a batch of target token rank = for each of the sentences,
              the rank of first token of the target word to disambiguate

        Output: log_softmax scores for the whole batch (batch_size x num_labels)
        """
        # TODO HERE
        #  - get the *bert last hidden vectors for all the tokens of all the batch sentences
        #    [ batch_size * seq_len * bert_emb_size ]
        #
        #  - isolate the vector of the (first) token of the target for all the batch sentences
        #    [ batch_size * bert_emb_size ]
        #
        #    Tips to do this:
        #    https://discuss.pytorch.org/t/how-to-select-specific-vector-in-3d-tensor-beautifully/37724
        #
        #  - and apply linear layer

    def run_on_dataset(self, wsd_data, batch_size=32):
        """
        Run classifier on wsd_data and compute accuracy
        Inputs =
         - wsd_data (WSDDataset instance)
         - batch_size
        Returns:
         - list of predicted label ids
        """
        pred_labels = []

        # VERY IMPORTANT : toggle evaluation mode of the model (no dropout)
        self.eval()

        # TODO

        return pred_labels

    def evaluate(self, gold_labels, pred_labels):
        """ returns accuracy, nb_correct, nb_total """
        # TODO




NameError: ignored

In [None]:
# an instance of WSDClassifier
num_labels = len(i2label)
classifier = WSDClassifier(num_labels, DEVICE, flaubert_model, flaubert_config)

# uncomment to see the huge nb of parameters ...
#for name, param in classifier.named_parameters():
#    print("PARAM named %s, of shape %s" % (name, str(param.shape)))
#    print(param)

#### Test of forward propagation

In [None]:
# useless to compute gradients when testing
with torch.no_grad():
    # toggle train mode off
    classifier.eval()
    for b_tid_seqs, b_tg_trks, b_labels in data['dev'].make_batches(32, shuffle_data=True):
        b_tid_seqs = torch.tensor(b_tid_seqs, device=classifier.device)
        b_tg_trks = torch.tensor(b_tg_trks, device=classifier.device)

        log_probs = classifier(b_tid_seqs, b_tg_trks)
        gold = b_labels[0] #.item()
        print("GOLD LABEL of first ex %d ( = %s)" % (gold, i2label[gold]))
        print("LOG_PROBS before training: %s\n\n" % str(log_probs[0]))
        break


### TODO6: Training : fine-tuning for the WSD task

**NB** full training on train data is **LONG**, so when developping your code, first try on a small part of the data.

**NB** In general when using a *bert model in fine-tuning mode (not frozen), the needed learning rate tends to be lower

In [None]:
# training

BATCH_SIZE = 32
LR = 0.0005

loss_function = nn.NLLLoss(reduction='mean')
# SGD is quicker (more convenient for debug phase)
#optimizer = optim.SGD(classifier.parameters(), lr=LR)
optimizer = optim.Adam(classifier.parameters(), lr=LR)

config_name = 'sequoiaftb.asfalda_1_3.wsd.lr' + 'Adam' + str(LR) + '_bs' + str(BATCH_SIZE)
out_model_file = './' + config_name + '.model'
out_log_file = './' + config_name + '.log'


# losses at each epoch (on train / on validation set)
train_losses = []
val_losses = []
min_val_loss = None

# to speed up during debug: train on dev
#train_data = wsd_data['dev'] # data['train']
train_data = wsd_data['train']
val_data = wsd_data['dev']

# TODO HERE
# training
# - basic : train for NB_EPOCHS
# - BONUS : early stopping: stop epoch loop as soon as accuracy on dev decreases

# don't forget to toggle
# - classifier.train() when training on train (it will use dropout)
# - classifier.eval() when evaluating on val corpus

# to speed up: don't compute the gradients when evaluating on dev / test


print("train losses: %s" % ' / '.join([ "%.4f" % x for x in train_losses]))
print("val   losses: %s" % ' / '.join([ "%.4f" % x for x in val_losses]))



### TODO7: Evaluation

In [None]:
# TODO HERE : run on dev and evaluate



## Scale ("barème")

- basic system will give 12 points

- quality of code / comments = 2 points


- various additional points (to choose)

  - generalization analysis
   Do you think it would be better to predict seen-in-train lemma/frame associations only ?
   (in order to answer that question, propose and implement simple analysis of the predictions performed without any control of the frame/lemma association)

  - implement an option to only predict seen-in-train lemma/frame associations

  - nice hyperparameter search

  - high results thanks to nice hyperparameter search

  - implement early stopping

  - does it help to fine-tune with a MLP instead of single layer ?

  - does it help to use a concatenation of flaubert's embeddings at different layers instead of the last layer only (eg 4 last layers, cf. table 7 of devlin et al. 2019) (do this in frozen mode only)

  - does it help to add a lemma embedding of the target (concatenate it to the bert output, before final linear layer)?

  - ... other ideas are welcome ...

Approximate expected accuracy:
 - In frozen mode, basic system can reach 83 / 84 % on the dev set when well trained

 - In fine-tuning mode: results seem unstable
  - take care to search for an appropriate learning rate, which tends to be lower than in frozen mode
  - some of the runs get stuck at 37% of accuracy, corresponding to assigning the MFS to all the instances ("other_sense" frame)
  - when learning goes well, accuracy can reach 88, or even 90% for some runs


NB: write below what you chose to investigate / implement.
Summarize your results / hyper-parameter search.
For an extra feature to count, you need to write down an analysis of the results.

Your notebook should show traces of a complete training and evaluation phase.