<a href="https://colab.research.google.com/github/mirjampaales/cool-ml-project/blob/main/named_entity_recognition/train_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Downloading data 
This can be skipped if using the notebook withing the .git repository as the repositories below are included there as submodules:

In [1]:
! git clone https://github.com/mukhal/xlm-roberta-ner.git 
! git clone https://github.com/TurkuNLP/turku-ner-corpus
! git clone https://github.com/ksirts/EstNER

Cloning into 'xlm-roberta-ner'...
remote: Enumerating objects: 312, done.[K
remote: Counting objects: 100% (312/312), done.[K
remote: Compressing objects: 100% (187/187), done.[K
remote: Total 312 (delta 165), reused 245 (delta 118), pack-reused 0[K
Receiving objects: 100% (312/312), 2.89 MiB | 10.43 MiB/s, done.
Resolving deltas: 100% (165/165), done.
Cloning into 'turku-ner-corpus'...
remote: Enumerating objects: 1611, done.[K
remote: Counting objects: 100% (1611/1611), done.[K
remote: Compressing objects: 100% (1515/1515), done.[K
remote: Total 1611 (delta 67), reused 1574 (delta 46), pack-reused 0[K
Receiving objects: 100% (1611/1611), 6.77 MiB | 13.13 MiB/s, done.
Resolving deltas: 100% (67/67), done.
Cloning into 'EstNER'...
remote: Enumerating objects: 8, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 8 (delta 2), reused 4 (delta 1), pack-reused 0[K
Unpacking objects: 100% (8/8), done.


## Data preparation

Data preparation to uniform formats across languages. As we will use the XLM-R finetuning code with the English dataset included, we don't have to worry about the format of English data.

The expected dataset format is a space-separated file where only the first and last column are looked at (word and its label), other columns are ignored. Sentences are separated by an empty line.

Finnish is easy, as it is a .tsv file with those two columns. Estonian is a completely different hierarchical JSON format and needs most preparation.


Additionally, we'll limit the named entity labels to just persons (PER), organizations (ORG) and locations (LOC), as those are common in all datasets.

In [31]:
import json

In [54]:
! cd data; mkdir et en fi merged

In [45]:
allowed_labels = ['O', 'B-LOC', 'I-LOC', 'B-ORG','I-ORG','B-PER','I-PER']

Estonian:

In [55]:
for split in ["dev","test","train"]:
  with open(f"EstNER/EstNER_v1_{split}.json", 'r') as f_in:
    data = json.loads(f_in.read())
  
  split = 'valid' if split=='dev' else split

  with open(f"data/et/{split}.txt", 'w') as f_out:
    for document in data:
      for sentence in document:
        for token in sentence:
          # Estonian has multi-part names sometimes marked as one token (e.g. New York). Those must be split to multiple rows to conform with the file format.
          words = token['word'].split()
          label = token['ner_1']

          label = label if label in allowed_labels else 'O'

          f_out.write(f"{words.pop(0)} {label}\n")

          if words: # name was multipart
            # if the first word is a named entity start (label B-*), others must be continuations
            if label.split('-')[0]=='B':
              label=f"I-{label.split('-')[1]}"
            
            while words:
              f_out.write(f"{words.pop(0)} {label}\n")
        f_out.write('\n')

Finnish:

In [58]:
for split in  ["dev","test","train"]:
  with open(f"turku-ner-corpus/data/conll/{split}.tsv", 'r') as f_in:
    data = f_in.readlines()
  
  split = 'valid' if split=='dev' else split

  with open(f"data/fi/{split}.txt", 'w') as f_out:
    for line in data:
      columns = line.strip().split('\t')
      if len(columns) >= 2:
        if columns[-1] in allowed_labels:
          f_out.write(f"{columns[0]} {columns[-1]}\n")
        else:
          f_out.write(f"{columns[0]} O\n")
      else:
        f_out.write(f"\n")
    f_out.write('\n')

English:

In [60]:
for split in  ["valid","test","train"]:
  with open(f"xlm-roberta-ner/data/coNLL-2003/{split}.txt", 'r') as f_in:
    data = f_in.readlines()

  with open(f"data/en/{split}.txt", 'w') as f_out:
    for line in data:
      columns = line.strip().split()
      if len(columns) >= 2:
        if columns[-1] in allowed_labels:
          f_out.write(f"{columns[0]} {columns[-1]}\n")
        else:
          f_out.write(f"{columns[0]} O\n")
      else:
        f_out.write(f"\n")
    f_out.write('\n')

Create merged datasets:

In [62]:
! for f in {valid,test,train}; do cat data/{et,en,fi}/$f.txt > data/merged/$f.txt; done

## Finetuning XLM-R

In [63]:
! pip install -r xlm-roberta-ner/requirements.txt
! pip install wandb

Checking which GPU resources we have:

In [None]:
! nvidia-smi

Downloading the pretrained XLM-R model

In [None]:
! mkdir model_dir
! mkdir pretrained_models finetuned_models
! wget -P pretrained_models https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz
! tar xzvf pretrained_models/xlmr.base.tar.gz  --directory pretrained_models/
! rm -r pretrained_models/xlmr.base.tar.gz

Setting up Weights & Biases monitoring to keep an eye on GPU performance metrics (utilization, memory consumption, etc.)

In [None]:
wandb.init()

[34m[1mwandb[0m: Currently logged in as: [33mliisaratsep[0m (use `wandb login --relogin` to force relogin)


Finetuning with the same parameters as close to the original [XLM-R paper](https://arxiv.org/pdf/1911.02116.pdf) as 
possible.

PS: actual finetuning was done on UT HPC Rocket cluster, as it was faster, but the examples below are completely 
functional.

In [None]:
# multilingual finetuning

! python xlm-roberta-ner/main.py \
    --data_dir=./data/merged/  \
    --task_name=ner   \
    --output_dir=finetuned_models/merged/   \
    --max_seq_length=128   \
    --num_train_epochs 10  \
    --do_eval \
    --warmup_proportion=0.0 \
    --pretrained_path pretrained_models/xlmr.base/ \
    --learning_rate 6e-5 \
    --do_train \
    --eval_on dev \
    --dropout 0.2 \
    --train_batch_size 32

In [None]:
# ET finetuning

! python xlm-roberta-ner/main.py \
    --data_dir=./data/et/  \
    --task_name=ner   \
    --output_dir=finetuned_models/et/   \
    --max_seq_length=128   \
    --num_train_epochs 10  \
    --do_eval \
    --warmup_proportion=0.0 \
    --pretrained_path pretrained_models/xlmr.base/ \
    --learning_rate 6e-5 \
    --do_train \
    --eval_on dev \
    --dropout 0.2 \
    --train_batch_size 32

In [None]:
# EN finetuning

! python xlm-roberta-ner/main.py \
    --data_dir=./data/en/  \
    --task_name=ner   \
    --output_dir=finetuned_models/et/   \
    --max_seq_length=128   \
    --num_train_epochs 10  \
    --do_eval \
    --warmup_proportion=0.0 \
    --pretrained_path pretrained_models/xlmr.base/ \
    --learning_rate 6e-5 \
    --do_train \
    --eval_on dev \
    --dropout 0.2 \
    --train_batch_size 32

In [None]:
# FI finetuning

! python xlm-roberta-ner/main.py \
    --data_dir=./data/fi/  \
    --task_name=ner   \
    --output_dir=finetuned_models/fi/   \
    --max_seq_length=128   \
    --num_train_epochs 10  \
    --do_eval \
    --warmup_proportion=0.0 \
    --pretrained_path pretrained_models/xlmr.base/ \
    --learning_rate 6e-5 \
    --do_train \
    --eval_on dev \
    --dropout 0.2 \
    --train_batch_size 32