<a href="https://colab.research.google.com/github/mirjampaales/cool-ml-project/blob/main/named_entity_recognition/train_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Downloading data 
This can be skipped if using the notebook withing the .git repository as the repositories below are included there as submodules:

In [1]:
! git clone https://github.com/mukhal/xlm-roberta-ner.git 
! git clone https://github.com/TurkuNLP/turku-ner-corpus
! git clone https://github.com/ksirts/EstNER

Cloning into 'xlm-roberta-ner'...
remote: Enumerating objects: 312, done.[K
remote: Counting objects: 100% (312/312), done.[K
remote: Compressing objects: 100% (187/187), done.[K
remote: Total 312 (delta 165), reused 245 (delta 118), pack-reused 0[K
Receiving objects: 100% (312/312), 2.89 MiB | 10.43 MiB/s, done.
Resolving deltas: 100% (165/165), done.
Cloning into 'turku-ner-corpus'...
remote: Enumerating objects: 1611, done.[K
remote: Counting objects: 100% (1611/1611), done.[K
remote: Compressing objects: 100% (1515/1515), done.[K
remote: Total 1611 (delta 67), reused 1574 (delta 46), pack-reused 0[K
Receiving objects: 100% (1611/1611), 6.77 MiB | 13.13 MiB/s, done.
Resolving deltas: 100% (67/67), done.
Cloning into 'EstNER'...
remote: Enumerating objects: 8, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 8 (delta 2), reused 4 (delta 1), pack-reused 0[K
Unpacking objects: 100% (8/8), done.


## Data preparation

Data preparation to uniform formats across languages. As we will use the XLM-R finetuning code with the English dataset included, we don't have to worry about the format of English data.

The expected dataset format is a space-separated file where only the first and last column are looked at (word and its label), other columns are ignored. Sentences are separated by an empty line.

Finnish is easy, as it is a .tsv file with those two columns. Estonian is a completely different hierarchical JSON format and needs most preparation.


Additionally, we'll limit the named entity labels to just persons (PER), organizations (ORG) and locations (LOC), as those are common in all datasets.

In [31]:
import pandas as pd
import json

In [54]:
! mkdir et-data en-data fi-data merged-data

In [45]:
allowed_labels = ['O', 'B-LOC', 'I-LOC', 'B-ORG','I-ORG','B-PER','I-PER']

Estonian:

In [55]:
for split in ["dev","test","train"]:
  with open(f"EstNER/EstNER_v1_{split}.json", 'r') as f_in:
    data = json.loads(f_in.read())
  
  split = 'valid' if split=='dev' else split

  with open(f"et-data/{split}.txt", 'w') as f_out:
    for document in data:
      for sentence in document:
        for token in sentence:
          # Estonian has multi-part names sometimes marked as one token (e.g. New York). Those must be split to multiple rows to conform with the file format.
          words = token['word'].split()
          label = token['ner_1']

          label = label if label in allowed_labels else 'O'

          f_out.write(f"{words.pop(0)} {label}\n")

          if words: # name was multipart
            # if the first word is a named entity start (label B-*), others must be continuations
            if label.split('-')[0]=='B':
              label=f"I-{label.split('-')[1]}"
            
            while words:
              f_out.write(f"{words.pop(0)} {label}\n")
        f_out.write('\n')

Finnish:

In [58]:
for split in  ["dev","test","train"]:
  with open(f"turku-ner-corpus/data/conll/{split}.tsv", 'r') as f_in:
    data = f_in.readlines()
  
  split = 'valid' if split=='dev' else split

  with open(f"fi-data/{split}.txt", 'w') as f_out:
    for line in data:
      columns = line.strip().split('\t')
      if len(columns) >= 2:
        if columns[-1] in allowed_labels:
          f_out.write(f"{columns[0]} {columns[-1]}\n")
        else:
          f_out.write(f"{columns[0]} O\n")
      else:
        f_out.write(f"\n")
    f_out.write('\n')

English:

In [60]:
for split in  ["valid","test","train"]:
  with open(f"xlm-roberta-ner/data/coNLL-2003/{split}.txt", 'r') as f_in:
    data = f_in.readlines()

  with open(f"en-data/{split}.txt", 'w') as f_out:
    for line in data:
      columns = line.strip().split()
      if len(columns) >= 2:
        if columns[-1] in allowed_labels:
          f_out.write(f"{columns[0]} {columns[-1]}\n")
        else:
          f_out.write(f"{columns[0]} O\n")
      else:
        f_out.write(f"\n")
    f_out.write('\n')

Create merged datasets:

In [62]:
! for f in {valid,test,train}; do cat {et,en,fi}-data/$f.txt > merged-data/$f.txt; done

Distribution of tags across languages. We'll keep only PER, ORG, and LOC - these are supported by all languages.

## Finetuning XLM-R

In [63]:
! pip install -r xlm-roberta-ner/requirements.txt
! pip install wandb

Collecting fairseq==0.9.0
  Downloading fairseq-0.9.0.tar.gz (306 kB)
[K     |████████████████████████████████| 306 kB 4.2 MB/s 
[?25hCollecting pytorch-transformers==1.2.0
  Downloading pytorch_transformers-1.2.0-py3-none-any.whl (176 kB)
[K     |████████████████████████████████| 176 kB 52.1 MB/s 
[?25hCollecting seqeval==0.0.12
  Downloading seqeval-0.0.12.tar.gz (21 kB)
Collecting sacrebleu
  Downloading sacrebleu-2.0.0-py3-none-any.whl (90 kB)
[K     |████████████████████████████████| 90 kB 7.1 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 54.7 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 35.7 MB/s 
Collecting boto3
  Downloading boto3-1.20.23-py3-none-any.whl (131 kB)
[K     |████████████████████████████████| 131 kB 72.9 MB/s 
Collecting s3trans

Collecting wandb
  Downloading wandb-0.12.7-py2.py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 4.3 MB/s 
[?25hCollecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting shortuuid>=0.5.0
  Downloading shortuuid-1.0.8-py3-none-any.whl (9.5 kB)
Collecting subprocess32>=3.5.3
  Downloading subprocess32-3.5.4.tar.gz (97 kB)
[K     |████████████████████████████████| 97 kB 5.2 MB/s 
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting yaspin>=1.0.0
  Downloading yaspin-2.1.0-py3-none-any.whl (18 kB)
Collecting configparser>=3.8.1
  Downloading configparser-5.2.0-py3-none-any.whl (19 kB)
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.5.0-py2.py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 41.6 MB/s 
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.24-py3-none-any.whl (180 kB)
[K     |████████████████████████████████| 180 kB 49.4 MB/s 
Collecting 

Checking which GPU resources we have:

In [None]:
! nvidia-smi

Sat Dec 11 13:37:56 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Downloading the pretrained XLM-R model

In [None]:
! mkdir model_dir
! mkdir pretrained_models 
! wget -P pretrained_models https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz
! tar xzvf pretrained_models/xlmr.base.tar.gz  --directory pretrained_models/
! rm -r pretrained_models/xlmr.base.tar.gz

mkdir: cannot create directory ‘pretrained_models’: File exists
--2021-12-11 13:36:10--  https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 104.22.74.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 512274718 (489M) [application/gzip]
Saving to: ‘pretrained_models/xlmr.base.tar.gz’


2021-12-11 13:36:26 (31.2 MB/s) - ‘pretrained_models/xlmr.base.tar.gz’ saved [512274718/512274718]

xlmr.base/
xlmr.base/dict.txt
xlmr.base/sentencepiece.bpe.model
xlmr.base/model.pt


Setting up Weights & Biases monitoring to keep an eye on GPU performance metrics (utilization, memory consumption, etc.)

In [None]:
wandb.init()

[34m[1mwandb[0m: Currently logged in as: [33mliisaratsep[0m (use `wandb login --relogin` to force relogin)


Finetuning with the same parameters as close to the original [XLM-R paper](https://arxiv.org/pdf/1911.02116.pdf) as possible. This finetunes the multilingual model, by changin the `data_dir` path, finetuning on a single language is possible.

PS: actual finetuning was done on UT HPC Rocket cluster, as it was faster, but the example below is completely functional.

In [None]:
! python xlm-roberta-ner/main.py \
    --data_dir=./merged-data/  \
    --task_name=ner   \
    --output_dir=merged-model/   \
    --max_seq_length=128   \
    --num_train_epochs 10  \
    --do_eval \
    --warmup_proportion=0.0 \
    --pretrained_path pretrained_models/xlmr.base/ \
    --learning_rate 6e-5 \
    --do_train \
    --eval_on dev \
    --train_batch_size 32 \
    --dropout 0.2 \
    --train_batch_size 32

loading archive file pretrained_models/xlmr.base/
| dictionary: 250001 types
12/11/2021 14:10:54 - INFO - root -   *** Example ***
12/11/2021 14:10:54 - INFO - root -   guid: train-0
12/11/2021 14:10:54 - INFO - root -   tokens: 0 32232 107272 433 11532 22869 24317 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
12/11/2021 14:10:54 - INFO - root -   input_ids: 0 32232 107272 433 11532 22869 24317 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
12/11/2021 14:10:54 - INFO - root -   input_mask: 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 