This is a colab tutorial on how to train, evaluate a custom NER model on the [GermEval14 dataset](https://sites.google.com/site/germeval2014ner/data), and how to load and use a custom pipeline with the trained NER model. The description of the training data format required by Trankit can be found [here](https://trankit.readthedocs.io/en/latest/training.html#training-a-named-entity-recognizer). 

This tutorial is made by [@mrshu](https://github.com/mrshu). Thank you for your contribution to Trankit.

## Install dependencies

In [None]:
!pip install trankit



In [None]:
!pip install git+https://github.com/giuliano-oliveira/gdown_folder.git

Collecting git+https://github.com/giuliano-oliveira/gdown_folder.git
  Cloning https://github.com/giuliano-oliveira/gdown_folder.git to /tmp/pip-req-build-nmqmpsm8
  Running command git clone -q https://github.com/giuliano-oliveira/gdown_folder.git /tmp/pip-req-build-nmqmpsm8
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: gdown
  Building wheel for gdown (PEP 517) ... [?25l[?25hdone
  Created wheel for gdown: filename=gdown-3.12.2-cp36-none-any.whl size=12112 sha256=d14c2f5cf1d978cb35c52f75be2ddc94a0d7277fbd5bac05ae618ac4be4aa2b2
  Stored in directory: /tmp/pip-ephem-wheel-cache-j7zov6s3/wheels/fc/37/89/910840c8a847ce7bee7481d166339b71c5af5a7814ff5d0ad4
Successfully built gdown


### Prepare the data

In [None]:
!mkdir -p "/content/drive/MyDrive/german-ner-data/Data and Task Setup"

In [None]:
!gdown https://drive.google.com/drive/folders/1kC0I2UGl2ltrluI9NqDjaQJGw5iliw_J --folder -O /content/drive/MyDrive/german-ner-data

Retrieving folder list
Processing file 13mk1icZHs3AGmxNDRFvTDAq_5o_ZdFe1 BenikovaBiemannReznicek_LREC2014_GermanNER.pdf
Processing file 11cbt0Bj5DU6baoayxzLNMZxyMMshRzpn Clarin_NoSta-D_NER-AnnotationGuidelines.pdf
Processing file 1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm NER-de-dev.tsv
Processing file 187cTeQpuxRvWnu6uGJFobVODcjXSxR5c NER-de-test-unlabeled.tsv
Processing file 1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH NER-de-test.tsv
Processing file 1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P NER-de-train.tsv
Retrieving folder list completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=13mk1icZHs3AGmxNDRFvTDAq_5o_ZdFe1
To: /content/drive/MyDrive/german-ner-data/Data and Task Setup/BenikovaBiemannReznicek_LREC2014_GermanNER.pdf
100% 328k/328k [00:00<00:00, 12.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=11cbt0Bj5DU6baoayxzLNMZxyMMshRzpn
To: /content/drive/MyDrive/german-ner-data/Data and Task Setup/Clarin_NoSta-D_NER-Annotati

In [None]:
%cd "/content/drive/MyDrive/german-ner-data/Data and Task Setup"

/content/drive/MyDrive/german-ner-data/Data and Task Setup


In [None]:
!ls

BenikovaBiemannReznicek_LREC2014_GermanNER.pdf	german.vocabs.json
Clarin_NoSta-D_NER-AnnotationGuidelines.pdf	german.zip
german.downloaded				NER-de-dev.tsv
german_lemmatizer.pt				NER-de-dev.tsv.filtered
german_mwt_expander.pt				NER-de-test.tsv
german.ner.germeval14.mdl			NER-de-test.tsv.filtered
german.ner.mdl					NER-de-test-unlabeled.tsv
german.ner-vocab.germeval14.json		NER-de-train.tsv
german.ner-vocab.json				NER-de-train.tsv.filtered
german.tagger.mdl				save_dir_filtered
german.tokenizer.mdl


In [None]:
!cat NER-de-dev.tsv | grep -v '^\#' | cut -f 2,3 | sed 's/[a-z]*$//' > NER-de-dev.tsv.filtered

In [None]:
!cat NER-de-train.tsv | grep -v '^\#' | cut -f 2,3 | sed 's/[a-z]*$//' > NER-de-train.tsv.filtered

## Run the training loop

In [None]:
import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
      'max_epoch': 7,
      'category': 'customized-ner',  # pipeline category
      'task': 'ner', # task name
      'save_dir': './save_dir_filtered', # directory to save the trained model
      'train_bio_fpath': './NER-de-train.tsv.filtered', # training data in BIO format
      'dev_bio_fpath': './NER-de-dev.tsv.filtered' # training data in BIO format
    }
)

# start training
trainer.train()

Setting up training config...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5069051.0, style=ProgressStyle(descript…


Skipped 0 over-length examples
Loaded 24000 examples
Skipped 0 over-length examples
Loaded 2200 examples


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=512.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1115590446.0, style=ProgressStyle(descr…




Train 0:   0%|                            | 1/1500 [00:00<04:32,  5.51it/s]

******************************
NER: Epoch: 0


Train 0: 100%|█████████████████████████| 1500/1500 [03:23<00:00,  7.38it/s]
dev 0: 100%|█████████████████████████████| 138/138 [00:09<00:00, 14.00it/s]
Train 1:   0%|                            | 2/1500 [00:00<01:41, 14.80it/s]

Saving adapter weights to ... ./save_dir_filtered/customized-ner/customized-ner.ner.mdl (9.96 MB)
------------------------------
Best dev F1 score: epoch 0, F1: 72.59
******************************
NER: Epoch: 1


Train 1: 100%|█████████████████████████| 1500/1500 [03:29<00:00,  7.17it/s]
dev 1: 100%|█████████████████████████████| 138/138 [00:09<00:00, 13.85it/s]
Train 2:   0%|                            | 2/1500 [00:00<01:34, 15.92it/s]

Saving adapter weights to ... ./save_dir_filtered/customized-ner/customized-ner.ner.mdl (9.96 MB)
------------------------------
Best dev F1 score: epoch 1, F1: 81.34
******************************
NER: Epoch: 2


Train 2: 100%|█████████████████████████| 1500/1500 [03:31<00:00,  7.10it/s]
dev 2: 100%|█████████████████████████████| 138/138 [00:10<00:00, 13.68it/s]
Train 3:   0%|                            | 2/1500 [00:00<01:51, 13.44it/s]

Saving adapter weights to ... ./save_dir_filtered/customized-ner/customized-ner.ner.mdl (9.96 MB)
------------------------------
Best dev F1 score: epoch 2, F1: 83.12
******************************
NER: Epoch: 3


Train 3: 100%|█████████████████████████| 1500/1500 [03:31<00:00,  7.10it/s]
dev 3: 100%|█████████████████████████████| 138/138 [00:10<00:00, 13.61it/s]
Train 4:   0%|                            | 2/1500 [00:00<01:19, 18.85it/s]

Saving adapter weights to ... ./save_dir_filtered/customized-ner/customized-ner.ner.mdl (9.96 MB)
------------------------------
Best dev F1 score: epoch 3, F1: 84.61
******************************
NER: Epoch: 4


Train 4: 100%|█████████████████████████| 1500/1500 [03:30<00:00,  7.12it/s]
dev 4: 100%|█████████████████████████████| 138/138 [00:10<00:00, 13.59it/s]
Train 5:   0%|                            | 2/1500 [00:00<01:54, 13.03it/s]

Saving adapter weights to ... ./save_dir_filtered/customized-ner/customized-ner.ner.mdl (9.96 MB)
------------------------------
Best dev F1 score: epoch 4, F1: 85.17
******************************
NER: Epoch: 5


Train 5: 100%|█████████████████████████| 1500/1500 [03:32<00:00,  7.07it/s]
dev 5: 100%|█████████████████████████████| 138/138 [00:10<00:00, 13.59it/s]
Train 6:   0%|                                    | 0/1500 [00:00<?, ?it/s]

Saving adapter weights to ... ./save_dir_filtered/customized-ner/customized-ner.ner.mdl (9.96 MB)
------------------------------
Best dev F1 score: epoch 5, F1: 86.34
******************************
NER: Epoch: 6


Train 6: 100%|█████████████████████████| 1500/1500 [03:32<00:00,  7.04it/s]
dev 6: 100%|█████████████████████████████| 138/138 [00:10<00:00, 13.52it/s]


Saving adapter weights to ... ./save_dir_filtered/customized-ner/customized-ner.ner.mdl (9.96 MB)
------------------------------
Best dev F1 score: epoch 6, F1: 86.77
Training done!


In [None]:
!cat NER-de-test.tsv | grep -v '^\#' | cut -f 2,3 | sed 's/[a-z]*$//' > NER-de-test.tsv.filtered

In [None]:
from trankit.iterators.ner_iterators import NERDataset
test_set = NERDataset(
    config=trainer._config,
    bio_fpath='./NER-de-test.tsv.filtered',
    evaluate=True
)
test_set.numberize()
test_batch_num = len(test_set) // trainer._config.batch_size + (len(test_set) % trainer._config.batch_size != 0)
result = trainer._eval_ner(data_set=test_set, batch_num=test_batch_num,
                           name='test', epoch=-1)

test -1:   1%|▏                            | 2/319 [00:00<00:15, 19.96it/s]

Skipped 0 over-length examples
Loaded 5100 examples


test -1: 100%|███████████████████████████| 319/319 [00:24<00:00, 13.10it/s]


In [None]:
print(result)

{'p': 88.16627556151525, 'r': 85.14082227258012, 'f1': 86.62714097496706}


## Running the trained model on the `test` set

In [None]:
!wget -q http://nlp.uoregon.edu/download/trankit/german.zip

In [None]:
%%shell
unzip -o german.zip

Archive:  german.zip
 extracting: german.downloaded       
  inflating: german_lemmatizer.pt    
  inflating: german_mwt_expander.pt  
  inflating: german.ner.germeval14.mdl  
  inflating: german.ner.mdl          
  inflating: german.ner-vocab.germeval14.json  
  inflating: german.ner-vocab.json   
  inflating: german.tagger.mdl       
  inflating: german.tokenizer.mdl    
  inflating: german.vocabs.json      




In [None]:
!cp -a german.tagger.mdl  ./save_dir_filtered/customized-ner/customized-ner.tagger.mdl
!cp -a german.vocabs.json  ./save_dir_filtered/customized-ner/customized-ner.vocabs.json
!cp -a german_lemmatizer.pt  ./save_dir_filtered/customized-ner/customized-ner_lemmatizer.pt
!cp -a german.tokenizer.mdl  ./save_dir_filtered/customized-ner/customized-ner.tokenizer.mdl

In [None]:
trankit.verify_customized_pipeline(
    category='customized-ner', # pipeline category
    save_dir='./save_dir_filtered' # directory used for saving models in previous steps
)

Customized pipeline is ready to use!
It can be initialized as follows:
-----------------------------------
from trankit import Pipeline
p = Pipeline(lang='customized-ner', cache_dir='./save_dir_filtered')


## Using the pretrained model in a Pipeline

In [None]:
from trankit import Pipeline

In [None]:
p = Pipeline(lang='customized-ner', cache_dir='./save_dir_filtered')

Loading pretrained XLM-Roberta, this may take a while...
Loading tokenizer for customized-ner
Loading tagger for customized-ner
Loading lemmatizer for customized-ner
Loading NER tagger for customized-ner
Active language: customized-ner


In [None]:
p.ner('''Die 75 Kampfjets werden im englischen Hatfield hergestellt und 1949 in die Schweiz geflogen.''', is_sent=True)

{'text': 'Die 75 Kampfjets werden im englischen Hatfield hergestellt und 1949 in die Schweiz geflogen.',
 'tokens': [{'id': 1, 'ner': 'O', 'span': (0, 3), 'text': 'Die'},
  {'id': 2, 'ner': 'O', 'span': (4, 6), 'text': '75'},
  {'id': 3, 'ner': 'O', 'span': (7, 16), 'text': 'Kampfjets'},
  {'id': 4, 'ner': 'O', 'span': (17, 23), 'text': 'werden'},
  {'MWT': 'Yes',
   'id': 5,
   'misc': 'MWT=Yes',
   'ner': 'O',
   'span': (24, 26),
   'text': 'im'},
  {'id': 6, 'ner': 'S-LOC', 'span': (27, 37), 'text': 'englischen'},
  {'id': 7, 'ner': 'S-LOC', 'span': (38, 46), 'text': 'Hatfield'},
  {'id': 8, 'ner': 'O', 'span': (47, 58), 'text': 'hergestellt'},
  {'id': 9, 'ner': 'O', 'span': (59, 62), 'text': 'und'},
  {'id': 10, 'ner': 'O', 'span': (63, 67), 'text': '1949'},
  {'id': 11, 'ner': 'O', 'span': (68, 70), 'text': 'in'},
  {'id': 12, 'ner': 'O', 'span': (71, 74), 'text': 'die'},
  {'id': 13, 'ner': 'S-LOC', 'span': (75, 82), 'text': 'Schweiz'},
  {'id': 14, 'ner': 'O', 'span': (83, 91)