# Training an CRF-BILSTM for POS Tagging in Hausa
This code is used to train the Deep Learning Models for Hausa POS tagging. It requires the Python package `flair`. Per default it trains for 10 epochs and has an initial learning rate of 0.1.<br>


In [None]:
% pip install flair

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting flair
  Downloading flair-0.11.3-py3-none-any.whl (401 kB)
[K     |████████████████████████████████| 401 kB 30.0 MB/s 
[?25hCollecting bpemb>=0.3.2
  Downloading bpemb-0.3.3-py3-none-any.whl (19 kB)
Collecting sentencepiece==0.1.95
  Downloading sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 51.8 MB/s 
Collecting konoha<5.0.0,>=4.0.0
  Downloading konoha-4.6.5-py3-none-any.whl (20 kB)
Collecting huggingface-hub
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 14.4 MB/s 
[?25hCollecting mpld3==0.3
  Downloading mpld3-0.3.tar.gz (788 kB)
[K     |████████████████████████████████| 788 kB 65.2 MB/s 
Collecting janome
  Downloading Janome-0.4.2-py2.py3-none-any.whl (19.7 MB)
[K     |████████████████████████████████| 19.7 MB 1.1 MB/s 
Collecting wiki

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import TransformerWordEmbeddings, StackedEmbeddings, FastTextEmbeddings, BytePairEmbeddings
from flair.trainers import ModelTrainer
from flair.models import SequenceTagger
from flair import set_seed

We set a seed in order to makre results reproducible.

In [None]:
seeds = [1103, 1704]
set_seed(seeds[1])

Next, we give the file path to the training data and the type of embeddings we want to use. Currently, this is `bert-bytepair` which uses multilingual BERT embeddings fine-tuned on Hausa and Byte-Pair encodings embeddings or `bert` which only used the multilingual BERT embeddings.<br>
Results will be stored in `OUT_PATH`. The test

In [None]:
# This is the folder in which train, test and dev files reside.
DATA_PATH = 'PATH/TO/TRAIN-TEST-DEV'

# Either bert-bytepair, bert or afribert-bytepair
TYPE = "bert-bytepair"

# Training file
EXP_FILE = "Tanzil-en-fast_align_sym"

# Test file
test = "test.conll"

# This where the results will be stored (final model)
OUT_PATH = 'OUTPUT/PATH'

In [None]:
# define columns, first one is assumed to be the tokens, the second colum contains the corresponding tags.
columns = {0: 'text', 1: 'pos'}

corpus: Corpus = ColumnCorpus(DATA_PATH, columns,
                              train_file=EXP_FILE+".conll",
                              test_file=test)

2022-08-11 14:54:47,934 Reading data from drive/MyDrive/Data/experiments
2022-08-11 14:54:47,938 Train: drive/MyDrive/Data/experiments/Tanzil-en-fast_align_sym.conll
2022-08-11 14:54:47,943 Dev: None
2022-08-11 14:54:47,945 Test: drive/MyDrive/Data/experiments/test.conll


In [None]:
label_dict = corpus.make_label_dictionary(label_type="pos")
print(label_dict)

2022-08-11 14:54:54,595 Computing label dictionary. Progress:


7637it [00:00, 8259.41it/s]

2022-08-11 14:54:55,580 Dictionary created for label 'pos' with 16 values: VERB (seen 32089 times), PRON (seen 25321 times), PUNCT (seen 23826 times), NOUN (seen 19413 times), ADP (seen 17849 times), ADV (seen 9501 times), CCONJ (seen 9487 times), DET (seen 9303 times), PROPN (seen 6349 times), ADJ (seen 4575 times), AUX (seen 1881 times), PART (seen 397 times), NUM (seen 385 times), INTJ (seen 240 times), X (seen 6 times)
Dictionary with 16 tags: <unk>, VERB, PRON, PUNCT, NOUN, ADP, ADV, CCONJ, DET, PROPN, ADJ, AUX, PART, NUM, INTJ, X





Next, we choose the embeddings according to the specified type in the variable `TYPE`.

In [None]:
if TYPE == "bert-bytepair":
  embedding_types = [
      TransformerWordEmbeddings("Davlan/bert-base-multilingual-cased-finetuned-hausa"),
      BytePairEmbeddings("ha")
  ]
elif TYPE == "afribert-bytepair":
    embedding_types = [
      TransformerWordEmbeddings("castorini/afriberta_large"),
      BytePairEmbeddings("ha")
  ]
elif TYPE == "bert":
    embedding_types = [
      TransformerWordEmbeddings("Davlan/bert-base-multilingual-cased-finetuned-hausa")
  ]
else:
  raise ValueError("Unknown embedding type: {}".format(TYPE))

Downloading tokenizer_config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/679M [00:00<?, ?B/s]

BPEmb fallback: ha from vocab size 100000 to 5000
downloading https://nlp.h-its.org/bpemb/ha/ha.wiki.bpe.vs5000.model


100%|██████████| 315468/315468 [00:00<00:00, 1192339.27B/s]


downloading https://nlp.h-its.org/bpemb/ha/ha.wiki.bpe.vs5000.d50.w2v.bin.tar.gz


100%|██████████| 957581/957581 [00:00<00:00, 2138970.64B/s]


In [None]:
embeddings = StackedEmbeddings(embeddings=embedding_types)

In [None]:
tagger = SequenceTagger(hidden_size=256,
                        embeddings=embeddings,
                        tag_dictionary=label_dict,
                        tag_type="pos",
                        use_crf=True)

2022-08-11 14:55:20,615 SequenceTagger predicts: Dictionary with 16 tags: <unk>, VERB, PRON, PUNCT, NOUN, ADP, ADV, CCONJ, DET, PROPN, ADJ, AUX, PART, NUM, INTJ, X


Finally, training is started:

In [None]:
trainer = ModelTrainer(tagger, corpus)

In [None]:
trainer.train(OUT_PATH,
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=10,
              write_weights=True,
              patience=2)

2022-08-11 14:55:24,172 ----------------------------------------------------------------------------------------------------
2022-08-11 14:55:24,178 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): TransformerWordEmbeddings(
      (model): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(119547, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linea

100%|██████████| 27/27 [00:24<00:00,  1.11it/s]

2022-08-11 15:02:01,436 Evaluating as a multi-label problem: False
2022-08-11 15:02:01,567 DEV : loss 1.1910310983657837 - f1-score (micro avg)  0.6486





2022-08-11 15:02:01,673 BAD EPOCHS (no improvement): 0
2022-08-11 15:02:01,680 saving best model
2022-08-11 15:02:04,894 ----------------------------------------------------------------------------------------------------
2022-08-11 15:02:23,723 epoch 2 - iter 23/239 - loss 1.23383781 - samples/sec: 39.11 - lr: 0.100000
2022-08-11 15:02:45,049 epoch 2 - iter 46/239 - loss 1.23372104 - samples/sec: 45.22 - lr: 0.100000
2022-08-11 15:03:06,716 epoch 2 - iter 69/239 - loss 1.23402671 - samples/sec: 44.21 - lr: 0.100000
2022-08-11 15:03:28,278 epoch 2 - iter 92/239 - loss 1.22508404 - samples/sec: 44.64 - lr: 0.100000
2022-08-11 15:03:50,247 epoch 2 - iter 115/239 - loss 1.22037242 - samples/sec: 43.71 - lr: 0.100000
2022-08-11 15:04:10,702 epoch 2 - iter 138/239 - loss 1.22190495 - samples/sec: 48.10 - lr: 0.100000
2022-08-11 15:04:32,174 epoch 2 - iter 161/239 - loss 1.21490918 - samples/sec: 44.79 - lr: 0.100000
2022-08-11 15:04:52,103 epoch 2 - iter 184/239 - loss 1.21382190 - samples/

100%|██████████| 27/27 [00:13<00:00,  2.01it/s]

2022-08-11 15:05:57,601 Evaluating as a multi-label problem: False
2022-08-11 15:05:57,740 DEV : loss 1.0660001039505005 - f1-score (micro avg)  0.6793





2022-08-11 15:05:57,838 BAD EPOCHS (no improvement): 0
2022-08-11 15:05:57,845 saving best model
2022-08-11 15:06:01,005 ----------------------------------------------------------------------------------------------------
2022-08-11 15:06:29,094 epoch 3 - iter 23/239 - loss 1.11176083 - samples/sec: 26.24 - lr: 0.100000
2022-08-11 15:07:01,534 epoch 3 - iter 46/239 - loss 1.12182005 - samples/sec: 26.83 - lr: 0.100000
2022-08-11 15:07:32,576 epoch 3 - iter 69/239 - loss 1.12347225 - samples/sec: 28.30 - lr: 0.100000
2022-08-11 15:08:04,877 epoch 3 - iter 92/239 - loss 1.11867885 - samples/sec: 27.00 - lr: 0.100000
2022-08-11 15:08:37,392 epoch 3 - iter 115/239 - loss 1.12055223 - samples/sec: 26.77 - lr: 0.100000
2022-08-11 15:09:10,746 epoch 3 - iter 138/239 - loss 1.11838608 - samples/sec: 26.11 - lr: 0.100000
2022-08-11 15:09:42,905 epoch 3 - iter 161/239 - loss 1.11753954 - samples/sec: 27.20 - lr: 0.100000
2022-08-11 15:10:15,111 epoch 3 - iter 184/239 - loss 1.11762787 - samples/

100%|██████████| 27/27 [00:21<00:00,  1.24it/s]

2022-08-11 15:11:56,947 Evaluating as a multi-label problem: False
2022-08-11 15:11:57,086 DEV : loss 1.0288790464401245 - f1-score (micro avg)  0.6877





2022-08-11 15:11:57,186 BAD EPOCHS (no improvement): 0
2022-08-11 15:11:57,192 saving best model
2022-08-11 15:12:00,235 ----------------------------------------------------------------------------------------------------
2022-08-11 15:12:28,911 epoch 4 - iter 23/239 - loss 1.07356593 - samples/sec: 25.68 - lr: 0.100000
2022-08-11 15:13:00,375 epoch 4 - iter 46/239 - loss 1.06430979 - samples/sec: 27.88 - lr: 0.100000
2022-08-11 15:13:32,491 epoch 4 - iter 69/239 - loss 1.06480749 - samples/sec: 27.24 - lr: 0.100000
2022-08-11 15:14:04,071 epoch 4 - iter 92/239 - loss 1.05681226 - samples/sec: 27.79 - lr: 0.100000
2022-08-11 15:14:35,481 epoch 4 - iter 115/239 - loss 1.05772477 - samples/sec: 27.89 - lr: 0.100000
2022-08-11 15:15:07,420 epoch 4 - iter 138/239 - loss 1.05415355 - samples/sec: 27.38 - lr: 0.100000
2022-08-11 15:15:39,910 epoch 4 - iter 161/239 - loss 1.05841195 - samples/sec: 26.86 - lr: 0.100000
2022-08-11 15:16:11,152 epoch 4 - iter 184/239 - loss 1.05476329 - samples/

100%|██████████| 27/27 [00:21<00:00,  1.26it/s]

2022-08-11 15:17:53,185 Evaluating as a multi-label problem: False
2022-08-11 15:17:53,317 DEV : loss 0.9974625110626221 - f1-score (micro avg)  0.6934





2022-08-11 15:17:53,431 BAD EPOCHS (no improvement): 0
2022-08-11 15:17:53,439 saving best model
2022-08-11 15:17:56,523 ----------------------------------------------------------------------------------------------------
2022-08-11 15:18:24,639 epoch 5 - iter 23/239 - loss 0.98321060 - samples/sec: 26.19 - lr: 0.100000
2022-08-11 15:18:56,849 epoch 5 - iter 46/239 - loss 1.00391830 - samples/sec: 27.11 - lr: 0.100000
2022-08-11 15:19:28,875 epoch 5 - iter 69/239 - loss 1.01344186 - samples/sec: 27.41 - lr: 0.100000
2022-08-11 15:19:59,753 epoch 5 - iter 92/239 - loss 1.01559550 - samples/sec: 28.75 - lr: 0.100000
2022-08-11 15:20:32,062 epoch 5 - iter 115/239 - loss 1.01254296 - samples/sec: 27.04 - lr: 0.100000
2022-08-11 15:21:03,506 epoch 5 - iter 138/239 - loss 1.01149080 - samples/sec: 27.92 - lr: 0.100000
2022-08-11 15:21:34,452 epoch 5 - iter 161/239 - loss 1.01328993 - samples/sec: 28.46 - lr: 0.100000
2022-08-11 15:22:06,460 epoch 5 - iter 184/239 - loss 1.01046391 - samples/

100%|██████████| 27/27 [00:21<00:00,  1.26it/s]

2022-08-11 15:23:47,303 Evaluating as a multi-label problem: False
2022-08-11 15:23:47,439 DEV : loss 0.9877561330795288 - f1-score (micro avg)  0.6939





2022-08-11 15:23:47,549 BAD EPOCHS (no improvement): 0
2022-08-11 15:23:47,556 saving best model
2022-08-11 15:23:50,567 ----------------------------------------------------------------------------------------------------
2022-08-11 15:24:09,500 epoch 6 - iter 23/239 - loss 1.68200927 - samples/sec: 38.90 - lr: 0.100000
2022-08-11 15:24:32,696 epoch 6 - iter 46/239 - loss 1.82462989 - samples/sec: 40.81 - lr: 0.100000
2022-08-11 15:24:55,338 epoch 6 - iter 69/239 - loss 1.80941592 - samples/sec: 41.92 - lr: 0.100000
2022-08-11 15:25:18,019 epoch 6 - iter 92/239 - loss 1.77567229 - samples/sec: 41.85 - lr: 0.100000
2022-08-11 15:25:41,545 epoch 6 - iter 115/239 - loss 1.74620606 - samples/sec: 39.87 - lr: 0.100000
2022-08-11 15:26:04,991 epoch 6 - iter 138/239 - loss 1.72357995 - samples/sec: 40.11 - lr: 0.100000
2022-08-11 15:26:28,333 epoch 6 - iter 161/239 - loss 1.70134101 - samples/sec: 40.40 - lr: 0.100000
2022-08-11 15:26:51,576 epoch 6 - iter 184/239 - loss 1.68111734 - samples/

100%|██████████| 27/27 [00:12<00:00,  2.18it/s]


2022-08-11 15:28:00,110 Evaluating as a multi-label problem: False
2022-08-11 15:28:00,235 DEV : loss 1.406103253364563 - f1-score (micro avg)  0.4988
2022-08-11 15:28:00,338 BAD EPOCHS (no improvement): 1
2022-08-11 15:28:00,344 ----------------------------------------------------------------------------------------------------
2022-08-11 15:28:17,884 epoch 7 - iter 23/239 - loss 1.50901917 - samples/sec: 41.98 - lr: 0.100000
2022-08-11 15:28:39,706 epoch 7 - iter 46/239 - loss 1.50155799 - samples/sec: 43.91 - lr: 0.100000
2022-08-11 15:29:02,952 epoch 7 - iter 69/239 - loss 1.48395463 - samples/sec: 40.54 - lr: 0.100000
2022-08-11 15:29:27,655 epoch 7 - iter 92/239 - loss 1.47559036 - samples/sec: 37.58 - lr: 0.100000
2022-08-11 15:29:51,589 epoch 7 - iter 115/239 - loss 1.46704033 - samples/sec: 38.99 - lr: 0.100000
2022-08-11 15:30:13,630 epoch 7 - iter 138/239 - loss 1.46326104 - samples/sec: 43.29 - lr: 0.100000
2022-08-11 15:30:36,585 epoch 7 - iter 161/239 - loss 1.45403120 - 

100%|██████████| 27/27 [00:11<00:00,  2.38it/s]


2022-08-11 15:32:07,764 Evaluating as a multi-label problem: False
2022-08-11 15:32:07,891 DEV : loss 1.2589526176452637 - f1-score (micro avg)  0.5911
2022-08-11 15:32:07,992 BAD EPOCHS (no improvement): 2
2022-08-11 15:32:07,999 ----------------------------------------------------------------------------------------------------
2022-08-11 15:32:24,559 epoch 8 - iter 23/239 - loss 1.38591451 - samples/sec: 44.46 - lr: 0.100000
2022-08-11 15:32:47,589 epoch 8 - iter 46/239 - loss 1.38760550 - samples/sec: 40.99 - lr: 0.100000
2022-08-11 15:33:10,294 epoch 8 - iter 69/239 - loss 1.38296202 - samples/sec: 41.70 - lr: 0.100000
2022-08-11 15:33:35,435 epoch 8 - iter 92/239 - loss 1.37711463 - samples/sec: 36.78 - lr: 0.100000
2022-08-11 15:33:58,966 epoch 8 - iter 115/239 - loss 1.37928749 - samples/sec: 39.86 - lr: 0.100000
2022-08-11 15:34:20,620 epoch 8 - iter 138/239 - loss 1.37656159 - samples/sec: 44.47 - lr: 0.100000
2022-08-11 15:34:42,832 epoch 8 - iter 161/239 - loss 1.36947288 -

100%|██████████| 27/27 [00:11<00:00,  2.37it/s]


2022-08-11 15:36:14,851 Evaluating as a multi-label problem: False
2022-08-11 15:36:14,978 DEV : loss 1.2056884765625 - f1-score (micro avg)  0.6083
2022-08-11 15:36:15,075 Epoch     8: reducing learning rate of group 0 to 5.0000e-02.
2022-08-11 15:36:15,079 BAD EPOCHS (no improvement): 3
2022-08-11 15:36:15,083 ----------------------------------------------------------------------------------------------------
2022-08-11 15:36:31,917 epoch 9 - iter 23/239 - loss 1.30596202 - samples/sec: 43.74 - lr: 0.050000
2022-08-11 15:36:55,371 epoch 9 - iter 46/239 - loss 1.31507865 - samples/sec: 40.03 - lr: 0.050000
2022-08-11 15:37:19,360 epoch 9 - iter 69/239 - loss 1.30526652 - samples/sec: 38.87 - lr: 0.050000
2022-08-11 15:37:41,957 epoch 9 - iter 92/239 - loss 1.31055194 - samples/sec: 41.99 - lr: 0.050000
2022-08-11 15:38:03,833 epoch 9 - iter 115/239 - loss 1.30949463 - samples/sec: 43.91 - lr: 0.050000
2022-08-11 15:38:27,425 epoch 9 - iter 138/239 - loss 1.30928329 - samples/sec: 39.7

100%|██████████| 27/27 [00:12<00:00,  2.19it/s]


2022-08-11 15:40:24,492 Evaluating as a multi-label problem: False
2022-08-11 15:40:24,626 DEV : loss 1.1722631454467773 - f1-score (micro avg)  0.6211
2022-08-11 15:40:24,725 BAD EPOCHS (no improvement): 1
2022-08-11 15:40:24,732 ----------------------------------------------------------------------------------------------------
2022-08-11 15:40:42,959 epoch 10 - iter 23/239 - loss 1.29182901 - samples/sec: 40.40 - lr: 0.050000
2022-08-11 15:41:05,382 epoch 10 - iter 46/239 - loss 1.29347435 - samples/sec: 42.39 - lr: 0.050000
2022-08-11 15:41:28,452 epoch 10 - iter 69/239 - loss 1.29364565 - samples/sec: 41.10 - lr: 0.050000
2022-08-11 15:41:51,174 epoch 10 - iter 92/239 - loss 1.29509299 - samples/sec: 41.66 - lr: 0.050000
2022-08-11 15:42:13,242 epoch 10 - iter 115/239 - loss 1.29400576 - samples/sec: 43.33 - lr: 0.050000
2022-08-11 15:42:37,021 epoch 10 - iter 138/239 - loss 1.28822207 - samples/sec: 39.27 - lr: 0.050000
2022-08-11 15:43:00,658 epoch 10 - iter 161/239 - loss 1.287

100%|██████████| 27/27 [00:11<00:00,  2.35it/s]


2022-08-11 15:44:32,105 Evaluating as a multi-label problem: False
2022-08-11 15:44:32,235 DEV : loss 1.1549733877182007 - f1-score (micro avg)  0.6259
2022-08-11 15:44:32,330 BAD EPOCHS (no improvement): 2
2022-08-11 15:44:35,110 ----------------------------------------------------------------------------------------------------
2022-08-11 15:44:35,118 loading file drive/MyDrive/Data/resources_run2/Tanzil-en-fast_align_sym/bert-bytepair/best-model.pt
2022-08-11 15:44:39,701 SequenceTagger predicts: Dictionary with 18 tags: <unk>, VERB, PRON, PUNCT, NOUN, ADP, ADV, CCONJ, DET, PROPN, ADJ, AUX, PART, NUM, INTJ, X, <START>, <STOP>


100%|██████████| 8/8 [00:01<00:00,  4.58it/s]

2022-08-11 15:44:41,787 Evaluating as a multi-label problem: False
2022-08-11 15:44:41,813 0.4881	0.4881	0.4881	0.4881
2022-08-11 15:44:41,815 
Results:
- F-score (micro) 0.4881
- F-score (macro) 0.2857
- Accuracy 0.4881

By class:
              precision    recall  f1-score   support

        VERB     0.4467    0.9019    0.5975       418
        NOUN     0.8092    0.5491    0.6542       479
        PRON     0.3307    0.6522    0.4388       253
       PUNCT     0.7698    0.9180    0.8374       317
         AUX     0.7843    0.0864    0.1556       463
         ADP     0.4467    0.5906    0.5087       149
        PART     0.0000    0.0000    0.0000       286
         ADV     0.1636    0.3506    0.2231        77
         DET     0.4118    0.5957    0.4870        47
        CONJ     0.0000    0.0000    0.0000        89
        INTJ     0.6000    0.1463    0.2353        41
       CCONJ     0.0000    0.0000    0.0000         0
       PROPN     0.0000    0.0000    0.0000         0
         AD




{'dev_loss_history': [1.1910310983657837,
  1.0660001039505005,
  1.0288790464401245,
  0.9974625110626221,
  0.9877561330795288,
  1.406103253364563,
  1.2589526176452637,
  1.2056884765625,
  1.1722631454467773,
  1.1549733877182007],
 'dev_score_history': [0.6485683259987426,
  0.6793164542492999,
  0.6877178944961994,
  0.6934331599702807,
  0.6938903812082071,
  0.49877121792307255,
  0.591072755329485,
  0.6083328570612105,
  0.6211350517231525,
  0.62587872206664],
 'test_score': 0.4880726997349489,
 'train_loss_history': [1.6035722777026817,
  1.2002931522103817,
  1.1123701024498707,
  1.0538615171302324,
  1.010418737119166,
  1.6436094155406404,
  1.4370607127037296,
  1.360074660441129,
  1.308443386965496,
  1.284898236608602]}