#### **Trankit:**

 **Introduction:**
A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing


*   [GitHub Repo](https://github.com/nlp-uoregon/trankit)

*   [Documentation](https://trankit.readthedocs.io/en/latest/#)

*   [Demo Website](http://nlp.uoregon.edu/trankit)

**Usage:**
Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level. Currently, Trankit supports the following tasks:

1. Sentence segmentation
2. Tokenization
3. Multi-word token expansion
4. Part-of-speech tagging
5. Morphological feature tagging
6. **Dependency parsing**
7. Named entity recognition

**Customized pipeline:**
Training customized pipelines with Trankit via the class TPipeline. [Documentation](https://trankit.readthedocs.io/en/latest/training.html#training-a-joint-model-for-part-of-speech-tagging-morphologicial-feature-tagging-and-dependency-parsing) for creating customized pipeline.

**CoNLL-U Format:**
The entire documentation is available at [CoNLL-U Format](https://universaldependencies.org/format.html). Sentences consist of one or more word lines, and word lines contain the following fields:

1. **ID:** Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
2. **FORM:** Word form or punctuation symbol.
3. **LEMMA:** Lemma or stem of word form.
4. **UPOS:** Universal part-of-speech tag.
5. **XPOS:** Language-specific part-of-speech tag; underscore if not available.
6. **FEATS:** List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
7. **HEAD:** Head of the current word, which is either a value of ID or zero (0).
8. **DEPREL:** Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
9. **DEPS:** Enhanced dependency graph in the form of a list of head-deprel pairs.
10. **MISC:** Any other annotation.

#### **Installation:**

In [None]:
%%capture
!pip install transformers
!pip install trankit

* **Working Directory:**

In [None]:
%cd '/content/drive/MyDrive/btp_trankit'
!ls

/content/drive/MyDrive/btp_trankit
btp_trankit		 news_articles_and_heritage  test-conllu.dat.filtered
dev-conllu.dat		 save_dir_filtered	     train-conllu.dat
dev-conllu.dat.filtered  test-conllu.dat	     train-conllu.dat.filtered


* **Merge dev, train, test files:**

In [None]:
import glob
read_files = glob.glob('*.dat')
with open('dev-conllu.dat', 'wb') as outfile:
  for f in read_files:
    with open(f, 'rb') as infile:
      outfile.write(infile.read())

* **Filtering CoNLL-U files:**

In [None]:
!cat train-conllu.dat | grep -v '^\#' | sed 's/[a-z]*$//' > train-conllu.dat.filtered
!cat dev-conllu.dat | grep -v '^\#' | sed 's/[a-z]*$//' > dev-conllu.dat.filtered
!cat test-conllu.dat | grep -v '^\#' | sed 's/[a-z]*$//' > test-conllu.dat.filtered

#### **Custom Pipeline:**

* **Setup:**

In [None]:
import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config = { 
    'language': 'hindi', # language of data
    'max_epoch': 5, # epoch count
    'category': 'customized', # pipeline category
    'task': 'posdep', # task name
    'save_dir': './save_dir_filtered', # directory for saving trained model
    'train_conllu_fpath': './train-conllu.dat.filtered', # annotations file in CONLLU format  for training
    'dev_conllu_fpath': './dev-conllu.dat.filtered' # annotations file in CONLLU format for development
    }
)

Setting up training config...
Loaded 15081 entries from ./train-conllu.dat.filtered
Loaded 1864 entries from ./dev-conllu.dat.filtered


Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

* **Training:**

In [None]:
trainer.train()

******************************
Posdep tagger: Epoch: 0


Train 0: 943it [01:14, 12.74it/s]                                          
dev 0: 100%|█████████████████████████████| 117/117 [00:04<00:00, 24.57it/s]


Saving adapter weights to ... ./save_dir_filtered/xlm-roberta-base/customized/customized.tagger.mdl (44.97 MB)
------------------------------ Best dev CoNLLu score: epoch 0------------------------------
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |    100.00 |    100.00 |    100.00 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |    100.00 |    100.00 |    100.00 |
UPOS       |     96.30 |     96.30 |     96.30 |     96.30
XPOS       |     91.76 |     91.76 |     91.76 |     91.76
UFeats     |    100.00 |    100.00 |    100.00 |    100.00
AllTags    |     91.62 |     91.62 |     91.62 |     91.62
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     74.20 |     74.20 |     74.20 |     74.20
LAS        |     60.32 |     60.32 |     60.32 |     60.32
CLAS       |     42.00 |     18.29 |     25.48 |     18.29
MLAS       |     39.67 |     17.27 |     24.06 |     17.27


Train 1: 943it [01:13, 12.87it/s]                                          
dev 1: 100%|█████████████████████████████| 117/117 [00:04<00:00, 24.90it/s]


Saving adapter weights to ... ./save_dir_filtered/xlm-roberta-base/customized/customized.tagger.mdl (44.97 MB)
------------------------------ Best dev CoNLLu score: epoch 1------------------------------
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |    100.00 |    100.00 |    100.00 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |    100.00 |    100.00 |    100.00 |
UPOS       |     98.39 |     98.39 |     98.39 |     98.39
XPOS       |     94.71 |     94.71 |     94.71 |     94.71
UFeats     |    100.00 |    100.00 |    100.00 |    100.00
AllTags    |     94.60 |     94.60 |     94.60 |     94.60
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     85.79 |     85.79 |     85.79 |     85.79
LAS        |     73.31 |     73.31 |     73.31 |     73.31
CLAS       |     68.75 |     44.70 |     54.18 |     44.70
MLAS       |     68.53 |     44.56 |     54.00 |     44.56


Train 2: 943it [01:12, 13.07it/s]
dev 2: 100%|█████████████████████████████| 117/117 [00:04<00:00, 25.32it/s]


Saving adapter weights to ... ./save_dir_filtered/xlm-roberta-base/customized/customized.tagger.mdl (44.97 MB)
------------------------------ Best dev CoNLLu score: epoch 2------------------------------
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |    100.00 |    100.00 |    100.00 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |    100.00 |    100.00 |    100.00 |
UPOS       |     98.80 |     98.80 |     98.80 |     98.80
XPOS       |     95.60 |     95.60 |     95.60 |     95.60
UFeats     |    100.00 |    100.00 |    100.00 |    100.00
AllTags    |     95.52 |     95.52 |     95.52 |     95.52
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     87.92 |     87.92 |     87.92 |     87.92
LAS        |     75.85 |     75.85 |     75.85 |     75.85
CLAS       |     71.40 |     52.54 |     60.54 |     52.54
MLAS       |     71.40 |     52.54 |     60.54 |     52.54


Train 3: 943it [01:14, 12.58it/s]
dev 3: 100%|█████████████████████████████| 117/117 [00:04<00:00, 24.74it/s]


Saving adapter weights to ... ./save_dir_filtered/xlm-roberta-base/customized/customized.tagger.mdl (44.97 MB)
------------------------------ Best dev CoNLLu score: epoch 3------------------------------
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |    100.00 |    100.00 |    100.00 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |    100.00 |    100.00 |    100.00 |
UPOS       |     98.85 |     98.85 |     98.85 |     98.85
XPOS       |     95.79 |     95.79 |     95.79 |     95.79
UFeats     |    100.00 |    100.00 |    100.00 |    100.00
AllTags    |     95.65 |     95.65 |     95.65 |     95.65
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     88.19 |     88.19 |     88.19 |     88.19
LAS        |     76.28 |     76.28 |     76.28 |     76.28
CLAS       |     71.37 |     50.65 |     59.25 |     50.65
MLAS       |     70.96 |     50.36 |     58.91 |     50.36


Train 4: 943it [01:14, 12.74it/s]                                          
dev 4: 100%|█████████████████████████████| 117/117 [00:04<00:00, 25.47it/s]


Saving adapter weights to ... ./save_dir_filtered/xlm-roberta-base/customized/customized.tagger.mdl (44.97 MB)
------------------------------ Best dev CoNLLu score: epoch 4------------------------------
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |    100.00 |    100.00 |    100.00 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |    100.00 |    100.00 |    100.00 |
UPOS       |     98.99 |     98.99 |     98.99 |     98.99
XPOS       |     96.06 |     96.06 |     96.06 |     96.06
UFeats     |    100.00 |    100.00 |    100.00 |    100.00
AllTags    |     95.96 |     95.96 |     95.96 |     95.96
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     88.44 |     88.44 |     88.44 |     88.44
LAS        |     76.96 |     76.96 |     76.96 |     76.96
CLAS       |     68.88 |     55.59 |     61.53 |     55.59
MLAS       |     68.17 |     55.01 |     60.88 |     55.01


* **Creating custom Pipeline:**

In [None]:
trankit.download_missing_files(
	category='customized', 
	save_dir='./save_dir_filtered', 
	embedding_name='xlm-roberta-base', 
	language='hindi'
)

http://nlp.uoregon.edu/download/trankit/v1.0.0/xlm-roberta-base/hindi.zip


Downloading: 100%|██████████| 27.1M/27.1M [00:02<00:00, 9.45MiB/s]


* **Loading Pipeline:**

In [None]:
import trankit
trankit.verify_customized_pipeline(
    category='customized', # pipeline category
    save_dir='./save_dir_filtered', # directory used for saving models in previous steps
    embedding_name='xlm-roberta-base' # embedding version that we use for training our customized pipeline, by default, it is `xlm-roberta-base`
)

Customized pipeline is ready to use!
It can be initialized as follows:
-----------------------------------
from trankit import Pipeline
p = Pipeline(lang='customized', cache_dir='./save_dir_filtered')


In [None]:
from trankit import Pipeline
p = Pipeline(lang='customized', cache_dir='./save_dir_filtered')

Loading pretrained XLM-Roberta, this may take a while...


Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Loading tokenizer for customized
Loading tagger for customized
Loading lemmatizer for customized
Active language: customized


#### **Testing Pipeline:**

In [None]:
sent_text = 'इसके कुंड गुफा तथा भीमशिला स्थल हैं ।'
tokens = p.posdep(sent_text, is_sent=True)

* **Actual Tokens:**

* {'id': 1,
   'text': 'इसके',
   'upos': 'pn',
   'xpos': 'PRP',
   'feats': 'cat-pn|gen-any|num-sg|pers-3|case-o|vib-0_अतिरिक्त|tam-ke|chunkId-NP|stype-|voicetype-',
   'head': 7,
   'deprel': 'vmod',
   'lemma': 'यह'}

* {'id': 2,
   'text': 'कुंड',
   'upos': 'n',
   'xpos': 'NNP',
   'feats': 'cat-n|gen-m|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP2|stype-|voicetype-',
   'head': 4,
   'deprel': 'ccof',
   'lemma': 'कुंड'}

* {'id': 3,
   'text': 'गुफा',
   'upos': 'n',
   'xpos': 'NNP',
   'feats': 'cat-n|gen-f|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP3|stype-|voicetype-',
   'head': 4,
   'deprel': 'ccof',
   'lemma': 'गुफा'}

* {'id': 4,
   'text': 'तथा',
   'upos': 'avy',
   'xpos': 'CC',
   'feats': 'cat-avy|gen-|num-|pers-|case-|vib-|tam-|chunkId-CCP|stype-|voicetype-',
   'head': 7,
   'deprel': 'k1',
   'lemma': 'तथा'}

* {'id': 5,
   'text': 'भीमशिला',
   'upos': 'n',
   'xpos': 'NNP',
   'feats': 'cat-n|gen-f|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP4|stype-|voicetype-',
   'head': 4,
   'deprel': 'ccof',
   'lemma': 'भीमशिला'}

* {'id': 6,
   'text': 'स्थल',
   'upos': 'n',
   'xpos': 'NN',
   'feats': 'cat-n|gen-m|num-pl|pers-3|case-d|vib-0|tam-0|chunkId-NP5|stype-|voicetype-',
   'head': 7,
   'deprel': 'k1s',
   'lemma': 'स्थल'}

* {'id': 7,
   'text': 'हैं',
   'upos': 'v',
   'xpos': 'VM',
   'feats': 'cat-v|gen-any|num-pl|pers-3|case-|vib-है|tam-hE|chunkId-VGF|stype-declarative|voicetype-active',
   'head': 0,
   'deprel': 'main',
   'lemma': 'है'}

* {'id': 8,
   'text': '।',
   'upos': 'punc',
   'xpos': 'SYM',
   'feats': 'cat-punc|gen-|num-|pers-|case-|vib-|tam-|chunkId-BLK|stype-|voicetype-',
   'head': 7,
   'deprel': 'rsym',
   'lemma': '।'}]

* **Predicted Tokens:**

In [None]:
tokens

{'text': 'इसके कुंड गुफा तथा भीमशिला स्थल हैं ।',
 'tokens': [{'id': 1,
   'text': 'इसके',
   'upos': 'pn',
   'xpos': 'PRP',
   'feats': 'cat-pn|gen-any|num-sg|pers-3|case-o|vib-0_अलावा|tam-ke|chunkId-NP|stype-|voicetype-',
   'head': 7,
   'deprel': 'vmod',
   'span': (0, 4)},
  {'id': 2,
   'text': 'कुंड',
   'upos': 'n',
   'xpos': 'NN',
   'feats': 'cat-n|gen-m|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP2|stype-|voicetype-',
   'head': 4,
   'deprel': 'ccof',
   'span': (5, 9)},
  {'id': 3,
   'text': 'गुफा',
   'upos': 'n',
   'xpos': 'NN',
   'feats': 'cat-n|gen-f|num-sg|pers-3|case-o|vib-0|tam-0|chunkId-NP3|stype-|voicetype-',
   'head': 4,
   'deprel': 'ccof',
   'span': (10, 14)},
  {'id': 4,
   'text': 'तथा',
   'upos': 'avy',
   'xpos': 'CC',
   'feats': 'cat-avy|gen-|num-|pers-|case-|vib-|tam-|chunkId-CCP|stype-|voicetype-',
   'head': 7,
   'deprel': 'k1',
   'span': (15, 18)},
  {'id': 5,
   'text': 'भीमशिला',
   'upos': 'n',
   'xpos': 'NNP',
   'feats': 'cat-n|gen-f|nu

* **Test Set:**

In [None]:
from trankit.iterators.tagger_iterators import TaggerDataset
test_set = TaggerDataset(
    config=trainer._config,
    input_conllu='./test-conllu.dat.filtered',
    gold_conllu='./test-conllu.dat.filtered',
    evaluate=True
)
test_set.numberize()
test_batch_num = len(test_set) // trainer._config.batch_size + (len(test_set) % trainer._config.batch_size != 0)
result = trainer._eval_posdep(data_set=test_set, batch_num=test_batch_num, name='test', epoch=-1)

test -1:   0%|                                     | 0/120 [00:47<?, ?it/s]
test -1:   1%|▏                          | 1/120 [00:41<1:21:45, 41.22s/it]


Loaded 1910 entries from ./test-conllu.dat.filtered


test -1:   0%|                                     | 0/120 [00:00<?, ?it/s]

KeyError: ignored

In [None]:
print(result)