Finetune RoBERTa for NER

Finetuning RoBERTa for multilingual Named-entity recognition.

Setup

First, create a python environment. We will use pyenv, but other options will likely work to.

Use th following commands to (1) install a specific python version, (2) create a new virtual environment, (3) activate that environment and (4) install python dependencies.

pyenv install -v 3.10.8
pyenv virtualenv 3.10.8 finetune-transformer
pyenv activate finetune-transformer
pip install -r requirements.txt

Run

Run a notebook headless:

pyenv activate finetune-transformer
jupyter nbconvert --ExecutePreprocessor.timeout=-1 --to notebook --inplace --execute original.ipynb

Execute a python file:

pyenv activate finetune-transformer
nohup python 05_Compile_Dataset.py &

General Information about RoBERTa:

Liu et al. presented an improved BERT variant named RoBERTa. To improve BERT, the authors conducted a series of experiments investigating the impact of training data and training parameters on the downstream performance of BERT. The authors determined that training BERT with a larger batch size and using larger input sequences during pretraining increases downstream performance. Furthermore, during pretraining, the authors forwent the sentence prediction task and used a dynamic method to mask tokens of the input task during the Mask Language phase. In the original work, Devlin et al. statically masked tokens before training the model. Before feeding an input sequence into the model, tokenization needs to be applied. The original BERT paper uses BytePair Encoding (BPE). The input sequence is split up into mixed pieces representing words or only characters. In comparison to a wordlevel-only approach, this enables the representation of a more diverse dataset, which is especially beneficial when training on a multilingual dataset. However, using BPE, the vocabulary size snowballs. Radfort et al. proposed an even more universal and more efficient approach. By splitting up the input sequence on byte level instead per Unicode character as the smallest unit, a universal vocabulary allows for representing any input text with a modest size of 50k units. Unicode-based approaches typically result in vocabulary sized between 10k-100k subword units.

The overall improved performance makes RoBERTa, in many cases, the obvious choice over the original BERT models. Also, the universal input encoding makes RoBERTa more convenient to use in a multilingual setting.

See:

Auto Tokenizer: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer
Auto Model for Token Classification: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer
RoBERTa on Huggingface: https://huggingface.co/docs/transformers/model_doc/roberta
XLM-RoBERTA-large on Huggingface: https://huggingface.co/xlm-roberta-large
BERT in Huggingface: https://huggingface.co/docs/transformers/model_doc/bert
BERT multilingual on Huggingface: https://huggingface.co/bert-base-multilingual-cased

Dataset:

The complete WikiANN dataset includes training examples for 282 languages and was constructed from Wikipedia. Training examples are extracted in an automated manner, exploiting entities mentioned in Wikipedia articles, often are formatted as hyperlinks to the source article. Provided NER tags are in the IOB2 format. Named entities are classified as location (LOC), person (PER), or organization (ORG).

See:

Wikiann: https://huggingface.co/datasets/wikiann

Other:

Add special Tokens: huggingface/transformers#5232, huggingface/tokenizers#247

Related Papers

Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., & Ji, H. (2017). Cross-lingual Name Tagging and Linking for 282 Languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1946–1958). Association for Computational Linguistics.
Rahimi, A., Li, Y., & Cohn, T. (2019). Massively Multilingual Transfer for NER. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 151–164). Association for Computational Linguistics.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V.. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
figures		figures
.env-example		.env-example
.gitignore		.gitignore
00_Test_RoBERTa_NER.ipynb		00_Test_RoBERTa_NER.ipynb
00_Test_RoBERTa_NER.py		00_Test_RoBERTa_NER.py
05_Compile_Dataset.ipynb		05_Compile_Dataset.ipynb
05_Compile_Dataset.py		05_Compile_Dataset.py
20_Preprocess_Dataset.ipynb		20_Preprocess_Dataset.ipynb
20_Preprocess_Dataset.py		20_Preprocess_Dataset.py
50_Finetune_Model.ipynb		50_Finetune_Model.ipynb
50_Finetune_Model.py		50_Finetune_Model.py
60_Show_Training_History.ipynb		60_Show_Training_History.ipynb
60_Show_Training_History.py		60_Show_Training_History.py
70_Evaluate_Model.ipynb		70_Evaluate_Model.ipynb
70_Evaluate_Model.py		70_Evaluate_Model.py
90_Use_Model.ipynb		90_Use_Model.ipynb
90_Use_Model.py		90_Use_Model.py
95_Upload_Model.ipynb		95_Upload_Model.ipynb
95_Upload_Model.py		95_Upload_Model.py
README.md		README.md
finetune-roberta-token-classification-mutl.ipynb		finetune-roberta-token-classification-mutl.ipynb
requirements.txt		requirements.txt

julianschelb/roberta-ner-multilingual

Folders and files

Latest commit

History

Repository files navigation

Finetune RoBERTa for NER

Setup

Run

General Information about RoBERTa:

Dataset:

Other:

Related Papers

About

Resources

Stars

Watchers

Forks

Languages