Skip to content

Michael-Beukman/NERTransfer

Repository files navigation

NER Transfer - Interpretation and Analysis

About

This is an project that investigates transfer learning in low-resourced languages, specifically in a named entity recognition (NER) task. This repository contains the code, the trained models can be found here and the paper (accepted to IJCNLP-AACL 2023) can be found here.

Further, many visualisations are available in ./analysis/. Finally, the raw predictions and results can be found in ./src/runs/v10/models/

Contents

Code Structure

We mainly did the following in this project:

  1. Fine-tune many different pre-trained models on different languages from the MasakhaNER dataset.
  2. Using these fine-tuned models, we evaluated on all of the languages, to get an idea for zero-shot potential.
  3. Then we moved onto analysis, which is the bulk of our contributions and code here. The specific analysis we did can be found in ./src/analysis, with the following subfolders:
    1. v10: Analyse Pre-training effect on final performance.
    2. v20: Look at zero-shot transfer.
    3. v40: Investigate word embeddings, performing, among others, PCA on them and plotting them.
    4. v50: Statistically analysing the data overlap between the different languages, and investigating correlation between this and performance.
├── analysis                            -> This contains the bulk of our results, including plots, csv files and Latex tables.
│   ├── v10
│   ├── v20
│   ├── v40
│   └── v50
├── data
│   └── masakhane-ner
├── doc
│   ├── report.pdf                      -> Written Report
├── env.yml                             -> Environment File
├── logs                                -> Log files are written to here
├── run.sh                              -> script to run any code in this repository
└── src                                 -> All of the source code
    ├── analysis                        -> Analyses the results we obtained
    │   ├── utils.py                    -> General utilities
    │   ├── v10                         -> Effect of pre-trained models
    │   ├── v20                         -> Effect of fine-tuning and doing zero shot transfer
    │   ├── v40                         -> Interpretability, specifically word embedding analysis
    │   └── v50                         -> Statistical analysis including data overlap
    ├── results
    │   └── v40                         -> This contains some pickle files with precomputed embeddings.
    ├── run.sh                          -> Another run.sh
    ├── runs                            -> Code that ran all the fine-tuning. Also contains the results on the test/dev set.
    │   ├── v10                         -> Fine-tuning from different pre-trained models.
    │   └── v20                         -> Evaluating zero shot transfer.
    ├── slurms                          -> Job files are saved here.
    │   └── v10
    └── utils                           -> Overall utility code, lot of it from the MasakhaNER codebase
        ├── create_slurms.py            -> Creates Slurm job files for the fine-tuning
        ├── evaluate_ner.py             -> Evaluates a specific model on any language
        ├── train_ner.py                -> Trains the models
        └── utils_ner.py                -> NER Utilities

Get Started

To get started, you can use our conda environment by running the following command:

conda env create -f env.yml

Then run conda activate mner, and you should be ready to go. If that fails, you can try to instead use env_all.yml, with pinned versions for all packages.

Now, to run any python file in this project, the recommended way to do so is to use the ./run.sh script provided in the root of the project instead of python directly, as the former sets up the requisite paths, e.g. ./run.sh src/utils/create_slurms.py

Train Models

If you want to train the models, then you can create the Slurm scripts used by running ./run.sh src/utils/create_slurms.py. Note that you will need to change this file's paths at the top of the file to be applicable to your cluster. Then, you should be able to see the slurm scripts in src/slurms/v10 and you can run all of them. Note, there are very many and your cluster scheduler may block you if you try running all of them at once.

Evaluate

Next up is how to do evaluation, i.e. take the models from before, and evaluate them on all of the other languages. To do so you can navigate to src/runs/v20 and run ./run_eval_all.sh. Note, this will run all of the inference (4 in parallel) in one step, so it might take a long time.

In practice, we do not recommend this training and evaluation is done from scratch as it uses significant compute and storage resources. We do provide downloadable models (albeit just one seed instead of 5), so it is recommended to just use those.

Analysis

The analysis is mostly self contained, and the vast majority requires just the result .txt files present and not the models at all.

TL; DR: You can probably just run the following commands

./run.sh src/analysis/v10/basic_analyse_pretrained_model.py
./run.sh src/analysis/v10/analyse_per_category.py

./run.sh src/analysis/v20/base.py

./run.sh src/analysis/v50/statistical_overlap.py

and most of the analysis should run.

The embeddings code is a bit more involved, as that requires the actual word embeddings, which are quite large, and are stored in a separate repository. To download these, do the following:

mkdir -p src/results/v40
cd src/results/v40
git clone https://github.com/Michael-Beukman/NERTransfer-Embeddings .
cd -

Once those have been downloaded, you can run the code by simply running ./run.sh src/analysis/v40/embeddings.py.

Code is also contained inside there to actually create the word embeddings, although that does require all of the fine-tuned models.

In the repository, there are many images of embeddings from different combinations of models and languages, specifically found in analysis/v40, so the code does not need to be rerun to just view them.

Exploration

In src/runs/v10/models there are many different folders, consisting of the results (metrics and predictions) from different fine-tuned models. src/runs/v10/models/v20 contains the results from the same models, when evaluating on different languages. There is lots of data here, so feel free to explore it.

Additionally, there are very many plots, images and tables in analysis/* that might be interesting.

The folder names might be slightly confusing, but, e.g. src/runs/v10/models/kin_v10_start_swa__kin__2_50/ means it was fine-tuned on Kinyarwanda NER data, starting from the Swahili pre-trained model, trained using seed = 2 and epochs = 50. Similarly, src/runs/v10/models/ibo_v10_start_base__ibo__0_50 is xlm-roberta-base fine-tuned on Igbo data.

For the results in v20, e.g. src/runs/v10/models/v20/lug_zero_shot_pcm_v10_start_swa__pcm__0_50_lug_50 is fundamentally the pcm_v10_start_swa__pcm__0_50 model (i.e. Swahili pre-trained and fine-tuned on Nigerian Pidgin) that was used to evaluate Luganda. Hence, these files follow the format src/runs/v10/models/v20/<eval>_zero_shot_<model>_<eval>_50.

References / Sources / Libraries

  • MasakhaNER (Github, paper). Their code and data can be found in data/, and is licensed under Masakhane-LICENSE
  • Interspersed throughout the code are links to other sources that were used to solve a particular problem.

License

This code is licensed under the Apache License, Version 2.0.

Citation

If you find this useful, then please cite this work as follows:

@inproceedings{NerTransfer,
  title={Analysing the effects of transfer learning on low-resourced named entity recognition performance},
  author={Michael Beukman},
  booktitle={3rd Workshop on African Natural Language Processing},
  year={2021},
  url={https://openreview.net/forum?id=HKWMFqfN8b5}
}

Model Cards

We release some models that were fine-tuned on the MasakhaNER dataset, using specific languages. We start with 3 different pre-trained models, and then fine-tune on the 10 languages in the dataset. The pre-trained models we start with are:

Here we describe a joint model card for all of the models, as they were quite similar.

About

These models are transformer based and were all fine-tuned on the MasakhaNER dataset. It is a named entity recognition dataset, containing mostly news articles in 10 different African languages. The models were all fine-tuned for 50 epochs, with a maximum sequence length of 200, 32 batch size, 5e-5 learning rate. Each model was fine-tuned 5 times (with different random seeds), and the ones we upload are the ones that did the best for the language it was evaluated on.

These models were fine-tuned by me, Michael Beukman while doing a project at the University of the Witwatersrand, Johannesburg. This is version 1, as of 20 November 2021. These models are licensed under the Apache License, Version 2.0.

Contact & More information

For more information about the models, including training scripts, detailed results and further resources, you can visit the the main Github repository. You can contact me by filing an issue on this repository.

Training Resources

In the interest of openness, and reporting resources used, we list here how long the training process took, as well as what the minimum resources would be to reproduce this. Fine-tuning each model on the NER dataset took between 10 and 30 minutes, and was performed on a NVIDIA RTX3090 GPU. To use a batch size of 32, at least 14GB of GPU memory was required, although it was just possible to fit these models in around 6.5GB's of VRAM when using a batch size of 1.

Data

The train, evaluation and test datasets were taken directly from the MasakhaNER Github repository, with minimal to no preprocessing, as the original dataset is already of high quality. The motivation for the use of this data is that it is the "first large, publicly available, high­ quality dataset for named entity recognition (NER) in ten African languages" (source). The high-quality data, as well as the groundwork laid by the paper introducing it are some more reasons why this dataset was used. For evaluation, the dedicated test split was used, which is from the same distribution as the training data, so the models may not generalise to other distributions, and further testing would need to be done to investigate this. The exact distribution of the data is covered in detail here.

Intended Use

These models are intended to be used for NLP research into e.g. interpretability or transfer learning. Using these models in production is not supported, as generalisability and downright performance is limited. In particular, these models are not designed to be used in any important downstream task that could affect people, as harm could be caused by the limitations of the model, described next.

Limitations

These models were only trained on one (relatively small) dataset, covering one task (NER) in one domain (news articles) and in a set span of time. The results may not generalise, and the model may perform badly, or in an unfair / biased way if used on other tasks. Although the purpose of this project was to investigate transfer learning, the performance on languages that the models were not trained for does suffer.

Because all of these models covered here used xlm-roberta-base as their starting point (potentially with domain adaptive fine-tuning on specific languages), this model's limitations can also apply here. These can include being biased towards the hegemonic viewpoint of most of its training data, being ungrounded and having subpar results on other languages (possibly due to unbalanced training data).

The embeddings we provide are meant purely for exploratory and analysis purposes, and might not be useful, or even dangerous when deployed in a high stakes environment that has tangible consequences.

As Adelani et al. (2021) showed, the models in general struggled with entities that were either longer than 3 words and entities that were not contained in the training data. This could bias the models towards not finding, e.g. names of people that have many words, possibly leading to a misrepresentation in the results. Similarly, names that are uncommon, and may not have been found in the training data (due to e.g. different languages) would also be predicted less often.

Additionally, these models have not been verified in practice, and other, more subtle problems may become prevalent if used without any verification that it does what it is supposed to.

Privacy & Ethical Considerations

The data comes from only publicly available news sources, the only available data should cover public figures and those that agreed to be reported on. See the original MasakhaNER paper for more details.

No explicit ethical considerations or adjustments were made during fine-tuning of these models.

Metrics

As shown in our report, and in the MasakhaNER paper, the language adaptive models achieve (mostly) superior performance over starting with xlm-roberta-base. We also demonstrated some better transfer capabilities for some of these models. Our main metric was the aggregate F1 score for all NER categories.

These metrics are on the test set for MasakhaNER, so the data distribution is similar to the training set, so these results do not directly indicate how well these models generalise. We do find large variation in results when starting from different seeds (5 different seeds were tested), indicating that the fine-tuning process for transfer might be unstable.

The metrics used were chosen to be consistent with previous work, and to facilitate research. Other metrics may be more appropriate for other purposes.

Caveats and Recommendations

In general, these models performed worse on the 'date' category compared to others, so if dates are a critical factor, then that might need to be taken into account and addressed, by for example collecting and annotating more data.

Model Structure

Here are some details regarding the specific models, like where to find them, what the starting points where and what they were evaluated on. Evaluation here is also the same language that the model was fine-tuned on. All of these metrics were calculated on the test set, and the seed was chosen that gave the best overall F1 score. The first three result columns are averaged over all categories, and the latter 4 provide performance broken down by category.

These models can predict the following label for a token (source):

Abbreviation Description
O Outside of a named entity
B-DATE Beginning of a DATE entity right after another DATE entity
I-DATE DATE entity
B-PER Beginning of a person’s name right after another person’s name
I-PER Person’s name
B-ORG Beginning of an organisation right after another organisation
I-ORG Organisation
B-LOC Beginning of a location right after another location
I-LOC Location
Model Name Staring point Evaluation / Fine-tune Language F1 Precision Recall F1 (DATE) F1 (LOC) F1 (ORG) F1 (PER)
xlm-roberta-base-finetuned-amharic-finetuned-ner-amharic amh amh 79.55 76.71 82.62 70.00 84.00 62.00 91.00
xlm-roberta-base-finetuned-swahili-finetuned-ner-amharic swa amh 70.34 69.72 70.97 72.00 75.00 51.00 73.00
xlm-roberta-base-finetuned-ner-amharic base amh 72.63 70.49 74.91 76.00 75.00 52.00 78.00
xlm-roberta-base-finetuned-hausa-finetuned-ner-hausa hau hau 92.27 90.46 94.16 85.00 95.00 80.00 97.00
xlm-roberta-base-finetuned-swahili-finetuned-ner-hausa swa hau 89.14 87.18 91.20 82.00 93.00 76.00 93.00
xlm-roberta-base-finetuned-ner-hausa base hau 89.94 87.74 92.25 84.00 94.00 74.00 93.00
xlm-roberta-base-finetuned-igbo-finetuned-ner-igbo ibo ibo 88.39 87.08 89.74 74.00 91.00 90.00 91.00
xlm-roberta-base-finetuned-swahili-finetuned-ner-igbo swa ibo 84.93 83.63 86.26 70.00 88.00 89.00 84.00
xlm-roberta-base-finetuned-ner-igbo base ibo 86.06 85.20 86.94 76.00 86.00 90.00 87.00
xlm-roberta-base-finetuned-kinyarwanda-finetuned-ner-kinyarwanda kin kin 79.55 75.56 83.99 69.00 79.00 77.00 90.00
xlm-roberta-base-finetuned-swahili-finetuned-ner-kinyarwanda swa kin 76.31 72.64 80.37 70.00 76.00 75.00 84.00
xlm-roberta-base-finetuned-ner-kinyarwanda base kin 74.59 72.17 77.17 70.00 75.00 70.00 82.00
xlm-roberta-base-finetuned-luganda-finetuned-ner-luganda lug lug 85.37 82.75 88.17 78.00 82.00 80.00 92.00
xlm-roberta-base-finetuned-swahili-finetuned-ner-luganda swa lug 82.57 80.38 84.89 75.00 80.00 82.00 87.00
xlm-roberta-base-finetuned-ner-luganda base lug 80.91 78.59 83.37 73.00 78.00 77.00 86.00
xlm-roberta-base-finetuned-luo-finetuned-ner-luo luo luo 78.71 78.91 78.52 72.00 84.00 59.00 87.00
xlm-roberta-base-finetuned-swahili-finetuned-ner-luo swa luo 78.13 77.75 78.52 65.00 82.00 61.00 89.00
xlm-roberta-base-finetuned-ner-luo base luo 75.99 76.18 75.80 71.00 76.00 62.00 85.00
xlm-roberta-base-finetuned-naija-finetuned-ner-naija pcm pcm 88.06 87.04 89.12 90.00 88.00 81.00 92.00
xlm-roberta-base-finetuned-swahili-finetuned-ner-naija swa pcm 89.12 87.84 90.42 90.00 89.00 82.00 94.00
xlm-roberta-base-finetuned-ner-naija base pcm 88.89 88.13 89.66 92.00 87.00 82.00 94.00
xlm-roberta-base-finetuned-amharic-finetuned-ner-swahili amh swa 86.66 85.23 88.13 84.00 90.00 74.00 92.00
xlm-roberta-base-finetuned-hausa-finetuned-ner-swahili hau swa 88.36 86.95 89.82 86.00 91.00 77.00 94.00
xlm-roberta-base-finetuned-igbo-finetuned-ner-swahili ibo swa 87.75 86.55 88.97 85.00 92.00 77.00 91.00
xlm-roberta-base-finetuned-kinyarwanda-finetuned-ner-swahili kin swa 87.26 85.15 89.48 83.00 91.00 75.00 93.00
xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili lug swa 88.93 87.64 90.25 83.00 92.00 79.00 95.00
xlm-roberta-base-finetuned-luo-finetuned-ner-swahili luo swa 87.93 86.91 88.97 83.00 91.00 76.00 94.00
xlm-roberta-base-finetuned-naija-finetuned-ner-swahili pcm swa 87.26 85.15 89.48 83.00 91.00 75.00 93.00
xlm-roberta-base-finetuned-swahili-finetuned-ner-swahili swa swa 90.36 88.59 92.20 86.00 93.00 79.00 96.00
xlm-roberta-base-finetuned-wolof-finetuned-ner-swahili wol swa 87.80 86.50 89.14 86.00 90.00 78.00 93.00
xlm-roberta-base-finetuned-yoruba-finetuned-ner-swahili yor swa 87.73 86.67 88.80 85.00 91.00 75.00 93.00
xlm-roberta-base-finetuned-ner-swahili base swa 88.71 86.84 90.67 83.00 91.00 79.00 95.00
xlm-roberta-base-finetuned-swahili-finetuned-ner-wolof swa wol 69.01 73.25 65.23 27.00 85.00 52.00 67.00
xlm-roberta-base-finetuned-wolof-finetuned-ner-wolof wol wol 69.02 67.60 70.51 30.00 84.00 44.00 71.00
xlm-roberta-base-finetuned-ner-wolof base wol 66.12 69.46 63.09 30.00 84.00 54.00 59.00
xlm-roberta-base-finetuned-swahili-finetuned-ner-yoruba swa yor 80.29 78.34 82.35 77.00 82.00 73.00 86.00
xlm-roberta-base-finetuned-yoruba-finetuned-ner-yoruba yor yor 83.68 79.92 87.82 78.00 86.00 74.00 92.00
xlm-roberta-base-finetuned-ner-yoruba base yor 78.22 77.21 79.26 77.00 80.00 71.00 82.00

Usage

To use these models, you can do the following, just changing the model name (source):

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = 'mbeukman/xlm-roberta-base-finetuned-hausa-finetuned-ner-hausa'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Donald Trump Ya Rufe Gwamnatin Amurka"

ner_results = nlp(example)
print(ner_results)
# [{'entity': 'B-PER', 'score': 0.9998982, 'index': 1, 'word': '▁Donald', 'start': 0, 'end': 6}, {'entity': 'I-PER', 'score': 0.9999061, 'index': 2, 'word': '▁Trump', 'start': 6, 'end': 12}, {'entity': 'B-LOC', 'score': 0.9999413, 'index': 7, 'word': '▁Amurka', 'start': 30, 'end': 37}]

About

Investigating transfer learning in low-resourced languages, specifically in a named entity recognition (NER) task (IJCNLP-AACL 2023). http://arxiv.org/abs/2309.05311

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published