This is the repository accompanying our paper AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation. In this is the repository we introduce:
- Introduce AraT5MSA, AraT5Tweet, and AraT5: three powerful Arabic-specific text-to-text Transformer based models;
- Introduce ARGEN: A new benchmark for Arabic language generation and evaluation for four Arabic NLP tasks, namely,
machine translation
,summarization
,news title generation
,question generation
, ,paraphrasing
,transliteration
, andcode-switched translation
. - Evaluate
AraT5
models onARGEN
and compare against available language models.
Our models establish new state-of-the-art (SOTA) on several publicly available datasets. Our language models are publicaly available for research (see below).
The rest of this repository provides more information about our new language models, benchmark, and experiments.
- 1 Our Language Models
- 2. ARGEN Benchmark and AraT5 Evaluation
- 3. How to use AraT5 model
- 4. Ethics
- 5. AraT5 Models Checkpoints
- 6. Citation
- 7. Acknowledgments
-
MSA Training Data: We use 70GB of MSA text 7.1B tokens) from the following sources: AraNews, El-Khair, Gigaword, OSCAR, OSIAN, Wikipedia Arabic, and Hindawi Books.
-
Twitter Training Data: We randomly sample 1.5B Arabic tweets from a large in-house dataset of about 10B tweets. We use string matching to only include tweets with at least 3 Arabic words, regardless whether the tweet has non-Arabic string or not. The dataset makes up 178GB of text 21B tokens.
To train our AraT5, we use the same architecture as T5-base
and T5-small
(Raffel 2019) where both encoder and decoder has 12 layers each with 12 attention heads, and 768 hidden units.
We pre-train three powerful variants of the text-to-text transformer (T5) model dedicated to Modern Standard Arabic (MSA) and Arabic dialects, AraT5. AraT5 comes. AraT5 comes in three flavors:
- AraT5MSA: trained on MSA data exclusively
- AraT5Tweet: trained on Twitter data (mix of MSA and dialectal Arabic),
- AraT5: trained on both Twitter and MSA data.
To evaluate our models, we also introduce ARGEN, a new benchmark for A new benchmark for Arabic language generation and evaluation. ARGEN is composed of four tasks, namely, machine translation
, summarization
, newstitle generation
and question generation
. ARGEN is collected from a total of ten datasets, including two new large datasets pro-posed in this work.
Dataset | Test Split | mT5 | AraT5Tweet | AraT5MSA | AraT5 |
---|---|---|---|---|---|
Bible II Sajjad et al. (2020) | Test 1 | 15.58 | 13.04 | 16.38 | 15.71 |
Bible II Sajjad et al. (2020) | Test 2 | 12.1 | 9.2 | 12.53 | 11.64 |
MADAR Bouamor et al. (2018) | MSA-EN | 11.84 | 11.11 | 11.42 | 10.57 |
IWSLT Cettolo et al. (2016) | TED15 | 29.39 | 28.2 | 30.37 | 30.45 |
IWSLT Cettolo et al. (2016) | TED16 | 28.39 | 27.03 | 29.37 | 29.18 |
IWSLT Cettolo et al. (2016) | QED16 | 21.09 | 18.55 | 20.98 | 19.11 |
UN Ziemski et al. (2016) | AR-EN | 52.38 | 51.48 | 53.29 | 52.96 |
Metric is BLEU. MADAR Bouamor et al. (2018) (25 datasets) results are show in Table 6 (see the paper)
Dataset | Test Split | mT5 | AraT5Tweet | AraT5MSA | AraT5 |
---|---|---|---|---|---|
ADPT Zbib et al. (2012) | Lev | 8.33 | 8.32 | 8.52 | 8.42 |
ADPT Zbib et al. (2012) | Egy | 12.57 | 11.25 | 12.38 | 12.92 |
Bible I Sajjad et al. (2020) | Tun | 8.08 | 5.86 | 8.52 | 7.94 |
Bible I Sajjad et al. (2020) | Mor | 7.21 | 4.69 | 7.83 | 6.82 |
QAraCy Sajjad et al. (2020) | Qat | 11.84 | 11.11 | 11.42 | 10.57 |
Metric is BLEU.
Spit | mT5 | AraT5MSA |
---|---|---|
EN → MSA | 17.80 | 18.58 |
DE → MSA | 11.92 | 12.80 |
FR → MSA | 18.61 | 18.99 |
RU → MSA | 26.63 | 28.01 |
Metric is BLEU. All the splits are from UN corpus Ziemski et al. (2016)
Metric | Metric | mT5 | AraT5Tweet | AraT5MSA | AraT5 |
---|---|---|---|---|---|
Rouge1 | 62.98 | 60.74 | 59.54 | 54.61 | |
EASC El-Haj et al. (2010) | Rouge2 | 51.93 | 48.89 | 47.37 | 43.58 |
RougeL | 62.98 | 60.73 | 59.55 | 54.55 | |
Rouge1 | 71.63 | 74.61 | 72.64 | 73.48 | |
WikiLin Alami et al. (2021) | Rouge2 | 63.60 | 67.00 | 64.21 | 65.09 |
RougeL | 71.56 | 74.52 | 72.57 | 73.37 |
Dataset | Metric | mT5 | AraT5Tweet | AraT5MSA | MSA |
---|---|---|---|---|---|
ARGENNTG Nagoudi et al., 2020 | BLEU | 19.49 | 20.00 | 20.61 | 20.51 |
ARGENQG Nagoudi et al. (2021) | BLEU | 15.29 | 12.06 | 14.18 | 16.99 |
Dataset | Metric | mT5 | AraT5Tweet | AraT5MSA | MSA |
---|---|---|---|---|---|
ARGENPPH I Cer et al. (2017) | BLEU | 19.32 | 18.17 | 19.38 | 19.03 |
ARGENPPH II Alian et al. (2021) | BLEU | 19.25 | 17.34 | 19.43 | 18.42 |
ARGENTR Song et al. (2014) | BLEU | 60.81 | 59.55 | 65.88 | 62.51 |
Dataset | Type | mT5 | AraT5Tweet | AraT5MSA | MSA |
---|---|---|---|---|---|
ALG-FR → FR | Natural | 23.83 | 28.19 | 26.27 | 26.17 |
JOR-EN → EN | Natural | 23.06 | 21.60 | 21.58 | 20.45 |
MSA-FR → FR | Synthetic | 11.06 | 8.99 | 11.53 | 11.42 |
MSA-EN → EN | Synthetic | 19.25 | 17.34 | 19.43 | 18.42 |
MSA-FR → MSA | Synthetic | 12.93 | 12.14 | 14.39 | 13.92 |
MSA-EN → MSA | Synthetic | 19.82 | 18.43 | 23.89 | 24.37 |
Metric is BLEU. All the ARGENCS datasets are from: Nagoudi et al. (2021)
Below is an example for fine-tuning AraT5-base for News Title Generation on the Aranews dataset
!python run_trainier_seq2seq_huggingface.py \
--learning_rate 5e-5 \
--max_target_length 128 --max_source_length 128 \
--per_device_train_batch_size 8 --per_device_eval_batch_size 8 \
--model_name_or_path "UBC-NLP/AraT5-base" \
--output_dir "/content/AraT5_FT_title_generation" --overwrite_output_dir \
--num_train_epochs 3 \
--train_file "/content/ARGEn_title_genration_sample_train.tsv" \
--validation_file "/content/ARGEn_title_genration_sample_valid.tsv" \
--task "title_generation" --text_column "document" --summary_column "title" \
--load_best_model_at_end --metric_for_best_model "eval_bleu" --greater_is_better True --evaluation_strategy epoch --logging_strategy epoch --predict_with_generate\
--do_train --do_eval
For the more details about the fine-tuning example, please read this notebook
In addition, we release the fine-tuned checkpoint of the News Title Generation (NGT) which is described in the paper. The model available at Huggingface (UBC-NLP/AraT5-base-title-generation).
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/AraT5-base-title-generation")
model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5-base-title-generation")
Document = "تحت رعاية صاحب السمو الملكي الأمير سعود بن نايف بن عبدالعزيز أمير المنطقة الشرقية اختتمت غرفة الشرقية مؤخرا، الثاني من مبادرتها لتأهيل وتدريب أبناء وبنات المملكة ضمن مبادرتها المجانية للعام 2019 حيث قدمت 6 برامج تدريبية نوعية. وثمن رئيس مجلس إدارة الغرفة، عبدالحكيم العمار الخالدي، رعاية سمو أمير المنطقة الشرقية للمبادرة، مؤكدا أن دعم سموه لجميع أنشطة ."
encoding = tokenizer.encode_plus(Document,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"], encoding["attention_mask"]
outputs = model.generate(
input_ids=input_ids, attention_mask=attention_masks,
max_length=256,
do_sample=True,
top_k=120,
top_p=0.95,
early_stopping=True,
num_return_sequences=5
)
for id, output in enumerate(outputs):
title = tokenizer.decode(output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
print("title#"+str(id), title)
The input news document
The generated titles
title#0 غرفة الشرقية تختتم المرحلة الثانية من مبادرتها لتأهيل وتدريب أبناء وبنات المملكة
title#1 غرفة الشرقية تختتم الثاني من مبادرة تأهيل وتأهيل أبناء وبناتنا
title#2 سعود بن نايف يختتم ثانى مبادراتها لتأهيل وتدريب أبناء وبنات المملكة
title#3 أمير الشرقية يرعى اختتام برنامج برنامج تدريب أبناء وبنات المملكة
title#4 سعود بن نايف يرعى اختتام مبادرة تأهيل وتدريب أبناء وبنات المملكة
Our models are developed using data from the public domain. We provide access to our models to accelerate scientific research with no liability on our part. Please use our models and benchmark only ethically. This includes, for example, respect and protection of people's privacy. We encourage all researchers who decide to use our models to adhere to the highest standards. For example, if you apply our models on Twitter data, we encourage you to review Twitter policy at Twitter policy. For example, Twitter provides the following policy around use of sensitive information:
You should be careful about using Twitter data to derive or infer potentially sensitive characteristics about Twitter users. Never derive or infer, or store derived or inferred, information about a Twitter user’s:
- Health (including pregnancy)
- Negative financial status or condition
- Political affiliation or beliefs
- Racial or ethnic origin
- Religious or philosophical affiliation or beliefs
- Sex life or sexual orientation
- Trade union membership
- Alleged or actual commission of a crime
- Aggregate analysis of Twitter content that does not store any personal data (for example, user IDs, usernames, and other identifiers) is permitted, provided that the analysis also complies with applicable laws and all parts of the Developer Agreement and Policy.
AraT5 Pytorch and Tenserflow checkpoints are available on Huggingface website for direct download and use exclusively for research
. For commercial use, please contact the authors via email @ (*muhammad.mageed[at]ubc[dot]ca*).
Model | Link |
---|---|
AraT5-base | https://huggingface.co/UBC-NLP/AraT5-base |
AraT5-msa-base | https://huggingface.co/UBC-NLP/AraT5-msa-base |
AraT5-tweet-base | https://huggingface.co/UBC-NLP/AraT5-tweet-base |
AraT5-msa-small | https://huggingface.co/UBC-NLP/AraT5-msa-small |
AraT5-tweet-small | https://huggingface.co/UBC-NLP/AraT5-tweet-small |
Title generation model | https://huggingface.co/UBC-NLP/AraT5-base-title-generation |
If you use our AraT5 models for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (to be updated):
@inproceedings{nagoudi2022_arat5,
title={AraT5: Text-to-Text Transformers for Arabic Language Generation},
author={Nagoudi, El Moatez Billah and Elmadany, AbdelRahim and Abdul-Mageed, Muhammad},
journal={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistic},
month = {May},
address = {Online},
year={2022},
publisher = {Association for Computational Linguistics}
}
We gratefully acknowledge support from the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, Canadian Foundation for Innovation, ComputeCanada and UBC ARC-Sockeye. We also thank the Google TensorFlow Research Cloud (TFRC) program for providing us with free TPU access.