Skip to content

iakovosevdaimon/GreekBART

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GreekBART

The First Pretrained Greek Sequence-to-Sequence Model

Introduction

The first sequence to sequence pretrained model for the Greek language based on [BART] base model (https://github.com/facebookresearch/fairseq/tree/main/examples/bart).

GreekBART is pre-trained from scratch on a corpus of 76.9GB of Greek raw text to reconstruct corrupted input sentences.
As GreekBART is built upon on BART model, it is suitable for generative tasks.

Model Architecture #layers #params Link
GreekBART BASE 12 181M Link

Structure of Code

We used fairseq (https://github.com/facebookresearch/fairseq) for our implementation.

In the src folder, we can find all the used codes and implementations.

Structure of src folder and its subfolders:

  • src
    • corpus
    • examples
      • classification-GreekSUM
      • classification-Macedonia
      • NLI
      • Sentimental Analysis
    • preprocess
      • data
      • Crawler
      • EU Parliament
      • OSCAR
      • Wikipedia
    • pretrain
    • summarization
      • abstract
      • title

Corpus Download & Preprocess

The pretrained corpus is produced by the following datasets:

  • the Greek part of Wikipedia1
  • the Greek part of the European Parliament Proceedings Parallel Corpus (EuroParl)2
  • the Greek part of OSCAR3, a clean version of CommonCrawl
  • the large web corpus, crawled from about 20 million Greek-language URLs4

The "src/preprocess/read_dataset.py" script can be used to directly download the Wikipedia dataset, while the other datasets need to be downloaded by the user. However, for obtaining the latest version of the Greek part of OSCAR, one needs to contact the OSCAR team. After the download of the aforementioned datasets, you run the "src/preprocess/read_dataset.py" in order to preprocess and clean the downloaded datasets.

Then by running the "src/preprocess/create_deduplication_script.py" code and the generated bash script, we deduplicate the downloaded datasets and we concatenate them to one file, our raw corpus. For the deduplication process you need the runiq package (https://github.com/whitfin/runiq).

Pretraining phase

After the formation of the corpus, we executed the "src/pretrain/all_in_one_script.sh" to perform several tasks. Firstly, we divide the corpus into training and validation sets. Next, we tokenize and binarize our dataset. Finally, we initiate the pretraining of our model.

For more details about the used tokenizer, see LINK

Discriminative Tasks

Into the subfolder "examples", we can find the four discriminative tasts, in which our model was evaluated.

classification-GreekSUM

Get GreekSUM classification dataset

Please follow the steps here to get GreekSUM.

We run the following scripts:

  • "src/examples/classification-GreekSUM/sentencepiece_nli.sh" (tokenize datasets)
  • "src/examples/classification-GreekSUM/binarization_nli.sh" (binarize datasets)
  • "src/examples/classification-GreekSUM/train_NLI.sh" (fine-tune them model to this task)

NLI task

We run the following scripts:

  • "src/examples/NLI/get-xnli.sh" (To download the dataset)
  • "src/examples/NLI/process_nli.py" (To preprocess dataset and split it to training/validation/test sets)
  • "src/examples/NLI/sentencepiece_nli.sh" (tokenize datasets)
  • "src/examples/NLI/binarization_nli.sh" (binarize datasets)
  • "src/examples/NLI/train_NLI.sh" (fine-tune them model to this task)

Sentimental Analysis task

Get dataset

Download dataset from LINK We run the following scripts:

  • "src/examples/Sentimental Analysis/process_sentimental.py" (To preprocess dataset and split it to training/validation/test sets)
  • "src/examples/Sentimental Analysis/sentencepiece_sentimental.sh" (tokenize datasets)
  • "src/examples/Sentimental Analysis/binarization_sentimental.sh" (binarize datasets)
  • "src/examples/Sentimental Analysis/train_sentimental.sh" (fine-tune them model to this task)

classification-Macedonia task

We run the following scripts:

  • "src/examples/classification-Macedonia/get-classification.sh" (To download the dataset)
  • "src/examples/classification-Macedonia/process_classification.py" (To preprocess dataset and split it to training/validation/test sets)
  • "src/examples/classification-Macedonia/sentencepiece_classification.sh" (tokenize datasets)
  • "src/examples/classification-Macedonia/binarization_classification.sh" (binarize datasets)
  • "src/examples/classification-Macedonia/train_classification.sh" (fine-tune them model to this task)

We can use the scripts "src/examples/inference.py" and "src/examples/calculate_score.py" to evaluate the model's performance into the test set.

If you trained the model with multiple seeds, you can utilize the "src/examples/compute_mean_std.py" script to calculate the mean, median, and standard deviation of the scores. The valid score corresponds to the best valid score across the epochs, and the test score corresponds to the test score of the epoch with the best valid score.

Summarization

Thanks to its encoder-decoder structure, GreekBART can perform generative tasks such as summarization.

Get the summarization dataset

Please follow the steps here to get GreekSUM.

Abstract

We run the following scripts:

  • "src/summarization/abstract/sentencepiece_summarization.sh" (tokenize datasets)
  • "src/summarization/abstract/binarization_summarization.sh" (binarize datasets)
  • "src/summarization/abstract/train_summarization.sh" (fine-tune them model to this task)

Title

We run the following scripts:

  • "src/summarization/title/sentencepiece_summarization.sh" (tokenize datasets)
  • "src/summarization/title/binarization_summarization.sh" (binarize datasets)
  • "src/summarization/title/train_summarization.sh" (fine-tune them model to this task)

Use "src/summarization/generate_summary.py" and "src/summarization/calculate_score.py" to generate the summaries and to compute their ROUGE and BERTScore scores, respectively. No stemming is applied before evaluation.

Finally, we can execute the "src/summarization/models_statistics.r" script to compute statistics on the generated summaries, such as their length and the percentage of repetitions relative to their reference summaries.

You can find our demo HERE

If you use the code or any of the models, you can cite the following paper:

@misc{evdaimon2023greekbart,
      title={GreekBART: The First Pretrained Greek Sequence-to-Sequence Model}, 
      author={Iakovos Evdaimon and Hadi Abdine and Christos Xypolopoulos and Stamatis Outsios and Michalis Vazirgiannis and Giorgos Stamou},
      year={2023},
      eprint={2304.00869},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Footnotes

  1. https://dumps.wikimedia.org/elwiki/

  2. https://www.statmt.org/europarl/

  3. https://oscar-corpus.com/

  4. http://nlp.polytechnique.fr/resources-greek

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published