GreekBART

The First Pretrained Greek Sequence-to-Sequence Model

Introduction

The first sequence to sequence pretrained model for the Greek language based on [BART] base model (https://github.com/facebookresearch/fairseq/tree/main/examples/bart).

GreekBART is pre-trained from scratch on a corpus of 76.9GB of Greek raw text to reconstruct corrupted input sentences.
As GreekBART is built upon on BART model, it is suitable for generative tasks.

Model	Architecture	#layers	#params	Link
GreekBART	BASE	12	181M	Link

Structure of Code

We used fairseq (https://github.com/facebookresearch/fairseq) for our implementation.

In the src folder, we can find all the used codes and implementations.

Structure of src folder and its subfolders:

src
- corpus
- examples
  - classification-GreekSUM
  - classification-Macedonia
  - NLI
  - Sentimental Analysis
- preprocess
  - data
  - Crawler
  - EU Parliament
  - OSCAR
  - Wikipedia
- pretrain
- summarization
  - abstract
  - title

Corpus Download & Preprocess

The pretrained corpus is produced by the following datasets:

the Greek part of Wikipedia¹
the Greek part of the European Parliament Proceedings Parallel Corpus (EuroParl)²
the Greek part of OSCAR³, a clean version of CommonCrawl
the large web corpus, crawled from about 20 million Greek-language URLs⁴

The "src/preprocess/read_dataset.py" script can be used to directly download the Wikipedia dataset, while the other datasets need to be downloaded by the user. However, for obtaining the latest version of the Greek part of OSCAR, one needs to contact the OSCAR team. After the download of the aforementioned datasets, you run the "src/preprocess/read_dataset.py" in order to preprocess and clean the downloaded datasets.

Then by running the "src/preprocess/create_deduplication_script.py" code and the generated bash script, we deduplicate the downloaded datasets and we concatenate them to one file, our raw corpus. For the deduplication process you need the runiq package (https://github.com/whitfin/runiq).

Pretraining phase

After the formation of the corpus, we executed the "src/pretrain/all_in_one_script.sh" to perform several tasks. Firstly, we divide the corpus into training and validation sets. Next, we tokenize and binarize our dataset. Finally, we initiate the pretraining of our model.

For more details about the used tokenizer, see LINK

Discriminative Tasks

Into the subfolder "examples", we can find the four discriminative tasts, in which our model was evaluated.

classification-GreekSUM

Get GreekSUM classification dataset

Please follow the steps here to get GreekSUM.

We run the following scripts:

"src/examples/classification-GreekSUM/sentencepiece_nli.sh" (tokenize datasets)
"src/examples/classification-GreekSUM/binarization_nli.sh" (binarize datasets)
"src/examples/classification-GreekSUM/train_NLI.sh" (fine-tune them model to this task)

NLI task

We run the following scripts:

"src/examples/NLI/get-xnli.sh" (To download the dataset)
"src/examples/NLI/process_nli.py" (To preprocess dataset and split it to training/validation/test sets)
"src/examples/NLI/sentencepiece_nli.sh" (tokenize datasets)
"src/examples/NLI/binarization_nli.sh" (binarize datasets)
"src/examples/NLI/train_NLI.sh" (fine-tune them model to this task)

Sentimental Analysis task

Get dataset

Download dataset from LINK We run the following scripts:

"src/examples/Sentimental Analysis/process_sentimental.py" (To preprocess dataset and split it to training/validation/test sets)
"src/examples/Sentimental Analysis/sentencepiece_sentimental.sh" (tokenize datasets)
"src/examples/Sentimental Analysis/binarization_sentimental.sh" (binarize datasets)
"src/examples/Sentimental Analysis/train_sentimental.sh" (fine-tune them model to this task)

classification-Macedonia task

We run the following scripts:

"src/examples/classification-Macedonia/get-classification.sh" (To download the dataset)
"src/examples/classification-Macedonia/process_classification.py" (To preprocess dataset and split it to training/validation/test sets)
"src/examples/classification-Macedonia/sentencepiece_classification.sh" (tokenize datasets)
"src/examples/classification-Macedonia/binarization_classification.sh" (binarize datasets)
"src/examples/classification-Macedonia/train_classification.sh" (fine-tune them model to this task)

We can use the scripts "src/examples/inference.py" and "src/examples/calculate_score.py" to evaluate the model's performance into the test set.

If you trained the model with multiple seeds, you can utilize the "src/examples/compute_mean_std.py" script to calculate the mean, median, and standard deviation of the scores. The valid score corresponds to the best valid score across the epochs, and the test score corresponds to the test score of the epoch with the best valid score.

Summarization

Thanks to its encoder-decoder structure, GreekBART can perform generative tasks such as summarization.

Get the summarization dataset

Please follow the steps here to get GreekSUM.

Abstract

We run the following scripts:

"src/summarization/abstract/sentencepiece_summarization.sh" (tokenize datasets)
"src/summarization/abstract/binarization_summarization.sh" (binarize datasets)
"src/summarization/abstract/train_summarization.sh" (fine-tune them model to this task)

Title

We run the following scripts:

"src/summarization/title/sentencepiece_summarization.sh" (tokenize datasets)
"src/summarization/title/binarization_summarization.sh" (binarize datasets)
"src/summarization/title/train_summarization.sh" (fine-tune them model to this task)

Use "src/summarization/generate_summary.py" and "src/summarization/calculate_score.py" to generate the summaries and to compute their ROUGE and BERTScore scores, respectively. No stemming is applied before evaluation.

Finally, we can execute the "src/summarization/models_statistics.r" script to compute statistics on the generated summaries, such as their length and the percentage of repetitions relative to their reference summaries.

You can find our demo HERE

If you use the code or any of the models, you can cite the following paper:

@misc{evdaimon2023greekbart,
      title={GreekBART: The First Pretrained Greek Sequence-to-Sequence Model}, 
      author={Iakovos Evdaimon and Hadi Abdine and Christos Xypolopoulos and Stamatis Outsios and Michalis Vazirgiannis and Giorgos Stamou},
      year={2023},
      eprint={2304.00869},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
fairseq		fairseq
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fairseq

fairseq

src

src

LICENSE

LICENSE

README.md

README.md

Repository files navigation

GreekBART

Introduction

Structure of Code

Structure of src folder and its subfolders:

Corpus Download & Preprocess

Pretraining phase

Discriminative Tasks

classification-GreekSUM

Get GreekSUM classification dataset

NLI task

Sentimental Analysis task

Get dataset

classification-Macedonia task

Summarization

Get the summarization dataset

Abstract

Title

About

Releases

Packages

Languages

License

iakovosevdaimon/GreekBART

Folders and files

Latest commit

History

Repository files navigation

GreekBART

Introduction

Structure of Code

Structure of src folder and its subfolders:

Corpus Download & Preprocess

Pretraining phase

Discriminative Tasks

classification-GreekSUM

Get GreekSUM classification dataset

NLI task

Sentimental Analysis task

Get dataset

classification-Macedonia task

Summarization

Get the summarization dataset

Abstract

Title

Footnotes

About

Resources

License

Stars

Watchers

Forks

Languages