Skip to content

Wikipedia-Vikidia Corpus (WiViCo) - A general-purpose parallel sentence simplification dataset for French

Notifications You must be signed in to change notification settings

lormaechea/wivico

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 

Repository files navigation

WiViCo | Wikipedia Vikidia Corpus

A general-purpose parallel sentence simplification dataset for French


     


General Presentation & Repo Structure:

This repository provides a general-purpose complex-simpler parallel sentence simplification dataset for French language: Wikipedia-Vikidia Corpus, WiViCo. It results from the development of a two-step automatic filtering method, that mines register-diversified comparable corpora so as to extract complex-simpler pairs. To do so, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid:

  • preservation of the original meaning, that we addressed with the use of n:m-aware SBERT-based cosine similarities; and
  • simpliciy gain with respect to the source text, that we treated with a text simplicity classification model.

This repository currently contains two different versions:

  • The wivico_v.1 subfolder. It comprises the initial version of the dataset, by which we operated the aforementioned conditions with the use of n:m-aware SBERT-based cosine similarities (as a proxy to meaning retention) and an FFNN-based simplicity gain classifier. It results from the experiments conducted in the following article:

    @inproceedings{ormaechea-2023-extracting-simplification-pairs,
        title = {Extracting sentence simplification pairs from French comparable corpora using a two-step filtering method},
        author = {Lucía Ormaechea and Nikos Tsourakis},
        booktitle = {Proceedings of the Swiss Text Analytics Conference 2023},
        month = {6},
        year = {2023},
        location = {Neuchâtel, Switzerland},
        publisher = {ACL},
        url = {https://archive-ouverte.unige.ch/unige:169798}
    }
  • The wivico_v.2 subfolder, that includes the newest the version of WiViCo. The data derives from SBERT-based cosine similarities to assess meaning preservation, but it uses a finer-grained method to capture complex-simpler sentence pairs than the one used in the first version. It results from the experiments performed in the following paper:

    @inproceedings{ormaechea-2023-simple-simpler-beyond,
        title = {Simple, Simpler and Beyond: A Fine-Tuning BERT-Based Approach to Enhance Sentence Complexity Assessment for Text Simplification},
        author = {Lucía Ormaechea, Nikos Tsourakis, Didier Schwab, Pierrette Bouillon and Benjamin Lecouteux},
        booktitle = {Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNSLP)},
        month = {12},
        year = {2023},
        location = {Trento, Italy},
        publisher = {ACL},
        url = {To appear},
    }

Authors

Contact person: Lucía Ormaechea, lucia.ormaecheagrijalba@unige.ch

If you have further questions, don't hesitate to send us an email.

About

Wikipedia-Vikidia Corpus (WiViCo) - A general-purpose parallel sentence simplification dataset for French

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published