Skip to content

LanguageTechnologyLab/DSL-TL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Discriminating Between Similar Languages - True Labels (DSL-TL)

Discriminating between similar languages (e.g., Croatian and Serbian) and language varieties (e.g., Brazilian and European Portuguese) has been a popular topic at VarDial since its first edition. The DSL shared tasks organized in 2014, 2015, 2016, and 2017 have addressed this issue by providing participants with the DSL Corpus Collection (DSLCC), a collection of journalistic texts containing texts written in multiple similar languages and language varieties. The DSLCC was compiled under the assumption that each instance's gold label is determined by where the text is retrieved from. While this is a straightforward (and mostly accurate) practical assumption, previous research has shown the limitations of this problem formulation as some texts may present no linguistic marker that allows systems or native speakers to discriminate between two very similar languages or language varieties.

We tackle this important limitation by introducing the DSL True Labels (DSL-TL) task. DSL-TL will provide participants with a human-annotated DSL dataset. A sub-set of nearly 13,000 sentences were retrieved from the DSLCC and annotated by multiple native speakers of the included language and varieties, namely English (American and British), Portuguese (Brazilian and European), Spanish (Argentinian and Peninsular). To the best of our knowledge, this is the first dataset of its kind opening exciting new avenues for language identification research.

  • Track 1 - Three-way Classification: In this track, systems will be evaluated with respect to the prediction of all three labels for each language, namely the variety-specific labels (e.g., PT-PT or PT-BR) and the common label (e.g., PT).
  • Track 2 - Binary Classification: In this track, systems will be scored only on the variety-specific labels (e.g., EN-GB, EN-US).

News

Update 3/08/2023

  • Included features for each language collected during annotation. Features can be found at: ../DSL-TL/DSL-TL-Corpus/Features-DSL-TL/..

Update 2/15/2023

  • Updated submission instructions, in particular, the file naming convention. The submission instructions should now be more clear. The file is located at:../Test-DSL-TL/Submission_Instructions.md

Update 2/13/2023

  • Converted four labels within the ES_train.tsv file to uppercase: es-ES to ES-ES. Updated file can be found at: ../ES-DSL-TL/ES_Train.tsv

Update 2/03/2023

  • Included the test.tsv file for DSL-TL. Can be found at: ../Test-DSL-TL/DSL-TL-test.tsv
  • Added submission instructions at: ../Test-DSL-TL/Submission_Instructions.md
  • Changed directory structure. Directories previously named DSLCC-TL have been shortened to DSL-TL to conform with dataset.

Update 1/30/2023

  • Labels have been standardized across the three datasets.
  • Several instances within the ES_train.tsv have been assigned ids that were previously missing.
  • Four duplicates have been removed across the EN_train.tsv and EN_dev.tsv.
  • Example Annotation Prompts added.

Contact

For more details please contact knorth8@gmu.edu

Last updated Feb 3 2023

About

The official repository for the DSLCC-TL shared-task

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published