Discriminating Between Similar Languages - True Labels (DSL-TL)

Discriminating between similar languages (e.g., Croatian and Serbian) and language varieties (e.g., Brazilian and European Portuguese) has been a popular topic at VarDial since its first edition. The DSL shared tasks organized in 2014, 2015, 2016, and 2017 have addressed this issue by providing participants with the DSL Corpus Collection (DSLCC), a collection of journalistic texts containing texts written in multiple similar languages and language varieties. The DSLCC was compiled under the assumption that each instance's gold label is determined by where the text is retrieved from. While this is a straightforward (and mostly accurate) practical assumption, previous research has shown the limitations of this problem formulation as some texts may present no linguistic marker that allows systems or native speakers to discriminate between two very similar languages or language varieties.

We tackle this important limitation by introducing the DSL True Labels (DSL-TL) task. DSL-TL will provide participants with a human-annotated DSL dataset. A sub-set of nearly 13,000 sentences were retrieved from the DSLCC and annotated by multiple native speakers of the included language and varieties, namely English (American and British), Portuguese (Brazilian and European), Spanish (Argentinian and Peninsular). To the best of our knowledge, this is the first dataset of its kind opening exciting new avenues for language identification research.

Track 1 - Three-way Classification: In this track, systems will be evaluated with respect to the prediction of all three labels for each language, namely the variety-specific labels (e.g., PT-PT or PT-BR) and the common label (e.g., PT).
Track 2 - Binary Classification: In this track, systems will be scored only on the variety-specific labels (e.g., EN-GB, EN-US).

News

Update 3/08/2023

Included features for each language collected during annotation. Features can be found at: ../DSL-TL/DSL-TL-Corpus/Features-DSL-TL/..

Update 2/15/2023

Updated submission instructions, in particular, the file naming convention. The submission instructions should now be more clear. The file is located at:../Test-DSL-TL/Submission_Instructions.md

Update 2/13/2023

Converted four labels within the ES_train.tsv file to uppercase: es-ES to ES-ES. Updated file can be found at: ../ES-DSL-TL/ES_Train.tsv

Update 2/03/2023

Included the test.tsv file for DSL-TL. Can be found at: ../Test-DSL-TL/DSL-TL-test.tsv
Added submission instructions at: ../Test-DSL-TL/Submission_Instructions.md
Changed directory structure. Directories previously named DSLCC-TL have been shortened to DSL-TL to conform with dataset.

Update 1/30/2023

Labels have been standardized across the three datasets.
Several instances within the ES_train.tsv have been assigned ids that were previously missing.
Four duplicates have been removed across the EN_train.tsv and EN_dev.tsv.
Example Annotation Prompts added.

Contact

For more details please contact knorth8@gmu.edu

Last updated Feb 3 2023

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
DSL-TL-Corpus		DSL-TL-Corpus
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DSL-TL-Corpus

DSL-TL-Corpus

README.md

README.md

Repository files navigation

Discriminating Between Similar Languages - True Labels (DSL-TL)

News

Update 3/08/2023

Update 2/15/2023

Update 2/13/2023

Update 2/03/2023

Update 1/30/2023

Contact

About

Releases

Packages

Contributors 2

LanguageTechnologyLab/DSL-TL

Folders and files

Latest commit

History

DSL-TL-Corpus

DSL-TL-Corpus

README.md

README.md

Repository files navigation

Discriminating Between Similar Languages - True Labels (DSL-TL)

News

Update 3/08/2023

Update 2/15/2023

Update 2/13/2023

Update 2/03/2023

Update 1/30/2023

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages