TuGeBiC -- A Turkish German Bilingual Code-Switching Corpus

This repository presents the TuGeBiC corpus that contains annotated transcriptions of spontaneous speech samples from Turkish-German bilinguals.

The recordings and their transcriptions date back to the first half of the 1990s. In 2022 we revisited the transciptions to make it available for the research community in a computer-readible, standardised format. The main improvements are manual tokenisation and normalisation, and the replacement of all proper names (names of participants and places mentioned in the conversations) with pseudonyms. We also performed token-level automatic language identification, which made it possible to derive basic statistics on the corpus.

\# of tokens	116688
\# of monolingual sentences	10141
\# of bilingual sentences	4510
\# of CS points in bilingual sentences	8180

There are 25 files in total in the corpus, presented in two different versions:

plain/tugebic_XX.txt files in plain text.
langID/tugebic_XX.conllu files in CoNLL-U format. The MISC column contains predicted language IDs.

where XX is the file ID between 00 and 25.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
langID		langID
plain		plain
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

langID

langID

plain

plain

README.md

README.md

Repository files navigation

TuGeBiC -- A Turkish German Bilingual Code-Switching Corpus

About

Releases

Packages

ozlemcek/TuGeBiC

Folders and files

Latest commit

History

Repository files navigation

TuGeBiC -- A Turkish German Bilingual Code-Switching Corpus

About

Resources

Stars

Watchers

Forks