The NURC Project that started in 1969 to study the cultured linguistic urban norm spoken in five Brazilian capitals, was responsible for compiling a large corpus for each capital. The digitized NURC/SP comprises 375 inquiries in 334 hours of recordings taken in Sao Paulo capital.
The purpose of this repository is to provide resources for NURC/SP analysis and research projects.
We evaluated the a Minimum Corpus (MC) with 21 inquiries of the NURC/SP in the Bringing NURC/SP to Digital Life: the Role of Open-source Automatic Speech Recognition Model paper. For further details, see the cm_analysis directory.
CORAA NURC-SP Minimal Corpus is a manually annotated corpus of Brazilian Portuguese spontaneous speech (São Paulo variety). The corpus is a subset of NURC (‘Cultured Linguistic Urban Norm’) project collection, one of the most influential in Brazilian Linguistics. The corpus was brought to digital life by TaRSiLa, a project aiming to build large multi-purpose datasets for speech processing (ASR, TTS, and Sentiment Analysis). It comprises 21 audio files and audio-aligned multilevel transcripts according to linguistically motivated intonation units. For further details, see the dataset website.