Skip to content
/ GUAva Public

A YouTube speech corpus to study Asian North American English.

Notifications You must be signed in to change notification settings

lspcheng/GUAva

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

GUAva

GUAva is a YouTube speech corpus originally inspired by the Growing Up Asian American video tag. It will be used to study language variation in Asian North American individuals.

The GUAva corpus is created and processed via the LingTube suite of tools for linguistic analysis of YouTube data. It is being developed alongside LingTube and the YouSpeak pipeline, a branch of LingTube specifically for doing phonetic speech research on YouTube-sourced audio. This repo therefore also serves as an example of how one could use the LingTube and/or YouSpeak tools.

Note: This repository contains only the transcribed data from the speech corpus. For access to the audio portion of the corpus, please contact Lauretta Cheng.

Table of Contents

Corpus Details

The process of the corpus creation and processing is roughly as follows:

Scrape YouTube

  1. Identify specific video urls, listed in the lists directory (currently grouped in "sets" per ethnic background) and run yt-tools/scrape-channels.py to get channel info along with urls in screened_urls
  2. For each video, download the audio and English captions using yt-tools/scrape-videos.py into raw_audio and raw_subtitles

Process text

  1. Convert captions (tx-tools/clean-captions.py) to a neater format and partially clean text
  2. (optional) Correct raw captions files with the help of tx-tools/correct-captions.py

Process audio

  1. Run conversion script to prepare audio (youspeak/convert-audio.py) in format amenable to processing (i.e., mono WAV files)
  2. Chunk the long audio files into short (<10 sec) segments based on pauses/breath breaks using youspeak/chunk-audio.py—a necessary and/or useful step for transcription and forced alignment

Process text + audio

  1. Classify each clip as usable or not (i.e., clear speech without noise, music, etc.) and confirm transcriptions for each segment of speech using youspeak/validate-chunks.py, which opens a GUI
  2. Match transcriptions to audio in TextGrid format with youspeak/create-textgrids.py
  3. Conduct forced alignment using the Montreal Forced aligner, then do manual correction of alignment boundaries (with the help of adjust-textgrids.py)

Corpus Processing Guidelines


How to Use Scripts

For the purposes of the GUAva corpus, this is how the LingTube scripts will be used to do hand-correction during corpus processing. In most cases, the only parameter that must be specified is the group (i.e., ethnicity grouping code), though a specific channel can be specified for some.

Correcting Captions

To run the caption correction script that opens the YouTube video in a web browser and a copy of the raw transcript file in a text editor, the command is path/to/correct-captions.py -g $GROUP [-ch $CHANNEL]

  • e.g. To go through all the captions of the Korean (kor) group in order, run ../LingTube/tx-tools/correct-captions.py -g kor

  • e.g. To go through all the files for a particular channel (e.g., AMYLEE), run ../LingTube/tx-tools/correct-captions.py -g kor -ch AMYLEE

Validating Audio Chunks

To run the audio chunk validation script for (i) classifying whether a chunk is usable or not, and (ii) matching the transcription to the audio, the command is path/to/validate-chunks.py -g $GROUP

  • e.g. To validate a file in the Korean group, run ../LingTube/youspeak/validate-chunks.py -g kor

A pop-up file window will then ask you to select to a chunking log file (for a particular video) to begin the process.

Adjusting Alignment Boundaries

To run the textgrid alignment adjustment script that opens a Praat with the appropriate directories in place, the command is path/to/adjust-textgrids.py -g $GROUP [-ch $CHANNEL]

  • e.g. To go through all the channels in Korean (kor) group in order, run ../LingTube/youspeak/adjust-textgrids.py -g kor

  • e.g. To go through all the files for a particular channel (e.g., AMYLEE), run ../LingTube/youspeak/adjust-textgrids.py -g kor -ch AMYLEE


Transcript Correction Guidelines

Listen once through—don’t spend too much time on this stage.

Basics:

  • Remove anything not actually said (e.g., joke subs, sound effects, commentary).
  • Fix incorrectly transcribed words.
  • If there is a mispronunciation (with the intended meaning clear based on context), transcribe as the intended word and add an asterisk (*). If in doubt, transcribe as it sounds.
    • e.g., when I say* this... where "say" is pronounced like "see"
  • If false start or single incomplete word, add a hyphen (-).
    • e.g., "I- I don't even remember..."
    • e.g., "I mean li- like I don't know"

Details:

  • Add or keep filler words (e.g., like, um, uh, you know).
  • Remove hyphens (-) from hyphenated words (e.g., self conscious not "self-conscious")
  • For colloquial pronunciations, replace standard/full forms with phonetically-accurate versions (i.e., represent how things are actually pronounced!).
    • e.g., 'cause for "because" or 'til for "until"
    • e.g., just feels a little... for "it just feels at little..."
    • e.g., wanna for "want to", dunno for "don't know", y'know for "you know"
  • For acronyms, capitalize all letters; don’t add periods or spaces
    • e.g., AM and PM for "a.m." and "p.m."
    • e.g., LA, FIDM, UCLA, US
  • Other than that, don’t worry about capitalization (i.e., don’t change whatever’s there).
  • For abbreviations, write out full forms as possible.
    • e.g., et cetera for "etc"
  • For numerals (including times and years), write out pronunciation in words.
    • e.g., a hundred or one hundred for 100
    • e.g., five AM for 5:00 a.m.
    • e.g., twenty four seven for 24/7
    • e.g., twenty ten for "2010"

Other:

  • For unidentifiable words, replace with <unk> (for unknown).
  • For words/utterances in another language, if you can’t identify the words, replace with <cs> (for code switch).
    • Can transcribe non-English words (e.g., in that language or romanization) but not necessary and don’t spend extra time on this.
      • e.g., kare rice for 'kare rice' pronounced with Korean phonology
  • For laughs not overlapping with speech, add in <lgh> (for laugh).
  • For any sound effects not overlapping with speech (e.g., transition ring, bell, noise, etc.), add in <sfx> (for sound effect).

Chunk Classification Guidelines

Usable?

  1. Check the box for "Yes" if audio clip is usable, meaning it contains clear, "natural" speech in English, with no (or minimal) background sounds or noise.

Here are a list of potential issues that would render a section of speech unusable for the purposes this project:

  • background music
  • background noise (e.g., fan, traffic, city noises)
  • multiple speakers (e.g., overlapping speech with other people)
  • another speaker's voice (e.g., a media clip or meme, a friend in the video)
  • altered voice (e.g., sped up, slowed down, higher pitch)
  • non-English speech (code-switching)
  • "unnatural" or "atypical" speech (e.g., performance/skit, imitation, "putting on a voice")
  • environmental noises (e.g., shuffling, placing something on a table, clapping)
  • sound effects (e.g., cheer, clap, pop, swish)
  1. If the clip includes very quiet ambient sounds, like very low music or some small degree of noticeable noise, can still check "Yes" if it seems usable (i,e., loud and clear, natural, etc.), but additionally check off any relevant Main Issues box(es).
  2. If only a portion of the clip contains any of the above issues (e.g., a specific word or the first half of the clip), check off "Yes" along with any relevant Main Issues box(es) and see Phonetic Coding Guidelines below for how to mark these issues in the transcript.
  3. If the vast majority or entire clip contains any combination of above issues, only check off the relevant relevant Main Issues box(es) and do not check "Yes".

Main Issues: With Speech

  • Speech + music: Speech with any form of background music
  • Speech + noise: Speech with any form of background noise
  • Other / altered voice: Speech with a voice or multiple voices other than the speaker, the altered voice of the speaker, or the speaker "putting on a voice"

Main Issues: Not Speech

  • Music only: A period of music but no speech overlapping (often at beginning and ends of videos, as well as some transition periods)
  • Noise: A period of non-speech ("silence") with audible noisiness (e.g., a loud breath, background traffic noise)
  • Other sounds: Any other case of a non-speech sound, including sound effects and environmental noises

If an issue that affects the whole clip doesn't fit into any of these categories (e.g., code-switching), simply don't check any box.


Phonetic Coding Guidelines

Basics

  • Fix any of the following:
    • missing filler words should be added (um, uh, like)
    • numerals should be written in number words (e.g., a hundred or one hundred for "100")
    • acronyms or abbreviations should be all caps without periods (e.g., LA not "l.a.", "la", or "L.A.", US not "u.s.")
    • colloquial pronunciations should be as pronounced (e.g., 'cause, 'til, wanna, dunno, y'know)
    • mispronounced words should have a following asterisk (e.g., dad* pronounced like "daah", place* pronounced like "splace")
    • hyphenated words should have NO hyphens (e.g., twenty nine not "twenty-nine", self conscious not "self-conscious", cooccur not "co-occur")
    • cut-off words, either as produced by the speaker or artificially via the chunking process, SHOULD have a following hyphen (e.g., li- for cut-off "like") even when the initial part is cut off (e.g., tion- for cut-off "action"). Alternatively, can use preceding apostrophe-hyphen notation (e.g., _'-tion') but NOT only preceding hyphen.
    • sound effects (e.g., swish, ding) not overlapping with speech, should be marked with <sfx>
    • laughs not overlapping with speech should be marked with <lgh>

Code-switching

  • If there is code-switching completely to a different language and words are not identifiable, should be marked as <cs>.
  • If a word is identifiable but a code-switch (e.g., pronounced using non-English phonology), mark it with a _cs tag (e.g., kare_cs rice_cs)
  • If the word is a non-English word (e.g. loanword) but clearly pronounced with English phonology, don't tag it. If unsure, leave as is for later decision.

Unclear, Unnatural or Other Speech

  • If you can't make out a word, should be marked as <unk>.
  • If, out of an otherwise good audio chunk, there is an individual word or two that cannot be used, mark it with a _unc tag (for unclear).
    • This could be if a word is masked by a noise, overlapping with a sound effect or has some other issue that prevents it from being clear.
      • e.g., kare_unc for 'kare rice' overlapped with a 'pop' sound effect
      • e.g., Amy_unc and_unc for 'Amy and' with cheering noises
  • If some speech is altered (e.g., pitch raising, sped up) or includes other voices (e.g., someone else speaking), mark it with _unn (for unnatural).
    • e.g., what_unn do_unn you_unn mean_unn with sped up speech
  • If a phrase or word is otherwise not produced in the speaker's "natural" voice, such as imitating somebody else or doing some sort of skit, also mark it with a _unn (for unnatural).
    • e.g., Jennifer_unn packed_unn... during an imitation
  • If there are multiple issues, that require tags, tag them all.
    • e.g., kare_cs_unc is a code-switch overlapped with a 'pop' sound
  • If unsure what words are affected (e.g., if you hear clicks but don't know whether and where it overlaps with speech), leave as is for later decision.

About

A YouTube speech corpus to study Asian North American English.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published