Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


This takes in an audio transcript and the audio and outputs audio files of the minimal pairs in m4a format

It is currently working with audio and transcripts from youtube videos

background - what are minimal pairs?

TL;DR - two words with only one meaningful sound difference (these differences can vary between different languages). This becomes very obvious once the usual written form of a language is converted into IPA, which has the benefit of representing one unique sound with one unique symbol. So for language learners from a background of fewer vowel phonemes than English (eg, Spanish), some vowels sound the same to them, and are very difficult to distinguish in conversation.

examples of minimal pairs (the weird characters are the IPA):

  • peach / pitch

    pit͡ʃ / pɪt͡ʃ

  • breath / bread

    bɹɛθ / bɹɛd

  • peer / fear

    pɪɹ / fɪɹ

...for a bit less of an overview, this paper outlines why practicing minimal pairs can be beneficial in the context of second language acquisition, and specifically describes a method involving learners listening to minimal pairs spoken by multiple different speakers.

The above paper also suggests an interesting method of doubling the vowel length, and I still need to investigate how feasible this would be using only ffmpeg before I add another dependency (PRAAT)

why automate it?

The only current minimal pairs training applications (that I'm aware of) rely on actual humans to record themselves saying these minimal pairs and submitting them to the system. This has obvious disadvantages in terms of scalability, since even the kind souls dedicated enough to submit their own voice recordings probably don't have a very large amount of time to devote to the task.

Fortunately, automatic speech recognition (while still not great) is good enough for us to use in force alignment, which outputs the time-stamp location of every word in an audio stream (It can actually output time stamps for every phoneme, but we don't need that level of detail here). We send it the transcript of the audio, and the audio itself, and we get back the time of each word in the audio.

Based on the list of minimal pairs present in the video (which we calculate from the transcript), we can then use the force alignment time-stamps to pull out just the audio that has each half of each minimal pair and save those to separate audio files.

At the end of the process, we concatenate all the audio for all the speakers for one particular minimal pair, and with enough input data (I'm aiming for 800 audio/transcript pairs), we should have a fair amount of variation in speakers/accents for a large number of minimal pairs.

eventual directory structure:

    |   |__subdirs(based off of minimal pair distinctions, eg a-e)
    |       |__bat-bet
    |            |__audio files for bat-bet across videos
    |       |__sat-set
    |            |__audio files for sat-set across videos
    |       |__etc...

to run


example steps not included in this particular repo:

  • find youtube video (or youtube channel)

  • grab audio from that video

  • grab transcript from that video

below this line is where the pipeline is decoupled from the data source. This is what runs, in this sequence:

  1. clean up file names (replace everything with underscores, etc)

  1. convert the webm files to mp3 (useful if the source files are from youtube)

  1. cleaning of transcripts to removing parentheses, titles, and bad words, as well as reformatting to be one word per line


...I will eventually refactor stuff to make it accept command line arguments for whatever input folder you want, but just focused on getting it all working at the moment

  1. generate list of minimal pairs per transcript (this step is where meta-data about which phonemes differ can be generated...and then used as folder names. In each of these folders, the eventual minimal pair audio files will be labelled with the video name, so that they remain distinct even with multiple minimal pair audio files in a folder)

  1. spin up a docker container with the "gentle" force-alignment server

  1. get force-aligned json file specifying the time-stamp for each word in the audio (this is the step that takes the most time)

  1. use force-alignment data to grab each word as a separate audio file from the original full audio (this also attempts to slow down all audio files by 50%)

  1. generate json file reflecting directory structure and vowel/consontant distinctions, as well as actually creating the audio sprites (this is what will be sent to the front end to describe the location of each of the compiled audio files)

external dependencies

beyond python, the current pipeline relies on three external programs:

-ffmpeg (add installation instructions)

-waudsprite --- this technically just uses nodejs to manipulate ffmpeg, so hoping to eventually rewrite this processing step in pure python to remove the depenedency (add installation instructions with preference for nvm)

-docker (add installation instructions)


this takes in an audio transcript and the audio and outputs audio files of the minimal pairs








No releases published


No packages published