# You do not need to do any of this
This file walks you through the preprocessing steps for:
1. Cleaning the data,
2. Aligning transcripts to utterances at the phoneme level, and
3. Packaging the data for data science and machine learning uses.

**This is a slow, painful, and iterative process. If you're just interested in data science / machine learning, I recommend skipping this and downloading the end results.**

If you want to help out with programmatic data cleaning, then this is for you. There's a lot of work to be done here.

# Convert clipper-formatted data to mfa-formatted data
The goal here is to run Montreal Forced Aligner (MFA) through Clipper's clips. Clipper's files are flac files and word-level transcripts. MFA takes in 16khz wave files and word-level transcripts, and it outputs phoneme-level transcripts. The `datapipes` module in `src/` can convert Clipper's files into MFA-compatible input.

First step: do a dry-run to check for any errors in the Clipper files we have. Sometimes there's a filename mismatch, a missing character name, missing transcript file, or similar. While running this, you'll see the `In [ ]` on the left-hand side change to `In [*]`. When it's complete, you'll see it change to `In [1]`. The number `[1]` tells you the order in which commands on this page were executed.

In [None]:
!(cd ../src; python -m datapipes --mfa-inputs \
    --input /home/celestia/data/clipper-samples `# clipper-formatted directory` \
    --output /home/celestia/data/mfa-inputs `# mfa-formatted directory` \
    --delta `# ignore files already processed` \
    --dry-run `# don't create any output files`)

If there are any errors, make sure to fix them and re-run the above command. Repeat until there are no errors, then run the next command to generate the mfa-formatted data. If you're running this on all of Clipper's data, this might take an hour to complete.

In [None]:
!(cd ../src; python -m datapipes --mfa-inputs \
    --input /home/celestia/data/clipper-samples \
    --output /home/celestia/data/mfa-inputs \
    --delta)

Finally, run montreal-forced-aligner with the following command to generate phoneme-level transcripts. Note that, due to quirks with IPython, this command won't produce intermediate output, so you won't be able to monitor progress here. If you're running this on all of Clipper's data, this command might take a few hours to complete. You can monitor progress by watching the `data/mfa-alignments` directory.

In [8]:
%%bash

function mfa() {
    mkdir /home/celestia/data/mfa-alignments/$1 || true

    yes n | mfa_align -v `# continue even with an incomplete dictionary` \
        /home/celestia/data/mfa-inputs/$1 `# input directory` \
        /opt/mfa/pronunciations_dicts/english.dict.txt \
        /opt/mfa/pretrained_models/english.zip \
        /home/celestia/data/mfa-alignments/$1 `# output directory` \
        || true
}

export -f mfa

ls /home/celestia/data/mfa-inputs | xargs -L1 -P16 bash -c 'mfa $@' _

Setting up corpus information...
Number of speakers in corpus: 1, average number of utterances per speaker: 19.0
Creating dictionary information...
Setting up training data...
There were words not found in the dictionary. Would you like to abort to fix them? (Y/N)Calculating MFCCs...
Setting up corpus information...
Number of speakers in corpus: 1, average number of utterances per speaker: 1.0
Creating dictionary information...
Setting up training data...
Calculating MFCCs...
Setting up corpus information...
Number of speakers in corpus: 1, average number of utterances per speaker: 2.0
Creating dictionary information...
Setting up training data...
Calculating MFCCs...
Setting up corpus information...
Number of speakers in corpus: 1, average number of utterances per speaker: 7.0
Creating dictionary information...
Setting up training data...
There were words not found in the dictionary. Would you like to abort to fix them? (Y/N)Calculating MFCCs...
Setting up corpus information...
Number

 50%|█████     | 1/2 [00:00<00:00,  1.09it/s] 50%|█████     | 1/2 [00:01<00:01,  1.02s/it] 50%|█████     | 1/2 [00:01<00:01,  1.11s/it]  0%|          | 0/2 [00:00<?, ?it/s]100%|██████████| 2/2 [00:01<00:00,  1.05it/s]
100%|██████████| 2/2 [00:02<00:00,  1.05s/it]
100%|██████████| 2/2 [00:02<00:00,  1.15s/it]
 50%|█████     | 1/2 [00:01<00:01,  1.32s/it]  0%|          | 0/2 [00:00<?, ?it/s]  0%|          | 0/2 [00:00<?, ?it/s]100%|██████████| 2/2 [00:02<00:00,  1.35s/it]
100%|██████████| 2/2 [00:03<00:00,  1.81s/it]
 50%|█████     | 1/2 [00:02<00:02,  2.12s/it]100%|██████████| 2/2 [00:03<00:00,  1.78s/it]
 50%|█████     | 1/2 [00:02<00:02,  2.02s/it]  0%|          | 0/2 [00:00<?, ?it/s]  0%|          | 0/2 [00:00<?, ?it/s]100%|██████████| 2/2 [00:04<00:00,  2.15s/it]
100%|██████████| 2/2 [00:04<00:00,  2.02s/it]
 50%|█████     | 1/2 [00:01<00:01,  1.72s/it]  0%|          | 0/2 [00:00<?, ?it/s]100%|██████████| 2/2 [00:03<00:00,  1.72s/it]
  0%|          | 0/2 [00:0

It's extremely likely that MFA failed on some inputs. There are three ways in which it can fail:
1. MFA found a word it didn't recognize and logged both the missing word and corresponding utterance.
2. MFA failed in some unexpected way while doing preprocessing for a character, and it borked its own configuration files.
3. MFA couldn't figure out how to align the transcript to an utterance.

For the first kind of failure, you can find an `oovs_found.txt` file in each of the directories within `mfa-alignments`. This file contains a list of words that could not be processed because they don't exist in the pronunciation dictionary. You can find the current pronunciation dictionary in `/opt/mfa/pronunciations_dicts/english.dict.txt`. If you end up adding the pronunciations of any missing words, make sure to post them to the thread. I can update the Docker image so everyone can benefit from it.

In [10]:
! head /home/celestia/data/mfa-alignments/Twilight-Sparkle/oovs_found.txt

'bout
'til
28.1
aaaaaaah
aaand
accelerative
actaully
adoreable
ahuizotl
aj


For the second and third kinds of failure, you can find out which characters MFA failed to process by searching for the empty directories  in `mfa-alignments`.

In [9]:
! find /home/celestia/data/mfa-alignments -type d -empty

If the above command produces any output, it's very likely that MFA stochastically borked something during its own preprocessing stage. The easiest way to handle this is to remove its character-specific cache directory and try again.

The following script does exactly that for the case where MFA fails on Applejack's files. In my case, I needed to run this for Apple-Bloom, Applejack, Cadance, and Rainbow-Dash the most recent time, but MFA's failures are pretty stochastic.

In [None]:
%%bash

retry_character="Applejack"

rm -r "/home/celestia/Documents/MFA/$retry_character"

yes n | mfa_align -v \
        /home/celestia/data/mfa-inputs/$retry_character \
        /opt/mfa/pronunciations_dicts/english.dict.txt \
        /opt/mfa/pretrained_models/english.zip \
        /home/celestia/data/mfa-alignments/$retry_character

The last type of MFA failure is the only one that's complicated to handle. If you run the following command, you can see a list of transcripts that MFA failed to align.

In [13]:
%%bash

function get_textgrids() {
    (cd "$1"
    find -iname '*.textgrid' |
        sed 's/\.textgrid$//gI' |
        sort)
}

diff <(get_textgrids /home/celestia/data/mfa-inputs) <(get_textgrids /home/celestia/data/mfa-alignments)

32d31
< ./AK-Yearling/AK-Yearling-s4e4-546.878969-552.430138
847d845
< ./Apple-Bloom/Apple-Bloom-s3e4-155.811699-158.691000
883d880
< ./Apple-Bloom/Apple-Bloom-s3e4-54.025283-57.379283
1186d1182
< ./Apple-Bloom/Apple-Bloom-s4e9-117.432683-121.522084
1191d1186
< ./Apple-Bloom/Apple-Bloom-s4e9-132.837474-135.650000
1263d1257
< ./Apple-Bloom/Apple-Bloom-s5e17-262.677720-268.582433
1416,1417d1409
< ./Apple-Bloom/Apple-Bloom-s5e4-237.844548-239.244548
< ./Apple-Bloom/Apple-Bloom-s5e4-239.800274-242.380419
2278d2269
< ./Applejack/Applejack-EQG-LoE-993.158000-997.359141
2399d2389
< ./Applejack/Applejack-MLP-Movie-1111.150440-1119.251441
7165d7154
< ./Big-Macintosh/Big-Macintosh-s6e23-465.757000-468.551000
7611d7599
< ./Cadance/Cadance-s4e11-1021.438863-1023.570000
7877d7864
< ./Capper/Capper-MLP-Movie-4182.755498-4185.892560
8050d8036
< ./Celestia/Celestia-EQG-RR-2419.197799-2426.514495
8052d8037
< ./Celestia/Celestia-EQG-RR-2826.078665-2830.600109
8851d8835
< ./Cheerilee/Cheerilee-s1e18-1199

CalledProcessError: Command 'b'\nfunction get_textgrids() {\n    (cd "$1"\n    find -iname \'*.textgrid\' |\n        sed \'s/\\.textgrid$//gI\' |\n        sort)\n}\n\ndiff <(get_textgrids /home/celestia/data/mfa-inputs) <(get_textgrids /home/celestia/data/mfa-alignments)\n'' returned non-zero exit status 1.

If you didn't complete the above steps for handling whole-character issues, you'll notice that some characters have a huge number of utterances listed. If you did complete the above steps, none of the characters should have _that_ many failures. For me, the worst offender is Pinkie Pie with 77 failures, followed by Fluttershy and Twilight Sparkle both with around 35. If a character has a huge number of utterances listed, it's likely that MFA crashed at some point. You can read through its logs in `/home/celestia/Documents/MFA/` to try to figure out why. 

You can try playing some of the listed files to figure out why MFA might be failing on them. I've found that it's often because either (1) the character is speaking in a very excited or abnormal way, (2) the clip is noisy or muffled, or (3) the utterance contains a lot of out-of-dictionary words.

Eventually, we'll want to find a way to make use of these utterances to generate more realistic speech in niche cases, but for now we can ignore them.

# Packaging data for tensorflow


In [None]:
!(cd ../src; python -m datapipes --audio-tgz \
    --input-audio /home/celestia/data/clipper-samples `# clipper-formatted directory` \
    --input-alignments /home/celestia/data/mfa-alignments `# mfa-formatted directory` \
    --output /home/celestia/data/audio-tgz `# output per-character tar.gz files here` \
    --audio-format 'wav' \
    --sampling-rate 48000 \
    --dry-run `# don't create any output files` \
    --verbose)

In [14]:
!(cd ../src; python -m datapipes --audio-tgz \
    --input-audio /home/celestia/data/clipper-samples \
    --input-alignments /home/celestia/data/mfa-alignments/Twilight-Sparkle \
    --output /home/celestia/data/audio-tgz \
    --audio-format 'wav' \
    --sampling-rate 48000 \
    --verbose)

reading utterances from  /home/celestia/data/clipper-samples
reading transcipts and labels from labels.text files
writing output files to /home/celestia/data/audio-tgz
Done


In [None]:
# test the archives