# VoxCommunis data processing pipeline

This is a script of running MFA on recordings from Common Voice corpus.

0. [Step 0: Setups](#step-0-setups)
1. [Step 1: Remap speakers](#step-1-remap-the-validated-speakers)
2. [Step 2: Create TextGrid and .wav files](#step-2-create-textgrid-files-and-wav-files-based-on-the-mp3-recordings-from-common-voice)
3. [Step 3: Prepare the lexicon](#step-3-prepare-the-lexicon)
4. [Step 4: G2P grapheme-to-phoneme](#step-4-g2p-grapheme-to-phoneme-epitran-or-xpf)
5. [Step 5: Validation](#step-5-train-the-acoustic-model)
6. [Step 6: Run MFA](#step-6-train-the-acoustic-model-and-forced-align)

This script was created by Miao Zhang (miao.zhang@uzh.ch), 22.12.2023

This script was modified by Miao Zhang, 07.02.2024 (Revalidation added)

## Step 0. Setups
Import packages and setup file directories (for both the scripts and data)

In [1]:
# Packages
# Import os and subprocess to run terminal commands, pandas and re to process the lexicon, and shutil to delete folders
import os, subprocess, shutil, re
import pandas as pd

Set the paths and directories of data and scripts to use.

In [2]:
# Set it to 0 if you use a Windows machine.
if_mac = 1 

# Specify the G2P engine. If 0, then epitran
if_xpf = 0

###################################### Directories #########################################

# This is the directory where your data downloaded from Common Voice should be saved. This is the root directory where data from each language is saved in individual folders.
# NO (BACK)SLASH at the end!!!
commonVoice_dir = '/Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice' 

# To use XPF as the G2P engine to process lexicon, you will need to download the XPF data from: https://github.com/CohenPr-XPF/XPF/tree/master/Data and save them on your computer.
# Specify the directory where your XPF data is saved.
# NO (BACK)SLASH at the end!!!
xpf_dir = '/Users/miaozhang/Research/CorpusPhon/CorpusData/G2P/XPF' 

######################### Language name/code and Common Voice version ##############################

# Language-related variable names
lang_code = 'yo' # the Common Voice code of the language (unfortunately, Common Voice mixes the use of iso 639-3 and iso 639-1 codes (they use bcp47 code). This code is also used in XPF).

# If you are using XPF, ...
# Please refer to VoxCommunics_info.csv to get the name of the language in XPF
# You can ignore this if you are not going to use XPF.
lang_name = 'Yoruba'

# If you are using epitran, ...
# Please refer to VoxCommunics_info.csv to get the processing code of the language in epitran
epi_code = 'yor-Latn' 

# The version of the data in Common Voice
# Only numbers!!!
cv_version = '16' 

# Specify if the subversion of a corpus is used. The default is 0
if_subversion = 0 
# If if_subversion == 1, what suffix you would use?:
subversion = '_' + 'sub3'

##################################################################

# This is where the acoustic model will be saved after MFA training is done:
mfa_mod_path = '/Users/miaozhang/Documents/MFA/pretrained_models/acoustic'

##################################################################

# Get the naming schema. (Don't change this part)
naming_schema = pd.read_csv('vxc_naming_schema.csv')
acs_mod_name = naming_schema['Python_code'][0]
spkr_file_name = naming_schema['Python_code'][4]
word_file_name = naming_schema['Python_code'][2]
dict_file_name = naming_schema['Python_code'][3]

Set up the paths and directories.

In [3]:
if if_mac == 1:
    path_sep = '/'
    # this is the default directory where Praat is installed on a Mac.
    praat_path = '/Applications/Praat.app/Contents/MacOS/Praat' 
else:
    path_sep = '\\'
    # the directory of Praat installed on Windows.
    praat_path = 'C:\Program Files\Praat.exe' 

if if_xpf == 1:
    g2p = 'xpf'
else:
    g2p = 'epi'

language_dir = lang_code + '_v' + cv_version

# For step 1: speaker remapping
# Get the full paths
remap_spkr_path = []
remap_spkr_path.append('vxc_remap_spkrs.py') # where the scipt of speaker remapping is
remap_spkr_path.append(commonVoice_dir + path_sep + language_dir + path_sep + 'validated.tsv') # where the validated utterance log of common voice is
remap_spkr_path.append(commonVoice_dir + path_sep + language_dir + path_sep + eval(spkr_file_name)) # where the validated speaker log will be saved


# For step 2: Create .wav and .TextGrid files
# Get the path of the praat script
create_txgdwav_script = 'createTextGridsWav.praat' # this is where the praat script was saved.

# Set the arguments for the praat script
praat_args = []
praat_args.append(commonVoice_dir + path_sep + language_dir) #this is the directory of the language. NO path_sep!!!
praat_args.append(commonVoice_dir + path_sep + language_dir + path_sep + 'validated') #this is the folder name of validated files. NO path_sep at the end!!!
praat_args.append(commonVoice_dir + path_sep + language_dir + path_sep + eval(spkr_file_name)) #this is remapped speaker file

# For step 3: prepare the lexicon and pronunciation dictionary
# Remember the file is saved in this variable:
validated_log = remap_spkr_path[1]
wordlist_path = commonVoice_dir + path_sep + language_dir + path_sep + eval(word_file_name)


# For step 4: G2P
if if_xpf == 1:
    xpf_translater_path = 'xpf_translate04.py'
    rule_file_path = xpf_dir + path_sep + lang_code + '_' + lang_name + path_sep + lang_code + '.rules'
    verify_file_path = xpf_dir + path_sep + lang_code + '_' + lang_name + path_sep + lang_code + '.verify.csv'
else:
    epitran_translater_path = 'epi_run.py'

dict_file_path = commonVoice_dir + path_sep + language_dir + path_sep + eval(dict_file_name)


# For step 5: running MFA
# Validate the corpus
validated_recs_path = praat_args[1] 
if if_subversion == 0:
    acs_mod_path = mfa_mod_path + path_sep + eval(acs_mod_name)
else:
    acs_mod_name = re.sub('.zip', subversion + '.zip', acs_mod_name)
    acs_mod_path = mfa_mod_path + path_sep + eval(acs_mod_name)
output_path = commonVoice_dir + path_sep + language_dir + path_sep + 'output/'

Print out the paths and directories.

In [4]:
# Print all the file and folder paths
print('Step 0:\t')
print(f'The name of the language:\t"{lang_name}" (as in the XPF corpus)')
print(f'The Common Voice code of the language:\t"{lang_code}"')
print(f'The version of the Common Voice data:\t"{cv_version}"')
print('\n')

print('Step 1:\tRemapping speakers')
print('The script of remapping speakers:\t' + remap_spkr_path[0])
print('Validated log:\t' + remap_spkr_path[1])
print('Validated log with speakers remapped (to be created):\t' + remap_spkr_path[2])
print('\n')

print('Step 2:\tCreating validated .wav and .TextGrid')
print('Praat:\t' + praat_path)
print('The scipt of creating .wav and .TextGrid files for validated recs:\t' + create_txgdwav_script)
print('\n')

print('Step 3:\tPreparing the lexicon')
print('The directory of Common Voice recordings of the language:\t' + praat_args[0])
print('The folder where validated .wav/.TextGrid files will be saved:\t' + praat_args[1])
print('Validated log with speakers remapped:\t' + praat_args[2])
print('Wordlist file:\t' + wordlist_path)
print('\n')

print('Step 4:\tG2P')
if if_xpf == 1:
    print('The XPF G2P script:\t' + xpf_translater_path)
    print('The XPF rule file:\t' + rule_file_path)
    print('The XPF verification file:\t' + verify_file_path)
else:
    print('The script to run epitran:\t' + epitran_translater_path)

print('The lexicon file (to be created):\t' + dict_file_path)
print('\n')

print('Step 5:\tMFA')
print('(Again) The validated recordings:\t' + validated_recs_path) # the validated recordings
print('The pronunciation dictionary:\t' + dict_file_path) # the lexicon
print('Where the acoustic model will be saved:\t' + acs_mod_path) # where the acoustic model will be saved
print('Where to put the output of forced alignment:\t' + output_path) # where the outputs will be saved

Step 0:	
The name of the language:	"Yoruba" (as in the XPF corpus)
The Common Voice code of the language:	"yo"
The version of the Common Voice data:	"16"


Step 1:	Remapping speakers
The script of remapping speakers:	vxc_remap_spkrs.py
Validated log:	/Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/yo_v16/validated.tsv
Validated log with speakers remapped (to be created):	/Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/yo_v16/yo_epi_spkr16.tsv


Step 2:	Creating validated .wav and .TextGrid
Praat:	/Applications/Praat.app/Contents/MacOS/Praat
The scipt of creating .wav and .TextGrid files for validated recs:	createTextGridsWav.praat


Step 3:	Preparing the lexicon
The directory of Common Voice recordings of the language:	/Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/yo_v16
The folder where validated .wav/.TextGrid files will be saved:	/Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/yo_v16/validated
Validated log with speakers remapped:	/U

## Step 1. Remap the validated speakers
Get speaker IDs to put on TextGrids for speaker adaptation

In [5]:
# Run the next line to remap speakers
if os.path.exists(commonVoice_dir + path_sep + language_dir + path_sep + 'validated_spkr.tsv'):
    os.remove(commonVoice_dir + path_sep + language_dir + path_sep + 'validated_spkr.tsv')
os.system(f'python {remap_spkr_path[0]} {remap_spkr_path[1]} {remap_spkr_path[2]}')

0

## Step 2. Create TextGrid files and .wav files based on the .mp3 recordings from Common Voice

Now we can create TextGrid files and .wav files

In [6]:
# The path of the 'validated' folder to contain validated recordings from Common Voice. If there is already a folder with the same name, delete it
if os.path.exists(commonVoice_dir + path_sep + language_dir + path_sep + 'validated'):
    shutil.rmtree(commonVoice_dir + path_sep + language_dir + path_sep + 'validated')
# Make the folder:
os.makedirs(commonVoice_dir + path_sep + language_dir + path_sep + 'validated')
# Run the praat script:
subprocess.run([praat_path, '--run', create_txgdwav_script, praat_args[0], praat_args[1], praat_args[2]])

CompletedProcess(args=['/Applications/Praat.app/Contents/MacOS/Praat', '--run', 'createTextGridsWav.praat', '/Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/yo_v16', '/Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/yo_v16/validated', '/Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/yo_v16/yo_epi_spkr16.tsv'], returncode=0)

## Step 3. Prepare the lexicon
Extract transcripts from validated.tsv and get each word on its own line

In [7]:
# Read in the validated.tsv file and get the orthographical transcriptions of the utterances
words_col = pd.read_csv(validated_log, sep='\t')['sentence'] # get the transcribed sentences
sentences = words_col.astype('string').tolist() # turn the transcription into a list of sentences

sentences_processed = []
for sentence in sentences:
    sentence = re.sub("[,|.|!|?|\"|“|„|–|-|’|‘|-]+", " ", sentence) # remove a lot of non-word symbols
    sentence = re.sub("[[:punct:]]", " ", sentence) # replace non-word characters with spaces
    sentence = re.sub("[ ]+", " ", sentence) # replace multiple continuous white spaces with a single space
    sentence = re.sub(" ", "\n", sentence) # replace space with new line
    sentence = sentence.lower()
    sentences_processed.append(sentence)


words = "".join(sentences_processed).split("\n") # make a string of word tokens
words = sorted(set(words)) # sort and get word types
words = list(filter(None, words)) # remove empty strings
print(words)

# Save the word list as a .txt file
if os.path.exists(wordlist_path):
    os.remove(wordlist_path)
    
with open(wordlist_path,'w') as word_file:
	for word in words:
		word_file.write(word + "\n")


["''olóde''", "'eré", "'sọtítóbirẹ̀'", "'tó", "'èjìogbè'", "'ṣé", 'a', 'aago', 'abàmì', 'abánikẹ́dùn', 'abániwáṣẹ́', 'abeokuta', 'abilékọ', 'abiyamọ', 'abílékọ', 'abímbọ́lá', 'abísádé', 'abíọ́dún', 'abíọ́lá', 'abíọ́láìlú', 'abo', 'aburu', 'aburú', 'abúlé', 'abínibí', 'abíọ́lá', 'abúle', 'abúlé', 'abẹ', 'abẹnugan', 'abẹtẹlẹ', 'abẹ́lé', 'abẹ́lé', 'abẹ́lẹ́', 'abẹ́òkuta', 'abẹ́òkutailé', 'abẹ́òkutaìjàmbá', 'abẹ́òkutaọpọ̀lọpọ̀', 'abẹ́òkúta', 'abẹ́rẹ́', 'abẹ́òkúta', 'abẹ́òkútaọ̀jọ̀gbọ́n', 'abọ́ìdùnnú', 'adarí', 'adarí', 'adájọ', 'adájọ́', 'adámọlẹ́kun', 'adelé', 'ademúrewá', 'aderẹ̀mi', 'adesina', 'adé', 'adéagbo', 'adébánjo', 'adébárá', 'adébáyọ̀', 'adébóyè', 'adébọ̀wáléowó', 'adébọ́lá', 'adédibú', 'adédigba', 'adédoyin', 'adédọja', 'adégbàyí', 'adégbìtẹ̀', 'adégbìtẹ́', 'adégborúwà', 'adégbuyìṣèyí', 'adégẹyè', 'adéjọkẹ́', 'adékùróyè', 'adékúnlé', 'adékọ́lá', 'adélabú', 'adélékè', 'adéloyè', 'adémúrew

  sentence = re.sub("[[:punct:]]", " ", sentence) # replace non-word characters with spaces


## Step 4. G2P grapheme-to-phoneme (epitran or XPF)
There three files you need to proceed.
1. A G2P rule file
2. A veryfication file
3. The translater python script
Make sure you have downloaded the G2P rule files and the translate.py file from XPF corpus and know where they are saved!

In [8]:
# Get the dictionary using XPF
# -l specifies the rule file
# -c specifies the verification file
# -r specifies the file to be translated
if g2p == 'xpf':
    g2p_cmd = ["python", xpf_translater_path, "-l", rule_file_path, "-c", verify_file_path, "-r", wordlist_path] # XPF translating command that will be sent to subprocess.run() to execute.

    if os.path.exists(dict_file_path):
        os.remove(dict_file_path)

    with open(dict_file_path,'w') as dict_file:# this creates the file
        subprocess.run(g2p_cmd, stdout = dict_file) # stdout = ... means to send the output to the file (so you have to open this file first as above)

    # Note: you have to run the command above through subprocess. os.system() cannot save the output to a Python variable.

    # This is to get rid of all the '@' in the lexicon (if there is any). @ means that the G2P rule for some letters is not documented in the rule file. So the G2P for that letter fails.
    with open(dict_file_path, "r") as dict_file:
        dict = dict_file.read().split("\n")

    with open(dict_file_path, 'w') as dict_file:
        for i in dict:
            if '@' not in i: # Get rid of all words the contained at least one '@' in it
                dict_file.write(i + "\n")
# Or using Epitran
else:
    g2p_cmd = ["python", epitran_translater_path, wordlist_path, dict_file_path, epi_code]
    subprocess.run(g2p_cmd)

["'", "'", 'o', 'l', 'o', '́', 'd', 'e', "'", "'"] 
 ['o', 'l', 'o', 'd', 'e']
["'", 'e', 'r', 'e', '́'] 
 ['e', 'r', 'e']
["'", 's', 'ɔ', 't', 'i', '́', 't', 'o', '́', 'b', 'i', 'r', 'ɛ', '̀', "'"] 
 ['s', 'ɔ', 't', 'i', 't', 'o', 'b', 'i', 'r', 'ɛ']
["'", 't', 'o', '́'] 
 ['t', 'o']
["'", 'e', '̀', 'd͡ʒ', 'i', '̀', 'o', 'ɡ͡b', 'e', '̀', "'"] 
 ['e', 'd͡ʒ', 'i', 'o', 'ɡ͡b', 'e']
["'", 'ʃ', 'e', '́'] 
 ['ʃ', 'e']
['a', 'b', 'a', '̀', 'm', 'i', '̀'] 
 ['a', 'b', 'a', 'm', 'i']
['a', 'b', 'a', '́', 'n', 'i', 'k', 'ɛ', '́', 'd', 'u', '̀', 'n'] 
 ['a', 'b', 'a', 'n', 'i', 'k', 'ɛ', 'd', 'u', 'n']
['a', 'b', 'a', '́', 'n', 'i', 'w', 'a', '́', 'ʃ', 'ɛ', '́'] 
 ['a', 'b', 'a', 'n', 'i', 'w', 'a', 'ʃ', 'ɛ']
['a', 'b', 'i', 'l', 'e', '́', 'k', 'ɔ'] 
 ['a', 'b', 'i', 'l', 'e', 'k', 'ɔ']
['a', 'b', 'i', '́', 'l', 'e', '́', 'k', 'ɔ'] 
 ['a', 'b', 'i', 'l', 'e', 'k', 'ɔ']
['a', 'b', 'i', '́', 'm', 'b', 'ɔ', '́', 'l', 'a', '́'] 
 ['a', 'b', 'i', 'm', 'b', 'ɔ', 'l', 'a']
['a', 'b', 'i', '́', 's', 'a'

## Step 5. Validate the corpus

First, you need to activate the MFA environment in the terminal.
1. Press ctrl+` to open Terminal in VS Code.
2. Run 'conda activate aligner' until you see '(aligner)' at the beginning of the line in Terminal.
3. When you finished using MFA (both training and aligning), run 'conda deactivate' to shut down the MFA environment.

In [None]:
# Create a folder of MFA in document
# You DON'T need to run this if you already have an MFA folder in your Documents folder (What would this be like on Windows?)
!mfa model download acostic english.zip

### Step 5. Validate

To validate the corpus, run this line in terminal: 

        mfa validate {wherever your validated recordings are} {wherever your lexicon file is} --ignore_acoustics --clean

You can copy the command lines from below.
Notebook can't handle ```mfa``` commands. The mfa commands above can only run in Terminal.

In [9]:
cmd_validate = f'mfa validate {validated_recs_path} {dict_file_path} --ignore_acoustics --clean'
cmd_train = f'mfa train --clean {validated_recs_path} {dict_file_path} {acs_mod_path}'
cmd_align = f'mfa align --clean {validated_recs_path} {dict_file_path} {acs_mod_path} {output_path}'
print('To validate, copy:\t' + cmd_validate)

To validate, copy:	mfa validate /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/yo_v16/validated /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/yo_v16/yo_epi_lexicon16.txt --ignore_acoustics --clean


Put the oov words back into the word list and rerun G2P.

In [10]:
# The folder of the OOV word files:
mfa_oov_path = '/Users/miaozhang/Documents/MFA/validated'
# The oov file:
oov_file = 'oovs_found_' + eval(dict_file_name)

oov_path = mfa_oov_path + path_sep + oov_file
with open(oov_path, 'r') as oov_file:
        with open(wordlist_path, 'a') as wordlist:
            shutil.copyfileobj(oov_file, wordlist)

# And then rerun Step 4. G2P to process the oov words.
if g2p == 'xpf':
    g2p_cmd = ["python", xpf_translater_path, "-l", rule_file_path, "-c", verify_file_path, "-r", wordlist_path]

    if os.path.exists(dict_file_path):
        os.remove(dict_file_path)

    with open(dict_file_path,'w') as dict_file:
        subprocess.run(g2p_cmd, stdout = dict_file) 

    with open(dict_file_path, "r") as dict_file:
        dict = dict_file.read().split("\n")

    with open(dict_file_path, 'w') as dict_file:
        for i in dict:
            if '@' not in i: 
                dict_file.write(i + "\n")

else:
    g2p_cmd = ["python", epitran_translater_path, wordlist_path, dict_file_path, epi_code]
    subprocess.run(g2p_cmd)

["'", "'", 'o', 'l', 'o', '́', 'd', 'e', "'", "'"] 
 ['o', 'l', 'o', 'd', 'e']
["'", 'e', 'r', 'e', '́'] 
 ['e', 'r', 'e']
["'", 's', 'ɔ', 't', 'i', '́', 't', 'o', '́', 'b', 'i', 'r', 'ɛ', '̀', "'"] 
 ['s', 'ɔ', 't', 'i', 't', 'o', 'b', 'i', 'r', 'ɛ']
["'", 't', 'o', '́'] 
 ['t', 'o']
["'", 'e', '̀', 'd͡ʒ', 'i', '̀', 'o', 'ɡ͡b', 'e', '̀', "'"] 
 ['e', 'd͡ʒ', 'i', 'o', 'ɡ͡b', 'e']
["'", 'ʃ', 'e', '́'] 
 ['ʃ', 'e']
['a', 'b', 'a', '̀', 'm', 'i', '̀'] 
 ['a', 'b', 'a', 'm', 'i']
['a', 'b', 'a', '́', 'n', 'i', 'k', 'ɛ', '́', 'd', 'u', '̀', 'n'] 
 ['a', 'b', 'a', 'n', 'i', 'k', 'ɛ', 'd', 'u', 'n']
['a', 'b', 'a', '́', 'n', 'i', 'w', 'a', '́', 'ʃ', 'ɛ', '́'] 
 ['a', 'b', 'a', 'n', 'i', 'w', 'a', 'ʃ', 'ɛ']
['a', 'b', 'i', 'l', 'e', '́', 'k', 'ɔ'] 
 ['a', 'b', 'i', 'l', 'e', 'k', 'ɔ']
['a', 'b', 'i', '́', 'l', 'e', '́', 'k', 'ɔ'] 
 ['a', 'b', 'i', 'l', 'e', 'k', 'ɔ']
['a', 'b', 'i', '́', 'm', 'b', 'ɔ', '́', 'l', 'a', '́'] 
 ['a', 'b', 'i', 'm', 'b', 'ɔ', 'l', 'a']
['a', 'b', 'i', '́', 's', 'a'

In [173]:
# To revalidate the corpus, copy and paste the command below.
print('To validate, copy:\t' + cmd_validate)

To validate, copy:	mfa validate /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/yo_v16/validated /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/yo_v16/yo_epi_lexicon16.txt --ignore_acoustics --clean


### Step 6. Train the acoustic model and forced align.

### Step 6.1. Then to train the acoustic model, run the next line:

        mfa train --clean {where your validated recordings are} {where your lexicon file is} {where your model will be saved}

### Step 6.2. The final step: forced align the recordings:

        mfa train --clean {where your validated recordings are} {where your lexicon file is} {where your output will be saved}

You can copy the command lines from below.
Notebook can't handle ```mfa``` commands. The mfa commands above can only run in Terminal.

In [None]:
print('To train, copy: \t' + cmd_train)
print("\n")
print('To align, copy: \t' + cmd_align)