# VoxCommunis data processing pipeline

This is a script of running MFA on recordings from Common Voice corpus. 

To run this pipeline, you need to download:

1. ffmpeg (a command line tool to convert multimedia files including both audio and video)
2. Python modules: epitran, pydub, praatio
3. The data of XPF corpus

The pipeline takes these steps to process data:

1. [Step 0: Setups](#step-0-setups)
2. [Step 1: Remap speakers](#step-1-remap-the-validated-speakers)
3. [Step 2: Create TextGrid and .wav files](#step-2-create-textgrid-files-and-wav-files-based-on-the-mp3-recordings-from-common-voice)
4. [Step 3: Prepare the lexicon](#step-3-prepare-the-lexicon)
5. [Step 4: G2P grapheme-to-phoneme](#step-4-g2p-grapheme-to-phoneme-epitran-or-xpf)
6. [Step 5: Validation](#step-5-train-the-acoustic-model)
7. [Step 6: Run MFA](#step-6-train-the-acoustic-model-and-forced-align)
8. [Finale](#finale)

This script was created by Miao Zhang (miao.zhang@uzh.ch), 22.12.2023

This script was modified by Miao Zhang, 07.02.2024 (Revalidation added)

Modified on 16.02.2024: added automatic log.

## Step 0. Setups
Import packages and setup file directories (for both the scripts and data).

In [None]:
# Import modules
import os, subprocess, shutil, re, csv
import pandas as pd
import numpy as np

Set the paths and directories of data and scripts to use.

_IMPORTANT_: the folder of the corpus data you downloaded from Common Voice should be named as: {lang_code}_v{version_number}.
- For example: the folder for the 16th version of Divhehi should be named: dv_v16.
- Another example: the folder for the 15th version of Upper Sorbian should be: hsb_v15.

In [138]:
###################################### Directories ################################################

# This is the directory where your data downloaded from Common Voice should be saved. This is the root directory where data from each language should be saved in individual folders.
# NO (BACK)SLASH at the end!!!
commonVoice_dir = '/Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice' 

# To use XPF as the G2P engine to process lexicon, you will need to download the XPF data from: https://github.com/CohenPr-XPF/XPF/tree/master/Data and save them on your computer.
# Specify the directory where your XPF data is saved.
# NO (BACK)SLASH at the end!!!
xpf_dir = '/Users/miaozhang/Research/CorpusPhon/CorpusData/G2P/XPF' 

######################### Language name/code and Common Voice version ##############################

# Language-related variable names
# the Common Voice code of the language (unfortunately, Common Voice mixes the use of iso 639-3 and iso 639-1 codes (they use bcp47 code). This code is also used in XPF).
# The code should match the code used in the name of the folder you downloaded from Common Voice.
lang_code = 'mn' 

# The version of the data in Common Voice
# Only numbers!!!
cv_version = '16' 

# Set it to 0 if you use a Windows machine.
if_mac = 1 

# Specify the G2P engine. If 0, then Epitran
if_xpf = 0

# Get the processing code for the language
if if_xpf == 1:
    g2p = 'xpf'
    # If you are using XPF, get the name of the language in XPF corpus from the VoxCommunis_Info.csv 
    # (You don't need to change this part)
    with open("VoxCommunis_info.csv", 'r') as f:
        reader = csv.DictReader(f)
        lang_row = [row for row in reader if row['code_cv'] == lang_code][0]
    lang_name = lang_row['name_xpf'].replace(' ', '')
else:
    g2p = 'epi'
    # If you are using epitran, ...
    # Please refer to VoxCommunics_info.csv to get the processing code of the language in epitran
    # !!!Do this manually, since depending on the orthography type, the epitran code can differ!!!
    epi_code = 'mon-Cyrl'

# Specify if the subversion of a corpus is used. The default is 0
if_subversion = 0 
# If if_subversion == 1, what suffix you would use?:
# Ignore this part, if you don't have a subversion of the corpus you are using.
subversion = '_' + 'sub3'

###################################################################################################

# The folder of the OOV word files (NO (BACK)SLASH at the end!!!):
mfa_oov_path = '/Users/miaozhang/Documents/MFA/validated'

# This is where the acoustic model will be saved after MFA training is done (NO (BACK)SLASH at the end!!!):
mfa_mod_path = '/Users/miaozhang/Documents/MFA/pretrained_models/acoustic'

# This is where files that will be uploaded to the OSF repo will be saved after the processing is finished (NO (BACK)SLASH at the end!!!):
osf_path = '/Users/miaozhang/Documents/VoxCommunis_OSF'

# This is where VxcCommunis_tracking.csv is (NO (BACK)SLASH at the end!!!):
cv_tracking_file = 'VoxCommunis_Info.csv'


####################################################################################################################################
####################################################################################################################################

###################################### DO NOT CHANGE ANYTHING IN THIS BLOCK FROM BELOW #############################################

####################################################################################################################################
####################################################################################################################################


if if_mac == 1:
    path_sep = '/'
    # this is the default directory where Praat is installed on a Mac.
    praat_path = '/Applications/Praat.app/Contents/MacOS/Praat' 
else:
    path_sep = '\\'
    # the directory of Praat installed on Windows.
    praat_path = 'C:\Program Files\Praat.exe' 
    
language_dir = lang_code + '_v' + cv_version

# Get the naming schema. (Don't change this part)
naming_schema = pd.read_csv('vxc_naming_schema.csv')
acs_mod_name = naming_schema['Python_code'][0]
textgrid_folder_name = naming_schema['Python_code'][1]
spkr_file_name = naming_schema['Python_code'][4]
word_file_name = naming_schema['Python_code'][2]
dict_file_name = naming_schema['Python_code'][3]


# For step 1 and 2: speaker remapping and creating a textgrid for each clip
# Get the full paths
remap_spkr_path = []
remap_spkr_path.append(commonVoice_dir + path_sep + language_dir + path_sep + 'invalidated.tsv') # where the invalidated utterance log of common voice is
remap_spkr_path.append(commonVoice_dir + path_sep + language_dir + path_sep + 'validated.tsv') # where the validated utterance log of common voice is
remap_spkr_path.append(commonVoice_dir + path_sep + language_dir + path_sep + eval(spkr_file_name)) # where the validated speaker log will be saved

# For step 3: prepare the lexicon and pronunciation dictionary
# Remember the file is saved in this variable:
validated_log = remap_spkr_path[1]
wordlist_path = commonVoice_dir + path_sep + language_dir + path_sep + eval(word_file_name)
 
# For step 4: G2P
if if_xpf == 1:
    xpf_translater_path = 'xpf_translate04.py'
    rule_file_path = xpf_dir + path_sep + lang_code + '_' + lang_name + path_sep + lang_code + '.rules'
    verify_file_path = xpf_dir + path_sep + lang_code + '_' + lang_name + path_sep + lang_code + '.verify.csv'
else:
    epitran_translater_path = 'epi_run.py'

dict_file_path = commonVoice_dir + path_sep + language_dir + path_sep + eval(dict_file_name)

# For step 6: running MFA
# Validate the corpus
validated_recs_path = commonVoice_dir + path_sep + language_dir + path_sep + 'validated'
if if_subversion == 0:
    acs_mod_path = mfa_mod_path + path_sep + eval(acs_mod_name)
else:
    acs_mod_name = re.sub('.zip', subversion + '.zip', acs_mod_name)
    acs_mod_path = mfa_mod_path + path_sep + eval(acs_mod_name)
output_path = commonVoice_dir + path_sep + language_dir + path_sep + 'output'

# Finale:
txtgrds_path = osf_path + path_sep + 'textgrids' + path_sep + eval(textgrid_folder_name)[:-4]

## Step 1. Remap the validated speakers
Get speaker IDs to put on TextGrids for speaker adaptation.

In [139]:
# Load the invalidated data
invalidated = pd.read_csv(remap_spkr_path[0], sep = '\t', quoting=csv.QUOTE_NONE, low_memory = False,
                          dtype = {
                              'client_id': 'str',
                              'path': 'str',
                              'sentence': 'str',
                              'up_votes': 'int16',
                              'down_votes': 'int16',
                              'age': 'str',
                              'gender': 'str',
                              'accentes': 'str',
                              'variant': 'str',
                              'locale': 'str',
                              'segment': 'str'
                          })
validated = pd.read_csv(remap_spkr_path[1], sep = '\t', quoting=csv.QUOTE_NONE, low_memory = False,
                        dtype = {
                              'client_id': 'str',
                              'path': 'str',
                              'sentence': 'str',
                              'up_votes': 'int16',
                              'down_votes': 'int16',
                              'age': 'str',
                              'gender': 'str',
                              'accentes': 'str',
                              'variant': 'str',
                              'locale': 'str',
                              'segment': 'str'
                          })
invalidated['validation'] = 'invalidated'
validated['validation'] = 'validated'
whole = pd.concat([validated, invalidated], axis=0)

# Get the clip durations
clip_dur = pd.read_csv(commonVoice_dir + path_sep + language_dir + path_sep + 'clip_durations.tsv', sep = '\t',
                       dtype = {'clip': 'str', 'duration[ms]': 'float64'})
clip_dur.rename(columns = {'clip':'path', 'duration[ms]':'dur'}, inplace=True)
clip_dur.set_index('path', inplace = True)

# Append duration info to validated speaker file
whole.set_index('path', inplace = True)
whole = pd.concat([whole, clip_dur], axis = 1, join = 'inner')
whole['dur'] = whole['dur']/1000
whole.reset_index(inplace=True)

# Create a column that shows if the clip is validated or not based on the votes
conditions = [
    ((whole['validation'] == 'validated') & (whole['dur'] > 1)),
    (whole['validation'] == 'invalidated'),
    (
        ((whole['validation'] == 'validated') & (whole['dur'] <= 1)) |
        (whole['sentence'] == '') |
        (isinstance(whole['sentence'], float))
        ),
]
choices = ["validated", "invalidated", "other"]
whole["validation"] = np.select(conditions, choices)

# remap the speakers:
whole['speaker_id'] = pd.factorize(whole['client_id'])[0] + 1
whole['speaker_id'] = whole.speaker_id.astype('str')
#speaker_lab = whole['speaker_id'].str.zfill(5)
#whole['new_utt'] = speaker_lab + '_' + whole['path']

# save the speaker file
if os.path.exists(remap_spkr_path[2]):
    os.remove(remap_spkr_path[2])
whole.to_csv(remap_spkr_path[2], sep='\t', index=False)

# The file paths
whole['src_path'] = commonVoice_dir + path_sep + language_dir + path_sep + 'clips/' + whole['path']

cond_snd_path = [
    (whole['validation'] == 'validated'),
    (whole['validation'] == 'invalidated'),
    (whole['validation'] == 'other'),
]
choice_snd_path = [commonVoice_dir + path_sep + language_dir + path_sep + 'validated' + path_sep + whole['path'],  
                   commonVoice_dir + path_sep + language_dir + path_sep + 'clips' + path_sep + whole['path'],
                   commonVoice_dir + path_sep + language_dir + path_sep + 'other' + path_sep + whole['path'],]
whole["new_path"] = np.select(cond_snd_path, choice_snd_path)


# Get total file number
n_clips = whole.shape[0]


del invalidated, validated

## Step 2. Create TextGrid files and .wav files based on the .mp3 recordings from Common Voice
Now we can create TextGrid files and .wav files

In [140]:
# The path of the 'validated' folder to contain validated recordings. If there is already a folder with the same name, delete it
if os.path.exists(commonVoice_dir + path_sep + language_dir + path_sep + 'validated'):
    shutil.rmtree(commonVoice_dir + path_sep + language_dir + path_sep + 'validated')
# Make the folder:
os.makedirs(commonVoice_dir + path_sep + language_dir + path_sep + 'validated')

# The path of the 'other' folder to contain unprocessable recordings. If there is already a folder with the same name, delete it
if os.path.exists(commonVoice_dir + path_sep + language_dir + path_sep + 'other'):
    shutil.rmtree(commonVoice_dir + path_sep + language_dir + path_sep + 'other')
# Make the folder:
os.makedirs(commonVoice_dir + path_sep + language_dir + path_sep + 'other')

# The function to create the textgrid files
from praatio import textgrid
def create_textgrid(snd_file, dur, speaker_id, transcript):
    # Create the textgrid
    tg = textgrid.Textgrid()
    
    # Add a new tier to the TextGrid
    speaker_tier = textgrid.IntervalTier(speaker_id, # tier name
                                        [(0.05, dur-0.05, transcript)], # interval start time, end time, and the transcript
                                        0, # start time
                                        dur) # end time
    tg.addTier(speaker_tier)

    # Save the TextGrid to a file
    tg_filename = snd_file.replace('.mp3', '.TextGrid')
    tg.save(tg_filename, format='short_textgrid', includeBlankSpaces=True)

for src_mp3_path, new_path, speaker, dur, transcript, validation in zip(whole.src_path, whole.new_path, whole.speaker_id, whole.dur, whole.sentence, whole.validation):
    # Copy sound file and crate the textgrid file  
    tg_file = new_path.replace('.mp3', '.TextGrid')
    if validation != 'invalidated':
        if not os.path.exists(new_path):
            subprocess.run(['mv', src_mp3_path, new_path])
        if not os.path.exists(tg_file) and validation != 'other':
            if isinstance(transcript, float):
                os.remove(new_path)
            else:
                create_textgrid(new_path, dur, speaker, transcript)

os.rename(commonVoice_dir + path_sep + language_dir + path_sep + 'clips', commonVoice_dir + path_sep + language_dir + path_sep + 'invalidated')

del clip_dur, src_mp3_path, new_path, speaker, dur, transcript, n_clips


## Step 3. Prepare the lexicon
Extract transcripts from validated.tsv and get each word on its own line

In [134]:
# Read in the validated.tsv file and get the orthographical transcriptions of the utterances
# Read in the validated.tsv file and get the orthographical transcriptions of the utterances
words = pd.read_csv(validated_log, sep='\t', low_memory = False, usecols = ['sentence'], dtype = {'sentence':'str'}) # get the transcribed sentences
words = words[words['sentence'].notnull()]['sentence']

words = words.str.replace("[›|‹|\(|\)|\[|\]|,|‚|.|!|?|+|\"|″|″|×|°|¡|“|⟨|⟩|„|→|‑|–|-|-|−|-|—|‒|\$|ʻ|ʿ|ʾ|`|´|’|‘|«|»|;|:|”|؟|&|\%|…| \' ]+", " ", regex=True)
words = words.str.replace("[[:punct:]]+", " ", regex=True)
words = words.str.replace("[ ]+", " ", regex=True)
words = words.str.lower()
words.tolist()
words = " ".join(words)
words = words.split(" ")
words = sorted(set(words))
words = list(filter(None, words))

# Save the word list as a .txt file
if os.path.exists(wordlist_path):
    os.remove(wordlist_path)
    
with open(wordlist_path,'w') as word_file:
    for word in words:
        word_file.write(word + "\n")

del words, word

## Step 4. G2P grapheme-to-phoneme (Epitran or XPF)
There three files you need to proceed if you use XPF.
1. A G2P rule file
2. A veryfication file
3. The translater python script

In [135]:
if os.path.exists(dict_file_path):
    os.remove(dict_file_path)
    
# Get the dictionary using XPF
# -l specifies the rule file
# -c specifies the verification file
# -r specifies the file to be translated
if g2p == 'xpf':
    g2p_cmd = ["python", xpf_translater_path, "-l", rule_file_path, "-c", verify_file_path, "-r", wordlist_path] # XPF translating command that will be sent to subprocess.run() to execute.

    if os.path.exists(dict_file_path):
        os.remove(dict_file_path)

    with open(dict_file_path,'w') as dict_file:
        subprocess.run(g2p_cmd, stdout = dict_file) # stdout = ... means to send the output to the file (so you have to open this file first as above)

    # This is to get rid of all the '@' in the lexicon (if there is any). @ means that XPF G2P failure
    with open(dict_file_path, "r") as dict_file:
        dict = dict_file.read().split("\n")

    with open(dict_file_path, 'w') as dict_file:
        for i in dict:
            i = re.sub(" ː", "ː", i)
            if '@' not in i:
                dict_file.write(i + "\n")
# Or using Epitran
else:
    g2p_cmd = ["python", epitran_translater_path, wordlist_path, dict_file_path, epi_code]
    subprocess.run(g2p_cmd)


['aː', 'i', 'u', '́', 'n'] 
 ['aː', 'i', 'u', 'n']
['a', 'b', 'u', '̄'] 
 ['a', 'b', 'u']
['a', 'k', 'k', 'i', 'o', '́', 'n'] 
 ['a', 'k', 'k', 'i', 'o', 'n']
['a', 'eː', 'ʀ', 'o', '́', 't', 'ʁ', 'oː', 'm', 'o'] 
 ['a', 'eː', 'ʀ', 'o', 't', 'ʁ', 'oː', 'm', 'o']
['a', 'f', 'k', 'h', 'a', '̄', 'n', 'i', '̄'] 
 ['a', 'f', 'k', 'h', 'a', 'n', 'i']
['a', 'k', 'ʁ', 'o', '̀'] 
 ['a', 'k', 'ʁ', 'o']
['a', 'ɡ', 'u', 's', 't', 'i', '́', 'n'] 
 ['a', 'ɡ', 'u', 's', 't', 'i', 'n']
['a', 'i', 'e', 'k', 'i', 'd', 'o', '̄'] 
 ['a', 'i', 'e', 'k', 'i', 'd', 'o']
['a', 'l', 'a', 'ɐ', 'k', 'o', '́', 'n'] 
 ['a', 'l', 'a', 'ɐ', 'k', 'o', 'n']
['a', 'l', 'b', 'a', '́', 'n'] 
 ['a', 'l', 'b', 'a', 'n']
['a', 'l', 'e', 'k', 'ʁ', 'i', '́', 'aː'] 
 ['a', 'l', 'e', 'k', 'ʁ', 'i', 'aː']
['a', 'l', 'i', 'ŋ', 'k', 's', 'a', '̊', 's'] 
 ['a', 'l', 'i', 'ŋ', 'k', 's', 'a', 's']
['a', 'l', 'm', 'eː', 'ʀ', 'i', '́', 'aː'] 
 ['a', 'l', 'm', 'eː', 'ʀ', 'i', 'aː']
['a', 'l', 'm', 'oː', 'd', 'o', '́', 'f', 'a', 'ɐ'] 
 ['

## Step 5. Validate the corpus

First, you need to activate the MFA environment in the terminal.
1. Press ctrl+` to open Terminal in VS Code.
2. Run 'conda activate aligner' until you see '(aligner)' at the beginning of the line in Terminal.
3. When you finished using MFA (both training and aligning), run 'conda deactivate' to shut down the MFA environment.

In [None]:
# Create a folder of MFA in document
# You DON'T need to run this if you already have an MFA folder in your Documents folder (What would this be like on Windows?)
# Uncomment the command below to run:
#!mfa model download acostic english.zip

To validate the corpus, run this line in terminal: 

        mfa validate {wherever your validated recordings are} {wherever your lexicon file is} --ignore_acoustics --clean

You can copy the command lines from below.
Notebook can't handle ```mfa``` commands. MFA commands can only run in Terminal.

In [133]:
cmd_validate = f'mfa validate {validated_recs_path} {dict_file_path} --ignore_acoustics --clean'
print('To validate, copy:\t' + cmd_validate)

To validate, copy:	mfa validate /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/de_v16/validated /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/de_v16/de_epi_lexicon16.txt --ignore_acoustics --clean


Put the oov words back into the word list and rerun G2P.

In [None]:
# The oov file:
oov_file = 'oovs_found_' + eval(dict_file_name)

oov_path = mfa_oov_path + path_sep + oov_file
with open(oov_path, 'r') as oov_file:
    with open(wordlist_path, 'a') as wordlist:
        shutil.copyfileobj(oov_file, wordlist)

# And then rerun Step 4. G2P to process the oov words.
if g2p == 'xpf':
    g2p_cmd = ["python", xpf_translater_path, "-l", rule_file_path, "-c", verify_file_path, "-r", wordlist_path]

    if os.path.exists(dict_file_path):
        os.remove(dict_file_path)

    with open(dict_file_path,'w') as dict_file:
        subprocess.run(g2p_cmd, stdout = dict_file) 

    with open(dict_file_path, "r") as dict_file:
        dict = dict_file.read().split("\n")

    with open(dict_file_path, 'w') as dict_file:
        for i in dict:
            i = re.sub(" ː", "ː", i)
            if '@' not in i: 
                dict_file.write(i + "\n")

else:
    g2p_cmd = ["python", epitran_translater_path, wordlist_path, dict_file_path, epi_code]
    subprocess.run(g2p_cmd)

In [None]:
# To revalidate the corpus, copy and paste the command below.
print('To validate, copy:\t' + cmd_validate)

## Step 6. Train the acoustic model and forced align.

### Step 6.1. Then to train the acoustic model, run the next line:

        mfa train --clean {where your validated recordings are} {where your lexicon file is} {where your model will be saved}

### Step 6.2. The final step: forced align the recordings:

        mfa train --clean {where your validated recordings are} {where your lexicon file is} {where your output will be saved}

You can copy the command lines from below.
Notebook can't handle ```mfa``` commands. The mfa commands above can only run in Terminal.

In [136]:
cmd_train = f'mfa train --clean {validated_recs_path} {dict_file_path} {acs_mod_path}'
cmd_align = f'mfa align --clean {validated_recs_path} {dict_file_path} {acs_mod_path} {output_path}'

print('To train, copy: \t' + cmd_train)
print("\n")
print('To align, copy: \t' + cmd_align)

To train, copy: 	mfa train --clean /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/de_v16/validated /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/de_v16/de_epi_lexicon16.txt /Users/miaozhang/Documents/MFA/pretrained_models/acoustic/de_epi_acoustic16.zip


To align, copy: 	mfa align --clean /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/de_v16/validated /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/de_v16/de_epi_lexicon16.txt /Users/miaozhang/Documents/MFA/pretrained_models/acoustic/de_epi_acoustic16.zip /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/de_v16/output


## Finale
Move the output files (the speaker file, the lexicon, the acoustic model, and the aligned textgrids) to the OSF folder to be ready to upload.

In [None]:
# Make a zip file of the aligned textgrids
shutil.make_archive(txtgrds_path, 'zip', output_path)

# Move the acoustic model
shutil.copy(acs_mod_path, osf_path + path_sep + 'acoustic_models' + path_sep)

# Move the lexicon
shutil.copy(dict_file_path, osf_path + path_sep + 'lexicons' + path_sep)

# Move the speaker file
shutil.copy(remap_spkr_path[2], osf_path + path_sep + 'spkr_files' + path_sep)

Upadate the tracking info in `VoxCommunis_Info.csv`. 

Make sure it is not in the lang_code_processing folder. Once updated, push the updated .csv to the GitHub.

In [132]:
# If you have trained the model, set this to 1
model_train = 0

# Paste the name of the outputs into the tracking file
cv_track = pd.read_csv(cv_tracking_file)
cv_track = cv_track.astype('string')
cv_track.loc[cv_track['code_cv'] == lang_code, 'spkr_file'] = eval(spkr_file_name)
cv_track.loc[cv_track['code_cv'] == lang_code, 'lexicon'] = eval(dict_file_name)
if model_train == 1:
    cv_track.loc[cv_track['code_cv'] == lang_code, 'acoustic_model'] = eval(acs_mod_name)
else:
    cv_track.loc[cv_track['code_cv'] == lang_code, 'acoustic_model'] = ''

# Update the tracking file
cv_track.to_csv(cv_tracking_file, index = False)