# VoxCommunis data processing pipeline

This is a script of running MFA on recordings from Common Voice corpus. 

To run this pipeline, you need to download:

1. Python modules: epitran, praatio, re, pandas, numpy, subprocess, shutil, os
2. The data of XPF corpus

The pipeline takes these steps to process data:

1. [Step 0: Setups](#step-0-setups)
2. [Step 1: Remap speakers](#step-1-remap-the-validated-speakers)
3. [Step 2: Create TextGrid and .wav files](#step-2-create-textgrid-files-and-wav-files-based-on-the-mp3-recordings-from-common-voice)
4. [Step 3: Prepare the lexicon](#step-3-prepare-the-lexicon)
5. [Step 4: G2P grapheme-to-phoneme](#step-4-g2p-grapheme-to-phoneme-epitran-or-xpf)
6. [Step 5: Validation](#step-5-train-the-acoustic-model)
7. [Step 6: Run MFA](#step-6-train-the-acoustic-model-and-forced-align)
8. [Finale](#finale)

This script was created by Miao Zhang (miao.zhang@uzh.ch), 22.12.2023

This script was modified by Miao Zhang, 07.02.2024 (Revalidation added)

Modified on 16.02.2024: added automatic log.

## Step 0. Setups
Import packages and setup file directories (for both the scripts and data).

In [17]:
pip install lingpy

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [18]:
# Import modules
import os, subprocess, shutil, re, csv, sys, importlib
import pandas as pd
# Turn Copy-On-Write on
pd.options.mode.copy_on_write = True
import numpy as np

# For creating textgrids
from praatio import textgrid

# For move files concurrently
from concurrent.futures import ThreadPoolExecutor

# Import functions from cv_processing.py
import vxc_processing as vxcproc


In [19]:
importlib.reload(vxcproc)

<module 'vxc_processing' from '/Users/eleanorchodroff/Documents/GitHub/voxcommunis/vxc_pipeline/vxc_processing.py'>

Set the paths and directories of data and scripts to use.

_IMPORTANT_: the folder of the corpus data you downloaded from Common Voice should be named as: {lang_code}_v{version_number}.
- For example: the folder for the 16th version of Divhehi should be named: dv_v16.
- Another example: the folder for the 15th version of Upper Sorbian should be: hsb_v15.

In [None]:
###################################### Directories ################################################

# This is the directory where your data downloaded from Common Voice should be saved. This is the root directory where data from each language should be saved in individual folders.
# NO (BACK)SLASH at the end!!!
commonVoice_dir = '/Users/eleanorchodroff/Documents/CommonVoice_processing' 

# To use XPF as the G2P engine to process lexicon, you will need to download the XPF data from: https://github.com/CohenPr-XPF/XPF/tree/master/Data and save them on your computer.
# Specify the directory where your XPF data is saved.
# NO (BACK)SLASH at the end!!!
xpf_dir = '/Users/eleanorchodroff/Documents/GitHub/XPF/Data/mk_Macedonian'

######################### Language name/code and Common Voice version ##############################

# Language-related variable names
# the Common Voice code of the language (unfortunately, Common Voice mixes the use of iso 639-3 and iso 639-1 codes (they use bcp47 code). This code is also used in XPF).
# The code should match the code used in the name of the folder you downloaded from Common Voice.
lang_code = 'bas' 

# The version of the data in Common Voice
# Only numbers!!!
cv_version = '20' 

######################### G2P ######################################################################

# Specify the G2P engine. Only these keywords are acceptable: 
# 'xpf' for XPF
# 'epi' for Epitran
# 'chr' for Charsiu
# 'mfa' for MFA
# 'vxc' for self-difined lexicon
g2p = 'vxc'

######################### The delimiter ############################################################

# Set it to 0 if you use a Windows machine.
if_mac = 1 

######################### What writing system is the language using? ###############################

# Specify if the language is Chinese/Japanese/Korean
if_cjk = 0

######################### Using existing model? ###############################

if_self_mod = 0

######################### Using existing lexicon? ###############################

if_self_lex = 0

###################### G2P settings (XPF or Epitran) ################################################

# This is where VxcCommunis_tracking.csv is (NO (BACK)SLASH at the end!!!):
cv_tracking_file = 'VoxCommunis_Info.csv'

# Get the G2P processing code for the language
if g2p == 'xpf' or 'chr':
    with open(cv_tracking_file, 'r') as f:
        reader = csv.DictReader(f)
        lang_row = [row for row in reader if row['code_cv'] == lang_code][0]
    # If you are using XPF, get the name of the language in XPF corpus
    if g2p == 'xpf':
        lang_name = lang_row['name_xpf'].replace(' ', '')
    # If you are using Charsiu, get the processing code for the language in Charsiu.
    elif g2p == 'chr':
        code_chr = lang_row['code_chr']

if g2p == 'epi':
    # If you are using epitran, ...
    # Please refer to VoxCommunics_info.csv to get the processing code of the language in epitran
    # !!!Do this manually, since depending on the type of the orthography, the epitran code can differ!!!
    epi_code = 'rus-Cyrl'


# Specify if the subversion of a corpus is used. The default is 0
if_subversion = 0 
# If if_subversion == 1, what suffix you would use?:
# Ignore this part, if you don't have a subversion of the corpus you are using.
subversion = '_' + 'sub3'

###################################################################################################

if if_mac == 1:
    path_sep = '/'
    # this is the default directory where Praat is installed on a Mac.
    #praat_path = '/Applications/Praat.app/Contents/MacOS/Praat' 
else:
    path_sep = '\\'
    # the directory of Praat installed on Windows.
    #praat_path = 'C:\Program Files\Praat.exe' 

# The folder for the language
language_dir = commonVoice_dir + path_sep + lang_code + '_v' + cv_version

# The file that contains the duration of each clip:
clip_info_path = language_dir + path_sep + 'clip_durations.tsv'

# MFA paths
# The folder of the OOV word files (NO (BACK)SLASH at the end!!!):
mfa_oov_path = '/Users/eleanorchodroff/Documents/MFA/validated'
# This is where the acoustic model will be saved after MFA training is done (NO (BACK)SLASH at the end!!!):
mfa_mod_folder = '/Users/eleanorchodroff/Documents/MFA/pretrained_models/acoustic'


############################################################################################################################


# This is where files that will be uploaded to the OSF repo will be saved after the processing is finished (NO (BACK)SLASH at the end!!!):
osf_path = '/Users/eleanorchodroff/Documents/CommonVoice/VoxCommunis_OSF'


####################################################################################################################################
####################################################################################################################################
cv_mod_version = "20"
cv_align_version = "20"
# Get the naming schema.
naming_schema = pd.read_csv('vxc_naming_schema.csv', usecols = ['Python_code'])['Python_code'].tolist()
naming_schema = [eval(name) for name in naming_schema]
acs_mod_name = naming_schema[0]
textgrid_folder_name = naming_schema[1]
word_file_name = naming_schema[2]
dict_file_name = naming_schema[3]
spkr_file_name = naming_schema[4]
textgrid_folder_path = language_dir + path_sep + textgrid_folder_name
word_file_path = language_dir + path_sep + word_file_name
dict_file_path = language_dir + path_sep + dict_file_name
spkr_file_path = language_dir + path_sep + spkr_file_name
del naming_schema

# For step 3: prepare the lexicon and pronunciation dictionary
validated_log = language_dir + path_sep + 'validated.tsv'

# For step 4: G2P
if g2p == 'xpf':
    xpf_translater_path = 'xpf_translate04.py'
    rule_file_path = xpf_dir + path_sep + lang_code + '_' + lang_name + path_sep + lang_code + '.rules'
    verify_file_path = xpf_dir + path_sep + lang_code + '_' + lang_name + path_sep + lang_code + '.verify.csv'
elif g2p == 'epi':
    epitran_translater_path = 'epi_run.py'
elif g2p == 'chr':
    from transformers import T5ForConditionalGeneration, AutoTokenizer
    chr_model = T5ForConditionalGeneration.from_pretrained('charsiu/g2p_multilingual_byT5_tiny_16_layers_100')
    chr_tok = AutoTokenizer.from_pretrained('google/byt5-small')


# For step 6: running MFA
if if_self_mod == 1:
    # Specify the path of the model
    acs_mod_path = '/Users/eleanorchodroff/Documents/MFA/pretrained_models/acoustic/bas20_cvu.zip'
if if_self_lex == 1:
    # Specify the path of the lexicon
    dict_file_path = language_dir + path_sep + 'ca_lexicon-IPA.txt'

# Validate the corpus
validated_recs_path = language_dir + path_sep + 'validated'
if if_subversion == 0:
    acs_mod_path = mfa_mod_folder + path_sep + acs_mod_name
else:
    acs_mod_name = re.sub('.zip', subversion + '.zip', acs_mod_name)
    acs_mod_path = mfa_mod_folder + path_sep + acs_mod_name
output_path = language_dir + path_sep + 'output'

# Finale:
txtgrds_path = osf_path + path_sep + 'textgrids' + path_sep + textgrid_folder_name[:-4]

## Step 1. Remap the validated speakers
Get speaker IDs to put on TextGrids for speaker adaptation.

In [33]:
print(language_dir, path_sep)
print(spkr_file_path, lang_code)
whole = vxcproc.remap_spkr(language_dir, path_sep, spkr_file_path, lang_code)

/Users/eleanorchodroff/Documents/CommonVoice_processing/bas_v20 /
/Users/eleanorchodroff/Documents/CommonVoice_processing/bas_v20/bas_vxc_spkr20.tsv bas


IsADirectoryError: [Errno 21] Is a directory: '/'

## Step 2. Create TextGrid files for the validated recordings and save the them in a separate folder.

All validated clips that are longer than 1s will be moved to a subfolder called 'validated'.

The validated clips but are shorter than 1s will be moved to the 'other' folder.

The invalidated clips will stay in the 'clips' folder. When the moving is done the 'clips' folder will be renamed to 'invalidated'.

In [None]:
# Make the folder for validated clips and other clips:
other_folder = language_dir + path_sep + 'other'
if os.path.exists(validated_recs_path) or os.path.exists(other_folder):
    shutil.rmtree(validated_recs_path)
    shutil.rmtree(other_folder)
os.makedirs(validated_recs_path)
os.makedirs(other_folder)

# Setup file chunks to batch processing clip moving and textgrid creating 
n_clips = len(whole.index)
n_workers = 10
chunksize = round(n_clips / n_workers)

# Move the clips and create textgrid files:
with ThreadPoolExecutor(n_workers) as exe:
    for i in range(0, len(whole), chunksize):
        chunk_data = whole.loc[i:(i+chunksize),]
        _ = exe.submit(vxcproc.move_and_create_tg, chunk_data)

# Rename the clip folder to invalidated
os.rename(language_dir + path_sep + 'clips', language_dir + path_sep + 'invalidated')

## Step 3. Prepare the lexicon
Generate the wordlist from Common Voice transcripts.

In [None]:
# Remove punctuations
if lang_code == 'ja':
    words = vxcproc.process_words(spkr_file_path, lang_code)
else:
    words = vxcproc.process_words(validated_log, lang_code)

# Filter other out unwanted words
words = vxcproc.remove_unwanted_words(words, lang_code, if_cjk)

# Save the word list as a .txt file
if os.path.exists(word_file_path):
    os.remove(word_file_path)
    
with open(word_file_path,'w') as word_file:
    for word in words:
        word_file.write(word + "\n")

## Step 4. G2P grapheme-to-phoneme (Epitran or XPF)
There three files you need to proceed if you use XPF.
1. A G2P rule file
2. A veryfication file
3. The translater python script

In [None]:
if os.path.exists(dict_file_path):
    os.remove(dict_file_path)
   
# Get the dictionary using XPF
# -l specifies the rule file
# -c specifies the verification file
# -r specifies the file to be translated
if g2p == 'xpf':
    g2p_cmd = ["python", xpf_translater_path, "-l", rule_file_path, "-c", verify_file_path, "-r", word_file_path] # XPF translating command that will be sent to subprocess.run() to execute.

    with open(dict_file_path,'w') as dict_file:
        subprocess.run(g2p_cmd, stdout = dict_file) # stdout = ... means to send the output to the file (so you have to open this file first as above)

    # This is to get rid of all the '@' in the lexicon (if there is any). @ means that XPF G2P failure
    with open(dict_file_path, "r") as dict_file:
        dict = dict_file.read().split("\n")

    with open(dict_file_path, 'w') as dict_file:
        for i in dict:
            i = re.sub(" ː", "ː", i)
            # Get rid of words that contain sounds XPF can't figure out
            if '@' not in i:
                dict_file.write(i + "\n")

# Or using Epitran
elif g2p == 'epi':
    g2p_cmd = ["python", epitran_translater_path, word_file_path, dict_file_path, epi_code]
    subprocess.run(g2p_cmd)

elif g2p == 'chr':
    from transformers import T5ForConditionalGeneration, AutoTokenizer

    model = T5ForConditionalGeneration.from_pretrained('charsiu/g2p_multilingual_byT5_tiny_16_layers_100')
    tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')

    chr_words = [f'<{code_chr}>: '+i for i in words]

    out = tokenizer(words, padding = True, add_special_tokens = False, return_tensors = 'pt')

    preds = model.generate(**out, num_beams = 1, max_length = 50) # We do not find beam search helpful. Greedy decoding is enough. 
    phones = tokenizer.batch_decode(preds.tolist(), skip_special_tokens = True)

    from ipatok import tokenise
    phones = [tokenise(phone) for phone in phones]
    phones = [' '.join(phone) for phone in phones]

    dict = []
    for sent, w in zip(sentence, phones):
        dict.append(sent + '\t' + w)

elif g2p == 'mfa':
    cmd_mfa_g2p = f'mfa g2p {word_file_path} {mfa_g2p_path} {dict_file_path}'  # If using a word list
    print('To g2p, copy and run:\t', cmd_mfa_g2p)

However, for some languages, you probably want to use the lexicon and the model from MFA or something of your own.

## Step 5. Validate the corpus

First, you need to activate the MFA environment in the terminal.
1. Press ctrl+` to open Terminal in VS Code.
2. Run 'conda activate aligner' until you see '(aligner)' at the beginning of the line in Terminal.
3. When you finished using MFA (both training and aligning), run 'conda deactivate' to shut down the MFA environment.

In [None]:
# Create a folder of MFA in document
# You DON'T need to run this if you already have an MFA folder in your Documents folder (What would this be like on Windows?)
# Uncomment the command below to run:
#!mfa model download acostic english.zip

To validate the corpus, run this line in terminal: 

        mfa validate {wherever your validated recordings are} {wherever your lexicon file is} --ignore_acoustics --clean

You can copy the command lines from below.
Notebook can't handle ```mfa``` commands. MFA commands can only run in Terminal.

In [None]:
cmd_validate = f'mfa validate {validated_recs_path} {dict_file_path} --ignore_acoustics --clean'
print('To validate, copy:\t' + cmd_validate)

## Step 6. Train the acoustic model and forced align.

### Step 6.1. Then to train the acoustic model, run the next line:

        mfa train --clean {where your validated recordings are} {where your lexicon file is} {where your model will be saved}

You can copy the command lines from below.
Notebook can't handle ```mfa``` commands. The mfa commands above can only run in Terminal.

In [None]:
# Train your own model
cmd_train = f'mfa train --clean {validated_recs_path} {dict_file_path} {acs_mod_path}'
print('To train, copy: \t' + cmd_train)

### Step 6.2. The final step: forced align the recordings:

        mfa align --clean {where your validated recordings are} {where your lexicon file is} {where your acoustic model is} {where your output will be saved}
        
When the model is trained, align the corpus.

However, since the MFA alignment somehow stops after generating 32609 textgrid files, we will split the corpus into n subfolders with each subfolder containing 32000 files.

In [None]:
# Get all the mp3 files in the validated folder
all_file = os.listdir(validated_recs_path)
all_mp3 = [file for file in all_file if file.endswith('.mp3')]

# If there are more than 32000 mp3s in the validated folder, split them into several subfolders with each one contains no more than 32000 clips
n_mp3 = len(all_mp3)
if n_mp3 > 32000:
    # Get the source path
    all_root = [os.path.join(validated_recs_path, rec) for rec in all_mp3]

    # Group the files into n groups with each group
    all_grouped = [(i, all_mp3[i:i+32000]) for i, _ in enumerate(range(0, len(all_mp3), 32000))]
    # Get the destination path
    all_recs_sub = [f"{validated_recs_path}{path_sep}subfolder_{index}{path_sep}{i}" for index, group in all_grouped for i in group]

    # Create subfolders
    for index, sublist in enumerate(all_grouped):
        subfolder_path = os.path.join(validated_recs_path, f'subfolder_{index}')
        if not os.path.exists(subfolder_path):
            os.makedirs(subfolder_path)
        
    # Move the files to the subfolders with multithreading
    n_workers = 10
    chunksize = round(n_mp3 / n_workers)
    with ThreadPoolExecutor(n_workers) as exe:
        for i in range(0, len(all_root), chunksize):
            src_names = all_root[i:(i+chunksize)]
            dest_names = all_recs_sub[i:(i+chunksize)]
            _ = exe.submit(vxcproc.move_recs, src_names, dest_names)

    # Check if all mp3 and TextGrid files are moved into subfolders
    all_items = os.listdir(validated_recs_path)
    contains_subdir = any(
        os.path.isfile(os.path.join(validated_recs_path, item)) and 
        (item.lower().endswith('.mp3') or item.lower().endswith('.textgrid')) 
        for item in all_items
        )
    if contains_subdir:
        print("The validated folder still contains mp3 or TextGrid files.")
        print('')
    else:
        print("All mp3 or TextGrid files are moved to subfolders.")
        print('')

    # Print the MFA aligning codes
    for index, sublist in enumerate(all_grouped):
        subfolder_path = os.path.join(validated_recs_path, f'subfolder_{index}')
        cmd_train = f'mfa align --clean {subfolder_path} {dict_file_path} {acs_mod_path} {output_path}'
        print(f'To align split {index}, copy: \t' + cmd_train)
        print('')
else:
    cmd_train = f'mfa align --clean {validated_recs_path} {dict_file_path} {acs_mod_path} {output_path}'
    print('To align, copy: \t' + cmd_train)

## Finale

First, if splits were created for aligning the data, put the recordings back to one folder.

In [None]:
# After finishing the forced-alignment, move the files in the subfolders out into validated folder
if n_mp3 > 32000:
    n_workers = 10
    chunksize = round(n_mp3 / n_workers)
    with ThreadPoolExecutor(n_workers) as exe:
        for i in range(0, len(all_root), chunksize):
            src_names = all_recs_sub[i:(i+chunksize)]
            dest_names = all_root[i:(i+chunksize)]
            _ = exe.submit(vxcproc.move_recs, src_names, dest_names)
    
    # Delete the empty subfolders
    for index, sublist in enumerate(all_grouped):
        subfolder_path = os.path.join(validated_recs_path, f'subfolder_{index}')
        if os.path.exists(subfolder_path):
            shutil.rmtree(subfolder_path)

    all_items = os.listdir(validated_recs_path)
    contains_subdir = any(os.path.isdir(os.path.join(validated_recs_path, item)) for item in all_items)
    if contains_subdir:
        print("The validated folder still contains subfolders.")
    else:
        print("The validated folder does not contain any subfolders now.")

Then, move the output files (the speaker file, the lexicon, the acoustic model, and the aligned textgrids) to the OSF folder to be ready to upload.

In [None]:
# Make a zip file of the aligned textgrids
shutil.make_archive(txtgrds_path, 'zip', output_path)

# Move the acoustic model
shutil.copy(acs_mod_path, osf_path + path_sep + 'acoustic_models' + path_sep)

# Move the lexicon
shutil.copy(dict_file_path, osf_path + path_sep + 'lexicons' + path_sep)

# Move the speaker file
shutil.copy(spkr_file_path, osf_path + path_sep + 'spkr_files' + path_sep)

Finally, upadate the tracking info in `VoxCommunis_Info.csv`. 

Make sure it is not in the lang_code_processing folder. Once updated, push the updated .csv to the GitHub.

In [None]:
# If you have trained the model, set this to 1
model_trained = 0
aligned = 0

# Paste the name of the outputs into the tracking file
cv_track = pd.read_csv(cv_tracking_file)
cv_track = cv_track.astype('string')
if model_trained == 1:
    cv_track.loc[cv_track['code_cv'] == lang_code, 'acoustic_model'] = acs_mod_name
else:
    cv_track.loc[cv_track['code_cv'] == lang_code, 'acoustic_model'] = ''
if aligned == 1:
    cv_track.loc[cv_track['code_cv'] == lang_code, 'textgrids'] = textgrid_folder_name
else:
    cv_track.loc[cv_track['code_cv'] == lang_code, 'textgrids'] = ''
cv_track.loc[cv_track['code_cv'] == lang_code, 'spkr_file'] = spkr_file_name
cv_track.loc[cv_track['code_cv'] == lang_code, 'lexicon'] = dict_file_name


# Update the tracking file
cv_track.to_csv(cv_tracking_file, index = False)