# VoxCommunis data processing pipeline

This is a script of running MFA on recordings from Common Voice corpus. 

To run this pipeline, you need to download:

1. Python modules: epitran, praatio, re, pandas, numpy, subprocess, shutil, os
2. The data of XPF corpus

The pipeline takes these steps to process data:

1. [Step 0: Setups](#step-0-setups)
2. [Step 1: Remap speakers](#step-1-remap-the-validated-speakers)
3. [Step 2: Create TextGrid and .wav files](#step-2-create-textgrid-files-and-wav-files-based-on-the-mp3-recordings-from-common-voice)
4. [Step 3: Prepare the lexicon](#step-3-prepare-the-lexicon)
5. [Step 4: G2P grapheme-to-phoneme](#step-4-g2p-grapheme-to-phoneme-epitran-or-xpf)
6. [Step 5: Validation](#step-5-train-the-acoustic-model)
7. [Step 6: Run MFA](#step-6-train-the-acoustic-model-and-forced-align)
8. [Finale](#finale)

This script was created by Miao Zhang (miao.zhang@uzh.ch), 22.12.2023

This script was modified by Miao Zhang, 07.02.2024 (Revalidation added)

Modified on 16.02.2024: added automatic log.

## Step 0. Setups
Import packages and setup file directories (for both the scripts and data).

In [None]:
# Import modules
import os, subprocess, shutil, re, csv, sys, importlib, multiprocessing, zipfile
import pandas as pd
# Turn Copy-On-Write on
pd.options.mode.copy_on_write = True
import numpy as np

# For creating textgrids
from praatio import textgrid

# For move files concurrently
from concurrent.futures import ThreadPoolExecutor

# Import Path
from pathlib import Path

# Import Lock to zip output textgrids
from threading import Lock

# Import functions from cv_processing.py
import vxc_processing as vxcproc


In [187]:
# Reload vxcproc in case if there are any changes to the code
importlib.reload(vxcproc)

<module 'vxc_processing' from '/Users/miaozhang/Research/CorpusPhon/Scripts/vxc_pipeline/vxc_processing.py'>

## Step 0.1 Path setup

Set the paths and directories of data and scripts to use.

_IMPORTANT_: the folder of the corpus data you downloaded from Common Voice should be named as: {lang_code}_v{version_number}.
- For example: the folder for the 16th version of Divhehi should be named: dv_v16.
- Another example: the folder for the 15th version of Upper Sorbian should be: hsb_v15.

In [206]:
###################################### Directories ################################################

# This is the directory where your data downloaded from Common Voice should be saved. This is the root directory where data from each language should be saved in individual folders.
# NO (BACK)SLASH at the end!!!
commonVoice_dir = '/Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice' 
# For Eleanor
#commonVoice_dir = '/Users/eleanor/Documents/CommonVoice' 

# To use XPF as the G2P engine to process lexicon, you will need to download the XPF data from: https://github.com/CohenPr-XPF/XPF/tree/master/Data and save them on your computer.
# Specify the directory where your XPF data is saved.
# NO (BACK)SLASH at the end!!!
xpf_dir = '/Users/miaozhang/Research/CorpusPhon/CorpusData/G2P/XPF' 
# For Eleanor
#xpf_dir = '/Users/eleanorchodroff/Documents/CorpusData/G2P/XPF'

######################### Language name/code and Common Voice version ##############################

# Language-related variable names
# the Common Voice code of the language (unfortunately, Common Voice mixes the use of iso 639-3 and iso 639-1 codes (they use bcp47 code). This code is also used in XPF).
# The code should match the code used in the name of the folder you downloaded from Common Voice.
lang_code = 'hu' 

# The version of the data in Common Voice
# Only numbers!!!
cv_mod_version = '17' # which version of common voice corpus that the model is trained on?
cv_align_version = '17' # which version of common voice corpus is forced-aligned?

######################### G2P ######################################################################

# Specify the G2P engine. Only these keywords are acceptable: 
# 'xpf' for XPF
# 'epi' for Epitran
# 'chr' for Charsiu
# 'mfa' for MFA
# 'vxc' for self-difined lexicon
g2p = 'xpf'

######################### What writing system is the language using? ###############################

# Specify if the language is Chinese/Japanese/Korean. If not, 1
if_cjk = 0

######################### Using existing model? ###############################

# Are you using a pre-trained model or training your own model?
# If training your own model, then set it to 0
if_self_mod = 0

######################### Using existing lexicon? ###############################

# Do you have your own prepared lexicon?
# If no, then set the value to 0
if_self_lex = 0

######################### G2P settings ################################################

# This is where VxcCommunis_tracking.csv is (NO (BACK)SLASH at the end!!!):
cv_tracking_file = 'VoxCommunis_Info.csv'
with open(cv_tracking_file, 'r') as f:
    reader = csv.DictReader(f)
    lang_row = [row for row in reader if row['code_cv'] == lang_code][0]
    lang_name = lang_row['name_xpf'].replace(' ', '')

# Get the G2P processing code for the language
if g2p == 'xpf' or 'chr':
    # If you are using XPF, get the name of the language in XPF corpus
    if g2p == 'xpf':
        code_xpf = lang_row['code_xpf']
    # If you are using Charsiu, get the processing code for the language in Charsiu.
    elif g2p == 'chr':
        code_chr = lang_row['code_chr']

if g2p == 'epi':
    # If you are using epitran, ...
    # Please refer to VoxCommunics_info.csv to get the processing code of the language in epitran
    # !!!Do this manually, since depending on the type of the orthography, the epitran code can differ!!!
    epi_code = 'ron-Latn'


# Specify if the subversion of a corpus is used. The default is 0
if_subversion = 0 
# If if_subversion == 1, what suffix you would use?:
# Ignore this part, if you don't have a subversion of the corpus you are using.
subversion = '_' + 'sub3'

################################################################################################### 

# The folder for the language
language_dir = os.path.join(commonVoice_dir, lang_code + '_v' + cv_align_version)

# The file that contains the duration of each clip:
clip_info_path = os.path.join(language_dir, 'clip_durations.tsv')

# MFA paths
# This is where the acoustic model will be saved after MFA training is done (NO (BACK)SLASH at the end!!!):
mfa_mod_folder = '/Users/miaozhang/Documents/MFA/pretrained_models/acoustic'


############################################################################################################################


# This is where files that will be uploaded to the OSF repo will be saved after the processing is finished (NO (BACK)SLASH at the end!!!):
osf_path = '/Users/miaozhang/Research/CorpusPhon/CorpusData/VoxCommunis_OSF'


####################################################################################################################################
####################################################################################################################################

# Get the naming schema.
naming_schema = pd.read_csv('vxc_naming_schema.csv', usecols = ['Python_code'])['Python_code'].tolist()
naming_schema = [eval(name) for name in naming_schema]

# Get the names
acs_mod_name = naming_schema[0]
textgrid_folder_name = naming_schema[1]
word_file_name = naming_schema[2]
dict_file_name = naming_schema[3]
spkr_file_name = naming_schema[4]

# Get the paths
textgrid_folder_path = os.path.join(language_dir, textgrid_folder_name)
word_file_path = os.path.join(language_dir, word_file_name)
dict_file_path = os.path.join(language_dir, dict_file_name)
spkr_file_path = os.path.join(language_dir, spkr_file_name)
del naming_schema

###################################################################################################################
###################################################################################################################

# For step 3: prepare the lexicon and pronunciation dictionary
validated_log = os.path.join(language_dir, 'validated.tsv')

###################################################################################################################
###################################################################################################################

# For step 4: G2P
if g2p == 'xpf':
    xpf_translater_path = 'xpf_translate04.py'
    rule_file_path = os.path.join(xpf_dir, code_xpf + '_' + lang_name, code_xpf + '.rules')
    verify_file_path = os.path.join(xpf_dir, code_xpf + '_' + lang_name, code_xpf + '.verify.csv')
elif g2p == 'epi':
    epitran_translater_path = 'epi_run.py'
elif g2p == 'chr':
    from transformers import T5ForConditionalGeneration, AutoTokenizer
    chr_model = T5ForConditionalGeneration.from_pretrained('charsiu/g2p_multilingual_byT5_tiny_16_layers_100')
    chr_tok = AutoTokenizer.from_pretrained('google/byt5-small')

###################################################################################################################
###################################################################################################################

# For step 6: running MFA
# Validate the corpus
validated_recs_path = os.path.join(language_dir, 'validated')
if if_subversion == 0:
    acs_mod_path = os.path.join(mfa_mod_folder, acs_mod_name)
else:
    acs_mod_name = re.sub('.zip', subversion + '.zip', acs_mod_name)
    acs_mod_path = os.path.join(mfa_mod_folder, acs_mod_name)
output_path = os.path.join(language_dir, 'output')

if if_self_mod == 1:
    # Specify the path of the model
    acs_mod_path = '/Users/miaozhang/Documents/MFA/pretrained_models/acoustic/ca_vxc_acoustic16.zip'
if if_self_lex == 1:
    # Specify the path of the lexicon
    dict_file_path = os.path.join(language_dir, 'ca_lexicon-IPA.txt')   

mfa_align_script_path = '/Users/miaozhang/Research/CorpusPhon/Scripts/vxc_pipeline/mfa_align.sh'

###################################################################################################################
###################################################################################################################

# Finale:
txtgrds_path = os.path.join(osf_path, 'textgrids', textgrid_folder_name)

###################################################################################################################
###################################################################################################################

print("Processing the folder:\t", language_dir)
print("The acoustic model to be trained/used:\t", acs_mod_path)
print("The lexicon to be generated/used:\t", dict_file_path)
print("The speaker file to be generated:\t", spkr_file_path)
print("The textgrid files to be generated:\t", txtgrds_path)

Processing the folder:	 /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/hu_v17
The acoustic model to be trained/used:	 /Users/miaozhang/Documents/MFA/pretrained_models/acoustic/hu_xpf_acoustic17.zip
The lexicon to be generated/used:	 /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/hu_v17/hu_xpf_lexicon17.txt
The speaker file to be generated:	 /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/hu_v17/hu_xpf_spkr17.tsv
The textgrid files to be generated:	 /Users/miaozhang/Research/CorpusPhon/CorpusData/VoxCommunis_OSF/textgrids/hu_xpf_textgrids17_acoustic17.zip


## Step 1. Speaker remapping
Get speaker IDs to put on TextGrids for speaker adaptation.

In [197]:
# Remap the speakers and save it to output the validated recordings to the speaker file
valid = vxcproc.remap_spkr(language_dir, spkr_file_path, lang_code, output=True)
print(f'There are {len(valid)} validated recordings in total for {lang_name}.')

There are 1524 validated recordings in total for Albanian.


## Step 2. TextGrid files

All validated clips that are longer than 1s will be moved to a subfolder called 'validated'.

The validated clips but are shorter than 1s will be moved to the 'other' folder.

The invalidated clips will stay in the 'clips' folder. When the moving is done the 'clips' folder will be renamed to 'invalidated'.

In [185]:
# Make the folder for validated clips
os.makedirs(validated_recs_path, exist_ok=True)

# Setup file chunks to batch processing clip moving and textgrid creating 
n_clips = len(valid)
n_workers = 10
chunksize = round(n_clips / n_workers)

# Move the clips and create textgrid files:
with ThreadPoolExecutor(n_workers) as exe:
    for i in range(0, len(valid), chunksize):
        chunk_data = valid.loc[i:(i+chunksize),]
        _ = exe.submit(vxcproc.move_and_create_tg, chunk_data)

## Step 3. Word list
Generate the word list from Common Voice transcripts.

In [188]:
# Remove punctuations
if lang_code == 'ja':
    words = vxcproc.process_words(spkr_file_path, lang_code)
else:
    words = vxcproc.process_words(validated_log, lang_code)

# Filter other out unwanted words
words = vxcproc.remove_unwanted_words(words, lang_code, if_cjk)

# Save the word list as a .txt file
if os.path.exists(word_file_path):
    os.remove(word_file_path)
    
with open(word_file_path,'w') as word_file:
    for word in words:
        word_file.write(word + "\n")

## Step 4. G2P
There three files you need to proceed if you use XPF.
1. A G2P rule file
2. A veryfication file
3. The translater python script

In [189]:
if os.path.exists(dict_file_path):
    os.remove(dict_file_path)
   
# Get the dictionary using XPF
# -l specifies the rule file
# -c specifies the verification file
# -r specifies the file to be translated
if g2p == 'xpf':
    g2p_cmd = ["python", xpf_translater_path, "-l", rule_file_path, "-c", verify_file_path, "-r", word_file_path] # XPF translating command that will be sent to subprocess.run() to execute.

    with open(dict_file_path,'w') as dict_file:
        subprocess.run(g2p_cmd, stdout = dict_file) # stdout = ... means to send the output to the file (so you have to open this file first as above)

    # This is to get rid of all the '@' in the lexicon (if there is any). @ means that XPF G2P failure
    with open(dict_file_path, "r") as dict_file:
        dict = dict_file.read().split("\n")

    with open(dict_file_path, 'w') as dict_file:
        for i in dict:
            i = re.sub(" ː", "ː", i)
            # Get rid of words that contain sounds XPF can't figure out
            if '@' not in i:
                dict_file.write(i + "\n")

# Or using Epitran
elif g2p == 'epi':
    g2p_cmd = ["python", epitran_translater_path, word_file_path, dict_file_path, epi_code]
    subprocess.run(g2p_cmd)

# Or use Charsiu
elif g2p == 'chr':
    # Generate the pronunciation
    chr_words = [f'<{code_chr}>: '+i for i in words]

    out = chr_tok(words, padding = True, add_special_tokens = False, return_tensors = 'pt')

    preds = chr_model.generate(**out, num_beams = 1, max_length = 50) # We do not find beam search helpful. Greedy decoding is enough. 
    phones = chr_tok.batch_decode(preds.tolist(), skip_special_tokens = True)

    # Separate the IPA symbols with white space
    from ipatok import tokenise
    phones = [tokenise(phone) for phone in phones]
    phones = [' '.join(phone) for phone in phones]

    # Save the output
    dict = []
    for word, w in zip(words, phones):
        dict.append(word + '\t' + w)

    with open(dict_file_path, 'w') as dict_file:
        for i in dict:
            dict_file.write(i + "\n")

# Or use the pretrained MFA G2P model
elif g2p == 'mfa':
    cmd_mfa_g2p = f'mfa g2p {word_file_path} {mfa_g2p_path} {dict_file_path}'  # If using a word list
    print('To g2p, copy and run:\t', cmd_mfa_g2p)

However, for some languages, you probably want to use the lexicon and the model from MFA or something of your own.

## Step 5. Validate

First, you need to activate the MFA environment in the terminal.
1. Press ctrl+` to open Terminal in VS Code.
2. Run 'conda activate aligner' until you see '(aligner)' at the beginning of the line in Terminal.
3. When you finished using MFA (both training and aligning), run 'conda deactivate' to shut down the MFA environment.

In [None]:
# Create a folder of MFA in document
# You DON'T need to run this if you already have an MFA folder in your Documents folder (What would this be like on Windows?)
# Uncomment the command below to run:
#!mfa model download acostic english.zip

To validate the corpus, run this line in terminal: 

        mfa validate {wherever your validated recordings are} {wherever your lexicon file is} --ignore_acoustics --clean

You can copy the command lines from below.
Notebook can't handle ```mfa``` commands. MFA commands can only run in Terminal.

In [207]:
cmd_validate = f'mfa validate {validated_recs_path} {dict_file_path} --ignore_acoustics --clean'
print('To validate, copy:\t' + cmd_validate)

To validate, copy:	mfa validate /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/hu_v17/validated /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/hu_v17/hu_xpf_lexicon17.txt --ignore_acoustics --clean


## Step 6. MFA.

### Step 6.1. Train the model

        mfa train --clean {where your validated recordings are} {where your lexicon file is} {where your model will be saved}

You can copy the command lines from below.
Notebook can't handle ```mfa``` commands. The mfa commands above can only run in Terminal.

In [208]:
# Train your own model
cmd_train = f'mfa train --clean {validated_recs_path} {dict_file_path} {acs_mod_path}'
print('To train, copy: \t' + cmd_train)

To train, copy: 	mfa train --clean /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/hu_v17/validated /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/hu_v17/hu_xpf_lexicon17.txt /Users/miaozhang/Documents/MFA/pretrained_models/acoustic/hu_xpf_acoustic17.zip


### Step 6.2. Forced-alignment

        mfa align --clean {where your validated recordings are} {where your lexicon file is} {where your acoustic model is} {where your output will be saved}
        
When the model is trained, align the corpus.

However, since the MFA alignment somehow stops after generating 32609 textgrid files, we will split the corpus into n subfolders with each subfolder containing 32000 files.
If the corpus has more than 32000 recordings, move the mp3 and textgrid files into subfolders.

In [None]:
# Get all mp3 files in the validated folder
all_mp3 = [item for item in os.listdir(validated_recs_path) if os.path.splitext(item)[1] == '.mp3']
n_clips = len(all_mp3)
print(f"There are {n_clips} clips in the validated folder.")
n_valid = len(valid)

if n_clips > 32000:
    # Create subfolders
    subfolders = valid['subfolder'].unique()
    for subfolder in subfolders:
        subfolder_path = os.path.join(validated_recs_path, subfolder)
        if not os.path.exists(subfolder_path):
            os.makedirs(subfolder_path)

    # Create the paths in the subfolders for each recording according to their grouping
    splits = valid[valid['path'].isin(all_mp3)]
    splits.to_csv(os.path.join(language_dir, 'all_splits.csv'), index = False)

    # Move the files into subfolders using multithreads
    n_workers = 10
    chunksize = round(len(splits) / n_workers)
    with ThreadPoolExecutor(n_workers) as exe:
        for i in range(0, len(splits), chunksize):
            chunk_data = splits.loc[i:(i+chunksize),]
            _ = exe.submit(vxcproc.split_recs, chunk_data)

Check if all mp3 and textgrid files are moved to subfolders, and check if there are any overlapping file names across the subfolders.

In [None]:
# Check if all mp3 and textgrid files are moved to the subfolders
if n_valid > 32000:
    # If there are still files left in the root directory, move them into their subfolders
    rest_mp3 = [item for item in os.listdir(validated_recs_path) if os.path.splitext(item)[1] == '.mp3']
    rest_move = valid[valid['path'].isin(rest_mp3)]
    vxcproc.split_recs(rest_move)
    del rest_move, rest_mp3
    
    # Check if there are still mp3 or textgrid files in the root directory
    contains_subdir = any(
        os.path.isfile(os.path.join(validated_recs_path, item)) and 
        (item.lower().endswith('.mp3') or item.lower().endswith('.textgrid')) 
        for item in os.listdir(validated_recs_path)
        )
    if contains_subdir:
        print("The validated folder still contains mp3 or TextGrid files.")
        print('')
        
    else:
        print("All mp3 or TextGrid files are moved to subfolders.")
        print('')

    # Check if there are overlapping file names across the subfolders
    overlap_dict = vxcproc.check_file_overlaps(validated_recs_path)
    if len(overlap_dict) == 0:
        print("There are no overlapping file names across the subfolders.") 
    else:
        print(overlap_dict)

Print out the MFA commands to align the data in (each subfolder of) the validated folder.

In [202]:
# Print the MFA commands for alignment
all_items = os.listdir(validated_recs_path)
all_items = [file for file in all_items if '.DS_Store' not in file]
all_items.sort()
any_file = any(os.path.isfile(os.path.join(validated_recs_path, item)) for item in all_items)
if not any_file:
    # Use a bash script to automatically align the data in all subfolders. Remember to activate the MFA virtual environment: conda activate aligner
    print(f'Copy and run this in the terminal to grant execution permission to the script:\tchmod +x {mfa_align_script_path}', '\n')
    print(f'Copy and run this in the terminal to align the data in all subfolders:\tbash {mfa_align_script_path} {validated_recs_path} {dict_file_path} {acs_mod_path} {output_path}', '\n')
else:  
    cmd_train = f'mfa align --clean {validated_recs_path} {dict_file_path} {acs_mod_path} {output_path}'
    print('To align, copy: \t' + cmd_train)

To align, copy: 	mfa align --clean /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/ro_v17/validated /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/ro_v17/ro_xpf_lexicon17.txt /Users/miaozhang/Documents/MFA/pretrained_models/acoustic/ro_xpf_acoustic17.zip /Users/miaozhang/Research/CorpusPhon/CorpusData/CommonVoice/ro_v17/output


### Step 6.3: (optional) Put back the recordings to the validated folder

When the alignment is done, if splits were created for aligning the data, put the recordings back to one single folder.

In [None]:
if n_valid > 32000:
    n_workers = 10
    chunksize = round(len(valid) / n_workers)
    with ThreadPoolExecutor(n_workers) as exe:
        for i in range(0, len(valid), chunksize):
            chunk_data = valid.loc[i:(i+chunksize),]
            _ = exe.submit(vxcproc.merge_recs, chunk_data)

Check if all files are put back to the validated folder's root directory.

In [None]:
# !!!If it reports there are still subfolders undeleted, run this block for a second time! It should move any left files from any subfolder back to the validated folder.

if n_valid > 32000:
    # Use os.scandir() for better performance
    with os.scandir(validated_recs_path) as entries:
        subfolders = [entry.name for entry in entries if entry.is_dir()]
        subfolders.sort()

    # Lists to store undeleted subfolders and files
    undeleted_subfolders = []

    # Batch deletion of empty subfolders
    for subfolder in subfolders:
        subfolder_path = os.path.join(validated_recs_path, subfolder)
        with os.scandir(subfolder_path) as sub_entries:
            if not any(entry.is_file() for entry in sub_entries):
                # If the subfolder does not contain any files, delete it
                shutil.rmtree(subfolder_path)
                print(f"Subfolder '{subfolder}' deleted because it contains no files.")
            else:
                undeleted_subfolders.append(subfolder)

    print("Subfolders checked and processed.")

    # List undeleted subfolders
    if len(undeleted_subfolders) > 0:
        print("Undeleted subfolders:")
        for subfolder in undeleted_subfolders:
            print(subfolder)
            # Move the files in the uncleared subfolder back to the validated folder if there is any
            vxcproc.move_files_to_root(validated_recs_path, os.path.join(validated_recs_path, subfolder))
        
    # Check if all the subfolders are deleted
    with os.scandir(validated_recs_path) as entries:
        contains_subdir = any(entry.is_dir() for entry in entries)
        if contains_subdir:
            print("\nThe validated folder still contains subfolders.")
        else:
            print("\nThe validated folder does not contain any subfolders now.")

Check if the output and input files match.

In [203]:
with multiprocessing.Pool() as pool:
    result = pool.apply(vxcproc.compare_inout, args=(output_path, validated_recs_path))
print(result)

There are 17737 mp3 files in the validated folder.
There are 17737 textgrid files in the output folder.
(True, 'The recordings in the validated folder and the textgrids in the output folder match.')


## Finale

Then, move the output files (the speaker file, the lexicon, the acoustic model, and the aligned textgrids) to the OSF folder to be ready to upload.

In [204]:
# Make a .tar.gz file of the aligned textgrids
# list all files to add to the zip
tgfiles = [os.path.join(output_path, filename) for filename in os.listdir(output_path)]
# create lock for adding files to the zip
lock = Lock()
# open the zip file
with zipfile.ZipFile(txtgrds_path, 'w', compression=zipfile.ZIP_DEFLATED) as handle:
    # create the thread pool
    with ThreadPoolExecutor(10) as exe:
        # add all files to the zip archive
        _ = [exe.submit(vxcproc.add_file, lock, handle, tg, output_path) for tg in tgfiles]

# Move the acoustic model
shutil.copy(acs_mod_path, os.path.join(osf_path, 'acoustic_models'))

# Move the lexicon
shutil.copy(dict_file_path, os.path.join(osf_path, 'lexicons'))

# Move the speaker file
shutil.copy(spkr_file_path, os.path.join(osf_path, 'spkr_files'))

'/Users/miaozhang/Research/CorpusPhon/CorpusData/VoxCommunis_OSF/spkr_files/ro_xpf_spkr17.tsv'

Finally, upadate the tracking info in `VoxCommunis_Info.csv`. 

Make sure it is not in the lang_code_processing folder. Once updated, push the updated .csv to the GitHub.

In [205]:
# If you have trained the model, set this to 1
model_trained = 1
aligned = 1

# Paste the name of the outputs into the tracking file
cv_track = pd.read_csv(cv_tracking_file)
cv_track = cv_track.astype('string')
if model_trained == 1:
    cv_track.loc[cv_track['code_cv'] == lang_code, 'acoustic_model'] = acs_mod_name
else:
    cv_track.loc[cv_track['code_cv'] == lang_code, 'acoustic_model'] = ''
if aligned == 1:
    cv_track.loc[cv_track['code_cv'] == lang_code, 'textgrids'] = textgrid_folder_name
else:
    cv_track.loc[cv_track['code_cv'] == lang_code, 'textgrids'] = ''
cv_track.loc[cv_track['code_cv'] == lang_code, 'spkr_file'] = spkr_file_name
cv_track.loc[cv_track['code_cv'] == lang_code, 'lexicon'] = dict_file_name


# Update the tracking file
cv_track.to_csv(cv_tracking_file, index = False)

# Sample ten files to check the alignment afterwards
post_check = valid.sample(10, random_state=42)
post_check.to_csv(os.path.join(language_dir, "post_check.csv"), index = False)