# Step 2.1: Parsing CORAAL Audio

This code will split CORAAL audio files based on utterance start and end times from the CSVs produced in Step 1. 

## Required Packages

The following packages are necessary to run this code:
[pandas](https://pypi.org/project/pandas/), [parselmouth](https://parselmouth.readthedocs.io/en/stable/)

**Note:** When installing parselmouth, use *pip install praat-parselmouth* rather than *pip install parselmouth*. The latter is the incorrect package.

## Convert Audio Sampling Rates

Because CORAAL audio files are in 44.1khz, they will need to be converted to 16khz in order to be processed by the automatic speech recognition (ASR) services. The following instructions were built for MacOS. 

<ol>
    <li> Use <a href="https://brew.sh/">Homebrew</a> to install <a href="http://sox.sourceforge.net/">sox</a> in the command line (terminal).
        <ul>
            <li>brew install sox</li>
        </ul>
    </li>
    <li> Move the current working directory of the folder with the audio files to be converted </li>
    <li> Use this command in the command line: for file in *.wav; do sox "\$file" "16khz_$file" rate 16k; done  (taken from <a href="https://madskjeldgaard.dk/posts/sox-tutorial-batch-processing/
        ">here</a>)</li>

## Parsing the Audio

In [None]:
# Designate the input path where the gold standard CSVs are stored
csv_input_path = "path"

# Designate the input path where the full audio files are stored
audio_input_path = "path"

In [None]:
import parselmouth
import pandas as pd

### Feature: Ain't

In [None]:
# Designate the filename of the gold standard corpus CSV
csv_filename = "aint_variations_coraal_instances_GoldStandard.csv"

# Designate the output path where the parsed audio files will be stored
audio_output_path = "path"

#reads in the CSV file and converts to a pandas dataframe
feature_instances_df = pd.read_csv(f"{csv_input_path}{csv_filename}")

#loops through dataframe rows and produces desired sound clips
for row in feature_instances_df.itertuples():
    
    #stores the file ID of the audio file (string type)
    sound_filename = f"16khz_{row.File}.wav"
    
    #stores the line number of the utterance
    sound_line = row.Line
    
    #stores the number of feature instances in the utterance
    feature_count = row.FeatureCountPerLine
    
    #stores the utterance start time in seconds and milliseconds (float type)
    utt_start_time = row.UttStartTime
    
    #stores the utterance end time in seconds and milliseconds (float type)
    utt_end_time = row.UttEndTime
    
    #creates an instance of the Parselmouth Sound class using the audio filename
    sound = parselmouth.Sound(f"{audio_input_path}{sound_filename}")
    
    #extracts the desired section of the audio
    subsound = sound.extract_part(from_time=utt_start_time,to_time=utt_end_time, preserve_times=False)
    
    #saves the section as a wav file
    #the first argument here is the file path plus the new file name
    #  the file names will include originalAudioFileID_utteranceStartTime_UtteranceEndTime.wav
    subsound.save(f"{audio_output_path}{sound_filename[:-4]}_Line{sound_line}_FeatCount{feature_count}.wav", "WAV")

### Feature: Be

In [None]:
# Designate the filename of the gold standard corpus CSV
csv_filename = "be_coraal_instances_GoldStandard.csv"

# Designate the output path where the parsed audio files will be stored
audio_output_path = "path"

#reads in the CSV file and converts to a pandas dataframe
feature_instances_df = pd.read_csv(f"{csv_input_path}{csv_filename}")

#loops through dataframe rows and produces desired sound clips
for row in feature_instances_df.itertuples():
    
    #stores the file ID of the audio file (string type)
    sound_filename = f"16khz_{row.File}.wav"
    
    #stores the line number of the utterance
    sound_line = row.Line
    
    #stores the number of feature instances in the utterance
    feature_count = row.FeatureCountPerLine
    
    #stores the utterance start time in seconds and milliseconds (float type)
    utt_start_time = row.UttStartTime
    
    #stores the utterance end time in seconds and milliseconds (float type)
    utt_end_time = row.UttEndTime
    
    #creates an instance of the Parselmouth Sound class using the audio filename
    sound = parselmouth.Sound(f"{audio_input_path}{sound_filename}")
    
    #extracts the desired section of the audio
    subsound = sound.extract_part(from_time=utt_start_time,to_time=utt_end_time, preserve_times=False)
    
    #saves the section as a wav file
    #the first argument here is the file path plus the new file name
    #  the file names will include originalAudioFileID_utteranceStartTime_UtteranceEndTime.wav
    subsound.save(f"{audio_output_path}{sound_filename[:-4]}_Line{sound_line}_FeatCount{feature_count}.wav", "WAV")

### Feature: Done

In [None]:
# Designate the filename of the gold standard corpus CSV
csv_filename = "done_coraal_instances_GoldStandard.csv"

# Designate the output path where the parsed audio files will be stored
audio_output_path = "path"

#reads in the CSV file and converts to a pandas dataframe
feature_instances_df = pd.read_csv(f"{csv_input_path}{csv_filename}")

#loops through dataframe rows and produces desired sound clips
for row in feature_instances_df.itertuples():
    
    #stores the file ID of the audio file (string type)
    sound_filename = f"16khz_{row.File}.wav"
    
    #stores the line number of the utterance
    sound_line = row.Line
    
    #stores the number of feature instances in the utterance
    feature_count = row.FeatureCountPerLine
    
    #stores the utterance start time in seconds and milliseconds (float type)
    utt_start_time = row.UttStartTime
    
    #stores the utterance end time in seconds and milliseconds (float type)
    utt_end_time = row.UttEndTime
    
    #creates an instance of the Parselmouth Sound class using the audio filename
    sound = parselmouth.Sound(f"{audio_input_path}{sound_filename}")
    
    #extracts the desired section of the audio
    subsound = sound.extract_part(from_time=utt_start_time,to_time=utt_end_time, preserve_times=False)
    
    #saves the section as a wav file
    #the first argument here is the file path plus the new file name
    #  the file names will include originalAudioFileID_utteranceStartTime_UtteranceEndTime.wav
    subsound.save(f"{audio_output_path}{sound_filename[:-4]}_Line{sound_line}_FeatCount{feature_count}.wav", "WAV")