# Step 2.3: Getting Speech Rate

This code will give you the speech rate of utterances by getting the number of syllables in a utterances and dividing the utterance length in seconds by the number of syllables.

## Required Packages

The following packages are necessary to run this code:
re, os, string, [pandas](https://pypi.org/project/pandas/), [numpy](https://pypi.org/project/numpy/)

## References:

The pronunciation dictionary used in this code is adapted from the ARPABET system found here: http://www.speech.cs.cmu.edu/cgi-bin/cmudict. The modified dictionary file used in this code can be found [here](https://www.dropbox.com/s/fz0bxjpsrdghipn/pronunciation_dictionary.txt?dl=0).

## Initial Set-Up

In [None]:
#import the necessary packages
import re
import pandas as pd
import numpy as np
from string import punctuation

In [None]:
# path to the pronunciation dictionary
dictionary_path = "path"


#this is the path where the CSVs created in Step 2-2 is stored
aint_gold_standard_csv_path = "path"

be_gold_standard_csv_path = "path"

done_gold_standard_csv_path = "path"



#This will create the dataframe from the csv
aint_gs_df = pd.read_csv(aint_gold_standard_csv_path)

be_gs_df = pd.read_csv(be_gold_standard_csv_path)

done_gs_df = pd.read_csv(done_gold_standard_csv_path)

## Creating the Pronunciation Dictionary

This section will take the pronunciation dictionary text file and create a python dictionary out of it. The original dictionary is taken from vowel phonemes from the ARPABET system (which the pronunciation dictionary of the [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) is based on) which can be found [here](http://www.speech.cs.cmu.edu/cgi-bin/cmudict). The list of vowels is combined with a number which indicates primary, secondary, etc. stress in the word.

The modified dictionary file used in this code can be found [here](https://www.dropbox.com/s/fz0bxjpsrdghipn/pronunciation_dictionary.txt?dl=0).

In [None]:
# reads in the dictionary text file as a list
with open(dictionary_path, "r") as i:
    
    dict_lines = i.readlines()

    
# removes tabs in the list lines
dict_lines = [re.sub(r"\t", " ", x) for x in dict_lines]


#creates an empty list to append to
new_lines = []


# loops through the sorted lines
for line in sorted(dict_lines):
    
    # finds the index of the first space which divides the word and the
    # vowels associated with it
    space_index = line.find(" ")
    
    # makes a line with the @@ symbol to be used as a splitter 
    line = f"{line[:space_index]}@@{line[space_index:].strip()}"
    
    # appends the new line to the list
    new_lines.append(line)

    
#creates a list of lists for each line
dict_lists = [line.split("@@") for line in new_lines]

#creates a dictionary from the dict_lists where the 
#  word is the key and the vowels list is the value
#### NOTE: the original dictionary provides
####  multiple entries for multiple pronunciations
####  the way this code currently works is it will only take the first pronunciation

# creates an empty dictionary
pronunciationDict = {}


# loops through the dict_lists which is a list of lists
# composed of 2-item long lists with (1) the word string
# and (2) a string of vowels
for item in dict_lists:
    
    # checks the dictionary to see if the word is already present
    #  if it is there, then it skips
    # this ensures only the first pronunciation will be taken
    #  for a word which has multiple entries
    if item[0] in pronunciationDict:
        
        continue
    
    # creates a key from the word and makes the split vowels list the value
    else:
        pronunciationDict[item[0]] = item[1].split()

        
# for some reason, 'be' is messed up in the dictionary. so, this will fix it
pronunciationDict['BE'] = ['B', 'EY1']

## Define the Syllable Counting Function

This section will create the function needed to count the syllables in each word. This will be counted by counting the amount of vowels in each word.

This function takes the following arguments:

<ol>
<li>The utterance content as a string</li>
<li>The pronunciation dictionary created in the previous step</li>
</ol>

In [None]:
def getSyllCount(utterance_content_string, pronunciationDictionary):
    
    """
    Gets the syllable count of an utterance
    string input must be a string, if it's not, this will return a zero
       I put this in here to avoid errors with NaNs in the dataframe
    pronunciationDictionary input must be a python dictionary
    """
    
    # this takes the vowel phonemes from the ARPABET system (which the pronunciation dict of MFA is based on)
    # see here: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
    # and makes a list of the vowels with the different stresses added
    # you can use these to figure out syllable count because syllables are based on vowel. number of vowels = number of syllables
    vowels_list = ['AA', 'AE', 'AH', 'AO', 'AW', 'AY', 'EH', 'ER', 'EY', 'IH', 'IY', 'OW', 'OY', 'UH', 'UW']
    numbers = [0,1,2]

    vowelList = [f"{v}{n}" for v in vowels_list for n in numbers]

    # or just use this

    #vowel_phonemes = ['AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0', 'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'EH0', 'EH1', 'EH2', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1', 'EY2', 'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2', 'OW0', 'OW1', 'OW2', 'OY0', 'OY1', 'OY2', 'UH0', 'UH1', 'UH2', 'UW0', 'UW1', 'UW2']
    # these come from here: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
    # the numbers are for stress
    
    # creates an empty list
    total_vowels = []
    
    # checks the input type. if it is not a string
    # then returns a NaN to be appended to the dataframe
    if type(utterance_content_string) != str:
        
        return float("NaN")
    
    #if the type is a sting, continues
    else:
        
        #creates a list of split words from the utterance content string
        string_list = [word.strip(punctuation) for word in utterance_content_string.split()]
        
        #loops through this new list
        for word in string_list:
            
            #gets the list of phonemes from the word's entry in the pronunciatioin dictionary
            phonemes = pronunciationDictionary.get(word.upper()) 
            
            # if there are no phonemes, passes
            if phonemes is None:
                
                pass
            
            # else, continues
            else:
                
                #loops through the phonemes
                for phon in phonemes:
                    
                    # if the phoneme is in the vowelList, appends the phoneme to the total_vowels list
                    if phon in vowelList:
                        total_vowels.append(phon)
                        
                    #if the phonemes is not in the vowels list, passes
                    else:
                        pass
                    
        #returns the vowel count in the utterance content line
        return(len(total_vowels))

## Adding the WADA Speech-to-Noise Ratio Column to the Dataframe

### Feature: Ain't

In [None]:
#creates a new column for Speech Rate filled with 
# nan values which will be replaced
aint_gs_df["SyllableCount"] = np.nan

aint_gs_df["SpeechRate"] = np.nan


#loops through the rows of the gold standard dataframe
for file_row in aint_gs_df.itertuples():
    
    # gets the syllable count for the utterance content in the line
    content_syllable_count = getSyllCount(file_row.Content, pronunciationDict)
    
    # if the syllable count is 0, writes a 0
    if content_syllable_count == 0:
        
        aint_gs_df.loc[file_row.Index, "SyllableCount"] = 0
        
        aint_gs_df.loc[file_row.Index, "SpeechRate"] = 0
    
    # else, divides the Utterance length (in seconds) by the syllable count,
    #  and writes the result
    else:
        
        aint_gs_df.loc[file_row.Index, "SyllableCount"] = float(content_syllable_count)
        
        aint_gs_df.loc[file_row.Index, "SpeechRate"] = float(content_syllable_count/file_row.UttLength)

### Feature: Be

In [None]:
#creates a new column for Speech Rate filled with 
# nan values which will be replaced
be_gs_df["SyllableCount"] = np.nan

be_gs_df["SpeechRate"] = np.nan

#loops through the rows of the gold standard dataframe
for file_row in be_gs_df.itertuples():
    
    # gets the syllable count for the utterance content in the line
    content_syllable_count = getSyllCount(file_row.Content, pronunciationDict)
    
    # if the syllable count is 0, writes a 0
    if content_syllable_count == 0:
        
        be_gs_df.loc[file_row.Index, "SyllableCount"] = 0
        
        be_gs_df.loc[file_row.Index, "SpeechRate"] = 0
    
    # else, divides the Utterance length (in seconds) by the syllable count,
    #  and writes the result
    else:
        
        be_gs_df.loc[file_row.Index, "SyllableCount"] = float(content_syllable_count)
        
        be_gs_df.loc[file_row.Index, "SpeechRate"] = float(content_syllable_count/file_row.UttLength)

### Feature: Done

In [None]:
#creates a new column for Speech Rate filled with 
# nan values which will be replaced
done_gs_df["SyllableCount"] = np.nan

done_gs_df["SpeechRate"] = np.nan

#loops through the rows of the gold standard dataframe
for file_row in done_gs_df.itertuples():
    
    # gets the syllable count for the utterance content in the line
    content_syllable_count = getSyllCount(file_row.Content, pronunciationDict)
    
    # if the syllable count is 0, writes a 0
    if content_syllable_count == 0:
        
        done_gs_df.loc[file_row.Index, "SyllableCount"] = 0
        
        done_gs_df.loc[file_row.Index, "SpeechRate"] = 0
    
    # else, divides the Utterance length (in seconds) by the syllable count,
    #  and writes the result
    else:
        
        done_gs_df.loc[file_row.Index, "SyllableCount"] = float(content_syllable_count)
        
        done_gs_df.loc[file_row.Index, "SpeechRate"] = float(content_syllable_count/file_row.UttLength)

## Sorting the Dataframes by File and Line

This will sort the dataframes first by filename and then by line number. Doing this each step will ensure consistency across the board.

### Feature: Ain't

In [None]:
aint_gs_df = aint_gs_df.sort_values(by=['File', 'Line'])

### Feature: Be

In [None]:
be_gs_df = be_gs_df.sort_values(by=['File', 'Line'])

### Feature: Done

In [None]:
done_gs_df = done_gs_df.sort_values(by=['File', 'Line'])

## Exporting Dataframes to CSV Files

This will export the dataframes to CSV files.

In [None]:
# Designate the output path where the CSVs will be stored
csv_output_path = "path"

### Feature: Ain't

In [None]:
aint_gs_df.to_csv(f"{csv_output_path}aint_variations_speechRate.csv", index=False)

### Feature: Be

In [None]:
be_gs_df.to_csv(f"{csv_output_path}be_speechRate.csv", index=False)

### Feature: Done

In [None]:
done_gs_df.to_csv(f"{csv_output_path}done_speechRate.csv", index=False)