# Experimental Project
## Measuring Error for ASR Systems Between Certain Dialects or Accents in English (specifically Indian dialects vs Southern California dialects)

- ##### What tool or tools did you experiment on? 
I used the Speech-to-Text APIs to measure the performance of 2 ASR systems in particular: the
Whisper API and the Microsoft Azure AI REST API.

- ##### How did you interact with the tool? 
I installed these 2 APIs locally and called them in Python. All of my code is in Python, formatted
in Jupyter Notebooks with Markdown interspersed to explain the code chunks. For the most part, I used
libraries in Python and created functions for specific string comparisons. For this project, I focused
on individual strings as sentences for the input, since iterating through a csv made it more complicated
and the results were harder to keep track of. I have written a Python file that uses these APIs and
returns the transcripts from the audio files into individual text files as well as a CSV.

My code for this transcription process can be found at my __[GitHub](https://github.com/oishani-b/Linguistics_Project.git)__ here. This transcription code should work on any computer or device as long as the required
packages and libraries have been locally installed. Currently, it works specifically for '.ogg' and '.wav'
files. You can try out this code with your own directory of audio files by changing the file paths
to your localy directories! Although this is a very important aspect of this project, my goal with the LIGN 168 
Final Project is to refine my error measurement methods, and
compare word-wise error measurements vs phonemic error measurements.

Since I collected an expandive aray of recordings, and I wanted to refine specifically my method of capturing error,
I used a specific subsets of the transcripts to make comparisons with the target sentence.


- ##### What hypotheses did you test? 
My general hypothesis is that these ASR systems will have significantly higher error rates
with Indian English dialects than California English dialects. Also, I think that the
most effective error measurement will be at the word level, since syllable level might
be so granular that it causes universally high error rates, whereas phrase or
sentence level may cause intuitively wrong similar error rates.
  
- ##### How did you test them?
I used a multi-step process, both at the word-level and the phoneme-level, to test this hypothesis. 

To measure error at the word level, I turned each sentence or string into a list of words, and then compared the word lists from the target sentences (the prompts shown to participants) to the word lists produced by each of the transcription services. By eliminating all of the words that the transcriptions got right, I was left with a list of the words that they did not get, subsituted, or added. 

Next, I used the word lists from above to compare the general pronunciations of these words using CMUdict. By using the arpabet transcriptions of each of the erroneous parts of the sentences, I could gain some idea of which phonemes had been missed, substituted, or added. This was more challenging than expected, due to the wide variety of possible errors made by these systems. But, I found that a similar process of adding all the phonemes to a list and then eliminating based on commonalities between the target and transcript lists, worked quite well for my example cases. This is something I might refine further depending on the different complexities of errors I see in these systems. Since I'm working with a relatively limited subset of the transcriptions to make sure I have good base cases, there is always the possibility of edge cases that I have not successfully addressed.


- ##### How do you know you’ve answered this question satisfactorily? (Or, if you’ve found that you can’t answer the question, explain why and engage in a nuanced way with what would need to be different to answer it in the future?)

I think I answered this question satisfactorily because I could observe the differences in each of my error measurements clearly, and trace them back using my code. I made sure to add several print statements and Markdown explanations throughout my code to make it explainable at every step of the process. 

Perhaps the most interesting results were Microsoft Azure's relatively great transcriptions compared to Whisper with the Indian dialects! Also, I found some noticeable differences across dialects in terms of numbers of errors, both word-wise and phoneme-wise. This gives me an idea of the possible statistically significant differences across speakers of Southern California vs Indian dialects. I also found some interesting results depending on style of speaking and background noise, explained further under the next question. 

Generally, I found that the Southern California dialect had higher accuracy than the 2 Indian dialects I tested, but this was especially noticeable with Whisper. The Azure system did either marginally worse for one of the Indian dialects or equally well for all 3 dialects. The ASR systems did surprisingly well with some of the prompts I thought would be heavily influenced by dialect, and universally poorly on some of the prompts I thought would be perceived more accurately in a Southern California dialect. 

While I thought word-error measurements seemed more intuitively correct in terms of eliminations, substitutions and deletions of words, I was surprised to find that the phonemic-error measurements were actually very intuitive and insightful. The fine-grained knowledge of eactly where the ASR systems were transcribing different vowels, for example, was very useful. While the word-level errors do make some sense in terms of human intuitions for what you would consider an error in recognizing speech, the phoneme-level errors are generally more intuitive and accurate in terms of measuring errors, especially for substitutions with similar pronunciatinos that are not well-captured by the broad strokes of word-level measurment.

- ##### What challenges did you face?
An unexpected factor that threw me off was the difference made by variance in speech styles. I used one participant's recordings, who used a more casual speech style with an Indian dialect (specifically Tamil), and added my own recordings with a more formal speech style, and it made a noticeable difference in the accuracy of the transcriptions. 

I was suprised to find the relatively high levels of accuracy when speakers used formal speech styles in general, regardless of dialect. I will definitely keep this factor in mind when recording participants in the future, because casual speech style was changing the transcripts to be far more erronious. 

I re-recorded a few prompts with participants and compared data across a variation of factors, such as speech style, closeness to the microphone, audio quality (based on the device thy were using), and background noise. I found that any loud background noise simply did not work, since the ASR systems would begin to transrribe voices from the background. 

To keep my focus on error measurement, I selected tramscripts that had some variation, but not much. For example, my transcripts are generally in a more formal style of speech and with minimal background noise. Some of the Southern California dialect tramscripts have some background noise or casual speech style, as do some of the other Indian dialect transcripts. Generally, though, I chose some of the relatively simple errors to investigate, so that I could successfully build base cases and some usefuledge cases to test my error measurement methods.

- ##### If your study involved human participants, how were they consented?
I asked my participants if they would agree to a UCSD Informed
Consent form (the template was taken from the IRB at UCSD). 

# Word Level Analysis

In [1]:
#filter warnings so that they only appear once
import warnings 
warnings.filterwarnings(action='once')

### Defining Functions for Word Error Measurement

The first step in the process of error measurment was to map out differences between the target string and the transcript string at the word level. To do this, I followed a multi-step process to ensure I was preserving as much of the original data as possible, and meaningfully drawing out steps to extract the transcriptions.

The steps I used were:
1) Convert the sentences into lists of words
2) Remove punctuation and capitalization for easier comparisons
3) Make the prompt and transcript word lisst equal in length for iteration
4) By iterating through the prompt list, eliminate all words the transcript got right, leaving only the problem words
5) Count the number of errors using the problem words

In [2]:
#turn the input sentence into a list of words
def to_list(sentence):
    word_list = sentence.split()
    return word_list

Next, I decided to strip the punctuation from the split words, so that the target sentence's words could be compared as accurately as possible to the transcript's words, regardless of surrounding punctuation. I also change all the words to lowercase, again to stay consistent throughout.

In [3]:
#remove punctuation from each of the words in the word list
def depunctuate(word_list):
    punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    depunct_word_list = []
    for i in word_list: 
        #iterate through each character in the word
        for character in i:
            #checking if the character is a punctuation mark and replacing it with an empty string
            if character in punc:     
                i = i.replace(character, "")
        #adding the new word without punctuation in lowercase to the result list
        depunct_word_list.append(i.lower())
    return depunct_word_list



This function gave me each of the word lists, so I was ready to start comparing the lists.

Here, I ran into an issue with indexing. Either one of the lists being longer meant it was hard to iterate through values in them. So, I decided to make both lists the same length by adding empty strings to whichever word list was shorter.

In [4]:
#make the shorter list equal in length to the longer one
def equalize_list_lengths(targ_list, err_list):
    
    #check if the transcript word list has more words
    if len(err_list) > len(targ_list):
        extra_errors = len(err_list) - len(targ_list)
        #add empty strings as dummy words now, depending on how many extra words the transcript has
        for i in range(0, extra_errors):
            targ_list.append("")

    #check if the prompt word list has more words
    elif len(targ_list) > len(err_list):
        targ_missed = len(targ_list) - len(err_list)
        #add empty strings as dummy words now, depending on how many missed words the prompt has
        for i in range(0, targ_missed):
            err_list.append("")
            
    #return a list containing 2 lists of equal length: the prompt word list and target word list with extra empty strings
    return [targ_list, err_list]

After making the lists equal in length, I started capturing the missing, substituted, and added words in the transcript. I did this by checking off each of the words in the prompt word list with the transcript word list.

I tried a few different ways to write this function, since it was important to get the 'last words standing' correct. Directly changing the prompt word list was leading to issues, so I initiated a new list for the words in the prompt that were missing in the transcript. I only removed the overlapping words from the transcript word list. I also used this function to remove the extra empty string 'dummy words' we had added in the previous step.

In [5]:
def get_error_words(ready_targ_list, ready_err_list):

    #this list will store all the words present in the prompt, but missing from the transcript
    targ_leftovers = []
    
    for i in ready_targ_list:
        
        #if the word is found in both the prompt and the transcript, it is correctly captured        
        if (i in ready_err_list and i in ready_targ_list):
            # the correctly captured word is removed from the transcript word list
            ready_err_list.remove(i)
            
        #if the word is not found in both the prompt and the transcript, something must be wrong
        else:
            #the word from the prompt that is missing from the transcript is added to this list
            targ_leftovers += [i]

    #this removes the empty strings, leaving a list of only the words present in the prompt, but 
    #missing from the transcript
    ready_targ_list = []
    for i in targ_leftovers:
        if len(i) != 0:
            ready_targ_list.append(i)

    #this list will store all the non-empty words remaining in the transcript list
    #these words were not in the prompt, so they are either substitutions or additions
    temp_err_list = []
    for i in ready_err_list:
        if len(i) != 0:
            temp_err_list.append(i)
    ready_err_list = temp_err_list

    return [ready_targ_list, ready_err_list]

At this point, I came to a tricky question: how do I actually count the errors? I have a list of words that were missed, and a list of words that were not in the prompt, but either of those could be substitutions.

I think there may be other ways to do this calculation, but I generally found that the longer list of words was a good metric. It typically accounted for substitutions without double-counting, and also captured omissions from the smaller list. This is an area I would like to explore and refine further, but I think it worked well for my example transcripts.

In [6]:
#count the number of errors between the word lists
def counter(final_targ_list, final_err_list):
    if len(final_targ_list) == len(final_err_list):
        return len(final_targ_list)
    elif len(final_targ_list) > len(final_err_list):
        return len(final_targ_list)
    elif len(final_targ_list) < len(final_err_list):
        return len(final_err_list)

After completing these steps, I had a good metric for calculating word-level error rates. I compiled all the functions above into one function so that it's easier to run and follows the right sequence.

The input for this function is 2 strings: the target sentence and the transcription sentence.

The output is a list with 3 values: 1) the prompt words that were missing, 2) any substituitons or additions, and 3) the number of errors.

In [7]:
def error_finder(target_sentence, error_sentence):
    #step 1
    targ_list = to_list(target_sentence)
    err_list = to_list(error_sentence)

    #step 2
    depunct_targ_list = depunctuate(targ_list)
    depunct_err_list = depunctuate(err_list)

    #step 3
    eq_length_lists = equalize_list_lengths(depunct_targ_list, depunct_err_list)
    ready_targ_list = eq_length_lists[0]
    ready_err_list = eq_length_lists[1]

    #step 4
    final_error_lists = get_error_words(ready_targ_list, ready_err_list)
    final_targ_list = final_error_lists[0]
    print("List of words missing from prompt in transcription: ", final_targ_list)
    final_err_list = final_error_lists[1]
    print("List of words in transcription not present in prompt: ", final_err_list)

    #step 5
    final_err_count = counter(final_targ_list, final_err_list)

    return [final_targ_list, final_err_list, final_err_count]

### Testing Word Error Measurements

Next, I tested this function using some of the results from the transctiptions I had. The first example I used is from an older set of transcriptions: I added some possible transcriptions as examples of different cases. The next few examples are from the results of the transcriptions I completed for this class project specifically. I used a limited number of examples, but had some variation throughout them to check if my general implementation was working as expected. The tests are below:

In [8]:
target_sentence_1 = "She was leaving for Bangalore that day."
error_sentence_1 = "She was leaving for bong load that day."
error_sentence_2 = "She was leaving that day."
error_sentence_3 = "She was leafing for Bangalore that day then."

print("Transcript 1")
print(error_finder(target_sentence_1, error_sentence_1))
print()

print("Transcript 2")
print(error_finder(target_sentence_1, error_sentence_2))
print()

print("Transcript 3")
print(error_finder(target_sentence_1, error_sentence_3))
print()

Transcript 1
List of words missing from prompt in transcription:  ['bangalore']
List of words in transcription not present in prompt:  ['bong', 'load']
[['bangalore'], ['bong', 'load'], 2]

Transcript 2
List of words missing from prompt in transcription:  ['for', 'bangalore']
List of words in transcription not present in prompt:  []
[['for', 'bangalore'], [], 2]

Transcript 3
List of words missing from prompt in transcription:  ['leaving']
List of words in transcription not present in prompt:  ['leafing', 'then']
[['leaving'], ['leafing', 'then'], 2]



In [9]:
target_sentence_2 = "You haven't even been to the In-n-Out in the Outback Steakhouse neighborhood?"
print(target_sentence_2)
print()

socal_1_whisper = "You haven't even been to the end and out in the Outback Steakhouse neighborhood."
indian_1_whisper = "You haven't even been to the inn and out in the outback stack or sniper herd."
indian_2_whisper = "You'll haven't even been to the inn and out in the Outback Steakhouse neighborhood."

socal_1_azure = "you haven't even been to the in and out in the outback steakhouse neighborhood"
indian_1_azure = "you haven't even been to the in and out in the outback steakhouse neighborhood"
indian_2_azure = "you haven't even been to the in and out in the outback steakhouse neighborhood"

print("socal_1_whisper")
print(error_finder(target_sentence_2, socal_1_whisper))
print()

print("indian_1_whisper")
print(error_finder(target_sentence_2, indian_1_whisper))
print()

print("indian_2_whisper")
print(error_finder(target_sentence_2, indian_2_whisper))
print()

print("socal_1_azure")
print(error_finder(target_sentence_2, socal_1_azure))
print()

print("indian_1_azure")
print(error_finder(target_sentence_2, indian_1_azure))
print()

print("indian_2_azure")
print(error_finder(target_sentence_2, indian_2_azure))
print()


You haven't even been to the In-n-Out in the Outback Steakhouse neighborhood?

socal_1_whisper
List of words missing from prompt in transcription:  ['innout']
List of words in transcription not present in prompt:  ['end', 'and', 'out']
[['innout'], ['end', 'and', 'out'], 3]

indian_1_whisper
List of words missing from prompt in transcription:  ['innout', 'steakhouse', 'neighborhood']
List of words in transcription not present in prompt:  ['inn', 'and', 'out', 'stack', 'or', 'sniper', 'herd']
[['innout', 'steakhouse', 'neighborhood'], ['inn', 'and', 'out', 'stack', 'or', 'sniper', 'herd'], 7]

indian_2_whisper
List of words missing from prompt in transcription:  ['you', 'innout']
List of words in transcription not present in prompt:  ['youll', 'inn', 'and', 'out']
[['you', 'innout'], ['youll', 'inn', 'and', 'out'], 4]

socal_1_azure
List of words missing from prompt in transcription:  ['innout']
List of words in transcription not present in prompt:  ['and', 'out', 'in']
[['innout'], ['a

These results are pretty interesting! Azure did a much better job than Whisper overall. Whisper recognized more errors for the 2 Indian dialects than for the single Southern California dialect. 

Also, these results give us an intuitive understanding of the errors fairly easily.

As an example, the 7 errors captured in the 'indian_1_whisper' dialect show steakhouse being replaced by steak, or being added, neighborhood being replaced by sniper and herd, and In-n-Out being replaced by inn and out. These errors make sense as intuitive measurements, but there are some gray areas here.

For example, In-n-Out (which looks like innout after the punctuation removal, but looked like In-n-Out in the prompt, seems to get 'in and out' the most often, even though it has a distinct pronunciation. We could count this as a single error, ubt it also makes sense to count it as 3 separate errors because of how In-n-Out is typically pronounced. In this case, a phoneme-level analysis would be very useful!

Here's another set of results that I analyzed:

In [10]:
target_sentence_3 = "Bro sure with all my work I've made it to the brochure."
print(target_sentence_3)
print()

indian_2_whisper = "Bro, share with all my work. I've made it to the brochure."
socal_1_whisper = "brochure with all my work I've made it to the brochure."
indian_1_whisper = "I'm going to show you all my work I've made to the brush."

indian_2_azure = "bro sure with all my work i've made it to the brochure"
socal_1_azure = "brochure with all my work i've made it to the brochure"
indian_1_azure = "share with all my work i've made it to the brochure"

print("socal_1_whisper")
print(error_finder(target_sentence_3, socal_1_whisper))
print()

print("indian_1_whisper")
print(error_finder(target_sentence_3, indian_1_whisper))
print()

print("indian_2_whisper")
print(error_finder(target_sentence_3, indian_2_whisper))
print()

print("socal_1_azure")
print(error_finder(target_sentence_3, socal_1_azure))
print()

print("indian_1_azure")
print(error_finder(target_sentence_3, indian_1_azure))
print()

print("indian_2_azure")
print(error_finder(target_sentence_3, indian_2_azure))
print()

Bro sure with all my work I've made it to the brochure.

socal_1_whisper
List of words missing from prompt in transcription:  ['bro', 'sure']
List of words in transcription not present in prompt:  ['brochure']
[['bro', 'sure'], ['brochure'], 2]

indian_1_whisper
List of words missing from prompt in transcription:  ['bro', 'sure', 'with', 'it', 'brochure']
List of words in transcription not present in prompt:  ['im', 'going', 'show', 'you', 'to', 'brush']
[['bro', 'sure', 'with', 'it', 'brochure'], ['im', 'going', 'show', 'you', 'to', 'brush'], 6]

indian_2_whisper
List of words missing from prompt in transcription:  ['sure']
List of words in transcription not present in prompt:  ['share']
[['sure'], ['share'], 1]

socal_1_azure
List of words missing from prompt in transcription:  ['bro', 'sure']
List of words in transcription not present in prompt:  ['brochure']
[['bro', 'sure'], ['brochure'], 2]

indian_1_azure
List of words missing from prompt in transcription:  ['bro', 'sure']
List 

Again, Azure does quite a bit better than Whisper, with one of the sentences exactly right! In this example, we can see where the measurement might not exactly reflect my intuitions: brochure as a subsitution for bro sure seems to make a lot more sense than share as a subsitution for bro sure. Here, again, phonemic analysis would be helpful!

Generally, though, it seems like Azure is doing fairly well with the Indian dialects, whereas Whisper is struggling quite noticeably.

This was also one of the cases where speech style played a role. In my transcription, I spoke formally and deliberatly, unlike the other 2 participants. This is reflected in the lowest error counts for both Whisper and Azure: leading to more insight about how much of a difference speech style can make in ASR interactions!

Next, we tried testing an Indian pronunciation of an Indian place name in a slightly different example:

In [11]:
target_sentence_4 = "He is leaving for Bengaluru tomorrow."
print(target_sentence_4)
print()

indian_1_whisper = "He's leaving for Bangalore tomorrow"
indian_2_whisper = "He is leaving for Bangalore tomorrow."
socal_1_whisper = "He is leaving for Bengaluru tomorrow."

indian_1_azure = "he is leaving for bengaluru tomorrow"
indian_2_azure = "he is leaving for bengaluru tomorrow"
socal_1_azure = "he is leaving for bengaluru tomorrow"

print("socal_1_whisper")
print(error_finder(target_sentence_4, socal_1_whisper))
print()

print("indian_1_whisper")
print(error_finder(target_sentence_4, indian_1_whisper))
print()

print("indian_2_whisper")
print(error_finder(target_sentence_4, indian_2_whisper))
print()

print("socal_1_azure")
print(error_finder(target_sentence_4, socal_1_azure))
print()

print("indian_1_azure")
print(error_finder(target_sentence_4, indian_1_azure))
print()

print("indian_2_azure")
print(error_finder(target_sentence_4, indian_2_azure))
print()

He is leaving for Bengaluru tomorrow.

socal_1_whisper
List of words missing from prompt in transcription:  []
List of words in transcription not present in prompt:  []
[[], [], 0]

indian_1_whisper
List of words missing from prompt in transcription:  ['he', 'is', 'bengaluru']
List of words in transcription not present in prompt:  ['hes', 'bangalore']
[['he', 'is', 'bengaluru'], ['hes', 'bangalore'], 3]

indian_2_whisper
List of words missing from prompt in transcription:  ['bengaluru']
List of words in transcription not present in prompt:  ['bangalore']
[['bengaluru'], ['bangalore'], 1]

socal_1_azure
List of words missing from prompt in transcription:  []
List of words in transcription not present in prompt:  []
[[], [], 0]

indian_1_azure
List of words missing from prompt in transcription:  []
List of words in transcription not present in prompt:  []
[[], [], 0]

indian_2_azure
List of words missing from prompt in transcription:  []
List of words in transcription not present in prom

Azure absolutely nailed it this time! Honestly, I'm very impressed at Microsoft Azure's performance on this overall, and I think it's very interesting to learn about how each of these APIs has different error sensitivities to dialect. Obviously, these are not nearly enough examples to say that Azure is miles ahead or does not have any dialect/accent issues, but it's done a great job so far!

Whisper, on the other hand, struggles with the Indian dialects again, this time in an even more interesting manner. The Southern California dialect saying 'Bengaluru' gets picked up, but the Indian dialects get picked up as Bangalore. 

### Overall Word Error Measurement: Not Bad, but Missing Some Important Stuff

From these tests, it's clear that Word-Level Error Measurements do give us meaningful insights into how the ASR systems are failing. I was especially surprised at the stark difference in performance across Azure and Whisper, with Azure outperforming Whisper in most of the example test cases I used. 

One of the caveats of word-level measurement if the lack of nuance seen in some of the transcription errors above. Approaching error measurement with phonemic analysis would help to handle these situations. 

# Phoneme Level Analysis Using CMUdict

### Defining Functions for Phoneme Error Measurement

I used the word error lists created already to calculate the phoneme-level errors. I followed a step-by-step process similar to the word-level error measurement process, but incorporated CMUdict. 

The steps I used were:
1) 

In [12]:
#import the required libraries for cmudict
import nltk
from nltk.corpus import cmudict

cmu_dict = cmudict.dict()

After loading CMUdict and looking through it, I could not find Bengaluru or In-n-Out. So, I updated CMUdict with custom entries before using it for error mea

In [13]:
#add custom entries
custom_entries = {
    "bengaluru": [['B', 'EH', 'NG', 'G', 'UH', 'L', 'UW', 'R', 'UW']],
    "innout": [['IH2', 'N', 'AH0', 'N', 'AW2', 'T']]
}
cmu_dict.update(custom_entries)

Then, I thought it would be useful to create a simple function that gives a word's arpabet transcription using CMUdict. That way, I can just use this function for all my transcriptions.

Something to note here: sometimes, CMUdict will have multiple transcriptions of a single word. For simplicity in these examples and tests, I'm defaulting to the first tramscription in CMUdict. It would be nice to make this function a bit more complex, as well as the following functions, and potentially account for all the variants of a certain word's pronunciation, in case that matches better across words, especially for substitutions. 

Since we are looking at the results of ASR processing and not actual speech here, I did not think it was necessary to include the variants. So, as of now, CMUdict will default to the first variant of its tramscriptions for both the prompt's words and the transcript's words.

In [14]:
def word_to_arpabet(word):
    #make the word lowercase because CMUdict only uses lowercase
    word_lower = word.lower()  
    if word_lower in cmu_dict:
        return (cmu_dict[word_lower])[0]
    else:
        return "NOT FOUND"

Next, we will follow a similar process to the word error measurement. First of all, the input for this function will only include the problem words from the word error measurement, instead of iterating through the arpabet of the entire transcription again. Specifically, the input includes 2 lists: the first is the list of words in the prompt missing from the tramscript, and second is the list of words in the transcript that are not in the prompt.

This function uses the function defined above to give each word's arpabet transcription in a list. Since our goal is just to compare all the remaining phonemes, I store all the arpabet characters into one list each for the prompt and the transcript. This can also potentially be refined, depending on the patterns of errors I get with edge cases. For my current examples, this method seems to work well with my intuitions.

In [15]:
def arpabet_list_compare(target_list, error_list):
    target_arpabet = []
    for i in target_list:
        arpabet_transcription = word_to_arpabet(i)
        for j in arpabet_transcription:
                target_arpabet.append(j)
        
    error_arpabet = []
    for i in error_list:
        arpabet_transcription = word_to_arpabet(i)
        for j in arpabet_transcription:
                error_arpabet.append(j)
    print("arpabet versions of words: ", [target_arpabet, error_arpabet])
    return [target_arpabet, error_arpabet]

Again, in a similar process to the word level errors, I add extra empty strings to the smaller list for easier iteration.

In [16]:
def equalize_arpabet_lengths(targ_arpabet, err_arpabet):
    if len(err_arpabet) > len(targ_arpabet):
        extra_errors = len(err_arpabet) - len(targ_arpabet)
        for i in range(0, extra_errors):
            targ_arpabet.append("")
            
    elif len(targ_arpabet) > len(err_arpabet):
        targ_missed = len(targ_arpabet) - len(err_arpabet)
        for i in range(0, targ_missed):
            err_arpabet.append("")

    return [targ_arpabet, err_arpabet]

This returns 2 lists with the same number of elements. Now, I use the same skeleton of the function to find errors for words, but with the arpabet lists instead. This function also returns 2 lists, the first with characters in the transcription but not in the prompt,

In [17]:
def get_error_words(ready_targ_arpa, ready_err_arpa):

    #this list stores all unclaimed characters from the prompt
    targ_arpa_leftovers = []
    for i in ready_targ_arpa:
        if (i in ready_err_arpa and i in ready_targ_arpa):
            ready_err_arpa.remove(i)
        else:
            targ_arpa_leftovers += [i]

    #this removes empty strings and gives us the final characters from the prompt that did not appear in the transcript
    ready_targ_arpa = []
    for i in targ_arpa_leftovers:
        if len(i) != 0:
            ready_targ_arpa.append(i)

    #this gives us the characters in the transcription that were not found in the prompt
    temp_err_arpa_list = []
    for i in ready_err_arpa:
        if len(i) != 0:
            temp_err_arpa_list.append(i)
    ready_err_arpa = temp_err_arpa_list

    return [ready_targ_arpa, ready_err_arpa]

To count errors, I can use the same counter function defined for the word error measurements. So, my next step now is to compile all the functions together and get the arpabet error differences for 2 lists of error words.

The output for this final function is a list of 3 elements, the first being the characters in the prompt that are missing from the transcript, the second being the characters from the transcript that are not in the prompt, and the third being the number of errors. The number of errors is measured, once again, using whichever list has a greater number of characters, since that tends to account for substitutions and additions without double-counting.

In [18]:
def arpa_error_finder(target_list, error_list):
    #step 1
    arpa_compare_result = arpabet_list_compare(target_list, error_list)
    targ_arpabet = arpa_compare_result[0]
    err_arpabet = arpa_compare_result[1]

    #step 2
    eq_arpabet_lists = equalize_arpabet_lengths(targ_arpabet, err_arpabet)
    ready_targ_arpa = eq_arpabet_lists[0]
    ready_err_arpa = eq_arpabet_lists[1]

    #step 3
    final_error_lists = get_error_words(ready_targ_arpa, ready_err_arpa)
    final_targ_list = final_error_lists[0]
    final_err_list = final_error_lists[1]

    #step 4
    final_err_count = counter(final_targ_list, final_err_list)

    return [final_targ_list, final_err_list, final_err_count]

### Testing Phoneme Error Measurements

Now, I will use the results from the word-level error measurement to test the phoneme-level errors.

In [19]:
#using the first set of examples
target_sentence_1 = "She was leaving for Bangalore that day."
error_sentence_1 = "She was leaving for bong load that day."
error_sentence_2 = "She was leaving that day."
error_sentence_3 = "She was leafing for Bangalore that day then."

print("Arpabet 1")
err1_results = error_finder(target_sentence_1, error_sentence_1)
target_list = err1_results[0]
error_list = err1_results[1]
print(arpa_error_finder(target_list, error_list))
print()

print("Arpabet 2")
err2_results = error_finder(target_sentence_1, error_sentence_2)
target_list = err2_results[0]
error_list = err2_results[1]
print(arpa_error_finder(target_list, error_list))
print()

print("Arpabet 3")
err3_results = error_finder(target_sentence_1, error_sentence_3)
target_list = err3_results[0]
error_list = err3_results[1]
print(arpa_error_finder(target_list, error_list))
print()

Arpabet 1
List of words missing from prompt in transcription:  ['bangalore']
List of words in transcription not present in prompt:  ['bong', 'load']
arpabet versions of words:  [['B', 'AE1', 'NG', 'G', 'AH0', 'L', 'AO2', 'R'], ['B', 'AA1', 'NG', 'L', 'OW1', 'D']]
[['AE1', 'G', 'AH0', 'AO2', 'R'], ['AA1', 'OW1', 'D'], 5]

Arpabet 2
List of words missing from prompt in transcription:  ['for', 'bangalore']
List of words in transcription not present in prompt:  []
arpabet versions of words:  [['F', 'AO1', 'R', 'B', 'AE1', 'NG', 'G', 'AH0', 'L', 'AO2', 'R'], []]
[['F', 'AO1', 'R', 'B', 'AE1', 'NG', 'G', 'AH0', 'L', 'AO2', 'R'], [], 11]

Arpabet 3
List of words missing from prompt in transcription:  ['leaving']
List of words in transcription not present in prompt:  ['leafing', 'then']
arpabet versions of words:  [['L', 'IY1', 'V', 'IH0', 'NG'], ['L', 'IY1', 'F', 'IH0', 'NG', 'DH', 'EH1', 'N']]
[['V'], ['F', 'DH', 'EH1', 'N'], 4]



As seen from these results, the phonemic transcriptions give us much more nuanced insight into where exactly the transcriptions are missing out or substituting phonemes. The error counts are also surprisingly intuitive and easy to understand. Using these phonemic error measurements gives me a better understanding of the level of differences between each of the transcripts. 

To continue testing, I'll bring down the previous examples again.

In [20]:
target_sentence_2 = "You haven't even been to the In-n-Out in the Outback Steakhouse neighborhood?"
print(target_sentence_2)
print()

socal_1_whisper = "You haven't even been to the end and out in the Outback Steakhouse neighborhood."
indian_1_whisper = "You haven't even been to the inn and out in the outback stack or sniper herd."
indian_2_whisper = "You'll haven't even been to the inn and out in the Outback Steakhouse neighborhood."

socal_1_azure = "you haven't even been to the in and out in the outback steakhouse neighborhood"
indian_1_azure = "you haven't even been to the in and out in the outback steakhouse neighborhood"
indian_2_azure = "you haven't even been to the in and out in the outback steakhouse neighborhood"

print("socal_1_whisper")
for_arpa = error_finder(target_sentence_2, socal_1_whisper)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

print("indian_1_whisper")
for_arpa = error_finder(target_sentence_2, indian_1_whisper)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

print("indian_2_whisper")
for_arpa = error_finder(target_sentence_2, indian_2_whisper)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

print("socal_1_azure")
for_arpa = error_finder(target_sentence_2, socal_1_azure)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

print("indian_1_azure")
for_arpa = error_finder(target_sentence_2, indian_1_azure)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

print("indian_2_azure")
for_arpa = error_finder(target_sentence_2, indian_2_azure)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

You haven't even been to the In-n-Out in the Outback Steakhouse neighborhood?

socal_1_whisper
List of words missing from prompt in transcription:  ['innout']
List of words in transcription not present in prompt:  ['end', 'and', 'out']
arpabet versions of words:  [['IH2', 'N', 'AH0', 'N', 'AW2', 'T'], ['EH1', 'N', 'D', 'AH0', 'N', 'D', 'AW1', 'T']]
[['IH2', 'AW2'], ['EH1', 'D', 'D', 'AW1'], 4]

indian_1_whisper
List of words missing from prompt in transcription:  ['innout', 'steakhouse', 'neighborhood']
List of words in transcription not present in prompt:  ['inn', 'and', 'out', 'stack', 'or', 'sniper', 'herd']
arpabet versions of words:  [['IH2', 'N', 'AH0', 'N', 'AW2', 'T', 'S', 'T', 'EY1', 'K', 'HH', 'AW2', 'S', 'N', 'EY1', 'B', 'ER0', 'HH', 'UH2', 'D'], ['IH1', 'N', 'AH0', 'N', 'D', 'AW1', 'T', 'S', 'T', 'AE1', 'K', 'AO1', 'R', 'S', 'N', 'AY1', 'P', 'ER0', 'HH', 'ER1', 'D']]
[['IH2', 'AW2', 'EY1', 'AW2', 'EY1', 'B', 'HH', 'UH2'], ['IH1', 'AW1', 'AE1', 'AO1', 'R', 'AY1', 'P', 'ER1',

This is really great! It gives me a more nuanced understanding of where exactly Whisper is messing up, and the difference between the Southern California dialect and the Indian dialects is pretty big in this case.

Microsoft Azure did pretty great with this one, so the next example will help me see the effects of phoneme-level errors.

In [21]:
target_sentence_3 = "Bro sure with all my work I've made it to the brochure."
print(target_sentence_3)
print()

indian_2_whisper = "Bro, share with all my work. I've made it to the brochure."
socal_1_whisper = "brochure with all my work I've made it to the brochure."
indian_1_whisper = "I'm going to show you all my work I've made to the brush."

indian_2_azure = "bro sure with all my work i've made it to the brochure"
socal_1_azure = "brochure with all my work i've made it to the brochure"
indian_1_azure = "share with all my work i've made it to the brochure"

print("socal_1_whisper")
for_arpa = error_finder(target_sentence_3, socal_1_whisper)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

print("indian_1_whisper")
for_arpa = error_finder(target_sentence_3, indian_1_whisper)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

print("indian_2_whisper")
for_arpa = error_finder(target_sentence_3, indian_2_whisper)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

print("socal_1_azure")
for_arpa = error_finder(target_sentence_3, socal_1_azure)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

print("indian_1_azure")
for_arpa = error_finder(target_sentence_3, indian_1_azure)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

print("indian_2_azure")
for_arpa = error_finder(target_sentence_3, indian_2_azure)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

Bro sure with all my work I've made it to the brochure.

socal_1_whisper
List of words missing from prompt in transcription:  ['bro', 'sure']
List of words in transcription not present in prompt:  ['brochure']
arpabet versions of words:  [['B', 'R', 'OW1', 'SH', 'UH1', 'R'], ['B', 'R', 'OW0', 'SH', 'UH1', 'R']]
[['OW1'], ['OW0'], 1]

indian_1_whisper
List of words missing from prompt in transcription:  ['bro', 'sure', 'with', 'it', 'brochure']
List of words in transcription not present in prompt:  ['im', 'going', 'show', 'you', 'to', 'brush']
arpabet versions of words:  [['B', 'R', 'OW1', 'SH', 'UH1', 'R', 'W', 'IH1', 'DH', 'IH1', 'T', 'B', 'R', 'OW0', 'SH', 'UH1', 'R'], ['IH1', 'M', 'G', 'OW1', 'IH0', 'NG', 'SH', 'OW1', 'Y', 'UW1', 'T', 'UW1', 'B', 'R', 'AH1', 'SH']]
[['UH1', 'R', 'W', 'DH', 'IH1', 'B', 'R', 'OW0', 'UH1', 'R'], ['M', 'G', 'IH0', 'NG', 'OW1', 'Y', 'UW1', 'UW1', 'AH1'], 10]

indian_2_whisper
List of words missing from prompt in transcription:  ['sure']
List of words in 

Whisper's struggle with the Indian dialects becomes starkly visible using this transcription. Interestingly, though, this is a good way to see that Azure is also doing pretty poorly with one of the Indian dialects compared to the other one and the Souther California dialect. This was not a clear distinction at the word-level, but becomes visible and easily quantifiable at the phoneme-level.

There seem to be some great advantages to computing the phoneme-level errors for measurement here. I will use the last set of examples to see if Whisper's error rates continue to be more starkly bad for Indian dialects. We know that Azure got all of this right, so we're going to skip over those transcriptions and focus on Whisper specifically.

In [22]:
target_sentence_4 = "He is leaving for Bengaluru tomorrow."
print(target_sentence_4)
print()

indian_1_whisper = "He's leaving for Bangalore tomorrow"
indian_2_whisper = "He is leaving for Bangalore tomorrow."
socal_1_whisper = "He is leaving for Bengaluru tomorrow."

print("socal_1_whisper")
for_arpa = error_finder(target_sentence_4, socal_1_whisper)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

print("indian_1_whisper")
for_arpa = error_finder(target_sentence_4, indian_1_whisper)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

print("indian_2_whisper")
for_arpa = error_finder(target_sentence_4, indian_2_whisper)
target_list = for_arpa[0]
error_list = for_arpa[1]
print(arpa_error_finder(target_list, error_list))
print()

He is leaving for Bengaluru tomorrow.

socal_1_whisper
List of words missing from prompt in transcription:  []
List of words in transcription not present in prompt:  []
arpabet versions of words:  [[], []]
[[], [], 0]

indian_1_whisper
List of words missing from prompt in transcription:  ['he', 'is', 'bengaluru']
List of words in transcription not present in prompt:  ['hes', 'bangalore']
arpabet versions of words:  [['HH', 'IY1', 'IH1', 'Z', 'B', 'EH', 'NG', 'G', 'UH', 'L', 'UW', 'R', 'UW'], ['N', 'O', 'T', ' ', 'F', 'O', 'U', 'N', 'D', 'B', 'AE1', 'NG', 'G', 'AH0', 'L', 'AO2', 'R']]
[['HH', 'IY1', 'IH1', 'Z', 'EH', 'UH', 'UW', 'UW'], ['N', 'O', 'T', ' ', 'F', 'O', 'U', 'N', 'D', 'AE1', 'AH0', 'AO2'], 12]

indian_2_whisper
List of words missing from prompt in transcription:  ['bengaluru']
List of words in transcription not present in prompt:  ['bangalore']
arpabet versions of words:  [['B', 'EH', 'NG', 'G', 'UH', 'L', 'UW', 'R', 'UW'], ['B', 'AE1', 'NG', 'G', 'AH0', 'L', 'AO2', 'R']]
[

While Whisper does perfectly with the Southern California dialect and not-so-bad with one of the Indian dialects, how much worse it does with the other one is a lot clearer using phonemic analysis rather than word-level analysis.

The magnitude of differences is fascinating and very insightful!

### Overall Phoneme Error Measurement: Surprisingly Intuitive, Very Insightful!

Although I had initially hypothesized that word-level error measurement would be most intuitive and useful, I found that phoneme-level analysis was a lot more informative and helpful to my analysis. 

Calculating errors using CMUdict is a great way to approximate the magnitudes of differences across different ASR services and dialects. The granularity from phoneme-level error calculations turned out to be extremely interesting!

# Conclusion

Overall, this was a really interesting project! I was able to use my knowledge from class and gain some meaningful insights into how ASR works, and why it's making the kinds of errors it is.

My big learnings from this project were:
1) Several factors matter when recording the audio itself. The speaker's speech style can be a really important factor in how ASR processes the speech. From discussions in class, this may be due to the training data for these systems using formal speech on average, biases in the training data, or something else entirely. It was really interesting to try using different levels of background noise and varying speech styles to see how results compared across them. The homeworks from class got me thinking about this - I will keep an eye out for it now when I'm recording participants or hearing their audio files!
2) Differences in accuracy across dialects varies a lot by service: I was not expecting Microsoft Azure to do nearly as well as it did! I was pleasantly surprised by Azure's results across dialects being fairly consistent, whereas Whisper was visibly worse (in both error measurements) with Indian dialects. I had expected the variation to be more consistent across different ASR services, but the training data being regularly updated and getting unique dialects might've given Azure the edge here.
3) Phoneme-Level Error Measurements are great! I discussed this already, but it's really interestnig how much of a difference it can make to look closely at these transcripts. It's also surprisingly challenging to find a good way to do this - I still have several kinks I want to work out in calculating the phoneme-level errors that I mentioned throughout.
4) Overall, Southern California dialects are still transcribed better: In spite of Azure's impressive performance, the phoneme-level analysis showed that generally, Southern California dialects are still transcribed better. This is probably due to this dialect being more fully represented in the training datasets for these systems.

This project has led to many ethical questions for those working on and creating these ASR systems. I think there's a lot of room to grow and explore here, and I'm glad I could use this project as an opportunity to figure some part of it out. Maybe, just maybe, more equitable training data is the solution!