# Dialect, Race, Gender, and The Pressure to Confirm to Standards

## Introduction and Motivations
Dialects are considered variations of languages that are generally mutually understandable. Due to English  being spoken in multiple different countries among diverse groups of speakers, English speakers exhibit diverse accents, localized words, and grammatical structures based on their countries and regions. Major English dialects are often classified into British Isles, North American, and Australasian categories. Dialects are not only linked to geographical regions but also to specific social groups. Each English-speaking country has its version of Standard English, often associated with education and formal communication.

In this study, I examined and analyized how Microsoft's Speech-to-text translation API performs on different dialects, specifically standard American English, standard British English, and African American Venacular English, to see if the API's algorithm performs better on any of these groups compared to the others. I also wanted to see if this, or differences in race or differences in gender would yield a more drastic performance gap.

## Dataset and Categories

For the dataset, I collected the transcripts of SNL monologues from Youtube performed by different men and women, all of which were native English speakers. I downloaded the videos from Youtube from the website https://y2down.cc/en/youtube-wav.html. After getting the .wav files, I collected the transcripts from https://snltranscripts.jt.org/2022/megan-thee-stallion-monologue.phtml, then edited the transcript by listening to the files and making changes whereever needed. I also edited the .wav files, so it would only include the speaker's speech throughout the script, and edited out parts that had a lot of background noise (for example, clapping) to make ensure that the speech-to-text transcription algorithm is as fair between different speakers as possible.

I had three different ways of analyzing performance as mentioned before:

### Dialects
First and foremost, I analyized the performance between dialects of English with speakers being categorized into one of three groups containing the following people:
1. African American Venacular English:
   - Keke Palmer (Black)
   - Tiffany Haddish (Black)
   - Eddie Murphy (Black)
   - Megan Thee Stallion (Black)

2. Standard American English
   - Billie Eilish (white)
   - Kim Kardashian (white)
   - Don Cheadle (Black)
   - Ayo Edebiri (Black)
   - Jack Harlow (white)
   - Pete Davidson (white)

3. Standard British English
   - Daniel Kaluuya (Black)
   - Idris Elba (Black)
   - Benedict Cumberbatch (white)
   - Phoebe Waller-Bridge (white)

### Race
The second analysis measured how race affects the accuracy of speech-to-text translation, with two categories in this case:
1. Black Speakers:
   - Daniel Kaluuya 
   - Idris Elba 
   - Don Cheatle 
   - Keke Palmer 
   - Tiffany Haddish 
   - Eddie Murphy 
   - Megan Thee Stallion 
   - Ayo Edebiri 
  
2. White Speakers:
   - Benedict Cumberbatch 
   - Jack Harlow
   - Pete Davidson
   - Phoebe Waller-Bridge 
   - Billie Eilish 
   - Kim Kardashian 

### Gender
I also wanted to see if gender would have a noticable impact on the performance of the speech-to-text algorithm, with categorizing the speakers into one of two categories:
1. Women
   - Keke Palmer 
   - Tiffany Haddish 
   - Megan Thee Stallion
   - Ayo Edebiri
   - Billie Eilish 
   - Kim Kardashian  
   - Phoebe Waller-Bridge
2. Men
   - Benedict Cumberbatch
   - Pete Davidson
   - Jack Harlow
   - Eddie Murphy 
   - Daniel Kaluuya 
   - Idris Elba

## Hypotheses and Setup 
My hypotheses was that dialect will have the biggest impact, second will be gender, then out of these three race will result in the smallest difference between groups. 


Now onto testing the hypothese! You will need to make sure that everything runs correctly. The current API key and setup should be able to run from your computer, but if it is not, please follow the following steps detailed below. These steps are entirely taken from Microsoft's Learning page, from the model called "Create your first Azure AI speech to text application" (link: https://learn.microsoft.com/en-us/training/modules/create-your-first-speech-to-text-app/2-create-azure-cognitive-services-account).


### Creating an Azure AI services account using the Azure portal
The multi-service resource is listed under Azure AI services > Azure AI services multi-service account in the portal. To create a multi-service resource follow these instructions:

1. Create an Azure [portal](https://login.microsoftonline.com/organizations/oauth2/v2.0/authorize?redirect_uri=https%3A%2F%2Fportal.azure.com%2Fsignin%2Findex%2F&response_type=code%20id_token&scope=https%3A%2F%2Fmanagement.core.windows.net%2F%2Fuser_impersonation%20openid%20email%20profile&state=OpenIdConnect.AuthenticationProperties%3DaAMwGPpStCDv_6lBa9SXZEQneerv_UpksUFLCtuQ4feE5_o1VZDabhpVwu3zHXio-O5EGebbdmepUg7IIgBb4puSw305JusVM6gfQ2mkuowUjyJ7ictMQyzEV_G_qfnNpzBbvA8tG0eRW_Mri7dVb6OIB-UiQB7gP52DgdtX6pWhnPGaUGZmniuz7Y5cqMDaItSWwz73U2nRJmjLM5yiAWBHS4o62xLGOaWLjS0uuDq8cjqbGIvKSyK0BU65rQuEONVb_lB3vkwh9ByT9TE17zVkS12HZOi5lMl340PXFUU3R1IcxHeCL4H-futF49RIuFbpwJM0gEsmuc03IZRLnfX317sjgInVWpqMyOCYxAtMkDQoTen5XtdbJ-jiAGELa4TFXJrY9hXPL69REYhrEFMbDJerC2j3svxdmvcf1QcOIszw5RYpTLUURHPgBU5oyFQUFfklj05ud_44uA-5ccXwSoGZ4dn_0cW5sguH7Dfm0dQufgmjYjlUNfFYHO1zyGk1sUzpZvSdxg1Nve0U7Gx1VEFc6qUbskqffUYxJM1tmpatUrHqS2kFLInhi8cyxZrUxeApNHOg_AICIvom8Q&response_mode=form_post&nonce=638449075930620669.MzJkZTBlNjEtZDk0Yi00MjUxLWJhOTUtZjg3YTg3MDJkNDIyYThmMjlmMTctNWYzZC00NDhmLTljODgtYWU4ODNiZjYwNDFi&client_id=c44b4083-3bb0-49c1-b47d-974e53cbdf3c&site_id=501430&client-request-id=32bf33d0-8e7a-4339-91a2-22de1b6998b4&x-client-SKU=ID_NET472&x-client-ver=6.34.0.0) if you don't already have one. You can use your Cornell email to create an account. If you already have an account, then sign in at the same link.

2. Create a multi-service resource [here](https://portal.azure.com/#create/Microsoft.CognitiveServicesAllInOne)

3. This should take you to a page titled "Create Azure AI services". Provide the information it asks for. 

4. After you are done, with the previous step, you should see a green checkmark and a confirmation saying "Your deployment is complete". If you see this, now click on "Deployment details"
   
5. Now, click on the link under "Resource" that should be named whatever you named it in step 3.
   
6. You should now see a subpage called "Essentials" in the middle of the screen. Click on the the link at "Manage keys". This should take you to a page called "Keys and Endpoint"
   
7. Once you are at the page "Keys and Endpoint"  copy "KEY 1". 
   
8. Now open Program.cs. Locate line 25: string azureKey = "c6330815003e4d7d94e03f17aa36a880". replace the value of it with the value you copied from "KEY 1"
  
9. Now go back to the page "Keys and Endpoint" and copy the value at "Location/Region". 
    
10. Now open Program.cs. Locate line 25: string azureLocation = "eastus";. Replace the value of it with the value you copied from "Location/Region"
    
11. Save your changes!
    
12. Now go back to [here](https://learn.microsoft.com/en-us/training/modules/create-your-first-speech-to-text-app/2-create-azure-cognitive-services-account). Go to unit 3 out of 8 by clicking the arrow.
    
13. Follow the directions in Unit 3 in the terminal of **THE JUPYTER NOTEBOOK**
    
14. After finishing Unit 3, you should be all set to continue to the next step, which is installing dependencies. If you run into any troubles, just continue onto unit 4 and follow links wherever needed.






### Dependencies [DO THIS BEFORE RUNNING ANY CODE]

To be able to run the algorithm you will need to install serval dependencies and set up a virtual environent. Follow these steps to ensure smooth running of the algorithm:

1. Open your terminal in VS code by going to Terminal >> New Terminal and type `dotnet add package Microsoft.CognitiveServices.Speech`
   
2. Now, still in the terminal, type `code Program.cs` to ensure that the system is set up correctly. 

3. To be able to run C# code from a Jupyter Python kernel, ensure that you have .NET installed
   To download do the following steps:
   1. Go to https://dotnet.microsoft.com/en-us/download
   2. Download the appropriate package for your operating system 
   3. To find the path of where .NET was installed open your terminal and do the following:
      - type "dotnet" into your terminal to ensure that it downloaded correctly
         it should output something like this:

         `(base) yourname@dhcp-vl2041-25861 ~ % dotnet`

         `Usage: dotnet [options]`

         `Usage: dotnet [path-to-application]`

         `Options:`

         ` -h|--help         Display help.`

         ` --info            Display .NET information.`

         ` --list-sdks       Display the installed SDKs.`

         ` --list-runtimes   Display the installed runtimes.`

         `path-to-application:`

         `The path to an application .dll file to execute.`

      - If it's successful, type "where dotnet" into your terminal
         That should output something like this:

         
         `(base) yourname@dhcp-vl2041-25861 ~ % where dotnet`

        ` /usr/local/share/dotnet/dotnet`

         Copy the line "`/usr/local/share/dotnet/dotnet`". This may be slightly different for you and that okay. In the code cell below (3 cells below) replace the line `dotnet_path = "/usr/local/share/dotnet/dotnet"` with your actual path




PS: Running the following cell takes about 20-30 minutes depending on your computer. I prepopulated the files in case you (my TA or whoever is grading this:D) don't want to wait that long. If you want to double check that that this cell still runs as expected you can comment out the for loop and instead run the part currently commented out a few cells below. You can also replace the name with any name from the following list:

In [1]:
from helpers import name_info

print(list(name_info.keys()))

['ayo_edebiri', 'benedict_cumberbatch', 'billie_eilish', 'daniel_kaluuya', 'don_cheadle', 'eddie_murphy', 'idris_elba', 'jack_harlow', 'keke_palmer', 'kim_kardashian', 'megan_thee_stallion', 'pete_davidson', 'phoebe_waller_bridge', 'tiffany_haddish']


In [2]:
! pip install textdistance
! pip install scikit-learn



import helpers 
speaker_list = list(helpers.name_info.keys())

In [3]:
import subprocess
import os
import textdistance

def run_speech_to_text(name):
    current_directory = os.getcwd()
    dotnet_path = "/usr/local/share/dotnet/dotnet"  # Replace with your actual path
    subprocess.run([dotnet_path, "run", name], cwd=current_directory)

# Example usage
speaker_list = list(name_info.keys())

for speaker in speaker_list:
  run_speech_to_text(speaker)

# This part can be un-commented if you don't want to wait the 22 minutes to rerun and instead just want to check one or a few people's transcripts. 
# run_speech_to_text("INSERT_YOUR_SPEAKER")


Speech recognition started for ayo_edebiri.
Speech recognition stopped for ayo_edebiri.
Speech recognition started for benedict_cumberbatch.
Speech recognition stopped for benedict_cumberbatch.
Speech recognition started for billie_eilish.
Speech recognition stopped for billie_eilish.
Speech recognition started for daniel_kaluuya.
Speech recognition stopped for daniel_kaluuya.
Speech recognition started for don_cheadle.
Speech recognition stopped for don_cheadle.
Speech recognition started for eddie_murphy.
Speech recognition stopped for eddie_murphy.
Speech recognition started for idris_elba.
Speech recognition stopped for idris_elba.
Speech recognition started for jack_harlow.
Speech recognition stopped for jack_harlow.
Speech recognition started for keke_palmer.
Speech recognition stopped for keke_palmer.
Speech recognition started for kim_kardashian.
Speech recognition stopped for kim_kardashian.
Speech recognition started for megan_thee_stallion.
Speech recognition stopped for meg

To analyze how Microsoft's API performed, I wanted to measure the accuracy based on several factors. 
I did some minor pre-processing, like turning the text documents into strings, taking out punctuation, and turning all characters into lowercase letters to reduce translation mistakes that are lower than word level (for example, putting a different punctionation mark, or capitalizing letters differently).

I used some predefined functions in Python's "textdistance" library (taken from https://pypi.org/project/textdistance/), as well as some code I wrote. To get an overall view on how the API performed on different demographics, I used 4 types measurements to calculate how it performed on each of them. 

These 4 types were:
1. Edit based:
   This measures how much post-translation editing a person would have to do get to the accurate transcript. The functions for this type were
   - Accuracy: how many words the API got correctly --> closer to 1 is better, 0 is worse
   - Word-error-rate: how much insertions/deletion/swaps to get to the correct transcript --> closer to 0 is better, 1 is worse
2. Token based
   - Cosine-similarity: calculates the similarity of two vectors by the dot product and divides it by the magnitudes of each vector (1 is better)
   - Jaccard_distance: calculates the similarity of two text documents by comparing the number unique of terms used in both documents (1 is better)
3. Compression based
   Тhis is based on the idea that similar strings can be compressed more effectively than less similar ones ones.
   - Square root: compares the size of compressed data (which is thesum of square roots of counts of every element) between the 2 texts (0 is better) 
4. Phonetic based
   - Match Rating Approach (MRA): "indexing of words by their pronunciation developed by Western Airlines in 1977 for the indexation and comparison of homophonous names" (Moore, G B.; Kuhns, J L.; Treffzs, J L.; Montgomery, C A. (Feb 1, 1977))


In [4]:
! pip install textdistance
! pip install numpy



In [6]:
import textdistance
import numpy as np
import re

import string
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer


def generate_scores(name):

    with open("transcript_files/" + name + ".txt") as f:
        reference = f.readlines()
    with open("output_files/" + name + "_output.txt") as f:
        hypothesis = f.readlines()

    reference_str = ""
    for lines in reference:
        reference_str += lines

    hypothesis_str = ""
    for lines in hypothesis:
        hypothesis_str += lines

    hypothesis = re.sub(r'[^\w\s]'," ",hypothesis_str.lower())
    reference = re.sub(r'[^\w\s]'," ",reference_str.lower())
    
    # edit based
    accuracy = get_accuracy(reference, hypothesis)
    wer = get_wer(reference, hypothesis)
    # token based 
    cos_sim = cosine_sim(reference, hypothesis)
    jaccard_distance = textdistance.jaccard(reference, hypothesis)
    # compression based
    sqrt_dist = textdistance.sqrt_ncd(reference, hypothesis)
    # phonetic based
    mra_distance = textdistance.mra(reference, hypothesis)

    return (accuracy, wer, cos_sim, jaccard_distance, sqrt_dist, mra_distance)


def get_wer(reference, hypothesis):
    ref_words = reference.split()
    hyp_words = hypothesis.split()
    d = np.zeros((len(ref_words) + 1, len(hyp_words) + 1))
    for i in range(len(ref_words) + 1):
        d[i, 0] = i
    for j in range(len(hyp_words) + 1):
        d[0, j] = j
    for i in range(1, len(ref_words) + 1):
        for j in range(1, len(hyp_words) + 1):
            if ref_words[i - 1] == hyp_words[j - 1]:
                d[i, j] = d[i - 1, j - 1]
            else:
                substitution = d[i - 1, j - 1] + 1
                insertion = d[i, j - 1] + 1
                deletion = d[i - 1, j] + 1
                d[i, j] = min(substitution, insertion, deletion)

    wer = d[len(ref_words), len(hyp_words)] / len(ref_words)
    return wer


def normalize(text):
    re.sub(r'[^\w\s]'," ",text.lower())
    text = text.split(" ")
    return text

def cosine_sim(text1, text2):
    vectorizer = TfidfVectorizer(tokenizer=normalize)
    tfidf = vectorizer.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0, 1]


def get_accuracy(reference, hypothesis):
    ref_words = reference.split()
    hyp_words = hypothesis.split()
    
    d = np.zeros((len(ref_words) + 1, len(hyp_words) + 1))
    for i in range(len(ref_words) + 1):
        d[i, 0] = i
    for j in range(len(hyp_words) + 1):
        d[0, j] = j
    for i in range(1, len(ref_words) + 1):
        for j in range(1, len(hyp_words) + 1):
            if ref_words[i - 1] == hyp_words[j - 1]:
                d[i, j] = d[i - 1, j - 1]
            else:
                d[i, j] = d[i - 1, j - 1] + 1

    accuracy = d[len(ref_words), len(hyp_words)] / len(ref_words)
    return accuracy



In [7]:
from helpers import name_info
# from analysis import generate_scores 

speaker_scores = {}
for name in name_info.keys():
  speaker_scores[name] = generate_scores(name)




In [9]:
# sae_speakers, sbe_speakers, aave_speakers,female_speakers, male_speakers, black_speakers, white_speakers
from helpers import name_info

def extract_info_of_criteria(criteria, group):
  #accuracy, wer, cos_sim, jaccard_distance, ratcliff_ob, sqrt_didt, mra_distance)
  acc, wer, cos_sim, jacc_dist, sqrt_dist, mra_distance = 0, 0, 0, 0, 0, 0
  num_in_group = 0

  for speaker in speaker_scores:
    if name_info[speaker][group] == criteria:
      acc += speaker_scores[speaker][0]
      wer += speaker_scores[speaker][1]
      cos_sim += speaker_scores[speaker][2]
      jacc_dist += speaker_scores[speaker][3]
      sqrt_dist+=  speaker_scores[speaker][4]
      mra_distance +=  speaker_scores[speaker][5]
      num_in_group += 1
    
  return (round(acc/num_in_group, 3), 
          round(wer/num_in_group, 3), 
          round(cos_sim/num_in_group, 3),  
          round(jacc_dist/num_in_group, 3),  
          round(sqrt_dist/num_in_group, 3), 
          round(mra_distance/num_in_group, 3))

score_of_men = extract_info_of_criteria("male", "gender")
score_of_women = extract_info_of_criteria("female", "gender")

score_of_sae = extract_info_of_criteria("sae", "dialect")
score_of_sbe = extract_info_of_criteria("sbe", "dialect")
score_of_aave = extract_info_of_criteria("aave", "dialect")

score_of_black_speakers = extract_info_of_criteria("black", "race")
score_of_white_speakers = extract_info_of_criteria("white", "race")


print("Women: " + str(score_of_women))
print("Men: " + str(score_of_men))
print()

print("SAE Speakers: " + str(score_of_sae))
print("SBE Speakers: " + str(score_of_sbe))
print("AAVE Speakers: " + str(score_of_aave))
print()


print("Black Speakers: " + str(score_of_black_speakers))
print("White Speakers: " + str(score_of_white_speakers))

Women: (0.989, 0.124, 0.98, 0.935, 0.422, 2.0)
Men: (0.851, 0.105, 0.971, 0.961, 0.423, 1.714)

SAE Speakers: (0.896, 0.077, 0.979, 0.959, 0.42, 1.8)
SBE Speakers: (0.957, 0.128, 0.979, 0.959, 0.422, 1.833)
AAVE Speakers: (0.887, 0.153, 0.962, 0.906, 0.43, 2.0)

Black Speakers: (0.937, 0.147, 0.97, 0.938, 0.425, 1.714)
White Speakers: (0.903, 0.083, 0.981, 0.958, 0.421, 2.0)


Since some of the measurements indicate high similarity when the score is 0, and some of them 1, I decided to flip the scores of the measurements that indicate high similarity when the score is 0 for the numbers to be easier to interpret. For the MRA, I decided to get the sum of all MRA scores for all demographics, then give the score of (1 - x/sum) for each demographic where x is their MRA score in order to normalize them. Therefore, the higher the modified MRA score the better the algorithm performed for that demographic.

In [10]:
# Closer to 0 is a better score: wer [1], square root distance [4]
# Closer to 1 is a better score: accuracy [0], cosine similarity [2], jaccard distance [3]
# The smaller the score the better: mra distance [5]
reverse_index = {0: "Women", 1: "Men", 2: "Standard American English Speakers", 3:"Standard British English Speakers", 4:"African American Vernacular English Speakers", 5:"Black Speakers",  6:"White Speakers"}
# We want to flip index 1 and index 4
list_of_measurements = [score_of_women, score_of_men, score_of_sae, score_of_sbe, score_of_aave, score_of_black_speakers, score_of_white_speakers]
list_of_measurements_modified = []

mra_sum = sum(tup[5] for tup in list_of_measurements)

for measurment in list_of_measurements:
  measurment_mod = list(measurment)
  measurment_mod[1] = (1 - measurment[1])
  measurment_mod[4] = round((1 - measurment[4]), 3) 
  measurment_mod[5] = round((1 - (measurment[5]/mra_sum)), 3) 
  list_of_measurements_modified.append((measurment_mod))

index = 0
for line in list_of_measurements_modified:
  print(reverse_index[index] + ":")
  print(str(line))
  print()
  index += 1
print(list_of_measurements_modified)


Women:
[0.989, 0.876, 0.98, 0.935, 0.578, 0.847]

Men:
[0.851, 0.895, 0.971, 0.961, 0.577, 0.869]

Standard American English Speakers:
[0.896, 0.923, 0.979, 0.959, 0.58, 0.862]

Standard British English Speakers:
[0.957, 0.872, 0.979, 0.959, 0.578, 0.86]

African American Vernacular English Speakers:
[0.887, 0.847, 0.962, 0.906, 0.57, 0.847]

Black Speakers:
[0.937, 0.853, 0.97, 0.938, 0.575, 0.869]

White Speakers:
[0.903, 0.917, 0.981, 0.958, 0.579, 0.847]

[[0.989, 0.876, 0.98, 0.935, 0.578, 0.847], [0.851, 0.895, 0.971, 0.961, 0.577, 0.869], [0.896, 0.923, 0.979, 0.959, 0.58, 0.862], [0.957, 0.872, 0.979, 0.959, 0.578, 0.86], [0.887, 0.847, 0.962, 0.906, 0.57, 0.847], [0.937, 0.853, 0.97, 0.938, 0.575, 0.869], [0.903, 0.917, 0.981, 0.958, 0.579, 0.847]]


In [11]:
demographic_scores = np.array(list_of_measurements_modified)
scores_demographics = demographic_scores.transpose()

# Demographics labels
demographics_labels = [
    "Women", "Men", "Standard American English Speakers",
    "Standard British English Speakers", "African American Vernacular English Speakers",
    "Black Speakers", "White Speakers"
]

measurement =  ["Accuracy", "Word-Error-Rate", "Cosine Similarity", "Jaccard Distance", "Square Root Distance", "Match Rating Approach"]

rankings = {}


# Loop through each measurement
index = 0
for score in scores_demographics: 
    sorted_scores = sorted(score, reverse=True)
    # sorted_indices = np.argsort(score)
    # sorted_indices_rev = np.flip(sorted_indices)
    meas_type_name = measurement[index]
    meas_type = {}
    for i in range(7):
        demographic_index = np.where(score == sorted_scores[i])[0]
        meas_type[i] = (reverse_index[i], score[i])

    rankings[meas_type_name] = meas_type
    index += 1


# print(rankings)
for measurement, values in rankings.items():
    # Sort the values based on the key in reverse order
    sorted_values = sorted(values.items(), key=lambda x: x[1][1], reverse=True)
    # Print the sorted results
    print(f"\n{measurement} Rankings:")
    for rank, (demographic, score) in enumerate(sorted_values, start=1):
        print(f"{rank}: {demographics_labels[demographic]} - {score[1]}")


Accuracy Rankings:
1: Women - 0.989
2: Standard British English Speakers - 0.957
3: Black Speakers - 0.937
4: White Speakers - 0.903
5: Standard American English Speakers - 0.896
6: African American Vernacular English Speakers - 0.887
7: Men - 0.851

Word-Error-Rate Rankings:
1: Standard American English Speakers - 0.923
2: White Speakers - 0.917
3: Men - 0.895
4: Women - 0.876
5: Standard British English Speakers - 0.872
6: Black Speakers - 0.853
7: African American Vernacular English Speakers - 0.847

Cosine Similarity Rankings:
1: White Speakers - 0.981
2: Women - 0.98
3: Standard American English Speakers - 0.979
4: Standard British English Speakers - 0.979
5: Men - 0.971
6: Black Speakers - 0.97
7: African American Vernacular English Speakers - 0.962

Jaccard Distance Rankings:
1: Men - 0.961
2: Standard American English Speakers - 0.959
3: Standard British English Speakers - 0.959
4: White Speakers - 0.958
5: Black Speakers - 0.938
6: Women - 0.935
7: African American Vernacular


| Ranking | Accuracy | Word-Error-Rate | Cosine Similarity | Jaccard Distance | Square Root Difference | Match Rating Approach  
|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
|   1   |   AAVE Speakers  |  SAE Speakers  |  SAE Speakers   | Men  | SAE Speakers  | Men   |
|   2   |   Black Speakers  |  White Speakers |  White Speakers   |  SAE Speakers   | White Speakers   | SAE Speakers   |
|   3   |   Women  |  Men   |  Men   | SBE Speakers   | Women   | SBE Speakers   |
|   4   |   SBE Speakers  | Women   |  Women   | White Speakers   | SBE Speakers   | Black Speakers   |
|   5   |   Men  |  SBE Speakers   |  SBE Speakers  | Black Speakers   | Men   | White Speakers   |
|   6   |   White Speakers  | Black Speakers   |  AAVE Speakers  | Women  | Black Speakers  | Women   |
|   7   |   SAE Speakers  |  AAVE Speakers   |   Black Speakers | AAVE Speakers   | AAVE Speakers   | AAVE Speakers  |


While accuracy rankings highlight certain demographic disparities, it's crucial to note that accuracy alone doesn't effectively capture speech recognition performance. Recognizing individual words is just one aspect; understanding context and transcribing complete sentences accurately is equally important. Accuracy fails to account for errors in word order or contextual nuances.

Notably, AAVE (African American Vernacular English) and Black speakers (often using AAVE) consistently perform poorly across metrics. Conversely, SAE (Standard American English) speakers, predominantly white, or white speakers individually, tend to rank higher.

Even more concerningly, there is a significant drop in performance for AAVE speakers in the Match Rating Approach (MRA). This implies that both the speech-to-text algorithm and the MRA algorithm in Python's textdistance library do not adapt to diverse dialects' pronunciation variations. This oversight suggests a potential bias towards standard English, disregarding dialectal differences. This limitation hinders performance, even with diverse training data, as it fails to account for crucial grammatical distinctions in various dialects.

Finally, for the sake of conciseness to have a singular number to rank all categories' overall performance instead of having to look at 6 different metrics, I then took the average score of all measurement types for each demographic, since all the results are now normalized and higher scores indicate better performance. It's important to note, however, that not all measurments carry the same weight, for example WER and MRA should not be given the same weight, but without formal research, it's hard to know what weights would capture the reality of the exact differences of the algorithm's performance. 

Also, I took out accuracy from this measurement, for the reasons mentioned above of it not being a good measurement system for real-time speech-to-text recognition.

In [12]:
import numpy as np

avg_scoring = {}
idx = 0
for demographic in list_of_measurements_modified:
  demo = reverse_index[idx]
  avg_scoring[demo] = round(np.mean(demographic[1:]), 5)
  idx += 1

sorted_scoring = dict(sorted(avg_scoring.items(), key=lambda item: item[1], reverse=True))

place = 1
for key, val in sorted_scoring.items():
  print(str(place) + ". " + key + " " + str(val))
  place += 1

1. Standard American English Speakers 0.8606
2. White Speakers 0.8564
3. Men 0.8546
4. Standard British English Speakers 0.8496
5. Women 0.8432
6. Black Speakers 0.841
7. African American Vernacular English Speakers 0.8264


In [13]:
diff_sae_and_sbe = round(sorted_scoring["Standard American English Speakers"] - sorted_scoring["Standard British English Speakers"], 5)
diff_sae_and_aave = round(sorted_scoring["Standard American English Speakers"] - sorted_scoring["African American Vernacular English Speakers"], 5)

diff_gender = round(sorted_scoring["Men"] - sorted_scoring["Women"], 5)

diff_race = round(sorted_scoring["White Speakers"] - sorted_scoring["Black Speakers"], 5)

print("Dialects:")
print("Microsoft's API performed best on Standard American English of all categories, performing " + str(diff_sae_and_sbe) + " points better than Standard British English.")
print("More concerningly, it performed " + str(diff_sae_and_aave) + " better compared to African-American Vernacular English, which was also the worst performing demographic of all groups.")
print()

print("Gender:")
print("Microsoft's API performed better on women speakers compared to men, scoring " + str(diff_gender) + " points better.")
print()

print("Race:")
print("Microsoft's API performed better on white speakers compared to Black speakers, scoring " + str(diff_race) + " points better.")


Dialects:
Microsoft's API performed best on Standard American English of all categories, performing 0.011 points better than Standard British English.
More concerningly, it performed 0.0342 better compared to African-American Vernacular English, which was also the worst performing demographic of all groups.

Gender:
Microsoft's API performed better on women speakers compared to men, scoring 0.0114 points better.

Race:
Microsoft's API performed better on white speakers compared to Black speakers, scoring 0.0154 points better.


From these results, we can see that my original hypotheses was not entirely correct. I predicted that dialect will result in the largest disparity between the performance of the speech-to-text recognition, and this proved to be true. Standard British and American English performed significantly better than AAVE. Gender, however, had less of an impact on the performance of the algorithm, instead, race had a larger impact, with white speakers' speech being recognized 0.0214 better compared to Black speakers, while the algorithm performing only 0.0178 points better on mens' speeches compared to womens' speeches.

It's important to note, however, that these findings might not be generally true for all kinds of speeches, and in all contexts, since in this study only examined the algorithm's performance on 14 speakers. A thorough, generalizable audit should include significantly more speakers (probably in the 100s, or even 1000s), and should also include differnt contexts, not just pre-written, well-perfomed SNL monologues. 

Sources, libraries, APIS used for this study:
- https://en.wikipedia.org/wiki/List_of_dialects_of_English
- https://dotnet.microsoft.com/en-us/download
- https://learn.microsoft.com/en-us/training/modules/create-your-first-speech-to-text-app/2-create-azure-cognitive-services-account
