# Step 2.7: Getting Word Error Rates (WER)

This code will get the Word Error Rate (WER) for each ASR output. 

## Required Packages

The following packages are necessary to run this code: re, os, [pandas](https://pypi.org/project/pandas/), [numpy](https://pypi.org/project/numpy/), [wagnerfischerpp](https://gist.github.com/kylebgorman/8034009)

## Intitial Setup

In [None]:
# Import required packages
import pandas as pd
import numpy as np
import os

In [None]:
#filepath for the csv produced in Step 2.6
aint_file_path ="path"

be_file_path = "path"

done_file_path = "path"

#reads in the gold standard dataframe    
aint_gs_df = pd.read_csv(aint_file_path)

be_gs_df = pd.read_csv(be_file_path)

done_gs_df = pd.read_csv(done_file_path)

## Defining the Word Error Rate Getting Function

This function depends on the importation of the wagnerfischerpp python script. To do so, follow these steps:
1. Go to https://gist.github.com/kylebgorman/8034009.
2. Download the wagnerfischerpp.py script.
3. Move the script into the current working directory you are working in with this code.
4. For a test, run *from wagnerfishcerpp import \** to make sure it works.

This function takes one argument:
1. The original utterance content as a string
2. The ASR output content as a string

## References

The *getWER* function is directly adapted from the work of [Kyle Gorman](https://gist.github.com/kylebgorman) (see code [here](https://gist.github.com/kylebgorman/8034009)).

In [None]:
def getWER(original_utterance, ASR_output):
    
    """
    Gets the word error rate from the ASR inference
    Depends on wagnerfischerpp. Have the wagnerfischerpp.py file in the
    same directory and then run from wagnerfischerpp import * before using this.
    original_utterance and ASR_output should be strings.
    this function will break them into lists
    """
    
    from wagnerfischerpp import WagnerFischer
    
    import numpy as np
    
    if type(original_utterance) != str or type(ASR_output) != str:
        
        return np.nan
    
    else:
        original_list = original_utterance.split()
        
        ASR_list = ASR_output.split()
        
        getter = WagnerFischer(original_list, ASR_list)
        
        cost = getter.cost
        
        wer = cost/len(original_list)
        
        return wer

## Executing the Code

In [None]:
# a list of column names to be appended next to
column_names = ["amazon_transcription_cleaned_WordCount", 
                "deepspeech_transcription_cleaned_WordCount", "google_transcription_cleaned_WordCount", 
                "IBMWatson_transcription_cleaned_WordCount", "microsoft_transcription_cleaned_WordCount"]

### Feature: Ain't

In [None]:
# adds new columns to the dataframe
for column_name in column_names:
    
    col_index = aint_gs_df.columns.get_loc(column_name)

    aint_gs_df.insert(col_index+1, f"{column_name.split('_W')[0]}_TotalWER", np.nan)
    

# gets the WER for each line
for file_row in aint_gs_df.itertuples():
    
    aint_gs_df.loc[file_row.Index, "amazon_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.amazon_transcription_cleaned)
    
    aint_gs_df.loc[file_row.Index, "deepspeech_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.deepspeech_transcription_cleaned)
    
    aint_gs_df.loc[file_row.Index, "google_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.google_transcription_cleaned)

    aint_gs_df.loc[file_row.Index, "IBMWatson_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.IBMWatson_transcription_cleaned)
    
    aint_gs_df.loc[file_row.Index, "microsoft_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.microsoft_transcription_cleaned)

### Feature: Be

In [None]:
# adds new columns to the dataframe
for column_name in column_names:
    
    col_index = be_gs_df.columns.get_loc(column_name)

    be_gs_df.insert(col_index+1, f"{column_name.split('_W')[0]}_TotalWER", np.nan)
    

# gets the WER for each line
for file_row in be_gs_df.itertuples():
    
    be_gs_df.loc[file_row.Index, "amazon_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.amazon_transcription_cleaned)
    
    be_gs_df.loc[file_row.Index, "deepspeech_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.deepspeech_transcription_cleaned)
    
    be_gs_df.loc[file_row.Index, "google_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.google_transcription_cleaned)

    be_gs_df.loc[file_row.Index, "IBMWatson_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.IBMWatson_transcription_cleaned)
    
    be_gs_df.loc[file_row.Index, "microsoft_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.microsoft_transcription_cleaned)

### Feature: Done

In [None]:
# adds new columns to the dataframe
for column_name in column_names:
    
    col_index = done_gs_df.columns.get_loc(column_name)

    done_gs_df.insert(col_index+1, f"{column_name.split('_W')[0]}_TotalWER", np.nan)
    

# gets the WER for each line
for file_row in done_gs_df.itertuples():
    
    done_gs_df.loc[file_row.Index, "amazon_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.amazon_transcription_cleaned)
    
    done_gs_df.loc[file_row.Index, "deepspeech_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.deepspeech_transcription_cleaned)
    
    done_gs_df.loc[file_row.Index, "google_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.google_transcription_cleaned)

    done_gs_df.loc[file_row.Index, "IBMWatson_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.IBMWatson_transcription_cleaned)
    
    done_gs_df.loc[file_row.Index, "microsoft_transcription_cleaned_TotalWER"] = getWER(file_row.Content_cleaned, file_row.microsoft_transcription_cleaned)

## Sorting the Dataframes by File and Line

This will sort the dataframes first by filename and then by line number. Doing this each step will ensure consistency across the board.

### Feature: Ain't

In [None]:
aint_gs_df = aint_gs_df.sort_values(by=['File', 'Line'])

### Feature: Be

In [None]:
be_gs_df = be_gs_df.sort_values(by=['File', 'Line'])

### Feature: Done

In [None]:
done_gs_df = done_gs_df.sort_values(by=['File', 'Line'])

## Exporting Dataframes to CSV Files

This will export the dataframes to CSV files.

In [None]:
# Designate the output path where the CSVs will be stored
csv_output_path = "path"

### Feature: Ain't

In [None]:
aint_gs_df.to_csv(f"{csv_output_path}aint_variations_totalWER.csv", index=False)

### Feature: Be

In [None]:
be_gs_df.to_csv(f"{csv_output_path}be_totalWER.csv", index=False)

### Feature: Done

In [None]:
done_gs_df.to_csv(f"{csv_output_path}done_totalWER.csv", index=False)