# Step 2.12: Creating the Correctness Column

This code will create a column for the annotation of correctness or incorrectness of ASR outputs. First, if the *ContainsFeature* column is 0 (meaning the word does not appear in the ASR output), then a 0 is appended to this column to indicate incorrectness. Next, if the *AdjacentTokens* column is any number equal to or greater than 1, then a 1 is appended to this column to indicate correctness. If the *AdjacentTokens* column is a -1 or -2, then a 0 is appended to this column to indicate incorrectness (see previous step for number keys/meanings). Otherwise, a NaN is appended so that the cell will be left blank. These blank cells will then be manually annotated for correctness in the next step. 

## Required Packages

The following packages are necessary to run this code: os, [pandas](https://pypi.org/project/pandas/), [numpy](https://pypi.org/project/numpy/)

## Intitial Setup

In [1]:
# Import required packages
import pandas as pd
import numpy as np
import os

In [2]:
#filepath for the csv produced in Step 2.11
aint_file_path = "path"

be_file_path = "path"

done_file_path = "path"

#reads in the gold standard dataframe    
aint_gs_df = pd.read_csv(aint_file_path)

be_gs_df = pd.read_csv(be_file_path)

done_gs_df = pd.read_csv(done_file_path)

# Defining the Copying Feature Columns Function

This function takes the following arguments:
1. The content of the *ContainsFeature* column
2. The content of the *AdjacentTokens* column

In [3]:
def copyFeatureColumns(feature_present, adjacent_tokens):
    
    """
    Copies the content of the "...containsFeature" and "...adjacentTokens" columns
    """
    
    import numpy as np
    
    # if the feature is not present (0), it is not correct
    if feature_present == 0:
        
        return 0
    
    # if the adjacent tokens match between the original utterance and the ASR output
    #  it is correct
    elif adjacent_tokens >= 1:
        
        return 1
    
    #converts -1s and -2s from the previous step into 0s as incorrect
    elif adjacent_tokens in [-1,-2]:
        
        return 0
   
   
    # else, returns a NaN for manual analysis
    else:
    
        return np.nan

## Executing the Code

In [4]:
# a list of column names to be appended next to
column_names = ["amazon_transcription_cleaned", 
                "deepspeech_transcription_cleaned", "google_transcription_cleaned", 
                "IBMWatson_transcription_cleaned", "microsoft_transcription_cleaned"]

### Feature: Ain't

In [5]:
# Appends new columns
for column_name in column_names:
    
    col_index = aint_gs_df.columns.get_loc(column_name)
    
    aint_gs_df.insert(col_index+3, f"{column_name}_correctness", np.nan)
            
# Loops through rows and executes the function
for file_row in aint_gs_df.itertuples():
    
    aint_gs_df.loc[file_row.Index, "amazon_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.amazon_transcription_cleaned_containsFeature, file_row.amazon_transcription_cleaned_adjacentTokens)
    
    aint_gs_df.loc[file_row.Index, "deepspeech_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.deepspeech_transcription_cleaned_containsFeature, file_row.deepspeech_transcription_cleaned_adjacentTokens)
    
    aint_gs_df.loc[file_row.Index, "google_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.google_transcription_cleaned_containsFeature, file_row.google_transcription_cleaned_adjacentTokens)
    
    aint_gs_df.loc[file_row.Index, "IBMWatson_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.IBMWatson_transcription_cleaned_containsFeature, file_row.IBMWatson_transcription_cleaned_adjacentTokens)
    
    aint_gs_df.loc[file_row.Index, "microsoft_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.microsoft_transcription_cleaned_containsFeature, file_row.microsoft_transcription_cleaned_adjacentTokens)

### Feature: Be

In [6]:
# Appends new columns
for column_name in column_names:
    
    col_index = be_gs_df.columns.get_loc(column_name)
    
    be_gs_df.insert(col_index+3, f"{column_name}_correctness", np.nan)
            
# Loops through rows and executes the function
for file_row in be_gs_df.itertuples():
    
    be_gs_df.loc[file_row.Index, "amazon_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.amazon_transcription_cleaned_containsFeature, file_row.amazon_transcription_cleaned_adjacentTokens)
    
    be_gs_df.loc[file_row.Index, "deepspeech_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.deepspeech_transcription_cleaned_containsFeature, file_row.deepspeech_transcription_cleaned_adjacentTokens)
    
    be_gs_df.loc[file_row.Index, "google_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.google_transcription_cleaned_containsFeature, file_row.google_transcription_cleaned_adjacentTokens)
    
    be_gs_df.loc[file_row.Index, "IBMWatson_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.IBMWatson_transcription_cleaned_containsFeature, file_row.IBMWatson_transcription_cleaned_adjacentTokens)
    
    be_gs_df.loc[file_row.Index, "microsoft_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.microsoft_transcription_cleaned_containsFeature, file_row.microsoft_transcription_cleaned_adjacentTokens)

### Feature: Done

In [7]:
# Appends new columns
for column_name in column_names:
    
    col_index = done_gs_df.columns.get_loc(column_name)
    
    done_gs_df.insert(col_index+3, f"{column_name}_correctness", np.nan)
            
# Loops through rows and executes the function
for file_row in done_gs_df.itertuples():
    
    done_gs_df.loc[file_row.Index, "amazon_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.amazon_transcription_cleaned_containsFeature, file_row.amazon_transcription_cleaned_adjacentTokens)
    
    done_gs_df.loc[file_row.Index, "deepspeech_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.deepspeech_transcription_cleaned_containsFeature, file_row.deepspeech_transcription_cleaned_adjacentTokens)
    
    done_gs_df.loc[file_row.Index, "google_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.google_transcription_cleaned_containsFeature, file_row.google_transcription_cleaned_adjacentTokens)
    
    done_gs_df.loc[file_row.Index, "IBMWatson_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.IBMWatson_transcription_cleaned_containsFeature, file_row.IBMWatson_transcription_cleaned_adjacentTokens)
    
    done_gs_df.loc[file_row.Index, "microsoft_transcription_cleaned_correctness"] = copyFeatureColumns(file_row.microsoft_transcription_cleaned_containsFeature, file_row.microsoft_transcription_cleaned_adjacentTokens)

## Sorting the Dataframes by File and Line

This will sort the dataframes first by filename and then by line number. Doing this each step will ensure consistency across the board.

### Feature: Ain't

In [8]:
aint_gs_df = aint_gs_df.sort_values(by=['File', 'Line'])

### Feature: Be

In [9]:
be_gs_df = be_gs_df.sort_values(by=['File', 'Line'])

### Feature: Done

In [10]:
done_gs_df = done_gs_df.sort_values(by=['File', 'Line'])

## Exporting Dataframes to CSV Files

This will export the dataframes to CSV files.

In [11]:
# Designate the output path where the CSVs will be stored
csv_output_path = "path"

### Feature: Ain't

In [12]:
aint_gs_df.to_csv(f"{csv_output_path}aint_variations_autoCorrectness.csv", index=False)

### Feature: Be

In [13]:
be_gs_df.to_csv(f"{csv_output_path}be_autoCorrectness.csv", index=False)

### Feature: Done

In [14]:
done_gs_df.to_csv(f"{csv_output_path}done_autoCorrectness.csv", index=False)