# Overview

This section calculates the Word Errror Rate on an expert annotated 2-hour data set provided by the FAA from 1994. This data set is in sphere format and also includes expert ground truth annotations for ASR applications.

**Data set link:** https://catalog.ldc.upenn.edu/LDC94S14A

**For more details about the data set:**

"The audio files are 8 KHz, 16-bit linear sampled data, representing continuous monitoring, without squelch or silence elimination, of a single FAA frequency for one to two hours. There are also files which indicate the amplitude of the received AM carrier signal at 10 msec. intervals.

Full transcripts, including the start and end times of each transmission, are provided for each audio file. Each flight is identified by its flight number.

ATC0 consists of three subcorpora, one for each airport in which the transmissions were collected -- Dallas Fort Worth (DFW), Logan International (BOS) and Washington National (DCA). The complete set contains approximately 70 hours of controller and pilot transmissions collected via antennas and radio receivers which were located in the vicinity of the respective airports.

Detailed information regarding the collection process and the equipment used can be found on in the files, "atc.doc" in the "doc" directories.

The ATC0 Corpus was collected by Texas Instruments under contract to DARPA. It was produced by the National Institute of Standards and Technology for distribution by the Linguistic Data Consortium."

**For more details on WER calculation**:

WER = (substitutions + deletions + insertions) / ( of Ground Truth Words)

In [1]:
from jiwer import wer
import pandas as pd
import re


C:\Users\micha\miniconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll
C:\Users\micha\miniconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll


# Loading different data sets

TODO: do AWS and Google

### Loading Appareo

In [317]:
df_appareo = pd.read_json('../sample_transcriptions/LDC94214A_appareo.json')

In [318]:
tran_appareo = ' '.join(df_appareo['utterance'])

In [319]:
tran_appareo = re.sub(r'(\d) (\d)', r'\1\2', tran_appareo)
tran_appareo = re.sub(r'(\d) (\d)', r'\1\2', tran_appareo)
tran_appareo = re.sub(r'(\d)(\D)', r'\1 \2', tran_appareo)

### Load Ground Truth

In [304]:
# Import here
string_gt = open("../sample_transcriptions/LDC94214A_groundtruth.txt", "r").read()

string_gt = string_gt.replace("\n", " ") # Removes all new lines
string_gt = ' '.join(re.findall(r'TEXT.*?TIMES', string_gt)) # Creates a string for all commands

Note: we must apply an array of transformations to have the two data sets roughly compared on equal grounds.
* Get rid of extra spaces
* Get rid of the (QUOTE RE) and (QUOTE LL) and instead replace it with the closest word. Note: Appareo does you're as your
* Get rid of TEXT, TIMES, and parentheses
* Make sure phonetics are letters
* Make sure letters are joined
* Make sure spelled out numbers are actually numbers
* Make sure numbers are joined
* 

In [305]:
string_gt = re.sub(' +', ' ', string_gt)



In [306]:
string_gt = re.sub(' \(QUOTE LL\)', 'LL', string_gt)
string_gt = re.sub(' \(QUOTE RE\)', 'RE', string_gt)
string_gt = re.sub('O \(QUOTE CLOCK\)', 'OCLOCK', string_gt)
string_gt = re.sub(' \(QUOTE S\)', 'S', string_gt)
string_gt = re.sub(' \(QUOTE M\)', 'M', string_gt)
string_gt = re.sub(' \(QUOTE T\)', 'T', string_gt)

In [336]:
string_gt = re.sub(' \(TIMES', '', string_gt)
string_gt = re.sub(' TEXT', '', string_gt)
string_gt = re.sub('\)', '', string_gt)
string_gt = re.sub('\(UNINTELLIGIBLE', '', string_gt)
string_gt = re.sub(' POINT', '.', string_gt)

In [337]:
def parse_int(textnum, numwords):
    textnum = textnum.lower()  
    # primary loop
    current = result = 0
    # loop while splitting to break into individual words
    for word in textnum.replace("-"," ").split():
        # if problem then fail-safe
        if word not in numwords:
          raise Exception("Illegal word: " + word)

        # use the index by the multiplier
        scale, increment, is_complex, is_number = numwords[word]
        current = current * scale + increment
        
        # if larger than 100 then push for a round 2
        if scale > 100:
            result += current
            current = 0

    # return the result plus the current
    print('parse int: ' + str(result + current))
    return result + current

In [338]:
def replace_numbers(text):
    numwords={}
    # singles
    units = [
      "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
      "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
      "sixteen", "seventeen", "eighteen", "nineteen",
    ]

    # tens
    tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]

    # larger scales
    scales = ["hundred", "thousand", "million", "billion", "trillion"]

    # divisors
    numwords["and"] = (1, 0, 0, 0) # NOTE: this is (scale, unit, is_complex, is_number)
    numwords["a"] = (1, 1, 0, 0)
    numwords[""] = (100, 0, 0, 0)

    # perform our loops and start the swap
    for idx, word in enumerate(units):    numwords[word] = (1, idx, 0, 1)
    for idx, word in enumerate(tens):     numwords[word] = (1, idx * 10, 1, 1)
    for idx, word in enumerate(scales):   numwords[word] = (10 ** (idx * 3 or 2), 0, 1,1 )


    # primary loop
    final_string = ''
    skip_index = 0

    for idx, word in enumerate(text.split()):
        if skip_index > 0:
            skip_index -= 1
            continue

        curr_idx = idx
        curr_word = word
        total_word = ''
        complex_sum = 0
        number_sum = 0

        if word not in numwords:
            final_string = final_string + " " + word # NOTE: this will cause an annoying space at the start, fix it
            continue

        while(curr_idx < len(text.split()) and text.split()[curr_idx] in numwords):
            curr_word = text.split()[curr_idx]
            total_word = total_word + " " + curr_word
            complex_sum = complex_sum + numwords[curr_word][2]
            number_sum = number_sum + numwords[curr_word][3]
            curr_idx += 1
            print(total_word)
            

        if number_sum > 0 and complex_sum > 0:
            final_string = final_string + " " + str(parse_int(total_word, numwords))
            skip_index = len(' '.join(total_word))
        elif complex_sum == 0 and numwords[word][3] == 1: # accounts for callsigns
            final_string = final_string + " " + str(parse_int(word, numwords))
        else: # accounts for A and AND
            final_string = final_string + " " + word
            skip_index = len(' '.join(total_word))

    return final_string


In [339]:
4000 % 1000

0

NOTE: a bug exists in which "one two twenty four" wouldn't work - this is okay for current text however

In [311]:
replace_numbers('thirty seven eighty eight')

 thirty
 thirty seven
 thirty seven eighty
 thirty seven eighty eight
parse int: 125


' 125'

In [340]:
# Other way of doing it
rep_thousands = {"one thousand": 1000, "two thousand": 2000, "three thousand": 3000, 
                "four thousand": 4000, "five thousand": 5000, "six thousand": 6000, 
                "seven thousand": 7000, "eight thousand": 8000, "nine thousand": 9000, 
                "ten thousand": 10000, "a thousand": 1000, "thousand": 1000} 
                # NOTE: doesn't get any higher than 10000, also thousand gets in repl loop last so it accoutns for one edge case

rep_hundreds = {"one hundred": 1, "two hundred": 2, "three hundred": 3, "four hundred": 4, 
                "five hundred": 5, "six hundred": 6, "seven hundred": 7, "eight hundred": 8,
                "nine hundred": 9, "a hundred": 1, "hundred": 1}
                # NOTE: this is a clever way to avoid conflict with modulo later down the line

rep_decades = {"twenty one": 21, "twenty two": 22, "twenty three": 23, "twenty four": 24, "twenty five": 25, "twenty six": 26, "twenty seven": 27, "twenty eight": 28, "twenty nine": 29,
               "thirty one": 31, "thirty two": 32, "thirty three": 33, "thirty four": 34, "thirty five": 35, "thirty six": 36, "thirty seven": 37, "thirty eight": 38, "thirty nine": 39,
               "forty one": 41, "forty two": 42, "forty three": 43, "forty four": 44, "forty five": 45, "forty six": 46, "forty seven": 47, "forty eight": 48, "forty nine": 49,
               "fifty one": 51, "fifty two": 52, "fifty three": 53, "fifty four": 54, "fifty five": 55, "fifty six": 56, "fifty seven": 57, "fifty eight": 58, "fifty nine": 59,
               "sixty one": 61, "sixty two": 62, "sixty three": 63, "sixty four": 64, "sixty five": 65, "sixty six": 66, "sixty seven": 67, "sixty eight": 68, "sixty nine": 69,
               "seventy one": 71, "seventy two": 72, "seventy three": 73, "seventy four": 74, "seventy five": 75, "seventy six": 76, "seventy seven": 77, "seventy eight": 78, "seventy nine": 79,
               "eighty one": 81, "eighty two": 82, "eighty three": 83, "eighty four": 84, "eighty five": 85, "eighty six": 86, "eighty seven": 87, "eighty eight": 88, "eighty nine": 89,
               "ninety one": 91, "ninety two": 92, "ninety three": 93, "ninety four": 94, "eighty five": 95, "ninety six": 96, "ninety seven": 97, "ninety eight": 98, "ninety nine": 99,
               "twenty": 20, "thirty": 30, "forty": 40, "fifty": 50, "sixty": 60, "seventy": 70, "eighty": 80, "ninety": 90}

rep_teens = {"eleven": 11, "twelve": 12, "thirteen": 13, "fourteen": 14, "fifteen": 15, "sixteen": 16, "seventeen": 17,
                    "eighteen": 18, "nineteen": 19, "zero": 0, "one": 1, "two": 2, "three": 3, "four": 4, "five": 5,
                    "six": 6, "seven": 7, "eight": 8, "niner": 9, "nine": 9, "ten": 10} 

In [341]:
test_string = string_gt

In [342]:
for word, replacement in rep_thousands.items():
    test_string = test_string.replace(word.upper(), str(replacement))

for word, replacement in rep_hundreds.items():
    test_string = test_string.replace(word.upper(), str(replacement))

for word, replacement in rep_decades.items():
    test_string = test_string.replace(word.upper(), str(replacement))

for word, replacement in rep_teens.items():
    test_string = test_string.replace(word.upper(), str(replacement))

In [278]:
#test_string = re.sub('(?<=\d) (?=\d)', '', test_string)

Potentially ignore

In [279]:
test_string = re.sub(r' (\d) (\d) (\d)', r' \1\2\3', test_string)

In [280]:
test_string = re.sub(r' (\d) (\d)(\d) ', r' \1\2\3 ', test_string)

In [281]:
test_string = re.sub(r' (\d) and (\d)(\d) ', r' \1\2\3 ', test_string)

In [282]:
test_string = re.sub(r' (\d)(\d) (\d)(\d) ', r' \1\2\3\4 ', test_string)

In [283]:
test_string = re.sub(r' (\d) (\d) ', r' \1\2 ', test_string)
test_string = re.sub(r' (\d)(\d)(\d) (\d)', r' \1\2\3\4 ', test_string)

Other small replacements

In [343]:
test_string = re.sub(r' I L S ', r' ILS ', test_string)
test_string = re.sub(r' D M E ', r' DME ', test_string)
test_string = re.sub(r' V F R ', r' VFR ', test_string)
test_string = re.sub(r' I F R ', r' IFR ', test_string)
test_string = re.sub(r' O K ', r' OK ', test_string)
test_string = re.sub(r' U S ', r' US ', test_string)

In [344]:
rep_phonetics = {"alpha": "A", "bravo": "B", "charlie": "C", "Delta": "D", 
                "Echo": "E", "Foxtrot": "F", "Golf": "G", "Hotel": "H",
                "India": "I", "Juliet": "J", "Kilo": "K", "Lima": "L",
                "Mike": "M", "November": "N", "Oscar": "O", "Papa": "P",
                "Quebec": "Q", "Romeo": "R", "Sierra": "S", "Tango": "T",
                "Uniform": "U", "Victor": "V", "Whiskey": "W", "X-ray": "X",
                "Yankee": "Y", "Zulu": "Z"}

In [345]:
for word, replacement in rep_phonetics.items():
    test_string = test_string.replace(word.upper(), replacement)

In [346]:
test_string = re.sub(r'(\d) (\d)', r'\1\2', test_string)
test_string = re.sub(r'(\d) (\d)', r'\1\2', test_string)

test_string = re.sub(r'(\d)(\D)', r'\1 \2', test_string)

test_string = re.sub(r' (\D) (\D) ', r'\1\2', test_string)
test_string = re.sub(r' (\D) (\D) ', r'\1\2', test_string)



In [287]:
test_string = re.sub(r'(\d)(\d) (\D) ', r'\1\2 ', test_string)
test_string = re.sub(r' (\D) (\D)', r' \1\2', test_string)

In [347]:
test_string

'TEXT 1000190  WELL GIVE YOU THAT ON THE SPEED AND WERE CLEARED FOR THE APPROACH AH NERA 3788  WELL HOLD SHORT OF 27  THANKS BIZEX 329  TURN LEFT HEADING 1  CORRECTION 090090329  ROGER THAT SIR US AIR 268  YOURE OVER L1 R CLEARED ILS DME APPROACH TO RUNWAY 27  TRAFFIC LANDING 22  LEFT WILL HOLD SHORT OF YOUR RUNWAY CLEARED THE ILS 27  US AIR 268  YES SIR 268  CONTACT THE TOWER 119  . 1  HAVE A GOOD DAY CESSNA 01  C VFR DESCENT MAINTAIN 3000300001  C BIZEX 329  TRAFFIC 1235  MILES LEAVING 30005  IS A MERLIN WERE LOOKING FOR THAT TRAFFIC 329  BIZEX 329  TURN RIGHT HEADING 130130  FOR 329  BIZEX 329  AH DESCEND VFR MAINTAIN 3000  OUT OF TURN TRAFFICS 1  OCLOCK 3  MILES AHEAD LEAVING 3000  AGAIN HES THE AH MERLIN AH WE GOT HIM IN SIGHT NOW SIR BIZEX 329  GOING DOWN TO 3000  EX 329  DESCENT YOUR DISCRETION AFTER LANDING 22  LEFT HOLD SHORT OF RUNWAY 27  FOLLOW THAT TRAFFIC AH ROGER THAT SIR AH BIZEX 329  NERA 3788  REDUCE AND MAINTAIN SPEED 170  TILL A LINDY AND CONTACT THE TOWER 119  . 1  

In [289]:
Alpha, Bravo, Charlie, Delta, Echo, Foxtrot, Golf, Hotel, India, Juliet, Kilo, Lima, Mike, November, Oscar, Papa, Quebec, Romeo, Sierra, Tango, Uniform, Victor, Whiskey, X-ray, Yankee, Zulu

NameError: name 'Alpha' is not defined

In [173]:

from word2number import w2n
  
# initializing string
test_str = "one fifty three"
# printing original string
print("The original string is : " + test_str)
  
# Convert numeric words to numbers
# Using word2number
res = w2n.word_to_num(test_str)
  
# printing result 
print("The string after performing replace : " + str(res)) 

The original string is : one fifty three
The string after performing replace : 53


In [76]:
string_test = 'hi my name is michael'.split()

In [77]:
string_test[0]

'hi'

In [73]:
parse_int('one two three')

6

# Cleaning different data sets

# Calculating the WER

In [348]:
ground_truth = "hello world hi hello"
hypothesis = "hello duck hi hello"

error = wer(test_string, tran_appareo)

In [349]:
error

0.21629368316613384