# CS304 Project 5 Problem 1
Author: Tianle Zhu

In project 5 problem 1, we are required to recognize continuous recording of 4-digit or 7-digit telephone numbers. 

### Our group's idea:
* Connect the pre-trained digit model to form a graph
* Perform DTW on the graph using similar strategy in Project 4
* Need to implement non-emitting state
* Need to allow skip of area code

For the complete code of data structures and different levels of abstraction, please refer to *CDR.PY*.

For the complete code of building specific graph for problem 1 and performing DTW, please refer to *problem1.py*.

Blocks below show the testing results and accuracies.

In [4]:
import os
import utils
import CDR
import mfcc
import problem1

Read pre-recorded sentences. 

In [5]:
folderPath = "./p1_sentence"
digit4 = []
digit7 = []
for file in os.listdir(folderPath):
    name = file.split(".")[0]
    if len(name) == 4:
        digit4.append(file)
    elif len(name) == 7:
        digit7.append(file)
    else:
        print("Unexpected sentence length!")
all = digit4 + digit7

### Testing all sentences and calculate accuracy

Below block defines the function to recognize single sentence. 

In calculating accuracy, we are also using DTW to align results and true answers to calculate the minimum edit distance and word error rate.

For this DTW implementation, please refer to *utils.py*. 

In [13]:
def test(filename,verbose, force7digit):
    """Recognize a singel sentence and calculate accuracies
    :param filename: target sentence name
    :param verbose: if to print results
    :param force7digit: if to force the model to have 7 digits
    :return: correct digit number, total digit number, minimum edit distance, sentence correct rate
    """
    # get correct answer from file name
    answer = utils.parseSName(filename)
    p1_folder = "./p1_sentence"
    filepath = os.path.join(p1_folder, filename)
    # calculate mfcc features
    sentence = mfcc.mfcc_features(filepath, 40)
    # build graph
    startNull, branchNull = problem1.build47()
    # flatten graph
    node_ls, nodeNum = CDR.flatten(startNull)
    # recognize sentence
    result, seq = problem1.RSS(sentence, node_ls, nodeNum, branchNull, force7digit)
    total = len(answer) # get total digit number
    count = 0
    # results and answers may have different length
    for i in range(total):
        try:
            if answer[i] == result[i]:
                count += 1
        except:
            break
    minEditDis = utils.dtw(answer, result)
    wre = minEditDis / total
    if verbose:
        print("Correct answer: ", answer)
        print("Result: ", result)
        print("Correct rate: {:.2f}".format(count / total))
        print("Minimum edit distance: {}\nWord error rate: {:.2f}".format(minEditDis, wre))
        print("*"*50)
    if minEditDis == 0:
        sentenceCorrect = 1
    else:
        sentenceCorrect = 0
    return count, total, minEditDis, sentenceCorrect

Below block define the function to test a set of sentences. 

In [11]:
def testMany(testSet, verbose, force7digit=False):
    """Recognize a set of sentences one by one and print the results
    :param testSet: sentence name list, digit4 or digit7 or all
    :param force7digit: if to force the model to have 7 digits
    :param verbose: if to print the results for each sentence during testing
    """
    totalNum = 0
    correctNum = 0
    medSum = 0
    s_correctNum = 0
    for file in testSet:
        count, total, minEditDis, sentenceCorrect= test(file, verbose, force7digit)
        totalNum += total
        correctNum += count
        medSum += minEditDis
        s_correctNum += sentenceCorrect
    print("Sentence correct rate: {:.2f}".format(s_correctNum/len(all)) )
    print("Digit correct rate: {:.2f}".format(correctNum/totalNum))
    print("Word error rate: {:.2f}".format(medSum/totalNum))

First, we test on all recordings. 

In [12]:
testMany(all, True)

Correct answer:  ['two', 'one', 'two', 'three']
Result:  ['eight', 'one', 'six', 'six']
Correct rate: 0.25
Minimum edit distance: 3
Word error rate: 0.75
**************************************************
Correct answer:  ['two', 'seven', 'nine', 'five']
Result:  ['six', 'seven', 'nine', 'five']
Correct rate: 0.75
Minimum edit distance: 1
Word error rate: 0.25
**************************************************
Correct answer:  ['two', 'nine', 'nine', 'three']
Result:  ['two', 'nine', 'nine', 'six']
Correct rate: 0.75
Minimum edit distance: 1
Word error rate: 0.25
**************************************************
Correct answer:  ['three', 'three', 'three', 'three']
Result:  ['three', 'three', 'three', 'three']
Correct rate: 1.00
Minimum edit distance: 0
Word error rate: 0.00
**************************************************
Correct answer:  ['four', 'four', 'four', 'four']
Result:  ['four', 'four', 'four', 'six']
Correct rate: 0.75
Minimum edit distance: 1
Word error rate: 0.25
*****

The sentence accuracy is ***0.19***.

The digit correct rate is ***0.37***.

The word error rate is ***0.54***. 

The results is not so good. We can also notice that most 7-digit sentence are recognized as 4-digit sentence. Thus, the digit accuracy is not that meaningful and word error rate should be a better reference. 

Then, we test separately on 4-digit and 7-digit sentences. 

First, we test solely on 4-digit sentences. 

In [10]:
testMany(digit4,False)

Sentence correct rate: 0.19
Digit correct rate: 0.67
Word error rate: 0.33


We can see that the results improves and word error rate drops by around ***0.2***. 

Then we test solely on 7-digit sentences. 

In [11]:
testMany(digit7,False)

Sentence correct rate: 0.00
Digit correct rate: 0.20
Word error rate: 0.66


The results is bad. But this is what we can forsee from our first testing results. 

Then we test on 7-digit sentences again, but this time, we force the model to output 7-digit sentence. 

In [12]:
testMany(digit7, False, True)

Sentence correct rate: 0.00
Digit correct rate: 0.40
Word error rate: 0.54


The accuracy improves, but still not very good. 

To account for the poor accuracy our model yield, we think it is mostly likely attributed to the pre-trained model of separate digits. 

And since we use models of separate digit to recognize continuous speech, the problem of pre-trained models get amplified. 