# Assignment 3 [15% of your grade, 70 points in total]

Hi! Welcome to assignment 3. Here, we are going to build a simple automatic speech recognition (ASR) system using the SpeechBrain framework, and check your understanding of some important concepts related to ASR. This assignment constitutes 15% of your final grade.

You are required to:
- Finish this notebook. Successfully run all the code cells and answer all the questions.
- When you need to embed screenshot in the notebook, put the picture in './resources'.

**Submission**
After finishing, **zip the whole assignment directory (but please exclude "datasets" directory)**, then submit to Canvas. **Naming: "eXXXXXXX_Name_Assignment3.zip"**.

**Late Policy**
Please submit before **Wednesday, Recess Week, 27 September 2023, 23:59**. For each late day, your will get -25% marks.

**Honor Code**
Note that plagiarism will not be condoned. You may discuss the questions with your classmates or search on the internet for references, but you MUST NOT submit your code/answers that is copied directly from other sources. If you referred to the code or tutorial somewhere, please explicitly attribute the source somewhere in your code, e.g., in the comment.

**Note** You might need to restart the jupyter kernel to clear the imported py files before running some code cells.

**Useful Resources**
- (Paper) [Recent Advances in End-to-End Automatic Speech Recognition](https://arxiv.org/abs/2111.01690)
- (Code) [SpeechBrain ASR from Scratch](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing#scrollTo=IVCCe6cXPzJ0)
- (Video) [End-to-End Models for Speech Processing](https://www.youtube.com/watch?v=3MjIkWxXigM)

## Getting Started

We will continue using the same conda environment as the assignment 2, but some additional packages are needed.
1. Enter the conda environment by:

        conda activate 4347
2. Install packages

        # Install SpeechBrain and other libraries
        pip install -r requirement.txt

        # Install CMU Dictionary
        python
        nltk.download('cmudict')
        exit()

3. When you run this notebook in your IDE, switch the interpreter to the 4347 conda environment.
4. You may be prompted to install the jupyter package. Click "confirm" in this case.

## Section 1 - Automatic Speech Recognition (ASR) [28 mark(s)]
An automatic speech ASR system recognize spoken words from audio. If we build it using singing data, it becomes a lyric transcription system. As you have learned in the lecture, in recent decades, the performance of ASR systems has advanced significantly thanks to end-to-end (E2E) ASR models and large-scale open-source datasets.

We are not going to build a well-performed E2E ASR system in this assignment because it's too demanding for both computation resources and scale of data. Instead, we will
- Use phoneme as the recognition unit. In English, they have tighter relationship with the pronunciation, hence is less data-demanding.
- Use a simple model with a toy dataset.
- Train the model from scratch.
- Decode the output without language model.

This is just for simplicity and let you know the general idea of ASR system and SpeechBrain framework, but not what we do to solve real-world problems. For current state-of-the-art ASR systems, they tend to
- Use grapheme (e.g., character, word, sub-word) as the recognition unit. This make the recognition workflow simpler.
- Use huge models with huge datasets.
- Transfer learning is commonly adopted -- systems are first trained with large-scale corpus from various domains, or even unlabeled data (audio-only, no text annotation), and then fine-tuned with some domain-specific labeled data.
- Language models participate in the decoding process, making the output with higher fluency.

Since we will be using phoneme as the target for the dataset, our goal is to recognize a sequence of spoken phonemes from audio. But many speech dataset do not provide phoneme annotation (as in this assignment). So we need to obtain the phoneme sequence from sentences ourselves.

### Task 1: Prepare phoneme annotation  [4 mark(s)]
1. Please finish the code of PhonemeUtil Class in utils.py, so that you can pass the below tests. Please using the CMU Dictionary in nltk to obtain the pronunciation. Use the first pronunciation if multiple ones exists for a word. If a word is not in the dictionary, mark its phoneme as "\<UNK\>".  **[2 mark(s)]**


In [None]:
from utils import *
phoneme_util = PhonemeUtil()
sentences = [
    "This is a test asdfdsaf",
    "For you phoneme tool",
    "thhat ensure you can get",
    "Correct labels",
]
out = [phoneme_util.word_to_phoneme_sequence(s) for s in sentences]
ans = [['DH', 'IH', 'S', 'IH', 'Z', 'AH', 'T', 'EH', 'S', 'T', '<UNK>'], ['F', 'AO', 'R', 'Y', 'UW', 'F', 'OW', 'N', 'IY', 'M', 'T', 'UW', 'L'], ['<UNK>', 'EH', 'N', 'SH', 'UH', 'R', 'Y', 'UW', 'K', 'AE', 'N', 'G', 'EH', 'T'], ['K', 'ER', 'EH', 'K', 'T', 'L', 'EY', 'B', 'AH', 'L', 'Z']]
for i,j in zip(out, ans):
    assert i == j
print('Congratulations!')

2. Run the code below to obtain phoneme annotation for tiny LibriSpeech dataset. After this, the phoneme annotations will be stored to 'phn' property in the annotation files for each audio.  **[2 mark(s)]**

In [None]:
phoneme_util = PhonemeUtil()
dataset_dir = './datasets/tiny_librispeech'
annot_dir_complete = jpath(dataset_dir, 'annotation_word')
annot_dir_word = jpath(dataset_dir, 'annotation')
if not os.path.exists(annot_dir_word):
    os.mkdir(annot_dir_word)
splits = ['train', 'valid', 'test']
for split in splits:
    annot_fp_old = jpath(annot_dir_complete, split+'.json')
    annot_fp_new = jpath(annot_dir_word, split+'.json')
    data = read_json(annot_fp_old)
    for id in data:
        entry = data[id]
        sentence = entry['words']
        phonemes = phoneme_util.word_to_phoneme_sequence(sentence)
        data[id]['phn'] = ' '.join(phonemes)
    save_json(data, annot_fp_new)
data = read_json(jpath(dataset_dir, 'annotation', 'test.json'))

t = 'R AA B AH N <UNK> S AO DH AE T HH IH Z D AW T S AH V W AA R AH N T AH N HH AE D B IH N AH N F EH R AH N D HH IY B IH K EY M AH SH EY M D AH V HH IH M S EH L F F AO R HH AA R B ER IH NG DH EH M'
assert data['61-70970-0036']['phn'] == t
print('Congrats!')

### Task 2: Prepare tokenizer [3 mark(s)]
In both training and inference, a tokenizer help to convert labels (in our case, phoneme annotations) from text to integer numbers so that the model can handle them easily.

1. Please finish the code of PhonemeTonekizer Class in utils.py so that it can pass the cell below. **[3 mark(s)]**

In [None]:
from utils import PhonemeTokenizer
tokenizer = PhonemeTokenizer()
assert len(tokenizer.vocab) == 41
assert tokenizer.token_to_id['<UNK>'] == 40
assert tokenizer.id_to_token[0] == '<blank>'

phn_seqs = [
    ['CH', 'AO', 'B', 'T', 'S', 'OY'],
    ['B', 'AE', 'AA', 'AH', 'ER', 'TH'],
    ['<UNK>', 'D', 'B', '<UNK>', 'HH', 'TH']
]
ans = [
    [8, 4, 7, 31, 29, 26],
    [7, 2, 1, 3, 12, 32],
    [40, 9, 7, 40, 16, 32],
]

assert tokenizer.encode_seq(phn_seqs[0]) == ans[0]
assert tokenizer.encode_seq(phn_seqs[1]) == ans[1]
assert tokenizer.encode_seq(phn_seqs[2]) == ans[2]
assert tokenizer.decode_seq(ans[0]) == phn_seqs[0]
assert tokenizer.decode_seq(ans[1]) == phn_seqs[1]
assert tokenizer.decode_seq(ans[2]) == phn_seqs[2]
assert tokenizer.decode_seq_batch(ans) == phn_seqs

print('Congrats!')

### Task 3: ASR Baseline [8 mark(s)]

We are now ready for building the first ASR system. Please finish the tasks below:

1. The current code uses the validation set as the testing set, while the code for preparing the test data is missing. Please complete it. **[1 mark(s)]**
2. Please use Checkpointer class of speechbrain to help you save the model with the lowest Phoneme Error Rate (PER) during training. Save the checkpoint under the directory "results/baseline/best_ckpt". **[1 mark(s)]**
3. Load the best model (lowest PER) for evaluation, instead of using the model from the last epoch. **[1 mark(s)]**
4. Please use speechbrain.utils.metric_stats.ErrorRateStats.write_stats to help you save the output of your model on the whole test set to help you know your model's performance better. In the output file, please use phoneme tokens instead of token ids (numbers). Save the file to "results/baseline/results.txt" **[1 mark(s)]**
5. Please log your training, validation, and evaluation statistics to the result folder, in whatever way you like. **[1 mark(s)]**

Run the training and testing by

    python train.py hparam_baseline.yaml
Expected PER: 90%.

**NOTE**: Please keep the (1) training log, (2) model checkpoint and the (3) corresponding result files, when submitting you assignment. **[3 mark(s)]**

### Task 4: Modifying the Model [13 mark(s)]

You may have spot some of the issues during the training, like the slow converging speed, overfitting, etc. Please make the following changes to your model by modifying the yaml file.
1. (Please create a new .yaml file from the hparam_baseline.yaml, naming it hparam_modified.yaml) **[1 mark(s)]**
2. Increase the N_epoch to 20. **[1 mark(s)]**
3. Increase the learning rate to 5e-3 **[1 mark(s)]**
4. Add weight decay = 0.1 to the optimizer **[1 mark(s)]**
5. Add a variable named "drop_p", with value 0.2. **[1 mark(s)]**
6. Add 3 dropout layers to the model, after act1, act2, and RNN. All with the same dropout rate of "drop_p" (you need to use a variable reference here). **[1 mark(s)]**
7. Change the output_dir from "results/baseline" to "results/drop0.2x2_lr0.005_wd0.1". **[1 mark(s)]**

There are some other changes you need to make in the train.py file:
1. Use the speechbrain.nnet.schedulers.NewBobScheduler to schedule the learning rate or training according to loss on validation set. If the validation loss did not decrease after an epoch of training, use that scheduler to adjust the learning rate. **[2 mark(s)]**
2. Before the training of each epoch, print out and log the current learning rate. **[1 mark(s)]**

Run the training and testing by

    python train.py hparam_modified.yaml
Expected PER: 65%.

**NOTE**: Please keep the (1) training log, (2) model checkpoint and the (3) corresponding result files, when submitting you assignment. **[3 mark(s)]**

## Section 2 - Questions [42 marks]

### - Result Analysis [2 mark(s)]
1. How does your system perform? Briefly introduce your system's performance with objective metric scores and the result file for the test set. **[2 mark(s)]**

(Your Answer)

### - Tokenization [8 mark(s)]
1. Do you think detecting phoneme sequence from speech recording is more difficult than detecting character or word sequence? Why? **[2 mark(s)]**
2. For the task of speech recognition, what are the drawbacks of using phoneme as the detecting unit? **[2 mark(s)]**
3. What is the advantage of sub-word tokenizer compared to word-level tokenizer? **[2 mark(s)]**
4. If we are changing our tokenizer to the type of grapheme, which level do you think is the best, among {character, word, sub-word}? Please state your reason. **[2 mark(s)]**

(Your Answer)

### - Modeling [7 mark(s)]
Connectionist Temporal Classification (CTC) is a type of loss function that is commonly used in ASR, especially when we do not know the precise alignment between the annotation and the audio.

1. Explain how does CTC deal with the misalignment issue between audio and annotation, i.e., the number of frames in the audio is much higher than the number of phoneme/character/sub-word/word in the annotation, and we do not know their correspondence. **[1 mark(s)]**
2. Why does CTC need an additional blank token in the prediction? **[1 mark(s)]**
3. Here are several decoded output from a CTC model. Write out their final recognition result. ("-" is CTC blank token, and "_" represent space) **[2 mark(s)]**

    (1) heeel-ll-l_lllooo--wooooorld

    (2) hhhhee-llow_wo--rr-rllll--dd
    
4. Recall the formula of CTC loss:
   $$L_{CTC} = -log(\sum_{\pi \in B^{-1}(W)} \prod_{t=1}^Tp(\pi_t|\mathbf{x}_t))$$
   Does this summation mark means that we have to list out all possible alignments between frames and texts, compute the probability for each pair, and add them together? Is there more efficient way to compute the CTC loss? If you think so, please briefly explain a more efficient algorithm. **[3 mark(s)]**

(Your Answer)

### - Language Model [7 mark(s)]
1. Consider the two sequences below:
    - A: I like Singapore's weather.
    - B: I Singapore like ? weathers.

    For a well-trained language model, which sentence will have lower perplexity from this model? Why? **[1 mark(s)]**
</br>

1. Given the corpus below:

            <s> I love to play football </s>
            <s> He loves to watch football </s>
            <s> I love to watch movies </s>
            <s> She loves to play tennis </s>
    (1) Assuming we are using a word-level tokenizer. <s> and </s> represent start and end of sentence token. Calculate the below bigram probability $P(B|A)$ by 
    $$
    P(B|A) = \frac{Count(A B)}{Count(A)}
    $$
    **[3 mark(s)]**
    
    a. P(love | I)

    b. P(to | love)

    c. P(football | play)
    
    d. P(movies | watch)
   </br>
   
    (2) Use the probability you obtained above, calculate the probability of below sentences **[2 mark(s)]**
    a. I love to watch football
    b. She loves to play football
   </br>
   
    (3) Why it's not a good idea to use a large n value for n-gram language models? **[1 mark(s)]**


(Your Answer)


### - Beam Search [4 mark(s)]
Assume we have a simplified language model that can predict the probability of next word. We have generated a start part of the sentence "I want to". Now we are using beam search to predict the rest of the sentence. Use letter "G" denote the generated part. Let's use beam size of 2 for this question.

        Probability calculated by language model:
        p(eat | G): 0.4
        p(play | G): 0.3
        p(go | G): 0.2
        p(watch | G): 0.1
        p(a sandwich | G eat): 0.5
        p(dinner | G eat): 0.4
        p(an apple | G eat): 0.1
        p(football | G play): 0.6
        p(games | G play): 0.4
1. Let's continue the generation from G="I want to". After the first step of beam search, what tokens will be selected, and what are the resulting candidate sequence? **[1 mark(s)]**
2. In the 2nd step of beam search, what are the two beams starting with "G eat"? What are their probability respectively? **[1 mark(s)]**
3. In the 2nd step of beam search, what are the two beams starting with "G play"? What are their probability respectively? **[1 mark(s)]**
4. What are the resulting candidate sequence from the 2nd step of beam search? **[1 mark(s)]**

(Your Answer)

### - Word Error Rate [3 mark(s)]

Consider an automatic speech recognition system that transcribes a spoken segment into text. We compare the transcription of the system with a human-annotated reference transcript to calculate the system's Word Error Rate.

Reference Transcript:
"I am excited to learn about speech recognition."

System's Transcription (Hypothesis):
"I am excited learn about speech recognise."

1. Calculate the number of insertions, deletions, and substitutions. **[1 mark(s)]**
2. Compute the Word Error Rate (WER) using the formula: **[1 mark(s)]**
$$WER=\frac{\text{Insertions}+\text{Deletions}+\text{Substitutions}}{\text{Number of words in Reference}}$$
3. Why might WER be a more resonable metric for ASR compared to a simple accuracy rate (correct words divided by total words)? **[1 mark(s)]**

(Your Answer)

### - Possible Improvement [3 mark(s)]
1. The performance of the recognition system in Section 1 might still have room to improve. What are possible reasons for the not-so-good performance, and directions of improvement? Please list 3 pairs of them. **[3 mark(s)]**

(Your Answer)

### - Speech vs Singing [6 mark(s)]
1. What are the properties that are different between audio of from speech recording and that of singing recording? What are the similar/same properties that are shared between them? **[2 mark(s)]**
2. What are the properties that are different between spoken texts and lyrics? What are the similar/same properties that are shared between them? **[1 mark(s)]**
3. Given the limited paired singing dataset of audio and lyric, how can we build a lyric transcription system with better performance? Please answer from 3 perspectives. **[3 mark(s)]**

(Your Answer)

### - Timing Survey [2 mark(s)]

- What do you think is the most difficult part? Which part did you spent most time on it? **[1 mark(s)]**
[Your answer]
</br>

- How much time did you spent on the assignment? Please fill an estimated time here if you did not time yourself. **[1 mark(s)]**
[Your answer]
</br>