# Assignment 3 [15% of your grade, 70 points in total]

Hi! Welcome to assignment 3. Here, we are going to build a simple automatic speech recognition (ASR) system using the SpeechBrain framework, and check your understanding of some important concepts related to ASR. This assignment constitutes 15% of your final grade.

You are required to:
- Finish this notebook. Successfully run all the code cells and answer all the questions.
- When you need to embed screenshot in the notebook, put the picture in './resources'.

**Submission**
After finishing, **zip the whole assignment directory (but please exclude "datasets" directory)**, then submit to Canvas. **Naming: "eXXXXXXX_Name_Assignment3.zip"**.

**Late Policy**
Please submit before **Wednesday, Recess Week, 27 September 2023, 23:59**. For each late day, your will get -25% marks.

**Honor Code**
Note that plagiarism will not be condoned. You may discuss the questions with your classmates or search on the internet for references, but you MUST NOT submit your code/answers that is copied directly from other sources. If you referred to the code or tutorial somewhere, please explicitly attribute the source somewhere in your code, e.g., in the comment.

**Note** You might need to restart the jupyter kernel to clear the imported py files before running some code cells.

**Useful Resources**
- (Paper) [Recent Advances in End-to-End Automatic Speech Recognition](https://arxiv.org/abs/2111.01690)
- (Code) [SpeechBrain ASR from Scratch](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing#scrollTo=IVCCe6cXPzJ0)
- (Video) [End-to-End Models for Speech Processing](https://www.youtube.com/watch?v=3MjIkWxXigM)

## Getting Started

We will continue using the same conda environment as the assignment 2, but some additional packages are needed.
1. Enter the conda environment by:

        conda activate 4347
2. Install packages

        # Install SpeechBrain and other libraries
        pip install -r requirements.txt

        # Install CMU Dictionary
        python
        nltk.download('cmudict')
        exit()

3. When you run this notebook in your IDE, switch the interpreter to the 4347 conda environment.
4. You may be prompted to install the jupyter package. Click "confirm" in this case.

## Section 1 - Automatic Speech Recognition (ASR) [28 mark(s)]
An automatic speech ASR system recognize spoken words from audio. If we build it using singing data, it becomes a lyric transcription system. As you have learned in the lecture, in recent decades, the performance of ASR systems has advanced significantly thanks to end-to-end (E2E) ASR models and large-scale open-source datasets.

We are not going to build a well-performed E2E ASR system in this assignment because it's too demanding for both computation resources and scale of data. Instead, we will
- Use phoneme as the recognition unit. In English, they have tighter relationship with the pronunciation, hence is less data-demanding.
- Use a simple model with a toy dataset.
- Train the model from scratch.
- Decode the output without language model.

This is just for simplicity and let you know the general idea of ASR system and SpeechBrain framework, but not what we do to solve real-world problems. For current state-of-the-art ASR systems, they tend to
- Use grapheme (e.g., character, word, sub-word) as the recognition unit. This make the recognition workflow simpler.
- Use huge models with huge datasets.
- Transfer learning is commonly adopted -- systems are first trained with large-scale corpus from various domains, or even unlabeled data (audio-only, no text annotation), and then fine-tuned with some domain-specific labeled data.
- Language models participate in the decoding process, making the output with higher fluency.

Since we will be using phoneme as the target for the dataset, our goal is to recognize a sequence of spoken phonemes from audio. But many speech dataset do not provide phoneme annotation (as in this assignment). So we need to obtain the phoneme sequence from sentences ourselves.

### Task 1: Prepare phoneme annotation  [4 mark(s)]
1. Please finish the code of PhonemeUtil Class in utils.py, so that you can pass the below tests. Please using the CMU Dictionary in nltk to obtain the pronunciation. Use the first pronunciation if multiple ones exists for a word. If a word is not in the dictionary, mark its phoneme as "\<UNK\>".  **[2 mark(s)]**


In [1]:
#NOTE TO TA: Ensure nltk.download('punkt') is also done in your machine along with nltk.download('cmudict')
from utils import *
phoneme_util = PhonemeUtil()
sentences = [
    "This is a test asdfdsaf",
    "For you phoneme tool",
    "thhat ensure you can get",
    "Correct labels",
]
out = [phoneme_util.word_to_phoneme_sequence(s) for s in sentences]
ans = [['DH', 'IH', 'S', 'IH', 'Z', 'AH', 'T', 'EH', 'S', 'T', '<UNK>'], ['F', 'AO', 'R', 'Y', 'UW', 'F', 'OW', 'N', 'IY', 'M', 'T', 'UW', 'L'], ['<UNK>', 'EH', 'N', 'SH', 'UH', 'R', 'Y', 'UW', 'K', 'AE', 'N', 'G', 'EH', 'T'], ['K', 'ER', 'EH', 'K', 'T', 'L', 'EY', 'B', 'AH', 'L', 'Z']]
for i,j in zip(out, ans):
    assert i == j
print('Congratulations!')

Congratulations!


2. Run the code below to obtain phoneme annotation for tiny LibriSpeech dataset. After this, the phoneme annotations will be stored to 'phn' property in the annotation files for each audio.  **[2 mark(s)]**

In [2]:
#NOTE: At the end of this the /datasets/tiny_librispeech/annotations/x.json contains the phonemes also!
phoneme_util = PhonemeUtil()
dataset_dir = './datasets/tiny_librispeech'
annot_dir_complete = jpath(dataset_dir, 'annotation_word')
annot_dir_word = jpath(dataset_dir, 'annotation')
if not os.path.exists(annot_dir_word):
    os.mkdir(annot_dir_word)
splits = ['train', 'valid', 'test']
for split in splits:
    annot_fp_old = jpath(annot_dir_complete, split+'.json')
    annot_fp_new = jpath(annot_dir_word, split+'.json')
    data = read_json(annot_fp_old)
    for id in data:
        entry = data[id]
        sentence = entry['words']
        phonemes = phoneme_util.word_to_phoneme_sequence(sentence)
        data[id]['phn'] = ' '.join(phonemes)
    save_json(data, annot_fp_new)
data = read_json(jpath(dataset_dir, 'annotation', 'test.json'))

t = 'R AA B AH N <UNK> S AO DH AE T HH IH Z D AW T S AH V W AA R AH N T AH N HH AE D B IH N AH N F EH R AH N D HH IY B IH K EY M AH SH EY M D AH V HH IH M S EH L F F AO R HH AA R B ER IH NG DH EH M'
assert data['61-70970-0036']['phn'] == t
print('Congrats!')

Congrats!


### Task 2: Prepare tokenizer [3 mark(s)]
In both training and inference, a tokenizer help to convert labels (in our case, phoneme annotations) from text to integer numbers so that the model can handle them easily.

1. Please finish the code of PhonemeTonekizer Class in utils.py so that it can pass the cell below. **[3 mark(s)]**

In [1]:
from utils import PhonemeTokenizer
tokenizer = PhonemeTokenizer()
assert len(tokenizer.vocab) == 41
assert tokenizer.token_to_id['<UNK>'] == 40
assert tokenizer.id_to_token[0] == '<blank>'

phn_seqs = [
    ['CH', 'AO', 'B', 'T', 'S', 'OY'],
    ['B', 'AE', 'AA', 'AH', 'ER', 'TH'],
    ['<UNK>', 'D', 'B', '<UNK>', 'HH', 'TH']
]
ans = [
    [8, 4, 7, 31, 29, 26],
    [7, 2, 1, 3, 12, 32],
    [40, 9, 7, 40, 16, 32],
]

assert tokenizer.encode_seq(phn_seqs[0]) == ans[0]
assert tokenizer.encode_seq(phn_seqs[1]) == ans[1]
assert tokenizer.encode_seq(phn_seqs[2]) == ans[2]
assert tokenizer.decode_seq(ans[0]) == phn_seqs[0]
assert tokenizer.decode_seq(ans[1]) == phn_seqs[1]
assert tokenizer.decode_seq(ans[2]) == phn_seqs[2]
assert tokenizer.decode_seq_batch(ans) == phn_seqs

print('Congrats!')

Congrats!


### Task 3: ASR Baseline [8 mark(s)]

We are now ready for building the first ASR system. Please finish the tasks below:

1. The current code uses the validation set as the testing set, while the code for preparing the test data is missing. Please complete it. **[1 mark(s)]**
2. Please use Checkpointer class of speechbrain to help you save the model with the lowest Phoneme Error Rate (PER) during training. Save the checkpoint under the directory "results/baseline/best_ckpt". **[1 mark(s)]**
3. Load the best model (lowest PER) for evaluation, instead of using the model from the last epoch. **[1 mark(s)]**
4. Please use speechbrain.utils.metric_stats.ErrorRateStats.write_stats to help you save the output of your model on the whole test set to help you know your model's performance better. In the output file, please use phoneme tokens instead of token ids (numbers). Save the file to "results/baseline/results.txt" **[1 mark(s)]**
5. Please log your training, validation, and evaluation statistics to the result folder, in whatever way you like. **[1 mark(s)]**

Run the training and testing by

    python train.py hparam_baseline.yaml
Expected PER: 90%.

**NOTE**: Please keep the (1) training log, (2) model checkpoint and the (3) corresponding result files, when submitting you assignment. **[3 mark(s)]**

**MY NOTE: The files are present in the ./results/baseline folder**

### Task 4: Modifying the Model [13 mark(s)]

You may have spot some of the issues during the training, like the slow converging speed, overfitting, etc. Please make the following changes to your model by modifying the yaml file.
1. (Please create a new .yaml file from the hparam_baseline.yaml, naming it hparam_modified.yaml) **[1 mark(s)]**
2. Increase the N_epoch to 20. **[1 mark(s)]**
3. Increase the learning rate to 5e-3 **[1 mark(s)]**
4. Add weight decay = 0.1 to the optimizer **[1 mark(s)]**
5. Add a variable named "drop_p", with value 0.2. **[1 mark(s)]**
6. Add 3 dropout layers to the model, after act1, act2, and RNN. All with the same dropout rate of "drop_p" (you need to use a variable reference here). **[1 mark(s)]**
7. Change the output_dir from "results/baseline" to "results/drop0.2x2_lr0.0005_wd0.1". **[1 mark(s)]**

There are some other changes you need to make in the train.py file:
1. Use the speechbrain.nnet.schedulers.NewBobScheduler to schedule the learning rate or training according to loss on validation set. If the validation loss did not decrease after an epoch of training, use that scheduler to adjust the learning rate. **[2 mark(s)]**
2. Before the training of each epoch, print out and log the current learning rate. **[1 mark(s)]**

Run the training and testing by

    python train.py hparam_modified.yaml
Expected PER: 65%.

**NOTE**: Please keep the (1) training log, (2) model checkpoint and the (3) corresponding result files, when submitting you assignment. **[3 mark(s)]**

**NOTE from TA: In Task 4, there is an inconsistancy of the folder naming and the learning rate. Please change the output_dir of the modified model to "results/drop0.2x2_lr0.005_wd0.1" instead.**

**MY NOTE: THe files are present in ./results/drop0.2x2_lr0.005_wd0.1**

## Section 2 - Questions [42 marks]

### - Result Analysis [2 mark(s)]

1. How does your system perform? Briefly introduce your system's performance with objective metric scores and the result file for the test set. **[2 mark(s)]**

**Answer:**

There are 2 systems trained which are the baseline model and the modified model.

- Here is a brief explanation the required files which are saved in the path ./results/baseline && ./results/drop0.2x2_lr0.005_wd0.1: ( please check all the files present in these directories)
    + <u>training log.txt</u>: Provides the training loss at each epoch along with the validation metric summary run at the end of each epoch. At the end of the file, the summary metrics for the best model at test time is also shown.
    + <u>train_valid_stats.txt</u>: Provides both the training loss, validation loss and validation metrics at the end of each epoch. Contains the same information as 'training log.txt'
    + <u>results.txt</u>: Provides a sample file showcasing the alignments at inference time on a series of sentences. It shows phoneme level comparisons of whether any insertions, deletions or substitutions happened along with the final testing evaluation metrics.

- Here are the brief explanations of the results of each model.
    + **a. Baseline model**
        - As we can see from the files mentioned above for this model , the final training epoch's validation "Word Error Rate" (which in this case corresponds to the "Phoneme Error Rate" ) is around 73.8% and while evaluating the best model on the test data we can observe that the WER is around 80.33% while the SER is 100%
        - This means that at a sentence level, the baseline model was not able to predict any sentence accurately as a whole but when it came to phoneme level predictions it only had an error rate of 80% on the test data.
        - This model was only trained for 10 epochs and the checkpointed model at the end of the 10th epoch was the best model which was picked up for further evaluation on the test dataset.
        - Also, we can see that our ASR model has a lot more deletions (399) and substitutions (71) when compared with insertions (7). Which means that this baseline model is predicted a lot lesser words ( i.e. more deletions) or predicting wrong phonemes( i.e. more substitutions) which also implies that this baseline model still has a long way to improve due to its poor performance.

    + **b. Modified model**
        - The architecture of this model is different from the base model as we additionally have 3 dropout layers, along with a weight decay in the Adam optimizer, a higher base learning rate of 0.005 and a NewBobScheduler which decreases the learning rate in the Adam optimizer if the validation loss ever decreases from its previous epoch's values.
        - As we can see from the files mentioned above for this model , the final training epoch's validation "Word Error Rate" (which in this case corresponds to the "Phoneme Error Rate" ) is around 57% and while evaluating the best model on the test data we can observe that the WER is around 62.033% while the SER is 100%. These results are clearly better than the values in the baseline model but are still not yet good enough.
        - Error rate for SER being 100% at a sentence level means that, even the modified model was not able to predict any sentence accurately as a whole but when it came to phoneme level predictions it only had an error rate of 62% on the test data.
        - This model was only trained for 20 epochs. The best checkpointed model was picked up for further evaluation on the test dataset.
        - In this particular ASR model we can see that the number of insertions , deletions and substitutions have significantly dropped from the base model (i.e. insertions (22), deletions (184), substitutions (160)) which implies that the model is now dropping lesser number of words but has different predictions at a phoneme level resulting in the substitutions increasing from 71 -> 160.
        - That said, this model is still not performant enough for any practical application as there is still a lot of room for improvement of this model as well.

### - Tokenization [8 mark(s)]
1. Do you think detecting phoneme sequence from speech recording is more difficult than detecting character or word sequence? Why? **[2 mark(s)]**
2. For the task of speech recognition, what are the drawbacks of using phoneme as the detecting unit? **[2 mark(s)]**
3. What is the advantage of sub-word tokenizer compared to word-level tokenizer? **[2 mark(s)]**
4. If we are changing our tokenizer to the type of grapheme, which level do you think is the best, among {character, word, sub-word}? Please state your reason. **[2 mark(s)]**

**Answer:**

1. Detecting an entire word is a lot harder than detecting individual characters or phonemes. That said, the current formulation of phonemes assumes that people speaking english in all accents have the same pronunciation. Due to this simplifying assumption, the number of phonemes to deal with are a lot lesser than what we would have to deal with when considering the phonemes for all varied accents in the world. However, detecting phonemes is not an easy task as different speakers may pronounce phonemes differently, and there can be substantial variation in how individuals articulate the same phonemes. Additionally, unlike written text, where spaces indicate word boundaries and characters are well-defined, spoken language lacks clear boundaries between phonemes. This makes it challenging to segment speech into individual phonemes without contextual information.


2. Here are some of the drawbacks of using phoneme as the detecting unit:
    - It's more complicated because it deals with many tiny units.
    - Sometimes, sounds can mean different things in different situations, making it hard to understand.
    - People say the same sounds differently, making it tricky for one-size-fits-all systems.
    - How sounds are spoken can change depending on nearby sounds, making it harder to separate them.
    - It takes a lot of computing power and data.
    - It often needs tweaking for different languages, speakers and accents.

3. Sub-word tokenization, compared to word-level tokenization, has several benefits:
    - It results in a smaller vocabulary, saving memory and making models easier to train.
    - It focuses on common linguistic structures helps models generalize better.
    - Sub-word tokenization retains character-level details, valuable for fine-grained text analysis.
    - Sub-word tokenization can recognize and process rare/ infrequent words not found in its vocabulary, making it more adaptable to a wider range of terms.
    - It effectively captures variations in word forms, such as tenses and plurals, by breaking them into common sub-word units which can help the model generalize better.
    - A base model based on this tokenization could potentially be used as a base model to learn ASR systems in other similar languages which have similar sub words.

4. A grapheme is the smallest unit of written language, like the building blocks of letters and symbols. Each letter in the alphabet and every symbol, like punctuation marks, is a grapheme. They come together to form words and sentences. When choosing the best level of grapheme tokenization (character, word, or sub-word), it depends on the specific task.
    - <u>Character-Level Grapheme Tokenization</u>: This treats each character as a separate unit, making it versatile and suitable for languages with complex scripts. It's best when there is a need to analyze individual letters or work with languages where words don't have clear boundaries. It's also useful for tasks like text generation or text classification that focus on character-level patterns.
    - <u>Word-Level Grapheme Tokenization</u>: This divides text into whole words, which are more interpretable and contextually meaningful. It's ideal when we need to understanding word meanings themselves, like in translation or sentiment analysis. It's also a good fit for languages with well-defined word boundaries.
    - <u>Sub-Word-Level Grapheme Tokenization</u>: This balances characters and words by segmenting text into meaningful sub-word units. It's a commonly chosen option for many language processing tasks. It handles tricky out-of-vocabulary words, word variations, and multiple languages effectively. It's often used in machine translation, speech recognition, and text generation where adaptability to different languages matter.
    - Overall, sub-word-level grapheme tokenization is often a practical choice because it combines the strengths of both characters and words while addressing challenges like handling new words and complex word forms.



### - Modeling [7 mark(s)]
Connectionist Temporal Classification (CTC) is a type of loss function that is commonly used in ASR, especially when we do not know the precise alignment between the annotation and the audio.

1. Explain how does CTC deal with the misalignment issue between audio and annotation, i.e., the number of frames in the audio is much higher than the number of phoneme/character/sub-word/word in the annotation, and we do not know their correspondence. **[1 mark(s)]**
2. Why does CTC need an additional blank token in the prediction? **[1 mark(s)]**
3. Here are several decoded output from a CTC model. Write out their final recognition result. ("-" is CTC blank token, and "_" represent space) **[2 mark(s)]**

    (1) heeel-ll-l_lllooo--wooooorld

    (2) hhhhee-llow_wo--rr-rllll--dd
    
4. Recall the formula of CTC loss:
   $$L_{CTC} = -log(\sum_{\pi \in B^{-1}(W)} \prod_{t=1}^Tp(\pi_t|\mathbf{x}_t))$$
   Does this summation mark means that we have to list out all possible alignments between frames and texts, compute the probability for each pair, and add them together? Is there more efficient way to compute the CTC loss? If you think so, please briefly explain a more efficient algorithm. **[3 mark(s)]**

Answer:
1. CTC deals with misalignment between audio and annotation by allowing the model to predict not only the phonemes but also repetitions. This is represented by a special blank token ("-"). In this approach, words with blank tokens or repeated phonemes are all merged together as long as there are no space tokens in between. This way, it aligns the audio frames with the annotation without knowing their precise alignment mapping even though the number of audio frames are much larger than the number of phonemes in the annotation. In this process, we can either repeat the same phoneme multiple times as the output at each time step or instead output a "blank token". The model then learns to find the best alignment during training by adjusting the probabilities of inserting blank tokens and tokens corresponding to the desired output units.
<br>

2. The blank token in the CTC framework plays a crucial role as a separator between repeated units such as phonemes, characters, sub-words, or words. Its primary function is to assist the model in handling instances where the same unit occurs multiple times consecutively. Without the blank token, the model might encounter difficulties in distinguishing between repeated units. For e.g., in the word "hello"; the presence of the blank token enables the model to distinguish between the two "l" characters, preventing them from being merged into a single unit. This differentiation is essential for the model to generate the correct output sequence, as it helps control merging of repetitions effectively.
<br>

3.  (1) helll loworld
    (2) helow worrld
<br>

4.
    - Yes, in this case the summation means that we need to add the probabilities for each possible alignment as multiple alignments can give rise to the same word due to the nature of how "blank tokens"   and repeated tokens work.
    - When computing this summation in a naive manner it is computationally very expensive however we can use dynamic programming to compute the CTC loss in a more efficient manner.
    - The original implementation of CTC loss leverages dynamic programming techniques inspired by Hidden Markov Models (HMMs). Here is a high level overview of how it works:
        - CTC adopts some concepts from HMMs, particularly the use of dynamic programming algorithms. For e.g. in HMMs, we use the forward and backward passes to efficiently compute probabilities while considering multiple possible alignments between audio frames and the target sequence.
        - Here we cache the intermediate results, such as probabilities of reaching specific states or alignments, as we calculate them during the forward and backward passes.
        - Since these probabilities are cached during computation of forward and backward algorithm, we can retrieve the cached value instead of recomputing it. This significantly speeds up the CTC loss computation and makes it feasible for training large ASR models on extensive datasets.

### - Language Model [7 mark(s)]
1. Consider the two sequences below:
    - A: I like Singapore's weather.
    - B: I Singapore like ? weathers.

    For a well-trained language model, which sentence will have lower perplexity from this model? Why? **[1 mark(s)]**
</br>

2. Given the corpus below:

            <s> I love to play football </s>
            <s> He loves to watch football </s>
            <s> I love to watch movies </s>
            <s> She loves to play tennis </s>
    (1) Assuming we are using a word-level tokenizer. <s> and </s> represent start and end of sentence token. Calculate the below bigram probability $P(B|A)$ by
    $$
    P(B|A) = \frac{Count(A B)}{Count(A)}
    $$
    **[3 mark(s)]**
    
    a. P(love | I)

    b. P(to | love)

    c. P(football | play)
    
    d. P(movies | watch)
   </br>
   
    (2) Use the probability you obtained above, calculate the probability of below sentences **[2 mark(s)]**
    a. I love to watch football
    b. She loves to play football
   </br>
   
    (3) Why it's not a good idea to use a large n value for n-gram language models? **[1 mark(s)]**


**Answer:**
1. In Sentence A, "I like Singapore's weather."; it is expected to have lower perplexity from a well-trained language model because it follows a regular sentence structure and makes logical and grammatical sense. Whereas in Sentence B, "I Singapore like ? weathers." is likely to have higher perplexity because it contains jumbled words and lacks proper grammar, making it harder for the language model to predict the next word accurately. In general, lower perplexity means that the predicted sentence is better as it is in some sense inversely proportional to the Nth root of the joint probability distribution of the predicted sequence of words. Meaning that higher probability corresponds to the model being less "perplexed" or "confused". Here is the formula

$$Perplexity = \sqrt[n]{\frac{1}{P(w1,w2,...wn)}}$$

2. **NOTE: This is just using the same corrected formula provided by Longshen, even though it is not graded here are my answers**
    (1).
        a. P(love | I) = Count(I, love) / Count(I) = 2/2 = 1
        b. P(to | love) = Count(love, to) / Count(love) = 2/2 = 1
        c. P(football | play) = Count(play, football) / Count(play) = 1/2 = 0.5
        d. P(movies | watch) = Count(watch, movies)/Count(watch) = 1/2 = 0.5

    <br>

    (2).a. P("I love to watch football) = P(I | \<s\>) * P(love | I) * P(to | love) * P(watch | to) * P(football | watch)
                                    = 2/4 * 1 * 1 * 2/4 * 1/2
                                    = 1/8
    b. P("She loves to play football") = P(She | \<s\>) * P(loves | She) * P(to | loves) * P(play | to) * P(football | play)
                                    = 1/4 * 1 * 2/2 * 2/4* 1/2
                                    = 1/16

    <br>

    (3). Although having a large n-value for n-grams might provide more context its generally not a good idea because:
        * As n increases, the n-gram model considers longer sequences of words. This can lead to a significant increase in the number of unique n-grams in the training data.
        * Many of these n-grams may occur very rarely or even be unique, resulting in sparse data.
        * Sparse data can lead to poor model generalization because the model may not have enough information to make accurate predictions for these rare or unseen n-grams.
        * In essence, a large n value can lead to data sparsity problems, making it challenging for the model to estimate probabilities accurately and potentially causing overfitting on the training data.

### - Beam Search [4 mark(s)]
Assume we have a simplified language model that can predict the probability of next word. We have generated a start part of the sentence "I want to". Now we are using beam search to predict the rest of the sentence. Use letter "G" denote the generated part. Let's use beam size of 2 for this question.

        Probability calculated by language model:
        p(eat | G): 0.4
        p(play | G): 0.3
        p(go | G): 0.2
        p(watch | G): 0.1
        p(a sandwich | G eat): 0.5
        p(dinner | G eat): 0.4
        p(an apple | G eat): 0.1
        p(football | G play): 0.6
        p(games | G play): 0.4
1. Let's continue the generation from G="I want to". After the first step of beam search, what tokens will be selected, and what are the resulting candidate sequence? **[1 mark(s)]**
2. In the 2nd step of beam search, what are the two beams starting with "G eat"? What are their probability respectively? **[1 mark(s)]**
3. In the 2nd step of beam search, what are the two beams starting with "G play"? What are their probability respectively? **[1 mark(s)]**
4. What are the resulting candidate sequence from the 2nd step of beam search? **[1 mark(s)]**

**Answer**

1. Since G = "I want to" the next possible tokens with a beam size of 2 are "eat" and "play" because P(eat|G) [0.4] and P(play|G) [0.3] have the highest probabilities after G is generated.
2. In the second step , one possible branch is "G eat" ( i.e. I want to eat). From this point onwards the next possible tokens are "a sandwich" & "dinner" as p(a sandwich | G eat) [0.5] and p(dinner | G eat) have the highest probability when the "G eat" is already generated. Therefore the probability for "a sandwich" is 0.5 and the probability for "dinner" is 0.4
3. In the second step , the other possible branch is "G play" (i.e. I want to play). From this point onwards the next possible tokens are "football" and "games" as their probabilities p(football | G play) and p(games | G play) are the highest given that "G play" is already generated. Therefore the probability for "football" is 0.6 and the probability for "games" is 0.4
4. There are totally 4 possible resultant candidate sequences at the end of the 2nd step which are:
    - When "G eat" is generated:
        - I want to eat a sandwich
        - I want to eat dinner
    - When "G play" is generated:
        - I want to play football.
        - I want to play games.

### - Word Error Rate [3 mark(s)]

Consider an automatic speech recognition system that transcribes a spoken segment into text. We compare the transcription of the system with a human-annotated reference transcript to calculate the system's Word Error Rate.

Reference Transcript:
"I am excited to learn about speech recognition."

System's Transcription (Hypothesis):
"I am excited learn about speech recognise."

1. Calculate the number of insertions, deletions, and substitutions. **[1 mark(s)]**
2. Compute the Word Error Rate (WER) using the formula: **[1 mark(s)]**
$$WER=\frac{\text{Insertions}+\text{Deletions}+\text{Substitutions}}{\text{Number of words in Reference}}$$
3. Why might WER be a more reasonable metric for ASR compared to a simple accuracy rate (correct words divided by total words)? **[1 mark(s)]**

**Answer:**
1. There is 1 deletion (Only the word "to" is missing in output) and 1 substitution (the word "recognition") and zero insertions. (i.e. I = 0, S = 1, D = 1)
2. Here is the calculation
    $$WER=\frac{\text{Insertions (0)}+\text{Deletions (1)}+\text{Substitutions (1)}}{\text{Number of words in Reference (8)}}$$
    $$WER=\frac{2}{8}$$
    $$WER=0.25$$
3. WER is preferred over simple accuracy for the following reasons:
    - WER considers insertions, deletions, and substitutions, making it better at detecting different types of errors in ASR systems. Basic accuracy rate treats all errors the same, which may not accurately reflect transcription quality.
    - WER allows for variations in alignment of reference and output, therefore it is more forgiving in this regard.( e.g. alignment variations can happen because of different speaker rates or different accents or pauses etc)
    - WER focuses on the correctness of the content, ensuring that even if a word is slightly mispronounced or replaced, the overall message is still captured.

### - Possible Improvement [3 mark(s)]
1. The performance of the recognition system in Section 1 might still have room to improve. What are possible reasons for the not-so-good performance, and directions of improvement? Please list 3 pairs of them. **[3 mark(s)]**

**Answer:**

There are several reasons for the below par performance, which are listed below along with other dimensions to explore for improving the model
- We need more data to train on which can be achieved with data augmentation techniques where we can consider more different time-sliced windows in each training data point of audio / text pair.
- It could also be the case that the dataset is not generalized enough to consider different accents as with different accents, the phoneme tokenization is also technically different. Therefore at inference time, if a person speaks with a different accent the machine might transcribe it incorrectly!
- We could train our machine learning model to directly predict characters instead of phonemes as the concept of phonemes is something invented by humans and this biased way of thinking about the tokenization might actually hinder the model from learning a better underlying representation. Therefore, ignoring phonemes when using deeper machine learning models might lead to better results as the model might learn its own underlying representations when its not restricted with this concept of sticking to phonemes.
- Other dimensions to consider for improvement would be to consider using more state of the art model architectures incorporating concepts like attention in order to learn the dependencies in a better manner
- We could also consider building on top of existing pretrained models for getting a better performance.
- Given that we know what the CTC loss formula is we can also add that loss term to the existing loss term between the encoder and decoder to consider minimizing the error in a more holistic end-to-end perspective.



### - Speech vs Singing [6 mark(s)]
1. What are the properties that are different between audio of from speech recording and that of singing recording? What are the similar/same properties that are shared between them? **[2 mark(s)]**
2. What are the properties that are different between spoken texts and lyrics? What are the similar/same properties that are shared between them? **[1 mark(s)]**
3. Given the limited paired singing dataset of audio and lyric, how can we build a lyric transcription system with better performance? Please answer from 3 perspectives. **[3 mark(s)]**

**Answer:**
1. Listed below are the properties similar and different between speech and singing recording respectively:
- **Similar Properties:**
    + Both speech and music happen in the human hearing range (20 Hz -> 20k Hz)
    + Both speech and singing have variations temporally. For e.g. rhythmic patterns and pauses in terms of duration, timing, and tempo.
    + Both of them can have variations in loudness/amplitude ranging from low decibel to high decibels

- **Different Properties:**
    + Singing uses deliberate pitch manipulation for melodies, while speech has a smaller pitch range and depends on the speaker.
    + Even though both contains phonetic features, in singing we often have to hold a particular phoneme at a particular note for a longer time and also sing it with some particular emotion as well. Whereas in speech the phonemes occur at a normal tempo which happens to be the rate at a which a person normall speaks.
    + While speech has some kind of prosody and tempo it is not consistent and can very from person to person. Whereas, for singing it involves melodic tempo and typically has more consistent rhythm.
    + Singing lyrics are generally more poetic and fit a particular time meter whereas regular speech need not necessarily be poetic in nature.
    + While both speech and singing can convey emotion and expression, singing is more explicitly focused on emotional delivery through melody and musical techniques, while speech relies on intonation and prosody for emotional nuances.

2. The only similar property is that both of them share features in the language dimension as both of them have phonemes, words, phrases and sentences in them which is used to convey some information or story.
However, there quite a few differences which are listed below:
- In terms of rhythm and tempo, singing lyrics have a fixed melodic rhythmic structure whereas spoken text may not.
- In terms of pronunciation, some words in the singing lyrics might be pronounced differently for providing some emotion or special emphasis on the mood of the song whereas in terms of spoken text the pronounciation varies only with different accents but is more or less the same throughout.
- In terms of melody, each word in the singing lyrics might be associated with a specific or multiple musical notes whereas each word in the spoken text may only have minor inflections in pitch or accent to emphasise a particular point but in general are not associated with any musical notes.

3. For building a better lyric transcription model here are some of the things we can do:
- We can perhaps do some kind of transformation/ preprocessing of the singing data and limit it to only a few pitches. This would be analogous to converting a singing lyric ASR system to a regular ASR system where we only try to detect spoken words. In this case, if we already had a performant ASR for spoken text (i.e. regular speech) we would essentially be converting the problem of singing detection to just speech detection
- Another route to take would be to use ML algorithms to directly predict tokens at the character level rather than at the phoneme level. This way a more complex ML model will not be biased to our limited understanding of what phonemes are and might discover the hidden underlying patters by itself
- Another key step we can do would be data augmentation. As given a pair of lyrics and its audio; we can make many time-sliced windows of varying lengths thus generating a lot more data for the model to learn from which would undoubtedly help any supervised learning algorithm perform better.
- Apart from this we can perhaps try training more state of the art architectures using concepts like attention/ self-attention on already existing pretrained models (lets say even on a regular ASR only for speech detection) as that way the new model architecture will already have the benefit of learning from the weights and insights gained by the pretrained model before it.
- Another approach to improve the lyric transcription would be to incorporate additional context in other modalities. For e.g. if we have the singing face or the lip movement, that might help the model to learn more underlying concepts better.


### - Timing Survey [2 mark(s)]

- What do you think is the most difficult part? Which part did you spent most time on it? **[1 mark(s)]**

Answer:
The entirety of section 2 was the hardest part as it required a deep understanding and revision of certain theoretical aspects of implementing ASR.

</br>

- How much time did you spent on the assignment? Please fill an estimated time here if you did not time yourself. **[1 mark(s)]**

Answer: ~4 days

</br>