# **Instructions**

This document is a template, and you are not required to follow it exactly. However, the kinds of questions we ask here are the kinds of questions we want you to focus on. While you might have answered similar questions to these in your project presentations, we want you to go into a lot more detail in this write-up; you can refer to the Lab homeworks for ideas on how to present your data or results. 

You don't have to answer every question in this template, but you should answer roughly this many questions. Your answers to such questions should be paragraph-length, not just a bullet point. You likely still have questions of your own -- that's okay! We want you to convey what you've learned, how you've learned it, and demonstrate that the content from the course has influenced how you've thought about this project.

# BENoiT: Better English Noisy Audio Transcriptions
Track: Applications | Sub-track: Text + Audio 

Project mentor: Xuan Zhang

Vijay Murari Tiyyala <vtiyyal1@jh.edu>, Pulkit Madaan <pmadaan2@jh.edu>, Ehong Sun <esun7@jh.edu>, Carolyne Holmes <cholme26@jh.edu>

https://github.com/madaanpulkit/benoit

# Outline and Deliverables

List the deliverables from your project proposal. For each uncompleted deliverable, please include a sentence or two on why you weren't able to complete it (e.g. "decided to use an existing implementation instead" or "ran out of time"). For each completed deliverable, indicate which section of this notebook covers what you did.

If you spent substantial time on any aspects that weren't deliverables in your proposal, please list those under "Additional Work" and indicate where in the notebook you discuss them.

### Uncompleted Deliverables
1. "Would Like to accomplish": w (need to correct model first) (What does this mean???)
2. "Expect to complete #2": Since, the model didn't learn anything meaningful, we still have to reach the stage where identify failure modes



### Completed Deliverables
1. "Must complete #1": A corpus of synthetically generated noisy speech audio with grammatically correct English sentences. [In "Dataset" below](#scrollTo=zFq-_D0khnhh&line=10&uniqifier=1).
2. "Must complete #2": Working implementation of the Denoising Model. [In "Methods" below](#scrollTo=PqB48IF9kMBf&line=9&uniqifier=1).
3. "Must complete #3": We have a function capable of taking raw audio and returning faithful and corrected text. [In "Methods" below](#scrollTo=PqB48IF9kMBf&line=9&uniqifier=1).
4. "Expect to complete #1": Analysed the source of error, that degrades model performance. [In "Results" below](#scrollTo=_Zdp4_H-kx8H).
5. "Expect to complete #3": Generated noisy data via human-crafted rules (random word deletion, word order permutation, etc.) [In "Discussion" below](#scrollTo=ugJXhZKNlUT4&line=20&uniqifier=1).



### Additional Deliverables
1. ...
2. ...

# Preliminaries

## What problem were you trying to solve or understand?


***Transcribing audio is a difficult task, and it becomes even harder when the audio is noisy or the speaker's first language is not English. Most approaches to transcription focus on being faithful to the original audio, but our aim is to use auto-correction to produce grammatically correct transcriptions.***

What are the real-world implications of this data and task?

***Automatic speech recognition (ASR) approaches are typically faithful to the original audio and work well when the content is of high quality. However, audio content is often not of high quality for reasons such as the presence of physical noise such as static, the cutting out of words due to poor audio connection, and speakers with thick accents or poor English proficiency since many users of technology don't have English as their first language. All of these factors can impact a listener's ability to understand the audio content; and for those who are hard of hearing or deaf, they have difficulty as the ASR may not be able to correctly transcribe the audio. In many situations, this inability to understand the audio could be seriously detrimental, such as if a user was couldn't understand what is being said in class and fails their class due to it. As such, just as auto-correct functions for typing have made it easier for people who are not professionally fluent in English to communicate without constantly looking up the correct spelling of words, our goal with BENoiT is to improve audio communication through the use of auto-corrected, grammatically correct transcriptions that will help all users fully understand the audio content.***

---


How is this problem similar to others we’ve seen in lectures, breakouts, and homeworks?

***We use deep learning to tackle this problem. We model it as a supervised classification task, where the input is a speech audio signal that is temporal in nature. For each time step, the model then classifies the audio as belonging to a particular word from the vocabulary.***

***In our model, we use a denoising autoencoder paradigm to denoise text, as covered in lectures. Additionally, we use seq2seq models (as covered with and without attention in lectures and homework) to transform text to text.***

---


What makes this problem unique?

***Most automatic speech recognition (ASR) approaches do not focus on producing grammatically correct output along with noise. Instead, they use language models to improve their predictions of what the faithful audio was. In contrast, our approach explicitly transforms ASR outputs into grammatically correct sentences in the cases where the faithful audio itself may have grammatical errors (such as when the speaker does not speak perfect English).***


---


What ethical implications does this problem have?

***The major ethical concern of our model is the use of voices during training. When generating audio files for the training data, the voices were selected from either an artifical middle-aged American male or female voice. This introduces bias as we do not train on accented voices or voices that have varied tones (such as the voices of children of the elderly) and our model may not be able to predict upon them as well. In this case, our model could be discriminatory against people with voices we did not train upon since they couldn't be understood as well. Additionally, though we used Artificial voices for this training, it is possible that other voices could be used. In the case of real-life people's voices being used, we would have to consider consent and if we are ethically allowed to use the voices of people who cannot consent, such as the dead. Overall, there are also ethical concerns if our model is inaccurate. If it mis-transcribes the audio, it could cause misunderstandings and disproportionately affect those who need transcriptions more (as with the previous example, if a deaf person relies on our model to learn their recorded class material and our model is inaccurate, their class grade and mental wellbeing could be harmed).***


---



## Dataset(s)

Describe the dataset(s) you used.

***We used the dataset Reuters-21578, Distribution 1.0, this contains the American-English news articles in 21 files which also contains some noise like HTML tags and other unwanted formatting and non-sentence text.***


---
How were they collected?

***We wrote a script to clean those files and get only the article content which has no noise into 21 different files. This resulting in 21 files, each with several thousand lines (each line being a sentence).We then did backtranslation on those sentences by translating them from English ➡ Telugu ➡ English. This process was done to generate grammatical noise within the text. These 21 text files were then put through text-to-speech model that employed the Python package pyttsx3. For each sentence within the files, a Windows American male or female voice with the default speaking rate of 200 words per minute (a slightly fast speaking rate for English speakers but within the normal range) was randomly selected. Overall, 49.95% or the files were male and 50.05% were female. (Note- for the future we also created code that creates noise by random permutations and deletions in the text, which can be found in the GitHub).***


---
Why did you choose them?

***We selected this dataset as it was in our desired target language of English and included a large number of samples (95,368 sentences over the 21 files). These files also included more obsure or technical words (such as 'treasury', 'manufacturing', 'joint venture', and 'Beryllium') that would be useful to make sure our model trained on, rather than common and easily transcribed words. 
For the text backtranslation, the Telugu language was selected for since it has complex grammar rules that often do not translate correctly with automatic translators like the package googletrans that we used, thereby introducing poor-translation grammatical noise.
The use of both male and female voices was done to try to reduce bias by increasing the variance of voice type, thouh not all biased was removed as this is still a limited voice sample.***

---
How many examples in each?

***As said, we had 21 files overall. Breaking these files into sentences and performing the TTS step gave us 95,368 audio examples. Each sentence, on average, had # words. Each word then became a vector of size 16.
These examples were then split into training, validation, and test sets as such: 4,974 training files, 4,689 validation files, and 4,751 test files.
Within the these, the average number of words ended up being # in the training set, # in the validation set, and # in the test set. These plots can be seen below.***


---






In [None]:
# Load your data and print 2-3 examples

## Pre-processing

What features did you use or choose not to use? Why?


---
If you have categorical labels, were your datasets class-balanced?


---
How did you deal with missing data? What about outliers?


---
What approach(es) did you use to pre-process your data? Why?

***For the first step of our model, we did not perform any pre-processing since we used a pre-trained ASR model that takes in raw audio. For the second step of our model that fixes the grammar, we used T5Transform pre-processing from torchtext since the T5 model used does not work with raw text. Our pre-processor takes the words and converts them into numbers (byte-pair encodes using Google's sentence piece model that does the text tokenization), truncates the sequences to length of 256 (since some sentences may be extremely long and hard to learn on), and pads the sequences with end of sequence and padding tokens into a batch tensor. Since this pre-processing step was done between steps in our model and the data distribution is unchanged from before pre-processing in terms of aspects such as sentence length, no additional distribution is given.***

---
Are your features continuous or categorical? How do you treat these features differently?


---

In [None]:
# For those same examples above, what do they look like after being pre-processed?

In [None]:
# Visualize the distribution of your data before and after pre-processing.
#   You may borrow from how we visualized data in the Lab homeworks.

# Models and Evaluation

## Experimental Setup

How did you evaluate your methods? Why is that a reasonable evaluation metric for the task?

***We selected WordErrorRate as our metric and we chose the value best_wer = 1e4, as these are the evaluation metrics used in most existing ASRs. WER is reasonable for our task since we know generally what the target text we are comparing our prediction to (the audio gives us a general target text) rather than generating a wholly new text sample. This makes WER better suited to our task than the other frequently used metric BLEU (since BLEU assumes there may be many possible valid target texts and the predicition must be compared to all of them). However, to be thorough and since our predicted transcript may have a couple possible correct results (as sometimes there are grammar fixes that could both be correct), we have included both metrics WER and BLEU common to these types of tasks.***


---


What did you use for your loss function to train your models? Did you try multiple loss functions? Why or why not?

***We used cross entropy loss as our loss function and Adam optimizer. The code is available in the notebook linked.***


---


How did you split your data into train and test sets? Why?

***We chose different files out of 21 as our train, val and test datasets***


---



In [None]:
# Code for loss functions, evaluation metrics or link to Git repo
# We used cross entropy loss as our loss function and Adam optimizer. The code is available in the notebook linked. 


## Baselines 

What baselines did you compare against? Why are these reasonable?


***Our baseline was Pre-trained Wav2vec2 ASR model from torchaudio.
We then add our denoising paradigm on top of it. The ASR models doesn't need any preprocessing to it.*** 


---



Did you look at related work to contextualize how others methods or baselines have performed on this dataset/task? If so, how did those methods do?


***Yes we looked at the related work for ASR models and their baselines.***


---







## Methods

What methods did you choose? Why did you choose them?

***As mentioned, we have two stages in our method. The first stage aims to produce text that is faithful to the audio without any kind of corrections. For this, we used the Wav2Vec 2.0 ASR pre-trained model from torchaudio. This model was selected because it is readily available, widely used in literature related to our task, and a lot of tutorials available. This means that is was easier to learn and implement. Using a pre-trained model such as this one which includes a greater variety of audio samples allows for our own model to achieve better generalized performance with limited data. Additionally, this method is useful since it works directly with raw audio, eliminating the need for non-trivial audio preprocessing that could add computational complexity and time.***

***Our second stage involves removing the noise and fixing the grammar. This process looks like: Denoising ➡ GRU Encoder - Decoder Seq2Seq model with and without attention. We chose this because it is typically a standard practice in literature related to our task. A GRU was used as we are working with data that is long sequences (sentences up to length 256 after pre-processing) and GRUs are better able to capture long range dependencies and dealing with the vanishing gradient problem that arises with long sequences. Additionally, GRUs are simpler to implement than LSTMs. Using the Seq2Seq GRU allows us to output text from text as we desired, and to be thorough, we perfomed it with and without attention (since attention can sometimes help the model by finding what parts of the sequence are more important to the transcript and not all words in a sentence are important in terms of understandability- example: in 'I go work', to isn't really needed but 'I go to' the word work is necessary). ***


---


How did you train these methods, and how did you evaluate them? Why?


***Only Denoising is trained for 100 epochs with Adam Optimizer evaluated on Word Error Rate. We chose Word Error Rate as it's the standard metric used in ASR tsks.***


---


Which methods were easy/difficult to implement and train? Why?

***The denoising model with and without attention both have been hard to train and implement***

***Challenges:***

***1.Challenge to adapt to our task***


***1.1 Shapes are challenge in sequence tasks***

***1.2 Batch first is intuitive, sequence first is not***

***1.3 Temporal aspect makes scripting sequential***


***2.The models learn the degenerate solution of only predicting <eos> at every step***

***2.1 The model is initially padded with <_eos_> at pre-processing***

***2.2 Random batch sampling leads to majority  sequence being just <_eos_>***



---



For each method, what hyperparameters did you evaluate? How sensitive was your model's performance to different hyperparameter settings?

***Batch_size:2,initial_learning_rate:0.0001, epoch=100 for denoising model***


---







In [None]:
# Code for training models, or link to your Git repository

In [None]:
# Show plots of how these models performed during training.
#  For example, plot train loss and train accuracy (or other evaluation metric) on the y-axis,
#  with number of iterations or number of examples on the x-axis.

## Results

Show tables comparing your methods to the baselines.

What about these results surprised you? Why?

Did your models over- or under-fit? How can you tell? What did you do to address these issues?

What does the evaluation of your trained models tell you about your data? How do you expect these models might behave differently on different data?  

In [None]:
# Show plots or visualizations of your evaluation metric(s) on the train and test sets.
#   What do these plots show about over- or under-fitting?
#   You may borrow from how we visualized results in the Lab homeworks.
#   Are there aspects of your results that are difficult to visualize? Why?

# Discussion

## What you've learned

What concepts from lecture/breakout were most relevant to your project? How so?


---
What aspects of your project did you find most surprising?

***We found it suprising how much is involved in the process of coding a full model like this. To begin with, just knowing generally how you want to approach the problem and what techniques you will use doesn't translate to actually implementing them as a lot goes into making sure the details of each method are correct. Especially with a deep learning method like ours, it is very hard to train and incredibly easy to misspecify some part or make a small error that can greatly affect the final outcome. For example, our choice to use a method (T5Transform) that required padding of the data could affect our end result since it kept adding end of sequence tokens that produces incorrectly low loss. An additional factor we had to consider at every step was the ethical implications, and it was surprising how a small aspect like what voices we used can have such a big impact on how useful or harmful this model is. Along with this, we were also surprised by our model's high hyperparameter sensitivity, since most models we have worked with in class have not been as sensitive.***


---
What lessons did you take from this project that you want to remember for the next ML project you work on? Do you think those lessons would transfer to other datasets and/or models? Why or why not?

***A few of the lessons we learned can generalize well to other models. Specifically, we learned that a project as complex as deep learning should be started a long time in advance, as many decisions besides the general methods must be made (such as how to pre-process data, finding the dataset, etc.) and often things can end up not working so you have to be prepared to keep trying various methods (as machine learning is somewhat of a keep trying until you find what works area). Similarly, it would have been useful for this project if we began with fine-tuning the pre-trained models available just to show that our idea was possible, that way we wouldn't get stuck with an idea that may not end up working very well. More specifically to this project, we also learned about the importance of the dataset and making sure it has the qualities you need. For example, we needed a dataset that had enough noise to properly reflect real-world noise levels and also reflect real-world noise types (such as static, words being dropped, grammar mistakes); and it would be useful to have a dataset that is more diverse to prevent bias (despite that it would make the model more complicated to train and would likely require a larger dataset).***

---
What was the most helpful feedback you received during your presentation? Why?

***Unfortunately, due to going last and other groups running over the time limit, our group was cut off from presenting fully and it seems that this limited the other group's abiltiy to provide us with feedback (since they missed some of our explanations). However, Xuan's feedback was helpful as was the general feedback we got from other groups about needing to explain why we make certain choices more thoroughly and use more visualization. We implemented the feeback of looking at more metrics other than WER and loss since often at times, one metric does not give the full picture or can be misleading (as seen with how accuracy can make a model look great when classes are imbalanced). Additionally, her feedback at looking at each step and reporting on it was very helpful, and given more time this process could help us understand issues with our model better. It was also good to note that we needed more charts and visualization of our data and results at each step since pictures are typcially easier to understand and more concise.***

---
If you had two more weeks to work on this project, what would you do next? Why?

***As mentioned, two big areas of issue with out project is that our model has areas of failure that cause it to be inaccurate and our model has some issues with bias. Firstly, it would be useful if we could have the time to identify exactly where these errors came from and perhaps test out some additional methods (like LSTM instead of GRU) and use a pre-trained model that does not require padding (in order to prevent the mentioned issue with the end of sequence tokens) to see if they improve the accuracy of our model. Additionally, we would take the time to work more with our data before training. Specifically, we could use the time to incorporate our method of adding noise that performs random permutations and deletion to the text before doing TTS (the code for which is in the GitHub repository) and to allow for more voices in the audio to reduce bias. In general, we would use the extra time to get a more detailed look at each step of our process, looking at exactly what happens and creating more charts to help us gain a deeper understanding.***

---