# **Instructions**

This document is a template, and you are not required to follow it exactly. However, the kinds of questions we ask here are the kinds of questions we want you to focus on. While you might have answered similar questions to these in your project presentations, we want you to go into a lot more detail in this write-up; you can refer to the Lab homeworks for ideas on how to present your data or results. 

You don't have to answer every question in this template, but you should answer roughly this many questions. Your answers to such questions should be paragraph-length, not just a bullet point. You likely still have questions of your own -- that's okay! We want you to convey what you've learned, how you've learned it, and demonstrate that the content from the course has influenced how you've thought about this project.

# BENoiT: Better English Noisy Audio Transcriptions
 Track: Applications | Sub-track: Text + Audio 
Project mentor: Xuan Zhang

Vijay Murari Tiyyala <vtiyyal1@jh.edu>, Pulkit Madaan <pmadaan2@jh.edu>, Ehong sun <esun7@jh.edu>, Carolyne Holmes <cholme26@jh.edu>

https://github.com/madaanpulkit/benoit

# Outline and Deliverables

List the deliverables from your project proposal. For each uncompleted deliverable, please include a sentence or two on why you weren't able to complete it (e.g. "decided to use an existing implementation instead" or "ran out of time"). For each completed deliverable, indicate which section of this notebook covers what you did.

If you spent substantial time on any aspects that weren't deliverables in your proposal, please list those under "Additional Work" and indicate where in the notebook you discuss them.

### Uncompleted Deliverables
1. "Would Like to ecomplish": w (need to correct model first)
2. "Expect to complete #2": Since, the model didn't learn anything meaningful, we still have to reach the stage where identify failure modes



### Completed Deliverables
1. "Must complete #1": A corpus of synthetically generated noisy speech audio with grammatically correct English sentences. [in "Dataset" below](#scrollTo=zFq-_D0khnhh&line=10&uniqifier=1).
2. "Must complete #2": Working implementation of the Denoising Model. [in "Baselines" below](#scrollTo=oMyqHUa0jUw7&line=5&uniqifier=1).
3. "Must complete #3": We have a function capable of taking raw audio and returning faithful and corrected text
4. "Expect to complete #1": Analysed the source of error, that degrades model performance
5. "Expect to complete #3": Generated noisy data via human-crafted rules (random word deletion, word order permutation, etc.)



### Additional Deliverables
1. We decided to add a second baseline using the published model from this paper. We discuss this [in "Baselines" below](#scrollTo=oMyqHUa0jUw7&line=5&uniqifier=1).
2. ...

# Preliminaries

## What problem were you trying to solve or understand?


***Transcribing audio is a difficult task, and it becomes even harder when the audio is noisy or the speaker's first language is not English. Most approaches to transcription focus on being faithful to the original audio, but our aim is to use auto-correction to produce grammatically correct transcriptions.***

What are the real-world implications of this data and task?

***Automatic speech recognition (ASR) approaches are typically faithful to the original audio and work well when the content is of high quality. However, for many users of technology, English is not their first language. Just as auto-correct functions for typing have made it easier for people who are not professionally fluent in English to communicate without constantly looking up the correct spelling of words, our goal with BENoiT is to improve audio communication through the use of auto-corrected, grammatically correct transcriptions.***

---


How is this problem similar to others we’ve seen in lectures, breakouts, and homeworks?

***We use deep learning to tackle this problem. We model it as a supervised classification task, where the input is a speech audio signal that is temporal in nature. For each time step, the model then classifies the audio as belonging to a particular word from the vocabulary.***

***The lectures introduced the denoising autoencoder paradigm as a way to denoise text and model this problem. The lectures and homework also covered the use of seq2seq models with and without attention for transforming text to text.***

---


What makes this problem unique?

***Most automatic speech recognition (ASR) approaches do not focus on producing grammatically correct output. Instead, they use language models to improve their predictions. In contrast, our approach explicitly transforms ASR outputs into grammatically correct sentences.***


---


What ethical implications does this problem have?

***The use of artificial voice samples raises ethical concerns, and the current data only includes recordings from two native American-English speakers (one male and one female), which may bias the model towards these two particular speakers. This can result in biases creeping into the automatic speech recognition (ASR) model.***


---



## Dataset(s)

Describe the dataset(s) you used.

***We used the dataset Reuters-21578, Distribution 1.0, this contains the news articles in 21 files which also contains some noise like html tags and unwanted stuff.***


---


How were they collected?


***We wrote a script to clean those files and get only the article content which has no noise into 21 different files.We made 21 files of english*** ***sentences retrieved from news articles. Later we did backtranslation on*** ***those sentences by translating them from English ➡ Telugu ➡ English. After we*** ***got those back translated sentences we used a TTS model to generate randomly selected Windows American male or female voice audio samples for each of those sentences. Overall, 49.95% or the files were male and 50.05 % were female.***


---


Why did you choose them?


---



---



How many examples in each?

***• Number of files: 21                
sentences/audios: 95,368***       
***every word/piece becomes a vector of 16***

***• train files: [1]  • train sentences/audios: 4,974***


***• val files: [11]    • val sentences/audios: 4,689***


***• test files: [12]        • test sentences/audios: 4,571***


---






In [None]:
# Load your data and print 2-3 examples

## Pre-processing

What features did you use or choose not to use? Why?

If you have categorical labels, were your datasets class-balanced?

How did you deal with missing data? What about outliers?

What approach(es) did you use to pre-process your data? Why?

Are your features continuous or categorical? How do you treat these features differently?

In [None]:
# For those same examples above, what do they look like after being pre-processed?

In [None]:
# Visualize the distribution of your data before and after pre-processing.
#   You may borrow from how we visualized data in the Lab homeworks.

# Models and Evaluation

## Experimental Setup

How did you evaluate your methods? Why is that a reasonable evaluation metric for the task?

***We used WordErrorRate as our metric and we chose the value best_wer = 1e4***


---


What did you use for your loss function to train your models? Did you try multiple loss functions? Why or why not?

***We used cross entropy loss as our loss function and Adam optimizer. The code is available in the notebook linked.***


---


How did you split your data into train and test sets? Why?

***We chose different files out of 21 as our train, val and test datasets***


---



In [2]:
# Code for loss functions, evaluation metrics or link to Git repo
# We used cross entropy loss as our loss function and Adam optimizer. The code is available in the notebook linked. 


## Baselines 

What baselines did you compare against? Why are these reasonable?


***Our baseline was Pre-trained Wav2vec2 ASR model from torchaudio.
We then add our denoising paradigm on top of it. The ASR models doesn't need any preprocessing to it.*** 


---



Did you look at related work to contextualize how others methods or baselines have performed on this dataset/task? If so, how did those methods do?


***Yes we looked at the related work for ASR models and their baselines.***


---







## Methods

What methods did you choose? Why did you choose them?

***We have two stages in our method. The first stage is ASR ➡ Wav2Vec 2.0 pre-trained form torch audio, we chose this because its Readily available, widely used in literature, and a lot of tutorials available.
This Works directly with raw audios - eliminating the need for non-trivial audio preprocessing.***

***Our second stage is Denoising ➡ GRU Encoder - Decoder Se2Seq model with and without attention. We chose this because its a Standard practice in literature and GRUs are able to capture long range dependencies.*** 


---


How did you train these methods, and how did you evaluate them? Why?


***Only Denoising is trained for 100 epochs with Adam Optimizer evaluated on Word Error Rate. We chose Word Error Rate as it's the standard metric used in ASR tsks.***


---


Which methods were easy/difficult to implement and train? Why?

***The denoising model with and without attention both have been hard to train and implement***

***Challenges:***

***1.Challenge to adapt to our task***


***1.1 Shapes are challenge in sequence tasks***

***1.2 Batch first is intuitive, sequence first is not***

***1.3 Temporal aspect makes scripting sequential***


***2.The models learn the degenerate solution of only predicting <eos> at every step***

***2.1 The model is initially padded with <_eos_> at pre-processing***

***2.2 Random batch sampling leads to majority  sequence being just <_eos_>***



---



For each method, what hyperparameters did you evaluate? How sensitive was your model's performance to different hyperparameter settings?

***Batch_size:2,initial_learning_rate:0.0001, epoch=100 for denoising model***


---







In [None]:
# Code for training models, or link to your Git repository

In [None]:
# Show plots of how these models performed during training.
#  For example, plot train loss and train accuracy (or other evaluation metric) on the y-axis,
#  with number of iterations or number of examples on the x-axis.

## Results

Show tables comparing your methods to the baselines.

What about these results surprised you? Why?

Did your models over- or under-fit? How can you tell? What did you do to address these issues?

What does the evaluation of your trained models tell you about your data? How do you expect these models might behave differently on different data?  

In [None]:
# Show plots or visualizations of your evaluation metric(s) on the train and test sets.
#   What do these plots show about over- or under-fitting?
#   You may borrow from how we visualized results in the Lab homeworks.
#   Are there aspects of your results that are difficult to visualize? Why?

# Discussion

## What you've learned

*Note: you don't have to answer all of these, and you can answer other questions if you'd like. We just want you to demonstrate what you've learned from the project.*

What concepts from lecture/breakout were most relevant to your project? How so?

What aspects of your project did you find most surprising?

What lessons did you take from this project that you want to remember for the next ML project you work on? Do you think those lessons would transfer to other datasets and/or models? Why or why not?

What was the most helpful feedback you received during your presentation? Why?

If you had two more weeks to work on this project, what would you do next? Why?