# Automatic Speech Recognition (ASR)

## Challenges in ASR

1. Background noise
2. Variability of the speaker (pitch, volume)
3. Same word, different speeds
4. Word Boundaries
5. Spoken language vs written language 


## Pipeline for ASR

<img src="assets/asr-pipeline.png" />

So far, with MFCC we can extract features from speech and we can address challenges 1 and 2. This features can be turned on phonetic representation or phonemes using an acoustic model. Phonemes can be translated into words using a Lexical Decoding or Lexicon. However there are systems capable to translate the acoustic model into words. This is a **design choice** indeed, and it depends on the dimensionality of the problem. 


## The acoustic model and trouble with time: same word, different speeds --> HMM's

The problem we address here, is the common fact that the words are usually pronounced using a diferent time length. For instance, besides the different pronunciations in the word 'Hello", it doesn't usually take the same time saying it. Further, some words might sound the same and only from the acoustic model, without providing any detail about the context, are really hard to distinguish like HEAR or HERE. 

Hidden Markov Models (HMM) are specially good to address the problem of variability in length, since they are really good to find patterns through time. The training set could consists of all labelled words, phonemes, sounds or groups of words, in order to determine the likelihood of a single word or phoneme. However this training becomes much more complicated when our dataset consists of utterances -phrases or whole sentences-. How could we this series of data be separated in training? Since a particular word may be connected to the previous and the next words, in order to train continuous utterances, HMM nodes are tied together as pairs which leads to an increase of dimensionality. When training words, the combinations are unbeareable, however training phonemes is a more accessible problem, since there are only 40 phonemes in English, only 1600 combinations are possible. Once the model is trained, it can be used to score new utterances. 

## Adding knowledge: Language model

Which combinations of words are more reasonable? Even if from phonemes we can get words, we still can't solve language ambiguities in spelling and context, since we haven't taught to the model which combinations are more likely, or at least provide knowledge about the context, to allow the model to learn from itself. The language models just does this job: inject knowledge. 

Every single word can be thought as a probability of distribution over many different words. Each possible sequence can be calculated as the likelihood that a particular word could have been produced by the audio signal.

$$P(signal   |   w_1,w_2)$$

A statistical language model does precisely that. It provides a probability distribution over sequences of words. 

$$word_1, word_2, ... = argmax_{w_1 w_2 ...} {P(signal | w_1,w_2,...) * P(w_1,w_2,...)}$$

Even though, the dimensionality applying a statistical model is extremely huge, and some heuristics or approximate can be employed here: **It turns out that in practice, the words we speak at any time are primarily dependent upon the three to four previous words.**

### N-grams

N-grams are prob of single words "I", ordered pairs "I love" (bigrams), triples "I love Science", etc. With n-grams we can approximate the sequence probability using the chaing rule. 

$$ P("I", "love", "Science") = P("I") * P("love"|"I") * P("Science" | "I", "love") $$

Then we can score these probability along with the probabilities coming from the Acoustic model to remove language ambiguities from the sequence options and **provide a better estimate of the utterance give an text**. 

#### Quizz

In the following series of quizes, you will work with 2-grams, or bigrams, as they are more commonly called. The objective is to create a function that calculates the probability that a particular sentence could occur in a corpus of text, based on the probabilities of its component bigrams. We'll do this in stages though:

* Quiz 1 - Extract tokens and bigrams from a sentence
* Quiz 2 - Calculate probabilities for bigrams
* Quiz 3 - Calculate the log probability of a given sentence based on a corpus of text using bigrams