# Automatic Speech Recognition (ASR)

## Challenges in ASR

1. Background noise
2. Variability of the speaker (pitch, volume)
3. Same word, different speeds
4. Word Boundaries
5. Spoken language vs written language 


## Pipeline for ASR

<img src="assets/asr-pipeline.png" />

So far, with MFCC we can extract features from speech and we can address challenges 1 and 2. This features can be turned on phonetic representation or phonemes using an acoustic model. Phonemes can be translated into words using a Lexical Decoding or Lexicon. However there are systems capable to translate the acoustic model into words. This is a **design choice** indeed, and it depends on the dimensionality of the problem. 


## The acoustic model and trouble with time: same word, different speeds --> HMM's

The problem we address here, is the common fact that the words are usually pronounced using a diferent time length. For instance, besides the different pronunciations in the word 'Hello", it doesn't usually take the same time saying it. Further, some words might sound the same and only from the acoustic model, without providing any detail about the context, are really hard to distinguish like HEAR or HERE. 

Hidden Markov Models (HMM) are specially good to address the problem of variability in length, since they are really good to find patterns through time. The training set could consists of all labelled words, phonemes, sounds or groups of words, in order to determine the likelihood of a single word or phoneme. However this training becomes much more complicated when our dataset consists of utterances -phrases or whole sentences-. How could we this series of data be separated in training? Since a particular word may be connected to the previous and the next words, in order to train continuous utterances, HMM nodes are tied together as pairs which leads to an increase of dimensionality. When training words, the combinations are unbeareable, however training phonemes is a more accessible problem, since there are only 40 phonemes in English, only 1600 combinations are possible. Once the model is trained, it can be used to score new utterances. 

## Adding knowledge: Language model

Which combinations of words are more reasonable? Even if from phonemes we can get words, we still can't solve language ambiguities in spelling and context, since we haven't taught to the model which combinations are more likely, or at least provide knowledge about the context, to allow the model to learn from itself. The language models just does this job: inject knowledge. 

Every single word can be thought as a probability of distribution over many different words. Each possible sequence can be calculated as the likelihood that a particular word could have been produced by the audio signal.

$$P(signal   |   w_1,w_2)$$

A statistical language model does precisely that. It provides a probability distribution over sequences of words. 

$$word_1, word_2, ... = argmax_{w_1 w_2 ...} {P(signal | w_1,w_2,...) * P(w_1,w_2,...)}$$

Even though, the dimensionality applying a statistical model is extremely huge, and some heuristics or approximate can be employed here: **It turns out that in practice, the words we speak at any time are primarily dependent upon the three to four previous words.**

### N-grams

N-grams are prob of single words "I", ordered pairs "I love" (bigrams), triples "I love Science", etc. With n-grams we can approximate the sequence probability using the chaing rule. 

$$ P("I", "love", "Science") = P("I") * P("love"|"I") * P("Science" | "I", "love") $$

Then we can score these probability along with the probabilities coming from the Acoustic model to remove language ambiguities from the sequence options and **provide a better estimate of the utterance give an text**. 

#### Quizz: computing bigrams

In the following series of quizes, you will work with 2-grams, or bigrams, as they are more commonly called. The objective is to create a function that calculates the probability that a particular sentence could occur in a corpus of text, based on the probabilities of its component bigrams. We'll do this in stages though:

* Quiz 1 - Extract tokens and bigrams from a sentence
* Quiz 2 - Calculate probabilities for bigrams
* Quiz 3 - Calculate the log probability of a given sentence based on a corpus of text using bigrams

##### Assumptions and terminology

* Utterance : 'I love language models'
* Tokens (word list from utterance + start tag + ending tag) : ['<s>', 'i', 'love', 'language', 'models', '</s>']
* Bigrams: The bigrams for this sentence are represented as a list of lower case ordered pairs of tokens:

bigrams = [('<s>', 'i'), ('i', 'love'), ('love', 'language'), ('language', 'models'), ('models', '</s>')]

##### Quiz 1 Instructions

In the quiz below, write a function that returns a list of tokens and a list of bigrams for a given sentence. You will need to first break a sentence into words in a list, then add a <s> and <s/> token to the start and end of the list to represent the start and end of the sentence.

Your final lists should be in the format shown above and called out in the function doc string.

In [25]:
test_sentences = [
    'the old man spoke to me',
    'me to spoke man old the',
    'old man me old man me',
]


def get_token(sentence): 
    token_list = sentence.split(" ")
    token_list.insert(0, "<s>")
    token_list.append("</s>")   
    return token_list


def get_bigram(token_list): 
    
    pairs=[]  
    [pairs.append(token_list[i]) for i in range(1, len(token_list))]       
    return list(zip(token_list, pairs))

def sentence_to_bigrams(sentence):
    """
    Add start '<s>' and stop '</s>' tags to the sentence and tokenize it into a list
    of lower-case words (sentence_tokens) and bigrams (sentence_bigrams)
    :param sentence: string
    :return: list, list
        sentence_tokens: ordered list of words found in the sentence
        sentence_bigrams: a list of ordered two-word tuples found in the sentence
    """
    sentence_tokens = get_token(sentence)
    sentence_bigrams = get_bigram(sentence_tokens)
    
    return sentence_tokens, sentence_bigrams

[sentence_to_bigrams(sentence) for sentence in test_sentences] 

[(['<s>', 'the', 'old', 'man', 'spoke', 'to', 'me', '</s>'],
  [('<s>', 'the'),
   ('the', 'old'),
   ('old', 'man'),
   ('man', 'spoke'),
   ('spoke', 'to'),
   ('to', 'me'),
   ('me', '</s>')]),
 (['<s>', 'me', 'to', 'spoke', 'man', 'old', 'the', '</s>'],
  [('<s>', 'me'),
   ('me', 'to'),
   ('to', 'spoke'),
   ('spoke', 'man'),
   ('man', 'old'),
   ('old', 'the'),
   ('the', '</s>')]),
 (['<s>', 'old', 'man', 'me', 'old', 'man', 'me', '</s>'],
  [('<s>', 'old'),
   ('old', 'man'),
   ('man', 'me'),
   ('me', 'old'),
   ('old', 'man'),
   ('man', 'me'),
   ('me', '</s>')])]

##### Probabilities and Likelihoods with Bigrams

The probability of a series of words can be calculated from the chained probabilities of its history:

$$  P_{w_1, w_2, ...,w_n} = \prod_{i=1}^{n} P(w_i| w_1 w_2,...,w_{i-1})$$

The probabilities of sequence occurrences in a large textual corpus can be calculated this way and used as a **language model to add grammar and contextual knowledge to a speech recognition system**. However, there is a prohibitively large number of calculations for all the possible sequences of varying length in a large textual corpus.

To address this problem, we use the Markov Assumption to approximate a sequence probability with a shorter sequence.

###### Markov Assumption

In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process. It is named after the Russian mathematician Andrey Markov.

**A stochastic process has the Markov property if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that preceded it**. A process with this property is called a Markov process. The term strong Markov property is similar to the Markov property, except that the meaning of "present" is defined in terms of a random variable known as a stopping time.

The term Markov assumption is used to describe a model where the Markov property is assumed to hold, such as a hidden Markov model.

A Markov random field extends this property to two or more dimensions or to random variables defined for an interconnected network of items. An example of a model for such a field is the Ising model. A discrete-time stochastic process satisfying the Markov property is known as a Markov chain.

Thus, we could approximate

$$  P_{w_1, w_2, ...,w_n} \approx \prod_{i=1}^{n} P(w_i| w_{i-k}...w_{i-1})$$


We can calculate the probabilities by using counts of the bigrams and individual tokens. The counts are represented below with the c() operator:

$$  P_{w_i|w_{i-1}} = \dfrac{c(w_{i-1}, w_i)}{c(w_{i-1})}$$

In the quiz below, write a function that returns a probability dictionary when given a lists of tokens and bigrams.

In [None]:
def bigrams_from_transcript(filename):
    """
    read a file of sentences, adding start '<s>' and stop '</s>' tags; Tokenize it into a list of lower case words
    and bigrams
    :param filename: string 
        filename: path to a text file consisting of lines of non-puncuated text; assume one sentence per line
    :return: list, list
        tokens: ordered list of words found in the file
        bigrams: a list of ordered two-word tuples found in the file
    """
    tokens = []
    bigrams = []
    with open(filename, 'r') as f:
        for line in f:
            line_tokens, line_bigrams = sentence_to_bigrams(line)
            tokens = tokens + line_tokens
            bigrams = bigrams + line_bigrams
    return tokens, bigrams


def sentence_to_bigrams(sentence):
    """
    Add start '<s>' and stop '</s>' tags to the sentence and tokenize it into a list
    of lower-case words (sentence_tokens) and bigrams (sentence_bigrams)
    :param sentence: string
    :return: list, list
        sentence_tokens: ordered list of words found in the sentence
        sentence_bigrams: a list of ordered two-word tuples found in the sentence
    """
    sentence_tokens = ['<s>'] + sentence.lower().split() + ['</s>']
    sentence_bigrams = []
    for i in range(len(sentence_tokens)-1):
        sentence_bigrams.append((sentence_tokens[i], sentence_tokens[i+1]))
    return sentence_tokens, sentence_bigrams