# NLP 
## Objectives
Be able to infer a likely sentance $s$ given the observed speech signal $a$.  

## Generative Approach
The generative approach is to build two components:  
__Observation model__, represented as $p(a|s)$, which tells us how likely the sentence $s$ is to lead to the acoustic signal $a$.   
__Prior__, represented as $p(s)$, which tells us how likely a given sentence $s$ is. E.g., it should know that "recognize speech" is more likely that "wreck a nice beach". 

Given these components, we can use Bayes' Rule to infer a posterior distribution over sentences given the speech signal: 
$$p(s|a) = \frac{p(s)p(a|s)}{\sum_{s'}p(s')p(a|s')}$$

## Language Modeling

Assume having a corpus of sentences $s^{(1)}, ..., s^{(N)}$. The ML criterion says we want our model to maximize the probability our model assigns to the observed sentences. Make the assumption that sentences are independent, so that the objective is $\max \prod^N p(s^{(i)})$.

Then, the __log probability__ is something we can work with more easily. It also conveniently decomposes as a sum, which is equivalent to cross-entropy loss. 

By chain rule of conditional probability (without any assumptions), 
$$p(s) = p(w_1,...,w_T) = \prod^Tp(w_i|w_1,...,w_{i-1})$$
With __Markov assumption__ (memoryless model), 
$$p(w_t|w_1,...,w_{t-1}) = p(w_t\mid w_{t-3}, w_{t-2}, w_{t-1})$$

### N-Gram
Using a conditional probability table, consider the empirical distribution
$$p(w_3 | w_1, w_2) = \frac{p(w_3, w_2, w_1)}{p(w_1, w_2)}\approx \frac{\text{count of phrase w1 w2 w3}}{\text{count of phrase w1 w2}}$$
The above example is $3$-gram

#### Problems 
The number of entries in the conditional probability table is exponential in the context length.  
__Data sparsity__: most n-grams never appear in the corpus, even if they are possible (we can use a short context, or smooth the probabilities by adding imaginary counts to solve the problem).  
Also, using an ensemble of n-gram models with different $n$ can deal with some data sparsity problem. 

### Distributed Representations
n-gram only have local information of the representations, but words can be related far away, e.g., similar part of sentences, similar meaning. 

## Neural Language Model
__Input__ previous $K$ words  
__Target__ next word  
__Loss__ cross-entropy

### Bengio's Neural Language Model

http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

![](./assets/bongio.png)


Each word it trained to a distributed representation, which the representation is shared-weights across all models, then we have hidden layers that is to predict the words.

## Global Vector Embeddings (GloVe)
a simpler and faster approach based on a matrix factorization similar to PCA. 

__Hypothesis__ words with similar distributions have similar meanings  

Consider a co-occurrence matrix $X$, which counts the number of times two words appear nearby (eg. distance = 5). This gives $V\times V$ matrix, $|V|=$vocabulary size

__Intuition pump__ we want a rank-K approximation $X\approx R\tilde R^T$ where $R$ and $\tilde R$ are $V\times K$ matrices. 
- Each row $r_i$ of $R$ is the $K$-dimensional representation of a word 
- Each entry is approximated as $x_{ij}\approx r_i^T \tilde r_j$
- Hence, more similar words are more likely to co-occur

#### Problems
- $X$ is extremely large, so fitting the above factorization using LS is infeasible.  we can reweight the entries so that only nonzero counts matter
- Words counts are heavy-trailed, so we approximate $log x_{ij}$ instead of $x_{ij}$

The final cost function is 
$$J(R) = \sum_{i,j}f(x_{ij})(r_i^T\tilde r_j + b_i + \tilde b_j - \log x_{ij})^2$$
$$f(x_{ij}) = \begin{cases}(\frac{x_{ij}}{100})^{3/4} &x_{ij} < 100 \\
1 &x_{ij}\geq 100\end{cases}$$