# N-gram Language Models

## Language Models

used in modeling fluency (speech recognition); measure goodness using probabilities; used for generation (ChatGPT); query completion, optical character recognition

1. machine translation
2. summarisation
3. dialogue systems

pretrained language models are the backbone of modern NLP systems

w## Deriving n-gram language models

### Probabilities: Joint to Conditional

First step: apply the chain rule

$$P(w_1,w_2,...,w_m)=P(w_1)P(W_2|W_1)P(w_3|w_1,w_2)...P(w_m|w_1,...,w_{m-1})$$

### The Markov Assumption

$$P(w_i|w_1,...,w_{i-1})\approx P(w_i|w_{i-n+1},...,w_{i-1})$$

when $n=1$, unigram: $P(w_1,w_2,...,w_m)=\prod_{i=1}^mp(w_i)$

when $n=2$, bigram: $P(w_1,w_2,...,w_m)=\prod_{i=2}^mp(w_i|w_{i-1})$

when $n=3$, unigram: $P(w_1,w_2,...,w_m)=\prod_{i=3}^mp(w_i|w_{i-1},w_{i-2})$

### Maximum Likelihood Estimation

unigram: $P(w_i)=\frac{C(w_i)}{M}$, $M$ is the total number of the word tokens in corpus.

bigram: $P(w_i|w_{i-1})=\frac{C(w_{i-1},w_i)}{C(w_{i-1})}$

n-gram: $P(w_i|w_{i-n+1},...,w_{i-1})=\frac{C(w_{i-2},w_{i-1},w_i)}{C(w_{i-2},w_{i-1})}$

### Book-ending Sequence

denote start and end of sequence

`<s>` = sentence start

`</s>` = sentence end

### Trigram example
e.g.
$$P(yes,no,no,yes)=P(yes|<s><s>)\times\\ P(no|<s>yes)\times P(no|yes,no)\times P(yes|no,no)\times P(</s>|no,yes)$$

<font color=red>Note: need to predict `</s>` cause it's the end of the sentence</font>

$$P(w_i|w_{i-2},w_{i-1})=\frac{C(w_{i-2},w_{i-1},w_i)}{C(w_{i-2},w_{i-1})}$$

### Several Problems

1. language has long distance effects => large n
2. result probabilities are often very small (use log prob)
3. unseen n-gram (smoothing)



## Smoothing to deal with sparsity

### Smoothing

give events you've never seen before some prob

1. constricted to satisfy $P(\{everything\})=1$
2. many kinds of: laplacian(add-one), add-k, absolute discounting, Kneser-Ney, Interpolation...


### Laplacian (Add-one) Smoothing

pretend we've seen each n-gram once more than we did

unigram ($V$=the vocabulary, $M$ is the total number of the word tokens in corpus):
$$P_{dll1}(w_i)=\frac{C(w_i)+1}{M+|V|}$$

bigram:
$$P_{dll1}(w_i|w_{i-1})=\frac{C(w_{i-1},w_i )+1}{C(w_{i-1})+|V|}$$

<center>
<img src="./figures/week2l1-1.png" width = "400" alt="图片名称" align=center />
</center>

<font color=red> NOTE: `<s>` is not part of vocabulary: cause we never need to infer its conditional prob (e.g. P(`<s>`|...)); But `</s>` is included. </font>

just give too much prob


### Add-k($\alpha$) smoothing (Lidstone Smoothing)

add just one is often too much; instead, add a fraction k: take weights from seen bigrams to unseen bigrams.

$$P_{addk}(w_i|w_{i-1},w_{i-2})=\frac{C(w_{i-2},w_{i-1},w_i)+k}{C(w_{i-2},w_{i-1})+k|V|}$$

Have to choose k (tuning it); smaller $\alpha$ means less weight to unseen.

**efftective counts**: the actual (equivalent) counts that were put into this word.

$$effective\ counts=smoothed\ prob\times |V|$$

<center>
<img src="./figures/week2l1-2.png" width = "400" alt="图片名称" align=center />
</center>

different n-grams will give out different weights.


### Absolute Discounting

Borrow a fixed prob to unseen words, to redistribute equally. Actually calculate effective counts first, then calculates smoothed prob

tune on discount $d$.

<center>
<img src="./figures/week2l1-3.png" width = "400" alt="图片名称" align=center />
</center>


### Katz Backoff
Absolute discounting redistributes the probability mass equally for all unseen n-grams: not always the case; 

redistributes the mass based on <font color=red>a</font> **lower order** model (e.g. unigram), but just <font color=red>one order lower</font>.

here, just redistribute the prob based on each word's unigram prob that occurred in unseen phrase.

$$P_{katz}(w_i|w_{i-1})=\begin{cases}
\frac{C(w_{i-1}, w_i)-D}{C(w_{i-1})}, & C(w_{i-1}, w_i) > 0\\
\alpha(w_{i-1})\times\frac{P(w_i)}{\sum_{w_j:C(w_{i-1},w_j)=0}P(w_j)}, & otherwise
\end{cases}$$

$\alpha(w_{i-1})$: the amount of prob mass that has been discounted for context $w_{i-1}$ ($0.1\times5/20$ in last figure)

$P(w_i)$: unigram prob for $w_i$ (e.g. $P(infirmity)$)

$\sum_{w_j:C(w_{i-1},w_j)=0}P(w_j)$: sum unigram prob for all words that do not co-occur with context $w_{i-1}$ (e.g. $P(infirmity) + P(alleged)$); 也就是对所有没有和$w_{i-1}$一起出现过的词的概率进行求和；然后分子是其中那个我们要算的词


如果unigram也没出现过？
会得到0；但是如果从没出现过，也不会出现在bigram里？？？？？？？？

a questional example:

<center>
<img src="./figures/week2l1-4.png" width = "400" alt="图片名称" align=center />
</center>


### Kneser-Ney Smoothing (Continuation prob)

high versatility: co-occurs with a lot of unique words

e.g. glasses: men's glasses, black glasses, buy glasses, etc

e.g. francisco: san francisco

$$P_{KN}(w_i|w_{i-1})=\begin{cases}
\frac{C(w_{i-1}, w_i)-D}{C(w_{i-1})}, & C(w_{i-1}, w_i) > 0\\
\beta(w_{i-1})\times P_{cont}(w_i), & otherwise
\end{cases}$$

$$P_{cont}(w_i)=\frac{|\{w_{i-1}:C(w_{i-1},w_i)>0\}|}{\sum_{w_j:C(w_{i-1},w_j)=0}|\{w-{j-1}:C(w_{j-1},w_j)>0\}|}$$

$\beta(w_{i-1})$: the amount of probability mass taht has been discounted for context $w_{i-1}$ (same as $\alpha$ in backoff)

$P_{cont}(w_i)$: numerator- #unique $w_{i-1}$ that co-occurs with $w_i$ (即后一个词和前一个词一起出现的词书计数); denominator-sums all $w_j$ that do not co-occur with $w_{i-1}$ （即当前所有和前一个词没有一起出现过的备选词的数量记录）

<img src="formula.png", with=300>

### Interpolation

a better way to combine different orders of n-gram models

$$P_{IN}(w_i|w_{i-1},w_{i-2})=\lambda_3...\lambda_2...\lambda_1...$$
$\lambda_1+\lambda_2+\lambda_3=1$











