# n-gram

$\textbf{Assumption}$:
- Markov Property: only last N words matter
- e.g. for unigram model
$$P(\text{teacher drinks tea}) = P(\text{teacher}) * P(\text{drinks|teacher}) * \textbf{P(tea|teacher drinks})$$
$$ = P(\text{teacher}) * P(\text{drinks|teacher}) * \textbf{P(tea|teacher})$$

$\textbf{Input}$:
- Sentence
- Start of sentence symbols $\text{<s>}$
- End of sentence symbods $\text{</s>}$
- N-gram: add N-1 start symbols

$$p(w_n | w_{n-N+1} ^{n-1}) = \frac{Count(w_{n-N+1} ^{n-1} w_n)}{ Count(w_{n-N+1} ^{n-1}) }$$

$\textbf{Steps}$:
1. Choose sentence start
2. Choose next n-gram starting with previous words
3. Continue until  $\text{</s>}$ is picked

$\textbf{Smoothing}$:
- Add-one
$$p(w_n | w_{n-N+1} ^{n-1}) = \frac{Count(w_{n-N+1} ^{n-1} w_n) + 1}{ \sum_{w \in V} (Count(w_{n-N+1} ^{n-1}, w) + 1) } = \frac{Count(w_{n-N+1} ^{n-1} w_n) + 1}{ Count(w_{n-N+1} ^{n-1}) + V }$$
- Add-k
$$p(w_n | w_{n-N+1} ^{n-1}) = \frac{Count(w_{n-N+1} ^{n-1} w_n) + k}{ \sum_{w \in V} (Count(w_{n-N+1} ^{n-1}, w) + k) } = \frac{Count(w_{n-N+1} ^{n-1} w_n) + k}{ Count(w_{n-N+1} ^{n-1}) + k * V }$$

$\textbf{n-gram model evaluation}: \underline{Perplexity}$
- small perplexity means better model
- PP(Character Level Models) < PP(Word-based Models)

$$PP(W) = \sqrt[^N]{\prod_{i=1}^N \frac{1}{P(w_i | w_1, ..., w_{i-1})}}$$
$$log(PP(W)) = -\frac{1}{N} \sum_{i=1}^N log(P(w_i | w_1, ..., w_{i-1}))$$


# TF-IDF

$$TF(t) = \frac{\text{# times term t appears in a document}}{\text{total # terms in the document}}$$
$$IDF(t) = ln \left(\frac{\text{total # documents}}{\text{# documents with term t}} \right)$$
$$weight(t) = TF(t) * IDF(t)$$

$\textbf{Notes}$:
- TF(t) considers $\underline{local}$ feature in a document
- IDF(t) contains $\underline{global}$ feature in a document
  - If a term appears in most of the documents, IDF(t) will be small

# Word Embedding

$$ e_{man} - e_{women} \approx e_{king} - e_{queen} $$
$$\underset{w}{\mathrm{argmax}} sim(e_w, e_{king} - e_{man} + e_{woman})$$
$$\text{cos sim}(u, w) = \frac{u^T v}{||u||_2 * ||v||_2}$$

## Word2Vec (Google)

$\textbf{Continuous bag-of-words (CBOW)}$: input previous word, output next word
- $u_j$: score of $word_j$ in Vocabulary $V$
- $\overrightarrow{w}_j$: output vector of $word_j$

$$E(word_1, word_2) = -log \frac{exp(u_j)}{\sum_{j=1}^V exp(u_j)} = -\overrightarrow{w}_j^T \overrightarrow{w}_I + log \left(\sum_{j=1}^V exp(\overrightarrow{w}_j^T \overrightarrow{w}_I) \right)$$

$$\text{Objective:} \underset{w}{\mathrm{min}} \sum_{(word_I, word_O) \in D} \left(-\overrightarrow{w}_j^T \overrightarrow{w}_I + log \left(\sum_{j=1}^V exp(\overrightarrow{w}_j^T \overrightarrow{w}_I) \right) \right) $$

$\textbf{Skip-gram}$: input one word, output words before or after the given word
- works well for infrequent words
- reduce computation: hierarchical softmax

$$E = -\sum_{c=1}^C u_j^c + C * log \left(\sum_{j=1}^V exp(u_j)\right)$$

$\textbf{Negative Sampling}$: train a set of logistic regression on k negative and 1 positive samples
- $\overrightarrow{w}_{w_O}$: output vector of real word
- $\overrightarrow{w}_{j}$: output vector of negative sampling
- $\sigma (\overrightarrow{w}_{w_O}^T \overrightarrow{h})$: probability of predicting $w_o$ as positive
- $\sigma (-\overrightarrow{w}_{j}^T \overrightarrow{h})$: probability of predicting word j as negative

$$E = -log(\sigma (\overrightarrow{w}_{w_O}^T \overrightarrow{h})) - \sum_{j\in W_{neg}} log(\sigma (-\overrightarrow{w}_{j}^T \overrightarrow{h}))$$


![Word2Vec](https://miro.medium.com/max/680/0*TY9nYgPpwJloevhp.png)

## GloVe (Stanford)

$\textbf{Summary}$: use neural methods to decompose co-occurrence matrix, $\underline{global}$ vectors for word representation, aggregated global word-word co-occurrence probabilities from a corpus

$\textbf{Input}$:
- $X_i = \sum_{k=1}^V X_{i,k}$: total number of words that appear before and after $word_i$
- $P_{i,j} = P(word_j | word_i) = \frac{X_{i,j}}{X_i}$: probability that $word_j$ appears after or before $word_i$
- $Ratio_{i,j}^k = \frac{P_{i,k}}{P_{j,k}}$: ratio of probability

$$Objective: F(\overrightarrow{w}_i^T \overrightarrow{w}_k - \overrightarrow{w}_j^T \overrightarrow{w}_k) = exp(\overrightarrow{w}_i^T \overrightarrow{w}_k - \overrightarrow{w}_j^T \overrightarrow{w}_k) = \frac{P_{i,k}}{P_{j,k}}$$

$$Loss: J = \sum_{i, k} f(X_{i, k}) (\overrightarrow{w}_i^T \overrightarrow{w}_k + b_i + b_k -log(X_{i, k}))^2$$
$$\text{Weight function:} f(x) =
    \begin{cases}
      \left(\frac{x}{x_{max}}\right)^{\alpha}, & \text{if}\ x < x_{max} \\
      1, & \text{else}\ 
    \end{cases}$$
    
$\textbf{Property of f(x)}$:
- $f(0) = 0$ so that $\lim_{x \to 0} f(x) log^2 x$ exists
- f(x) is non-decreasing
- for large $X_{i,k}$, f($X_{i,k}$) cannot be large

$\textbf{Model Evaluation Task}$:
- Word analogies task: "Athens" to "Greece" as "Berlin" to "_"?
- Word Similarity
- Entity Recognition

$\textbf{Hyperparameters of GloVe}$:
- Word Vector Length
- Window Size of sentence

## FastText (Facebook)

$\textbf{Summary}$: train neural networks in text classification
- Input: All words in a document
- Output: Probability of each class (y: real label)
  - number of class K << dictionary size V
  - don't need hierarchical softmax
  - don't need negative sampling (less training time)
- Hidden Vector: average of allinput words
$$\overrightarrow{h} = \frac{1}{C} W^T (\overrightarrow{x}_1 + \overrightarrow{x}_2 + ... + \overrightarrow{x}_C)$$

$$Loss: E = -u_y + log \left(\sum_{k=1}^K exp(u_k)\right) = -\overrightarrow{w}_y^T \overrightarrow{h} + log \sum_{k=1}^K exp(\overrightarrow{w}_k^T \overrightarrow{h})$$

## References

- FastText: https://arxiv.org/pdf/1607.04606.pdf
- GloVe: https://nlp.stanford.edu/pubs/glove.pdf
- Word2Vec: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf