# N-gams Language Model

`n-grams language model` is a `statistical model` used in natural language processing (NLP) to predict the next word in a sequence based on the previous n – 1 words. It’s one of the simplest and most fundamental approaches to language modeling.

One-gram:
$$ P(w_i) $$

Bi-gram:
$$ P(w_i∣w_{i−1})=\frac{count(w_{i−1},w_i)}{count(w_{i−1})} $$

Tri-gram:
$$ P(w_i∣w_{i−1},w_{i−2})=\frac{count(w_{i−2},w_{i−1},w_i)}{count(w_{i−1},w_{i−2})} $$

$$ P(w_i,w_{i−1},w_{i−2}) = P(w_{i-2}) \cdot P(w_{i−1}|w_{i-2}) \cdot P(w_{i}|(w_{i−2},w_{i-1})) $$

### Markov assumption 
`Markov assumption` -  only last n words matter.

Tri-gram simplification: $$ P(w_i,w_{i−1},w_{i−2}) \approx P(w_{i-2}) \cdot P(w_{i-1}∣w_{i−2}) \cdot P(w_{i-2}∣w_{i−1}) $$


While n-gram models are still used for autocomplete, speech recognition, and baseline NLP tasks, they’ve largely been replaced by `neural language models` (e.g., RNNs, Transformers) that can capture long-range dependencies and semantic meaning.

#### Starting and Ending sentences

At the beginning of each sentence fo `n-gram` algorith we add `n-1 start signs` $<s>$ in order to be able to have n-gram probability for first word in a text (to generalize).
    
    
At the end of sentence we add `only one end sign` $</s>$ to have n-grams for each ending sentence word.

#### Example
$$ <s> I study I learn </s> $$
`count matrix`
|           | `<s>` | `I` | `study` | `learn` | `</s>` |
| :-------- | :---: | :-: | :-----: | :-----: | :----: |
| **`<s>`**   |   0   |  1  |    0    |    0    |    0   |
| **I**     |   0   |  0  |    1    |    1    |    0   |
| **study** |   0   |  1  |    0    |    0    |    0   |
| **learn** |   0   |  0  |    0    |    0    |    1   |
| **`</s>`**  |   0   |  0  |    0    |    0    |    0   |

`probability matrix`
|           | `<s>` |  `I` | `study` | `learn` | `</s>` |
| :-------- | :---: | :--: | :-----: | :-----: | :----: |
| **`<s>`**   |  0.00 | 1.00 |   0.00  |   0.00  |  0.00  |
| **I**     |  0.00 | 0.00 |   0.50  |   0.50  |  0.00  |
| **study** |  0.00 | 1.00 |   0.00  |   0.00  |  0.00  |
| **learn** |  0.00 | 0.00 |   0.00  |   0.00  |  1.00  |
| **`</s>`**  |  0.00 | 0.00 |   0.00  |   0.00  |  0.00  |


In [23]:
import numpy as np
import pandas as pd

corpus = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
vocab = []
bigrams = []
count_dict = dict()

for i in range(len(corpus)-3+1):
    trigram = tuple(corpus[i:i+3])
    bigram = trigram[:-1]
    word = trigram[-1]
    if bigram not in bigrams:
        bigrams.append(bigram)
    if word not in vocab:
        vocab.append(word)
    if (bigram,word) not in count_dict:
        count_dict[bigram,word] = 0
    count_dict[bigram,word] += 1
    
matrix = np.zeros((len(bigrams), len(vocab)))
for key, value in count_dict.items():
    matrix[bigrams.index(key[0]), vocab.index(key[1])] = value

In [24]:
count_matrix = pd.DataFrame(matrix, index=bigrams, columns=vocab)
count_matrix

Unnamed: 0,happy,because,i,am,learning,.
"(i, am)",1.0,0.0,0.0,0.0,1.0,0.0
"(am, happy)",0.0,1.0,0.0,0.0,0.0,0.0
"(happy, because)",0.0,0.0,1.0,0.0,0.0,0.0
"(because, i)",0.0,0.0,0.0,1.0,0.0,0.0
"(am, learning)",0.0,0.0,0.0,0.0,0.0,1.0


In [25]:
matrix

array([[1., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1.]])