# NLP. Week 4. Language models. N-grams

## Probabilistic Language Models -

designed to predict the likelihood of a sequence of words.

----------

Language models offer a way to **assign a probability to a sentence** or other sequence of words, and to predict a word from preceding words.
- Machine Translation: P(high winds tonite) > P(large winds tonite)
- Spell Correction: The office is about fifteen **minuets** from my house. P(about fifteen minutes from) > P(about fifteen minuets from)
- Speech Recognition: P(I saw a van) >> P(eyes awe of an)

## Types of Language Models
There are primarily two types of Language Models:

1. Statistical Language Models: these models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM) and certain linguistic rules to learn the probability distribution of words
2. Neural Language Models: these use different kinds of Neural Networks to model language

# Statistical Language Model

### N-grams -

probabilistic language models (Markov models) that estimate the likelihood of a word based on the preceding N-1 words. In other words, they model the conditional probability of a word given its context. There are unigram (consideration of a single word), bi gram (2 words), trigram (3 words), etc.

> An N-gram is a sequence of N tokens (or words)


![Uni/bi/tri grams](https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab04/Ngram.png)

* **n-gram** language models are evaluated extrinsically in some task, or intrinsically using perplexity


#### Example 

Consider the following sentence: `“Innopolis University is a university located in the city of Innopolis.”`

* A 1-gram (or unigram) is a one-word sequence. For the above sentence, the unigrams would simply be: “Innopolis“,  “University“, “is“, “a“, “university“, “located“, “in“, “the“, “city“, “of“, “Innopolis“.

* A 2-gram (or bigram) is a two-word sequence of words, like “Innopolis University”, “university located”, or “located in”.


### Model

> If we have a good N-gram model, we can predict `p(w | h)` — the probability of seeing the word w given a history of previous words `h` — where the history contains `n-1` words.


Probability of a sentence after applying chain rule, assuming each word is independent of others:
$$ P(x_0...x_m) = P(x_0) * P(x_1|x_0) * P(x_2|x_0x_1) ... = \prod_0^{m-1}{P(x_i|x_0...x_{i-1})} \text{, where } m \text{ - the sentence length.}$$ 

N-gram model simplifies this by limiting the preceding text to length N:

$$ P(x_m|x_0...x_{m-1}) \approx P(x_m|x_{m-n}...x_{m-1})$$

For a trigram model, this is:

$$P(x_0x_1x_2 ... x_m) = P(x_0) * P(x_1|x_0) * P(x_2|x_0x_1) * \prod_{i=3}^{m-1}P(x_i|x_{i-2}x_{i-1})$$


### Estimation of probabilities
The probabilities in n-gram models are typically estimated using maximum likelihood estimation (MLE) from a training corpus. For a bigram model, this is:
$$ P(x_i|x_{i-1}) = \frac{Count(x_{i-1}, x_i)}{Count(x_{i-1})}$$
where Count() function defines:
- $Count(x_{i-1})$ - number of times the word $x_{i-1}$​ appears in the training corpus
- $Count(x_{i-1}, x_i)$ - number of times the word pair $(x_{i-1}, x_i)$ appears in the training corpus

For russian speakers: 
> [The link](https://ru.wikipedia.org/wiki/N-%D0%B3%D1%80%D0%B0%D0%BC%D0%BC%D0%B0) for understanding of estimation of probabilities

### Smoothing techniques

To handle zero probabilities for unseen n-grams in the training data, various smoothing techniques are used:
1. Additive Smoothing (Laplace Smoothing):
$P(x_i​∣x_{i−1​})=\frac{Count(x_{i−1​},x_i​)+\alpha}{Count(x_{i−1​})+\alpha*V​}$
2. Good-Turing Smoothing: adjusts the probability of unseen events based on the frequency of frequencies.
3. Kneser-Ney Smoothing: a more advanced method that adjusts probabilities based on both the frequency of words and the diversity of contexts in which they appear.

### Advantages and Application
N-grams models are:
- Simple
- Efficient in terms of computation and storage for smaller n
- Suffer from data sparsity with high N, because many possible word sequences will not appear in the training data
- Limited by fixed context window size (n−1) decreasing the ability to capture long-range dependencies

Application:
- Speech recognition: rredicts the likelihood of word sequences to improve transcription accuracy
- Machine translation: helps in generating fluent translations by predicting probable word sequences in the target language
- Text prediction: used in keyboards and text editors to suggest the next word or phrase.

### Task 1.
Given the .txt file with some text. Count the probability of a randomly chosen word to be the exact one. 

In [1]:
!wget https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/big.txt
!wget https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/w2_.txt

--2024-06-28 14:21:21--  https://raw.githubusercontent.com/Dnau15/NLP2/main/big.txt
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6488666 (6,2M) [text/plain]
Saving to: ‘big.txt’


2024-06-28 14:21:25 (1,73 MB/s) - ‘big.txt’ saved [6488666/6488666]

--2024-06-28 14:21:25--  https://raw.githubusercontent.com/Dnau15/NLP2/main/w2_.txt
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17681036 (17M) [text/plain]
Saving

In [2]:
import re
from collections import Counter

def words(text):
    return re.findall(r'\w+', text.lower())

# dictionary {(word):(its amount in the text)}
WORDS = Counter(words(open('big.txt').read()))

def P(word: str, N=sum(WORDS.values())) -> float:
    """ Probability of `word` computation function.

    Args:
        word (str): the word for which the probability need to be found
        N (int): amount of all words in the text. Defaults to sum(WORDS.values()).

    Returns:
        prob (float): the probability
    """
    prob = ...
    return prob

In [4]:
print(P('name'))
print(P('woman'))
print(P('man'))
print(P('this'))
print(P('a'))

0.00023485435892379335
0.0002895341905816231
0.0014772518454443185
0.0036420353446846273
0.018935356785901566


In [6]:
def almost_equal(x,y,threshold=0.0001):
  return abs(x-y) < threshold

In [7]:
assert almost_equal(P('name'), 0.0002348, 0.000001)
assert almost_equal(P('woman'), 0.0002895, 0.000001)
assert almost_equal(P('man'), 0.0014772, 0.000001)
assert almost_equal(P('this'), 0.0036420, 0.000001)
assert almost_equal(P('a'), 0.0189353, 0.000001)

### Task 2.
Among a given list of random words choose only those which appear in the 'big.txt' file. `Hint:` use the WORDS dictionary

In [8]:
def known(words):
    "Return the subset of `words` that appear in the 'big.txt'"
    return ...

In [9]:
assert known(['this', 'is', 'my', 'test', 'adc']) == {'is', 'my', 'test', 'this'}
assert known(['this', 'message', 'car', 'price', 'computer', 'pytorch']) == {'this', 'message', 'car', 'price', 'computer'}
assert known(['tensor', 'wandb', 'float', 'binary']) == {'float', 'binary'}

### Task 3.

Given the file 'w2_.txt'. It contains 1020385 lines. Each line is in the format: "int`\t`word`\t`word'". Count the amount of each word-pair (bigram) in this file.

In [2]:
from collections import defaultdict

bigrams = defaultdict(int)

In [3]:
with open('w2_.txt', 'r') as file:
    lines = file.read().splitlines()
    for line in lines:
        line = line.strip().split('\t')

        frequency = int(line[0])

        bigram = tuple(line[1:])

        bigrams[bigram] += frequency

In [4]:
assert bigrams[('a', 'bombing')] == 320
assert bigrams[('a', 'most')] == 1988
assert bigrams[('c', 'can')] == 0

## Model implementation

In [1]:
# for those of you who run locally
# import nltk
# nltk.download('reuters', quiet=True)

In [2]:
%%capture 
# for those of you who run on kaggle
import subprocess
import nltk
# Download and unzip reuters
try:
    nltk.data.find('reuters.zip')
except:
    nltk.download('reuters', download_dir='/kaggle/working/', quiet=True, force=True)
    command = "unzip /kaggle/working/corpora/reuters.zip -d /kaggle/working/corpora"
    result = subprocess.run(command.split(), capture_output = True, text = True )
    nltk.data.path.append('/kaggle/working/')


In [3]:
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict

nltk.download('punkt')
# Create a placeholder for model
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count frequency of co-occurance  
for sentence in reuters.sents():
    lower_sentence = [word.lower() for word in sentence]
    for word1, word2, word3 in trigrams(lower_sentence, pad_right=True, pad_left=True):
        model[(word1, word2)][word3] += 1
        
        
# Let's transform the counts to probabilities
for word1_word2 in model:
    total_count = float(sum(model[word1_word2].values()))
    for word3 in model[word1_word2]:
        model[word1_word2][word3] /= total_count

In [4]:
import random

# starting words
text = ["today", "the"]
sentence_finished = False
 
while not sentence_finished:
    # select a random probability threshold  
    probability_threshold = random.random()
    accumulator = .0

    for word in model[tuple(text[-2:])].keys():
        accumulator += model[tuple(text[-2:])][word]
        
        # select words that are above the probability threshold
        if accumulator >= probability_threshold:
            text.append(word)
            break

    if text[-2:] == [None, None]:
        sentence_finished = True

print (' '.join([t for t in text if t]))

today the emirate ' s u . s .


## Alternative model implementation

In [5]:
import string
from nltk.corpus import stopwords
from nltk import FreqDist


#  remove the n-grams with removable words
def remove_stopwords(ngrams, removal_list):     
    y = []
    for pair in ngrams:
        count = 0
        for word in pair:
            if word in removal_list:
                count = count or 0
            else:
                count = count or 1
        if (count==1):
            y.append(pair)
    return (y)

def pick_word(counter):
    "Chooses a random element."
    return random.choice(list(counter.elements()))


In [6]:
# input the reuters sentences
sents  = reuters.sents()

nltk.download('stopwords')  
# write the removal characters such as : Stopwords and punctuation
stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation +'"'+'"'+'-'+'''+'''+'—'
string.punctuation
removal_list = list(stop_words) + list(string.punctuation)+ ['lt','rt']

In [7]:
# generate unigrams bigrams trigrams
unigram=[]
bigram=[]
trigram=[]
tokenized_text=[]
for sentence in sents:
    sentence = list(map(lambda x:x.lower(),sentence))
    for word in sentence:
        if word== '.':
            sentence.remove(word) 
        else:
            unigram.append(word)
    
    tokenized_text.append(sentence)
    bigram.extend(list(nltk.ngrams(sentence, 2,pad_left=True, pad_right=True)))
    trigram.extend(list(nltk.ngrams(sentence, 3, pad_left=True, pad_right=True)))

In [8]:
unigram = remove_stopwords(unigram, removal_list)
bigram = remove_stopwords(bigram,removal_list)
trigram = remove_stopwords(trigram,removal_list)
  

In [9]:
# generate frequency of n-grams 
freq_bi = FreqDist(bigram)
freq_tri = FreqDist(trigram)

In [10]:
d = defaultdict(Counter)
for a, b, c in freq_tri:
    if(a != None and b!= None and c!= None):
        d[(a, b)] += {(a, b, c) : freq_tri[a, b, c]}

In [11]:
# Next word prediction      
s=''

prefix = "today", "the"
print(" ".join(prefix))
s = " ".join(prefix)
for i in range(19):
    suffix = pick_word(d[prefix])[-1]
    s=s+' '+suffix
    print(s)
    prefix = prefix[1], suffix

today the
today the emirate
today the emirate '
today the emirate ' s
today the emirate ' s graphic
today the emirate ' s graphic arts
today the emirate ' s graphic arts group
today the emirate ' s graphic arts group for
today the emirate ' s graphic arts group for about
today the emirate ' s graphic arts group for about 1
today the emirate ' s graphic arts group for about 1 ,
today the emirate ' s graphic arts group for about 1 , 816
today the emirate ' s graphic arts group for about 1 , 816 tonnes
today the emirate ' s graphic arts group for about 1 , 816 tonnes the
today the emirate ' s graphic arts group for about 1 , 816 tonnes the previous
today the emirate ' s graphic arts group for about 1 , 816 tonnes the previous quarter
today the emirate ' s graphic arts group for about 1 , 816 tonnes the previous quarter ,
today the emirate ' s graphic arts group for about 1 , 816 tonnes the previous quarter , the
today the emirate ' s graphic arts group for about 1 , 816 tonnes the previou

# Conclusion

In this lesson we started diving into Language Models and considered the 1st type - Statistical LMs (N-grams). 

N-grams play a pivotal role in natural language processing by enabling the analysis of contiguous sequences of words.
- By breaking down text into sequences of n consecutive words or characters, we can estimate the probabilities of these sequences to predict and generate text.
- By estimating the probabilities of these n-grams, we can build robust models for various NLP tasks such as text generation, speech recognition, and machine translation. 
- Understanding and implementing n-grams is essential for creating systems that can effectively interpret and generate human language, forming the foundation for more advanced NLP techniques and applications. As we move forward, mastering these basic concepts will empower us to tackle more complex challenges in the field of natural language processing.

> As we delve deeper into the complexities of language, n-grams serve as a stepping stone to more advanced models, such as neural networks and transformers. Mastering n-grams provides a solid foundation for understanding how machines can process and generate human language.