## N Grams
- Conditional Probabilities
- Text Pre-Processing
- Language Modelling
- Perplexity
- K-Smoothing
- N-Grams
- Backoff
- Tokenization

In [1]:
import nltk
import re

nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/shankar/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
corpus = "Learning% makes 'me' happy. I am happy be-cause I am learning! :)"
corpus = corpus.lower()

print(corpus)

learning% makes 'me' happy. i am happy be-cause i am learning! :)


In [3]:
# Remove special character
corpus = "learning% makes 'me' happy. i am happy be-cause i am learning! :)"
corpus = re.sub(r"[^a-zA-Z0-9.?! ]+", "", corpus)
print(corpus)

learning makes me happy. i am happy because i am learning! 


In [4]:
# Text Splitting
# split text by a delimiter to array
input_date="Sat May  9 07:33:35 CEST 2020"

# Get the date parts
date_parts = input_date.split(" ")
print(f"Date Parts = {date_parts}")

# Get the time parts in array
time_parts = date_parts[4].split(":")
print(f"Time Parts = {time_parts}")

Date Parts = ['Sat', 'May', '', '9', '07:33:35', 'CEST', '2020']
Time Parts = ['07', '33', '35']


### Sentence Tokenizing

In [5]:
sentence = "i am happy because i am learning."
tokenized_sentence = nltk.word_tokenize(sentence)
print(f"{sentence} -> {tokenized_sentence}")

i am happy because i am learning. -> ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']


In [8]:
# Find length of each word in the tokenized sentence
sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
word_lengths = [(a_word, len(a_word)) for a_word in sentence]

print(f"Lengths of the words: \n{word_lengths}")

Lengths of the words: 
[('i', 1), ('am', 2), ('happy', 5), ('because', 7), ('i', 1), ('am', 2), ('learning', 8), ('.', 1)]


## N-Grams
Sentence to n-grams is
- A sliding window of size n words can generate n-grams
- the window scans the lis of words starting at the sentence beginning...
- Moving by a step of one word until it reaches the end of the sentence

In [22]:
def sentence_to_trigram(tokenized_sentence):
    
    # Note that the last position of i is 3rd to the end
    trigrams = [tokenized_sentence[i:i+3] for i in range(len(tokenized_sentence) - 2)]
    print(trigrams)

In [23]:
sentence_to_trigram(sentence)

[['i', 'am', 'happy'], ['am', 'happy', 'because'], ['happy', 'because', 'i'], ['because', 'i', 'am'], ['i', 'am', 'learning'], ['am', 'learning', '.']]


In [24]:
def sentence_to_n_gram(tokenized_sentence, n):
    
    # Note that the last position of i is 3rd to the end
    trigrams = [tokenized_sentence[i:i+n] for i in range(len(tokenized_sentence) - (n-1))]
    print(trigrams)
    

In [26]:
sentence_to_n_gram(sentence, 3)

[['i', 'am', 'happy'], ['am', 'happy', 'because'], ['happy', 'because', 'i'], ['because', 'i', 'am'], ['i', 'am', 'learning'], ['am', 'learning', '.']]


In [27]:
sentence_to_n_gram(sentence, 2)

[['i', 'am'], ['am', 'happy'], ['happy', 'because'], ['because', 'i'], ['i', 'am'], ['am', 'learning'], ['learning', '.']]


In [28]:
sentence_to_n_gram(sentence, 4)

[['i', 'am', 'happy', 'because'], ['am', 'happy', 'because', 'i'], ['happy', 'because', 'i', 'am'], ['because', 'i', 'am', 'learning'], ['i', 'am', 'learning', '.']]


### Prefix to an n-gram

$$
P \left ( w_n | w^{n-1}_1 \right) = \frac{C\left(w^n_1 \right)}{C\left(w^{n-1}_1\right)}
$$

In [29]:
# Get trigram prefix form a 4-gram
fourgram = ['i', 'am', 'happy','because']
trigram = fourgram[0:-1] # Get the elements from 0, included, up to the last element, not included.
print(trigram)

['i', 'am', 'happy']


**Start and end of sentence word $<s>$ and $</s>$**

In [33]:
n = 3
tokenized_sentence = [["<s>"] * (n-1) + sentence + ["</s>"]]
print(tokenized_sentence)

[['<s>', '<s>', 'i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.', '</s>']]
