<a href="https://colab.research.google.com/github/mir-abir-hossain/NLP-projects/blob/main/Sentence_generation_using_ngram_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Goal of the project: Creating a bigram model from the Brown Corpus and evaluate the perplexity of the model

*Creating training and testing data from the brown corpus*

In [71]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [72]:
from nltk.corpus import brown
categories = brown.categories()
for i in categories:
    print(i, end=', ')

adventure, belles_lettres, editorial, fiction, government, hobbies, humor, learned, lore, mystery, news, religion, reviews, romance, science_fiction, 

In [181]:
train_lines = brown.sents(categories = ['adventure', 'fiction', 'science_fiction'])
test_lines = brown.sents(categories = ['mystery'])

In [182]:
len(train_lines), len(test_lines)

(9834, 3886)

In [183]:
print(train_lines[:10], test_lines[:10])

[['Thirty-three'], ['Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.'], ['His', 'parents', 'talked', 'seriously', 'and', 'lengthily', 'to', 'their', 'own', 'doctor', 'and', 'to', 'a', 'specialist', 'at', 'the', 'University', 'Hospital', '--', 'Mr.', 'McKinley', 'was', 'entitled', 'to', 'a', 'discount', 'for', 'members', 'of', 'his', 'family', '--', 'and', 'it', 'was', 'decided', 'it', 'would', 'be', 'best', 'for', 'him', 'to', 'take', 'the', 'remainder', 'of', 'the', 'term', 'off', ',', 'spend', 'a', 'lot', 'of', 'time', 'in', 'bed', 'and', ',', 'for', 'the', 'rest', ',', 'do', 'pretty', 'much', 'as', 'he', 'chose', '--', 'provided', ',', 'of', 'course', ',', 'he', 'chose', 'to', 'do', 'nothing', 'too', 'exciting', 'or', 'too', 'debilitating', '.'], ['His', 'teacher', 'and', 'his', 'school', 'principal', 'were', 'conferred', 'with', 'and', 'everyone', 'agreed', 'that', ',', 'if', 'he', 'kept', 'up', 'with', 'a', 'certain', 'amount', 'of', 'work', 'at', 'home', ',', 'there', 'wa

*step 1: creating sentences from the lists and turn everything into lowercase*

In [184]:
train_sentences = [" ".join(sent).lower() for sent in train_lines]
test_sentences = [" ".join(sent).lower() for sent in test_lines]

In [185]:
for i, sent in enumerate(train_sentences[:10]):
    print(i, ': ', sent)
print('----------------')
for i, sent in enumerate(test_sentences[:10]):
    print(i, ': ', sent)

0 :  thirty-three
1 :  scotty did not go back to school .
2 :  his parents talked seriously and lengthily to their own doctor and to a specialist at the university hospital -- mr. mckinley was entitled to a discount for members of his family -- and it was decided it would be best for him to take the remainder of the term off , spend a lot of time in bed and , for the rest , do pretty much as he chose -- provided , of course , he chose to do nothing too exciting or too debilitating .
3 :  his teacher and his school principal were conferred with and everyone agreed that , if he kept up with a certain amount of work at home , there was little danger of his losing a term .
4 :  scotty accepted the decision with indifference and did not enter the arguments .
5 :  he was discharged from the hospital after a two-day checkup and he and his parents had what mr. mckinley described as a `` celebration lunch '' at the cafeteria on the campus .
6 :  rachel wore a smart hat and , because she had bee

step 2: removing punctuation marks and adding start and end token for every sentence

In [186]:
import string

def initial_preprocess(sentences):
    sents = [sent.translate(str.maketrans('', '', string.punctuation)) for sent in sentences]
    sents = ['<s> ' + sent + ' </s>' for sent in sents]
    return sents

preprocessed_train_sentences = initial_preprocess(train_sentences)
preprocessed_test_sentences = initial_preprocess(test_sentences)

In [187]:
for i, sent in enumerate(preprocessed_train_sentences[:7]):
    print(i, ': ', sent)
print('----------------')
for i, sent in enumerate(preprocessed_test_sentences[:7]):
    print(i, ': ', sent)

0 :  <s> thirtythree </s>
1 :  <s> scotty did not go back to school  </s>
2 :  <s> his parents talked seriously and lengthily to their own doctor and to a specialist at the university hospital  mr mckinley was entitled to a discount for members of his family  and it was decided it would be best for him to take the remainder of the term off  spend a lot of time in bed and  for the rest  do pretty much as he chose  provided  of course  he chose to do nothing too exciting or too debilitating  </s>
3 :  <s> his teacher and his school principal were conferred with and everyone agreed that  if he kept up with a certain amount of work at home  there was little danger of his losing a term  </s>
4 :  <s> scotty accepted the decision with indifference and did not enter the arguments  </s>
5 :  <s> he was discharged from the hospital after a twoday checkup and he and his parents had what mr mckinley described as a  celebration lunch  at the cafeteria on the campus  </s>
6 :  <s> rachel wore a sma

step 3: replace the words, that occurs only once in the corpus, with < UNK > token

In [188]:
from nltk.probability import FreqDist

def generate_tokens(sentence):
    """
    Takes a list of sentences with start and end tokens.
    Replace the words which occur only once in the corpus with
    '<UNK>' token and return the list of all tokens.
    For example:
    Args:
        sentences(list):
        ['<s> this is first sentence </s>', '<s> this is second sentence </s>']

    Returns:
        tokens_with_unk(list):
        ['<s>', 'this', 'is', '<UNK>', 'sentence', '</s>', '<s>', 'this', 'is', '<UNK>', 'sentence', '</s>']

    """
    unk = '<UNK>'
    tokens = ' '.join(sentence).split()
    vocab = FreqDist(tokens)
    freq_one = [i for i in vocab.keys() if vocab[i]==1]
    tokens_with_unk = []
    for word in tokens:
        if word in freq_one:
            tokens_with_unk.append(unk)
        else:
            tokens_with_unk.append(word)

    return tokens_with_unk


In [189]:
train_tokens = generate_tokens(preprocessed_train_sentences)
test_tokens = generate_tokens(preprocessed_train_sentences)

*step 4: Create n-grams using these tokens*

In [190]:
def ngrams(tokens, n=2):
    """
    Create n-grams and return unique n-grams with their corresponding counts.

    Args:
        tokens (list): list of tokens
        n(int) = 1 for unigram, 2 for bigram

    Returns:
    n-grams(dict): dictionary of n-grams as a tuple and it's corresponding count.

    Example:
        tokens = ['<s>', 'this', 'is', '<UNK>', 'sentence', '</s>',
                '<s>', 'this', 'is', '<UNK>', 'sentence', '</s>']
        For n = 2,
        n_grams: {
                ('<s>', 'this') : 2,
                ('this', 'is') : 2,
                ('is', '<UNK>') : 2,
                ('<UNK>', 'sentence') : 2,
                ('</s>' '<s>') : 1,
                ('sentence', '</s>') : 2
                }
    """
    ngram_list = list(nltk.ngrams(tokens, n))
    ngram_dicts = {}
    for pairs in ngram_list:
        if pairs not in ngram_dicts:
            ngram_dicts[pairs] = 1
        else:
            ngram_dicts[pairs] += 1

    return ngram_dicts


In [191]:
bigram_dicts = ngrams(train_tokens, 2)
unigram_dicts = ngrams(train_tokens, 1)

In [192]:
for k, v in bigram_dicts.items():
    if v > 250:
        print(k, v)

('</s>', '<s>') 9833
('and', '<UNK>') 326
('a', '<UNK>') 437
('at', 'the') 281
('it', 'was') 320
('of', 'the') 817
('<UNK>', 'of') 345
('<UNK>', '</s>') 998
('<s>', 'he') 1090
('he', 'was') 346
('<UNK>', 'and') 431
('on', 'the') 376
('<UNK>', '<UNK>') 540
('<s>', 'the') 871
('the', '<UNK>') 919
('<s>', 'i') 491
('he', 'had') 336
('in', 'the') 652
('<s>', 'she') 309
('<s>', 'it') 334
('to', 'the') 391
('<s>', 'but') 264


In [193]:
vocab = FreqDist(train_tokens)
len(vocab)

6774

In [194]:
vocab

FreqDist({'<s>': 9834, '</s>': 9834, 'the': 8295, '<UNK>': 7089, 'and': 3770, 'to': 3136, 'of': 3079, 'a': 3008, 'he': 2782, 'was': 2210, ...})

step 5: apply the laplace smoothing formula

In [195]:
def laplace_smoothed_bigram(bigram, bigram_count, unigram_dicts, vocab_size):
    unigram = bigram[0]
    unigram = tuple(list(unigram.split()))
    if unigram in unigram_dicts.keys():
        unigram_count = unigram_dicts[unigram]
    else:
        unigram_count = 0
    smoothed_prob = (bigram_count + 1)/(unigram_count + vocab_size)

    return smoothed_prob

In [196]:
def smoothing(bigram_dict):
    return {n_gram: laplace_smoothed_bigram(n_gram, count, unigram_dicts, len(vocab)) \
            for n_gram, count in bigram_dicts.items()}

In [197]:
model = smoothing(bigram_dicts)
sorted(model.items(), key=lambda x: x[1], reverse=True)[:20]

[(('</s>', '<s>'), 0.5921242774566474),
 (('of', 'the'), 0.08302039987820968),
 (('in', 'the'), 0.07419611407794569),
 (('<UNK>', '</s>'), 0.07206232417225708),
 (('<s>', 'he'), 0.0656912331406551),
 (('the', '<UNK>'), 0.06105249187072798),
 (('<s>', 'the'), 0.05250481695568401),
 (('on', 'the'), 0.04843890530643711),
 (('a', '<UNK>'), 0.04477611940298507),
 (('to', 'the'), 0.03955600403632694),
 (('it', 'was'), 0.039261252446183954),
 (('<UNK>', '<UNK>'), 0.039024742119310396),
 (('at', 'the'), 0.03726215644820296),
 (('he', 'was'), 0.036312264545835075),
 (('he', 'had'), 0.035265801590623695),
 (('<UNK>', 'and'), 0.031162086128543605),
 (('and', '<UNK>'), 0.03101289833080425),
 (('him', '</s>'), 0.03014811901953074),
 (('<s>', 'i'), 0.029624277456647398),
 (('from', 'the'), 0.027986348122866895)]

*step 6: Calculating perplexity*

In [198]:
masks = [[1,1], [1, 0], [0, 1], [0, 0]]

def convert_oov(ngram):
    """Converts, if necessary, a given n-gram to one which is known by the model.
    Args:
        ngram (tuple): a bigram tuple. for ex: ("the", "great")
    Returns:
        The n-gram with <UNK> tokens in certain positions such that the model
        contains an entry for it.

    """
    mask = lambda ngram, bitmask: tuple((token if flag == 1 else "<UNK>" for token,flag in zip(ngram, bitmask)))

    ngram = (ngram,) if type(ngram) is str else ngram
    for possible_known in [mask(ngram, bitmask) for bitmask in masks]:
        if possible_known in model:
            return possible_known

In [199]:
test_ngrams = nltk.ngrams(test_tokens, 2)
N = len(test_tokens)
known_ngrams  = (convert_oov(ngram) for ngram in test_ngrams)
probs = [model[ngram] for ngram in known_ngrams]

In [200]:
import math
import numpy as np

def perplexity(prob, N):
    perplexity = np.exp(sum([np.log(x) for x in prob])*(-1)/N)
    return perplexity

In [201]:
pps = perplexity(probs, N)
print(f"Perplexity of the model is: {pps}")

Perplexity of the model is: 616.8423318232044


*step 7: Generate Sentence*

In [202]:
import random

def best_candidate(prev, i, without=[]):
    """Choose the most likely next token given the previous (n-1) tokens.
    Args:
        prev (tuple of str): the previous n-1 tokens of the sentence.
        i (int): which candidate to select if not the most probable one.
        without (list of str): tokens to exclude from the candidates list.
    Returns:
        A tuple with the next most probable token and its corresponding probability.

    """

    blacklist  = ["<UNK>"] + without
    candidates = ((ngram[-1], prob) for ngram, prob in model.items() if ngram[:-1]==prev)
    candidates = filter(lambda candidate: candidate[0] not in blacklist, candidates)
    candidates = sorted(candidates, key=lambda candidate: candidate[1], reverse=True)

    if len(candidates) == 0:
        return ("</s>", 1)
    else:
        candidate_index = int((random.randint(0, len(candidates)))/2)
        return candidates[candidate_index if prev != () and prev[-1] != "<s>" else i]

def generate_sentences(num, min_len=12, max_len=24):
    """Generate random sentences using the language model.
    Args:
        num (int): the number of sentences to generate.
        min_len (int): minimum allowed sentence length.
        max_len (int): maximum allowed sentence length.
    Yields:
        A tuple with the generated sentence and the combined probability
        (in log-space) of all of its n-grams.

    """
    for i in range(num):
        sent, prob = ["<s>"], 1
        while sent[-1] != "</s>":
            prev = tuple(sent[-(1):])
            blacklist = sent + (["</s>"] if len(sent) < min_len else [])
            next_token, next_prob = best_candidate(prev, i, without=blacklist)
            sent.append(next_token)
            prob *= next_prob

            if len(sent) >= max_len:
                sent.append("</s>")

        yield ' '.join(sent), -1/math.log(prob)

In [203]:
print("Generating sentences...")
for sentence, prob in generate_sentences(num = 5):
    print("{} ({:.5f})".format(sentence, prob))

Generating sentences...
<s> he opened a chinless jake called and rolled between his bones of this mood it from hardtack into battle andrei remembered to roll </s> (0.00575)
<s> the protest from it will find a blanket much was of july </s> (0.01041)
<s> i really die of study room in spite and learn if a little payne did so nice people which gives something heroic even </s> (0.00569)
<s> it but when in four or out as your trial period mike turned on now i meant him an afternoons after wilson made </s> (0.00576)
<s> she threw him be doing on by its estate people talk with importance to abel </s> (0.00820)
