Language Modelling and Text Generation


### Introduction
An n-gram is a contiguous sequence of n words. For example "Machine" is a unigram, "Machine Learning" is a bigram and "Machine Learning PA2" is a trigram. In language modeling, n-gram models are probabilistic models of text that use word dependencies and context to predict the likelihood of occurence of an n-gram, i.e. predicting the nth word in an n-gram based on the previous n-1 words:
$$
P(ngram) =  P(word|context) = P(x^{n}|x^{n-1},...,x^{1})
$$
One use of the predictions made by such a model is text generation. In this part you will be training your own n-gram model and using it to generate text after learning from the provided Urdu short stories. 
<br><br>
For additional details of the working of n-gram models, you can also consult [Chapter 3](https://web.stanford.edu/~jurafsky/slp3/3.pdf) of the Speech and Language Processing book as and references.


### Dataset
You will be using the Urdu short stories by Patras Bukhari given in the folder `Urdu Short Stories`. This contains 6 stories of varying lengths which will serve as inputs for your n-gram model. 
We will implement an n-gram model that uses the given stories to generate Urdu text that mimics the input stories.

Start by importing all required libraries here.

In [318]:
# import all required libraries here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import scipy as sp
import nltk
import os
import urduhack

### 1.1 - Loading and Preprocessing the Dataset

Read in the short story files given and tokenize the text to be preprocessed.

In [319]:
# code here
data=[]
os.chdir(r"C:\Users\Lenovo\Desktop\Assignment 4\DataP1")
for file in os.listdir():
    if file.endswith(".txt"):
        with open(file, "r", encoding='utf8') as f:
            data.append(f.read())

Preprocess the tokenized data. Go through the data and use your own discretion to decide on what kind of pre-processing might be required.

In [320]:
# code here
tokenized=[]
for i in range(6):
     tokenized.append(urduhack.preprocessing.remove_punctuation(data[i]))
tokenized = [urduhack.preprocessing.remove_accents(i) for i in tokenized]
#tokenized=[nltk.word_tokenize(i) for i in tokenized]
for i in range(len(tokenized)):
    tokenized[i] = tokenized[i].replace("\n", "")
y=[]
for i in range(6):
    y.append(tokenized[i].split(" "))
stop_words=urduhack.stop_words.STOP_WORDS
y=[[w for w in i if not w in stop_words] for i in y]




### 1.2 - Creating Unigrams

Start by training a unigram model. For a unigram model, the n-gram probability is approximated by probability of the word in the unigram, as the model assumes independence:

$$
P(word) = \frac{n}{N}
$$

where n = count of the word in the corpus and N = total number of words in the corpus.

Generate a list of unigrams. Print the first 10 unigrams obtained.

In [321]:
# code here
unigram = y
unigram = [i for list in unigram for i in list]
print("Unigram: ")
unigram_set = set(unigram)
unigram_set = list(unigram)
for i in range(10):
    print(unigram_set[i])


Unigram: 
سینما
عشق
عنوان
عجب
ہوس
خیز
افسوس
مضمون
توقعات
مجروح


Find the probabilities for each unique unigram. 

In [322]:
# code here
uni_dict={}
for i in unigram_set:
    uni_dict[i]=(unigram.count(i))/len(unigram)
print(len(list(uni_dict.values())))

3080


### 1.3 - Creating Bigrams
Now train a bigram model. 

Generate a list of bigrams. Print the first 10 bigrams obtained.

In [323]:
# code here
bigram=[]
for i in range(len(unigram)):
    if i+1<len(unigram):
        bigram.append(unigram[i]+" "+unigram[i+1])
bigram_set = set(bigram)
bigram_set = list(bigram_set)
for i in range(10):
    print(bigram_set[i])

منصوبے باندھ
انگلیوں بربط
طول البلد
سورج کرنیں
پیداوار طلباء
کتابوں متعلق
لاہور حدود
بہنے شغل
باتیں کیں
دس لالہ


Find the probabilities for each unique bigram. 

In [324]:
# code here
bi_dict={}
for i in bigram_set:
    bi_dict[i]=(bigram.count(i))/len(bigram)
prob = list(bi_dict.values())
print(sum(prob))

1.0000000000001017


### 1.4 - Creating Trigrams
Lastly train a trigram model.

Generate a list of trigrams. Print the first 10 trigrams obtained.

In [325]:
# code here
trigram=[]
for i in range(len(unigram)):
    if i+2<len(unigram):
        trigram.append(unigram[i]+" "+unigram[i+1]+" "+unigram[i+2])
trigram_set = set(trigram)
trigram_set = list(trigram_set)
for i in range(10):
    print(trigram_set[i])

مسافروں سفر نقل
داخل اندھیرا گھپ
بھیجنے شہر روایات
سینے دردمند دل
دکھتی رگ نام
بنائیں چنانچہ تجویز
ليے یوں قوت
آئندہ لوگ پروفیسر
فارسی شخص فیل
تقریبا غلط حقیقت


Find the probabilities for each unique trigram. 

In [326]:
# code here
tri_dict={}
for i in trigram_set:
    tri_dict[i]=(trigram.count(i))/len(trigram)

### 1.5 - Generating Text
Generate a paragraph with ten sentences each containing 9-15 words (pick the length of the sentence randomly within this range) using you language model. Start with trigrams, use back-off technique (i.e. use n-1 gram) if a token is not available. 

For each word prediction, get top 5 most probabale words using the n-gram model and then pick the next word randomly from within these. This is being done to avoid excessive repetitive sequences in your generated text.

In [327]:
import random
length_rand = [9,10,11,12,13,14,15]


def prob(x):
    y = pd.DataFrame.from_dict(x, orient='index', columns=['prob'])
    y = y.sort_values(by=['prob'], ascending=False)
    y = y[0:5]
    y = y.to_dict()
    y = random.choice(list(y['prob'].keys()))
    return y

#select first word

def text():
    x = random.choice(list(uni_dict.keys()))
    #select second word
    second = {}
    for i in bi_dict:
        if i.startswith(x):
            second[i] = bi_dict[i]
    sec = prob(second)
    #add second word to sentence
    sentence = x+" "+sec.split(" ")[1]

    #select third word
    third = {}
    for i in tri_dict:
        if i.startswith(sec):
            third[i] = tri_dict[i]
    third = prob(third)
    #add third word to sentence
    sentence = sentence+" "+third.split(" ")[2]
    length = random.choice(length_rand)
    for i in range(length):
        #select ith word
        temp = sentence.split(" ")
        temp = temp[-2]+" "+temp[-1]
        #print(temp)
        third = {}
        for i in tri_dict:
            if i.startswith(temp):
                third[i] = tri_dict[i]

        if third == {}:
            temp = temp.split(" ")
            temp = temp[1]
            third = {}
            for i in bi_dict:
                if i.startswith(temp):
                    third[i] = bi_dict[i]
            if third == {}:
                temp = random.choice(list(uni_dict.keys()))
                sentence = sentence + " " + temp
                continue

            third = prob(third)
            sentence += " "+third.split(" ")[1]
            continue
        third = prob(third)
        #add ith word to sentence
        sentence += " "+third.split(" ")[2]
    return length,sentence

for i in range(10):
    print("Length: ",text())


Length:  (10, 'کہاجی کہنے سوال جواب پرنسپل مشورہ رضا مند چنانچہ بی اے سرٹفکیٹ جانا')
Length:  (9, 'مکھیاں مچھر مارنے کئی کئی افسر مقرر اگلے سال پیشین گوئی ضرورت')
Length:  (13, 'قطع روزگارڈن کہلاتا عظیم الشان تصانیف خدا عظمت آثار دکھائی صبح وقت الله میاں یاد چیز')
Length:  (12, 'کرکٹ ٹیم ڈنر شامل مطمح نظروسیع تقریر عام تینوں موقعوں کام آسکتی چنانچہ سامعین سہولت')
Length:  (10, 'کیجئے مصنوعی پن دیجئے بدولت غریب نام مانوس گے دوپہر وقت درختوں سایے')
Length:  (10, 'الجھن  معلوم پرچوں لکھ اچھی جانتا ممتحن لوگ نشے حالت پرچے دیکھیں')
Length:  (12, 'طریقہ باقی فارسی فیل گے اگلے سال مطلب یوں ادا ہاسٹل آب وہوا اچھی صفائی')
Length:  (14, 'عورتوں کائنات رہبر مردوں حشرات الارض سمجھتی بات نظرانداز میبل دن دس بارہ شادیاں بیٹھتے چنانچے گھر')
Length:  (14, 'رساند دانا اندراں حیراں بماند ڈیڑھ مہینے شخصیت ہاسٹل زندگی انحصا ر دومضمونوں وقتا فوقتا خیالات اظہار')
Length:  (12, 'نسخ صناعی ہنرمندی لکھے کاتب قدرت آفرین ہوگا پنجاب خطہ خطہ نستعلیق جدید جمیل طرز')


### 1.6 - Discussion and Evaluation

- Analyze the text generated, and mention 3 distinct observations. Also compare it with the input text and how different it is and why might that be.
- Is going upto n=3 enough? What do you think would be a good value of n and why?

Answer here:
1. The sentence does not finish naturally, instead it tends to finish abruptly. The sentences are missing connective words like tum aur etc because we removed the stop words. The sentences start abrupty as well. This is because our model does not consider where the sentences in the input data start or end. Maybe we can assign weights to identify these words.

2. having a high n value will be problematic as the rarity of n grams will increase, the model will take higher time to search the data for such a specific n gram and it wont most likely exist. A very low value will result in very inaccurate predictions. A good value would either be 3 or 4. We should test the model at n=4 as well to see if we get more accurate anwsers.