## In this notebook we investigate a language model that uses four grams and five grams to generate sequences using a words from a dataset

In [1]:
import re
f = open("9053.txt", "r")  # open the dataset
Dataset = []
for line in f:  # clean the dataset
    line = re.sub(r'\@\@\d{8} \@\d{7}/','',line)
    line = re.sub(r'<h>|<p>','',line)
    line = re.sub(r'  ',' ',line)
    line = re.sub(r'\@','',line)
    Dataset.append(line)

In [2]:
def four_gram(string):
    '''
    take a string and return a list containing lists of all possible four grams in this string
    '''
    
    string = "<START> "+ string +" <END>"
    word_list = string.split(' ')
    FourGrams = []
    for i in range(len(word_list)-3):
        FourGrams.append([word_list[i],word_list[i+1],word_list[i+2], word_list[i+3]])
    return FourGrams

In [3]:
# test the four_gram function
string = "Hello world today is good"
four_gram(string)

[['<START>', 'Hello', 'world', 'today'],
 ['Hello', 'world', 'today', 'is'],
 ['world', 'today', 'is', 'good'],
 ['today', 'is', 'good', '<END>']]

In [4]:
def five_gram(string):
    '''
    take a string and return a list containing lists of all possible five grams in this string
    '''
    string = "<START> "+ string +" <END>"
    word_list = string.split(' ')
    FiveGrams = []
    for i in range(len(word_list)-4):
        FiveGrams.append([word_list[i],word_list[i+1],word_list[i+2], word_list[i+3], word_list[i+4]])
    return FiveGrams

In [5]:
# test the four_gram function
string = "Hello world today is good"
five_gram(string)

[['<START>', 'Hello', 'world', 'today', 'is'],
 ['Hello', 'world', 'today', 'is', 'good'],
 ['world', 'today', 'is', 'good', '<END>']]

In [6]:
from collections import Counter, defaultdict
Total_Count = 0  # init count to zereo
FourGram_Dict = defaultdict(lambda: defaultdict(lambda: 0))  # init fourgramdict to zeros
for line in Dataset:  # get one line at a time from the dataset
    list_of_grams = four_gram(line)  # get all possible four grams in this line
    Total_Count += len(list_of_grams)  # increment total_count
    for w1,w2,w3,w4 in list_of_grams:  # inrcease the occurence of the 4th word after the first 3 words
        FourGram_Dict[(w1, w2, w3)][w4] += 1
        
for w1_w2_w3 in FourGram_Dict:  # normalize the counts to get the probabilities
    for w4 in FourGram_Dict[w1_w2_w3]:
        FourGram_Dict[w1_w2_w3][w4] /=  Total_Count

# generate a list of all possible start sequences in the dictionary
start_sequences_4 = [fourgrams for fourgrams in list(FourGram_Dict.keys()) if fourgrams[0] == '<START>']

In [7]:
# below code does the same as the cell above but for 5 grams

Total_Count = 0
FiveGram_Dict = defaultdict(lambda: defaultdict(lambda: 0))
for line in Dataset:
    list_of_grams = five_gram(line)
    Total_Count += len(list_of_grams)
    for w1,w2,w3,w4,w5 in list_of_grams:
        FiveGram_Dict[(w1, w2, w3, w4)][w5] += 1
        
for w1_w2_w3_w4 in FourGram_Dict:
    for w5 in FourGram_Dict[w1_w2_w3_w4]:
        FiveGram_Dict[w1_w2_w3_w4][w5] /=  Total_Count
        
start_sequences_5 = [fivegrams for fivegrams in list(FiveGram_Dict.keys()) if fivegrams[0] == '<START>']

In [8]:
import random

def generate_rand_seq(word_dict, max_len, start_sequences, n):
    '''
    generate a random sequence give the word_dict. sequence has a maximum length of max_len.
    choose a start sequence randomly from the list of start_sequnces. n is the number of previous words
    used to predict the next word.
    '''
    
    seq = list(random.choice(start_sequences)) # get a random start sequence
    length = len(seq)  # get the initial length of the sequnce
    while length <= max_len:  # keep addin more words as long as max_len is not achieved
        # get a list of all possible words to generat given the last n words from the sequence
        possible_words = list((word_dict[tuple(seq[-n:])]).keys())
        if possible_words:  # if there is at least one possible new word
            new_word = random.choice(possible_words)  # pick the new word randomly from all possible new words
            seq.append(new_word)  # append the new word to the sequence
            if new_word.strip() == '.':  # if the new word is a . then the sentence is over
                seq.pop(0)  # remove the start token
                return ' '.join(seq)  # return the sequence as a string
            length += 1  # increment the sequence length
    
    seq.pop(0)  # remove the start token
    return ' '.join(seq)  # return the sequence as a string

In [9]:
for _ in range(10):  # generate 10 random sequences with max_len 20 using 4-grams
    print(generate_rand_seq(FourGram_Dict, 20, start_sequences_4, 3))

 What to Do If Youre a Victim of Plagiarism You 're trawling the internet one day and say that
 How Long Does It Take to Write a Book ? How long is a piece of writing , "
 Publisher Word Count for Magazine Writing When you write for print magazines and newspapers .
 Its Okay to be a strong one rather than a content mill that is taking your labor and giving
 Former vs Latter with Examples here 's a statement that seems to contradict itself , but they only see
 Earth Island Journal , a glossy quarterly magazine of Earth Island Institute , its a word .
 Although there are many others .
 50 Creative Writing Ideas and Prompts When it comes to that .
 Random Word Generator is a tool which will do exactly that for you by easily allowing you to change
 Create Your Optimal Writing Place : 1000 Words a Day Writing Challenge If you find this tool useful in


In [10]:
for _ in range(10):  # generate 10 random sequences with max_len 20 using 5-grams
    print(generate_rand_seq(FiveGram_Dict, 20, start_sequences_5, 4))

 Plural Possessives : Why You Put an Apostrophe After the S Its common for people to wonder , "
 A Fine Parent Pays Writers $100 Per Article " You 're a dumbass , " or , " The
 Simple Ways to Improve Your Writing Vocabulary A great vocabulary is just one essential tool in a writers toolbox
 Although there are many opinions on how many types of essays there are , everyone seems to agree on
 Scary Mommy Pays Freelance Writers $100 per Article If it was published in newsprint , Scary Mommy would be
 Stumbling into a Freelance Writing Career When people find out that I 'm a professional writer .
 Become a Better Writer : Preserve and Improve Your Reading Skills Its no secret that reading and writing go
 10 Common Writing Submission Mistakes Writers are sometimes their own worst enemies .
 WEB PAGE WORD COUNTER Non-Common Keywords Keyword Quantity All Keywords Keyword Quantity There may be certain times when instead
 20 Helpful and Fun Products for Writers Most writers do n't need a lot of 

## Comparison Between 4 and 5 gram:
There does not seem to be much difference from the above sentences, however, almost all of the sentences generated by 5-gram seem to make sense while the same can not be said for the sentences generated by 4-grams. This is may be due to the fact that probably starting with 4 words makes the 5-gram model has only one choice to go since the dataset has only one option matching this sequence therefore the model will be just completing a single sentence in the dataset so it will make sense as it is not putting together words from different sentences