# Markov Text Generator

Write a bare-bones Markov text generator.
Implement a function of the form
finish_sentence(sentence, n, corpus, deterministic=False)
that takes four arguments:
1. a sentence [list of tokens] that we’re trying to build on,
2. n [int], the length of n-grams to use for prediction, and
3. a source corpus [list of tokens]
4. a flag indicating whether the process should be deterministic [bool]
and returns an extended sentence until the first ., ?, or ! is found OR until it has 10 total
tokens.


If the input flag deterministic is true, choose at each step the single most probable next
token. When two tokens are equally probable, choose the one that occurs first in the corpus.

If deterministic is false, draw the next word randomly from the appropriate distribution.
Use stupid backoff and no smoothing.

Provide some example applications of your function in both deterministic and
stochastic modes, for a few sets of seed words and a few different n.

## Package using 


In [1]:
import random
import numpy as np
import nltk

## Code

### 1. Backoff function
In a backoff n-gram model, if the n-gram we need has zero counts, we approximate it by backing off to the (N-1)-gram. We continue backing off until we reach a history that has some counts.

In [3]:
def backoff(last_n_words, n, corpus):
    try:
        words = np.array([])
        freqs = np.array([])
        for i in range(len(corpus)-n): 
            if corpus[i:i+n-1] == last_n_words: 
                if np.isin(corpus[i+n-1], words) == False: 
                    words = np.append(words, corpus[i+n-1]) 
                    freqs = np.append(freqs, 1) 
                else:
                    freqs[words == corpus[i+n-1]] += 1 
        return words,freqs
    except:
        last_n_words_in_corpus = backoff(last_n_words, n-1, corpus) # If n-grams are not found, backoff to n-1-grams
    return last_n_words_in_corpus

### 2. Finish a sentence with n-grams


In [4]:
def finish_sentence(tokens, n, corpus, deterministic): 
    
    for i in range(10):
        last_n_words = tokens[-(n-1):] 
        last_n_words_in_corpus, freqs = backoff(last_n_words, n, corpus) 
        if deterministic:
            next_words = last_n_words_in_corpus[np.argmax(freqs)] 
        else:
            next_words = random.choice(last_n_words_in_corpus) 
        tokens.append(next_words)
        if next_words in ['.', '!', '?']:
            break
        elif len(tokens) == 10: 
            break

    return tokens

### 3. Test
Firstly, we set our corpus for the Markov function as following:

In [5]:
corpus = nltk.word_tokenize(
        nltk.corpus.gutenberg.raw("austen-sense.txt").lower())

#### Case 1

In the first test case here we set the 

        tokens = ["she", "was", "not"]
        n  =  3
        deterministic = True

We get the result as our expected as following:

        ['she', 'was', 'not', 'in', 'the', 'world', '.']

#### Case 2
Let's try to change the 
        
        deterministic = False
In this case, the deterministic = False, we randomly pick the words. 
and we get the following results:

        First run: ['she', 'was', 'not', 'it', 'what', 'you', 'tell', 'me', 'that', 'it']
        Second run: ['she', 'was', 'not', 'beyond', 'one', 'day', 'that', 'she', 'wishes', 'for']

#### Case 3
Then we change the
        
         n = 4. 
In this case, we set our the length of n-grams to 4 using for the prediction.
we get the following results:

        ['she', 'was', 'not', 'in', 'the', 'habit', 'of', 'seeing', 'much', 'occupation']

#### Case 4
We change our token seed as 

        ['she', "was", "not", "that"]
and we get the following:

        ['she', 'was', 'not', 'that', 'i', 'have', 'been', 'so', 'happy', ',']

#### Case 5
We change the token seed and n as :

       Token = ['she', "was", "not", "that"] 
        n = 2
we get the following result:

        ['she', 'was', 'not', 'that', 'she', 'was', 'not', 'be', 'a', 'very']

#### Case 6
We change the token seed, n, and the deterministic as :

       Token = ['she', "was", "not", "that"] 
        n = 2
        deterministic = False

We get the following:

        ['she', 'was', 'not', 'that', 'all', 'cousins', 'the', 'girls', ';', 'prepared']
