# N-gram Language Model
In this exercise, we will build an n-gram language model based on the Shakespeare corpus, and generate some Shakespeare-style text using the language model. Going beyond that, you can use the same technique to build language models using other corpora. 

In [None]:
# Step 1: Obtain the corpus. 
# NOTE: here corpus itself has performed the tokenization operation
from nltk.corpus import shakespeare
books = shakespeare.fileids()
all_words = []
for bk in books:
    all_words += shakespeare.words(bk)
all_words = [ww.lower() for ww in all_words]
print(all_words[:30])

In [None]:
# Step 2: extract n-grams from all_words

def extract_ngrams(word_list, n_value):
    # INPUT: word_list, a list of tokens 
    # INPUT: n_value, the desirable n value
    # Output: a list of n-grams
    pass

n_value = 2 # you can let it be 2 or 3 or 4 or ...
all_ngrams_list = extract_ngrams(all_words, n_value)

In [None]:
# Step 3: build the n-gram language model (LM)
# the built n-gram LM should allow you to check the frequency n-grams conditioned on some n-1 grams
# for example, suppose I have built a 3-gram LM; when I query model(('tragedy','of')), 
# it should return something like: 
# {'antony': 1, 'hamlet': 1, 'julius': 1, 'macbeth': 1, 'othello': 1, 'romeo': 1},
# indicating that 'tragedy of antony' appears once in the corpus, 
# 'tragedy of hamlet' appears once, etc.
# HINT: you may use the ConditionalFreqDict function from NLTK to build the LM

def build_lm(all_ngrams_list, n_value):
    pass

lang_model = build_lm(all_ngrams_list, n_value)
print(lang_model(('tragedy','of')))

In [None]:
# Step 4: generate text using the built LM

def generate_text(lang_model, initial_words, wanted_text_length=100, word_selection_strategy='greedy'):
    # this function allows you to provide the first n-1 word, and it will
    # generate some text following your input words using the language model
    # INPUT lang_model: the LM you have built
    # INPUT initial_words: a list/tuple that includes the first few (n-1) words 
    # INPUT wanted_text_length: how many words you want the model to generate 
    # INPUT word_selection_strategy: how to select the next word from the LM (e.g. greedy, random, or following certain probability distributions)
    # OUTPUT: a list of words the text-generator generates
    pass

print(' '.join(generate_text(lang_model,('the','tragedy'))))
# you may expect the outcome be a piece of text like:
# the tragedy of hamlet sits smiling to my mind did lose it . first musician no . mark antony speak to her wounds : then it was your enemy say so : but words are suited ! i partly feel thee . bid you alexas to mardian bring me word , menecrates , and but thou hast need . benvolio tut , dun ' s something tells me , heaven forgive him ! emilia she give that ? and do invite you to use . mine eyes . exit portia come hither arm ' d till i come : give me grace

### Extra task 1: 
Try different n_value and see their influence on the quality of the generated text; you should see that with larger n_value, the generated text reads better.

### Going beyond: 
Build LM on other corpora (e.g. Reuters, Brown) and generate text, check the different styles of generated text.