# Star Wars N-gram Text Generation

Let's try to generate some text based on the dialogues from the Star Wars scripts (episodes IV,V, and VI).

All the information for this exercise was retrieved from the [Visualizing Star Wars Movie Scripts](https://github.com/gastonstat/StarWars) project.

We start by reading all the dialogue lines from the scripts, which are labeled with the character speaking. We are only considering Luke, Leia, and Han Solo. We left Chewbacca out of the example for obvious reasons...

We read all the lines of each character and combine them in one single string. We tokenize this string using the `WordPunctTokenizer` and use these tokens to create an NLTK Text object.

In [None]:
import nltk
from nltk import word_tokenize, WordPunctTokenizer
import pandas

wpt = WordPunctTokenizer()

c3po_string = ""
vader_string = ""
solo_string = ""
luke_string = ""
leia_string = ""

def read_lines(path):
    lines = pandas.read_csv(path, delim_whitespace=True, error_bad_lines=False)
    
    c3po_lines = lines.loc[lines['Char'] == 'THREEPIO']['Text']
    solo_lines = lines.loc[lines['Char'] == 'HAN']['Text']
    vader_lines = lines.loc[lines['Char'] == 'VADER']['Text']
    luke_lines = lines.loc[lines['Char'] == 'LUKE']['Text']
    leia_lines = lines.loc[lines['Char'] == 'LEIA']['Text']
    
    
    global vader_string, c3po_string, solo_string, luke_string, leia_string
    c3po_string = c3po_string + " " + " ".join(c3po_lines)
    solo_string = solo_string + " " + " ".join(solo_lines)
    vader_string = vader_string + " " + " ".join(vader_lines)
    luke_string = luke_string + " " + " ".join(luke_lines)
    leia_string = leia_string + " " + " ".join(leia_lines)
    
read_lines('files/SW_EpisodeIV.txt')
read_lines('files/SW_EpisodeV.txt')
read_lines('files/SW_EpisodeVI.txt')

c3po_text = nltk.Text(wpt.tokenize(c3po_string))
solo_text = nltk.Text(wpt.tokenize(solo_string))
vader_text = nltk.Text(wpt.tokenize(vader_string))
luke_text = nltk.Text(wpt.tokenize(luke_string))
leia_text = nltk.Text(wpt.tokenize(leia_string))

Using these Text objects, we can proceed in the same way as in the previous examples, to generate texts. 

The following `generate_text_backoff` tries to generate a new word based on an N-gram proabability. If this fails, it tries the Tri-gram one and then the Bi-gram. If none of them are sucessful, it just stops. Recalling from the POS tagging session, this is known as a backoff strategy.

In [None]:
def generate_text_backoff(text, initialword, numwords):
   
    #ngrams
    ngrams = list(nltk.ngrams(text, 4,  pad_right=True, pad_left=True))
    ngram_pairs = (((w0, w1, w2), w3) for w0, w1, w2, w3 in ngrams)
    cpdNgram = nltk.ConditionalProbDist(nltk.ConditionalFreqDist(ngram_pairs), nltk.MLEProbDist)

    #trigram 
    trigrams = list(nltk.ngrams(text, 3,  pad_right=True, pad_left=True))
    trigram_pairs = (((w0, w1), w2) for w0, w1, w2 in trigrams)
    cpd3gram = nltk.ConditionalProbDist(nltk.ConditionalFreqDist(trigram_pairs), nltk.MLEProbDist)

    #bigram
    bigrams = list(nltk.ngrams(text, 2))
    cpd2gram = nltk.ConditionalProbDist(nltk.ConditionalFreqDist(bigrams), nltk.MLEProbDist)
    
    
    word = initialword
    for i in range(numwords):
        #try n-gram
        if (word[i],word[i+1], word[i+2]) in cpdNgram:
            w = cpdNgram[(word[i],word[i+1], word[i+2])].max()
        #try 3-gram
        elif (word[i+1],word[i+2]) in cpd3gram:
            w = cpd3gram[(word[i+1],word[i+2])].max()
        #try 2-gram
        elif word[i+2] in cpd2gram:
            w = cpd2gram[word[i+2]].max()
        #at least we tried...
        else:
            break
            
        word += [w]
    
    return " ".join(word)

Now that we have our function ready, let's try to generate some texts, and see how they vary from one character to another, using by the same starting tuple.

In [None]:
print("C3PO: " + generate_text_backoff(c3po_text, ["It", "sure", "is"], 25) + "\n")
print("Han Solo: " + generate_text_backoff(solo_text, ["Chewie", "come", "here"], 25) + "\n")
print("Leia: " + generate_text_backoff(leia_text, ["It", "sure", "is"], 25) + "\n")
print("Luke: " + generate_text_backoff(luke_text, ["It", "sure", "is"], 25) + "\n")
print("Vader: " + generate_text_backoff(vader_text, ["I", "am", "your"], 25) + "\n")