# Shakespeare play text prediction

Jan Polzer and Ryan Duckworth

Dataset from: https://www.kaggle.com/kingburrito666/shakespeare-plays

In [1]:
import numpy as np
import pandas as pd
import random

Load dataset and drop the rows where at least one element is missing.

In [2]:
text = pd.read_csv('data/Shakespeare_data.csv').dropna()

We imported 105,152 lines of Shakespeare text.

In [3]:
print(text.shape)
text.head(1)

(105152, 6)


Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"


The PlayerLine column contains the text we want to use.

In [4]:
text = text.iloc[:,-1].values
print(text)

['So shaken as we are, so wan with care,'
 'Find we a time for frighted peace to pant,'
 'And breathe short-winded accents of new broils' ...
 'Each one demand an answer to his part'
 "Perform'd in this wide gap of time since first"
 "We were dissever'd: hastily lead away."]


We join the text using the new line character.

In [5]:
data = "\n".join(text)

In [6]:
dictionary = {}

We create a dictionary and count the number of times the next word occurs after the current word.

In [7]:
for line in data.split('\n'):
    words = line.split(' ')
    for word1, word2 in zip(words[:-1],words[1:]):
        if word1 not in dictionary:
            dictionary[word1] = {word2:1}
        else:
            if word2 not in dictionary[word1]:
                #First time encountered word2 after word1 (not in dictionary), set to 1 occurrence
                dictionary[word1][word2] = 1
            else:
                #Additional times encountered word2 after word1, increment the number of occurrences
                dictionary[word1][word2] += 1

Predict next word function. word1 is the previous word. The most_frequent flag has more useful results when set to False, to avoid repetition of the most frequently used word. The randomization of results works much better.

In [8]:
def predict_next_word(word1, dictionary, most_frequent):
    if word1 not in dictionary:
        word1 = ""
    
    path = dictionary[word1]
    pathArray = []
    
    if most_frequent:
        #Return the most frequently occuring word that occurs after last_word
        most_frequent_word = max(path, key=path.get)
        return most_frequent_word
    else:
        #Return a random word that occurs after last_word
        for word in path:
            freq = path[word]
            for f in range(freq):
                pathArray.append(word)
        return pathArray[random.randint(0, len(pathArray)-1)]

### Generate new text from the text corpus

In [9]:
sentence = ""
last_word = ""
length = 100

for each_word in range(0, length):
    next_word = predict_next_word(last_word, dictionary, False)
    sentence += " %s"%(next_word)
    last_word = next_word
print(sentence.strip())

'Thou liest' unto thy pump, that their own shapes, of the softness of shame to say 'Amen,'   Thy lord's health, in surmise, and his horse, even for her lord. My lord I am your cabin at a fault for my grandsire Priam turn his journey, be wicked? is nobility's true and to a coward hand and I of wrath, which she shall: go with their swords with her, in the expulsion is fabulous story, she shall devour their ease, will you suspect thee so answer'd. Tell me sport: her youngest son against her passing to alter


### Perform text prediction given a sequence of words

In [10]:
def predict_given_words(words, length, most_frequent = False):
    
    w = words.split(' ')
    last_word = w[-1]

    for each_word in range(0, length):
        next_word = predict_next_word(last_word, dictionary, most_frequent)
        words += " %s"%(next_word)
        last_word = next_word
        words.strip()
    return words

3 Examples below of predicting the next word using random words found in the dictionary after the last word

In [11]:
sentence = predict_given_words("I shall be", 50)
print(sentence)

I shall be out of a dulcet sounds retreat, and as I am Cinna the nuptial vow, sirrah, by right. Go, wind, turns the world. O my rapt in the food for the table, now purple dye, 'Tis now to him. Know'st thou hast not that title,--  These weeds are they will


In [12]:
sentence = predict_given_words("O my", 50)
print(sentence)

O my country's earth, bear to seek redemption thence into the curtains draw. And thus it comes in her out of the fair safety, for your affections by ten thousand ducats in the king this? Sense, sure, it meet, think your pasture, let us some hour before her knee slaves, vapours, and


In [13]:
sentence = predict_given_words("As this", 50)
print(sentence)

As this arrest: Francis! Who wears at him, tear that did when I can impress of that rebels are witness you know you did you? Hence, saucy with grief, it in a foot, and I have told him with no more than fight. Now my opinion, they come: the back receive thyself.


Example below of using most frequent word found in the in dictionary after the last word
We found the randomization above to be better for generating text because using the most frequent words creates a repetitive pattern. See 2nd example below for example: "and the king..."

In [14]:
sentence = predict_given_words("I shall be", 5, True)
print(sentence)

I shall be a man of the king


In [15]:
#Example of the repetition we don't want when only using the most frequent words, without randomization
sentence = predict_given_words("I shall be", 11, True)
print(sentence)

I shall be a man of the king and the king and the king


In [16]:
#By passing False into the most_frequent parameter of the predict_given_words and predict_next_word functions, 
#it produces much better text using randomization of words in the path of the previous word.
sentence = predict_given_words("I shall be", 11, False)
print(sentence)

I shall be forgot!  If thou art: but the breach in Rome with


### Try the application

In [17]:
text_start = input('Enter a beginning of a sentence: ')
word_count = int(input('Enter how many more words you want in the sentence: '))

Enter a beginning of a sentence: Hello the king of England
Enter how many more words you want in the sentence: 10


In [18]:
print(predict_given_words(text_start, word_count))

Hello the king of England prisoner? I am conqueror of your love. I'll make catlings
