# Shakespeare Search Engine in 7 Lines of Code

Okay, a few more than that because we need to load the plays and import dependencies.

The idea is to create a hash table that relates searchable terms with their occurences in the document. This is an extremely fast method of search, but it does require building an index beforehand.

**Import dependencies**

In [1]:
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from collections import defaultdict

#### Load the corpus
The first task is to load Shakespeare into a data frame. I have pre-cleaned it, but it is available [here](https://www.kaggle.com/kingburrito666/shakespeare-plays).

In [2]:
df = pd.read_csv('shakespeare.csv')
df.head()

Unnamed: 0,Play,Character,Act,Scene,Passage,Line_Num,Line
0,Henry IV,KING HENRY IV,1,1,1,1,"So shaken as we are, so wan with care,"
1,Henry IV,KING HENRY IV,1,1,1,2,"Find we a time for frighted peace to pant,"
2,Henry IV,KING HENRY IV,1,1,1,3,And breathe short-winded accents of new broils
3,Henry IV,KING HENRY IV,1,1,1,4,To be commenced in strands afar remote.
4,Henry IV,KING HENRY IV,1,1,1,5,No more the thirsty entrance of this soil


#### Index
We'll use NLTK to do the annoying sentence processing and word stemming

In [3]:
ps = PorterStemmer()
word2idx = defaultdict(list)

for row in df.itertuples():
    for word in word_tokenize(row.Line): # Decompose the sentence to tokens
        if word.isalpha(): # Only add this token it if is a word (i.e., not punctuation)
            word2idx[ps.stem(word.lower())].append(row.Index)

#### Search
Basic search is a simple manner of inputting a key to the index dictionary

In [4]:
search_term = 'lustre'
for idx in word2idx[ps.stem(search_term.lower())]:
    print df.loc[idx].Line

It lends a lustre and more great opinion,
He beats thee 'gainst the odds: thy lustre thickens,
A lustre to it.
That hath not noble lustre in your eyes.
Equal in lustre, were now best, now worst,
About his neck, yet never lost her lustre;
Did lose his lustre: I did hear him groan:
Where is thy lustre now?
Piercing a hogshead! a good lustre of conceit in a
You have added worth unto 't and lustre,
The lustre of the better yet to show,
The lustre in your eye, heaven in your cheek,
Tincture or lustre in her lip, her eye,


This can be condensed to a one line function. (Though this is a little clunky.)

In [5]:
def search(term):
    return df.loc[[idx for idx in word2idx[ps.stem(term.lower())]]]

Let's look at some examples:

In [6]:
search_result = search('Tragic')
search_result.sample(6)

Unnamed: 0,Play,Character,Act,Scene,Passage,Line_Num,Line
92501,Titus Andronicus,TITUS ANDRONICUS,4,1,15,48,"This is the tragic tale of Philomel,"
79504,Richard III,QUEEN MARGARET,4,4,14,68,"And the beholders of this tragic play,"
65007,A Midsummer nights dream,THESEUS,5,1,10,61,Merry and tragical! tedious and brief!
21120,A Comedy of Errors,AEGEON,1,1,5,64,Gave any tragic instance of our harm:
79443,Richard III,QUEEN MARGARET,4,4,1,7,"Will prove as bitter, black, and tragical."
65006,A Midsummer nights dream,THESEUS,5,1,10,60,And his love Thisbe; very tragical mirth.'


In [7]:
search_result = search('Love')
search_result.sample(6)

Unnamed: 0,Play,Character,Act,Scene,Passage,Line_Num,Line
19967,Antony and Cleopatra,MARK ANTONY,4,4,10,21,"More tight at this than thou: dispatch. O love,"
14222,Alls well that ends well,HELENA,4,4,3,20,To recompense your love: doubt not but heaven
88776,Timon of Athens,APEMANTUS,1,1,117,253,labour: he that loves to be flattered is worth...
101004,Two Gentlemen of Verona,DUKE,3,2,22,58,You are already Love's firm votary
100381,Two Gentlemen of Verona,PROTEUS,2,4,96,205,And that I love him not as I was wont.
50820,Loves Labours Lost,BIRON,3,1,77,207,"What, I! I love! I sue! I seek a wife!"
