## Hamlet text analysis in Pyspark

The goal is to utlize pyspark to transform Hamlet to a dataframe. 

In [1]:
# Find path to PySpark.
import findspark
findspark.init()

In [2]:
# Import PySpark and initialize SparkContext object.
import pyspark
sc = pyspark.SparkContext()

In [7]:
# Read `recent-grads.csv` in to an RDD.
f = sc.textFile('hamlet.txt')
data = f.map(lambda line: line.split('\t'))
data.take(5)

[['hamlet@0', '', 'HAMLET'],
 ['hamlet@8'],
 ['hamlet@9'],
 ['hamlet@10', '', 'DRAMATIS PERSONAE'],
 ['hamlet@29']]

> Each line includes a hamlet@ followed by the line number, we'd like to create a list that has the line numbers as the first elemet. We can define a function and pass it to `map()` property of `SparkContext` object.

In [8]:
def format_id(x):
    id=x[0].split('@')[1]
    results=[]
    results.append(id)
    if len(x)>1:
        for y in x[1:]:
            results.append(y)
    return results

hamlet_with_ids=data.map(lambda line: format_id(line))
hamlet_with_ids.take(5)

[['0', '', 'HAMLET'], ['8'], ['9'], ['10', '', 'DRAMATIS PERSONAE'], ['29']]

> Next, we want to get rid of elements that don't contain any actual words (and just have an ID as the first value). These typically represent blank lines between paragraphs or sections in the play. We also want to remove any blank values ('') within elements, which don't contain any useful information for our analysis.

In [16]:
real_text = hamlet_with_ids.filter(lambda line: len(line) > 1)\
                            .map(lambda line: [l for l in line if l != ''])

real_text.take(5)

[['0', 'HAMLET'],
 ['10', 'DRAMATIS PERSONAE'],
 ['31', 'CLAUDIUS', 'king of Denmark. (KING CLAUDIUS:)'],
 ['75', 'HAMLET', 'son to the late, and nephew to the present king.'],
 ['132', 'POLONIUS', 'lord chamberlain. (LORD POLONIUS:)']]

In [17]:
# removing the pipe character
def fix_pipe(line):
    results = list()
    for l in line:
        if l == "|":
            pass
        elif "|" in l:
            fmtd = l.replace("|", "")
            results.append(fmtd)
        else:
            results.append(l)
    return results

clean_hamlet = real_text.map(lambda line: fix_pipe(line))