## Spark Hamlet

(c) Miradiz Rakhmatov

This are my notes and experements with Spark. In this project, my goal is to clean the Hamlet dataset so that it is ready for analysis later. I will mainly use map and filter to transform the dataset into a "clean" one.

In [1]:
import findspark

In [2]:
findspark.init('/users/miradiz/Downloads/spark-3.1.2-bin-hadoop3.2/')

In [3]:
import pyspark

In [4]:
## A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs
sc = pyspark.SparkContext()

In [5]:
raw_hamlet = sc.textFile('hamlet.txt')
raw_hamlet.take(5)

['hamlet@0\t\tHAMLET',
 'hamlet@8',
 'hamlet@9',
 'hamlet@10\t\tDRAMATIS PERSONAE',
 'hamlet@29']

## Step 1

Let's split the lines on "\t" delimiter 

In [6]:
hamlet = raw_hamlet.map(lambda line: line.split('\t'))
hamlet.take(5)

[['hamlet@0', '', 'HAMLET'],
 ['hamlet@8'],
 ['hamlet@9'],
 ['hamlet@10', '', 'DRAMATIS PERSONAE'],
 ['hamlet@29']]

## Step 2

### ID transformation:

Each id has 'hamlet@' before the integer. Let's remove 'hamlet@' from each ID and only keep the integer part of the ID from each line

In [7]:
def proper_id(line):
    id = line[0]       ## Since each row is a list of strings, I'm going to assign the first element to 'id' variable
    modified = id[7:]  ## [7:] extracts only integer part of the the string i.e 'hamlet@01' -> '01'
    
    new = list()       
    new.append(modified)  ## append the integer part of the first element of the list 
    for i in line[1:]:    ## append everthing else that comes after ID 
        new.append(i)
    
    return new


hamlet = hamlet.map(proper_id)
hamlet.take(10)

[['0', '', 'HAMLET'],
 ['8'],
 ['9'],
 ['10', '', 'DRAMATIS PERSONAE'],
 ['29'],
 ['30'],
 ['31', 'CLAUDIUS', 'king of Denmark. (KING CLAUDIUS:)'],
 ['74'],
 ['75', 'HAMLET', 'son to the late, and nephew to the present king.'],
 ['131']]

## Step 3:

We want to eliminate elements that don't contain any words and just have an ID in them. These typically represent blank lines between paragraphs or sections in the play. We want to remove any blank values ('') within elements that don't contain any useful information for our analysis.

In [8]:
hamlet = hamlet.filter(lambda line: len(line) > 1).map(lambda line: [i for i in line if i != ""])
hamlet.take(20)

[['0', 'HAMLET'],
 ['10', 'DRAMATIS PERSONAE'],
 ['31', 'CLAUDIUS', 'king of Denmark. (KING CLAUDIUS:)'],
 ['75', 'HAMLET', 'son to the late, and nephew to the present king.'],
 ['132', 'POLONIUS', 'lord chamberlain. (LORD POLONIUS:)'],
 ['177', 'HORATIO', 'friend to Hamlet.'],
 ['204', 'LAERTES', 'son to Polonius.'],
 ['230', 'LUCIANUS', 'nephew to the king.'],
 ['261', 'VOLTIMAND', '|'],
 ['273', '|'],
 ['276', 'CORNELIUS', '|'],
 ['288', '|'],
 ['291', 'ROSENCRANTZ', '|  courtiers.'],
 ['317', '|'],
 ['320', 'GUILDENSTERN', '|'],
 ['335', '|'],
 ['338', 'OSRIC', '|'],
 ['348', 'A Gentleman, (Gentlemen:)'],
 ['376', 'A Priest. (First Priest:)'],
 ['405', 'MARCELLUS', '|']]

## Step 4:

Remove the elements of the lists that are pipe character "|"

Note that we are removing an independent element of the list that is itself a pipe character 

In [9]:
hamlet = hamlet.map(lambda line: [l for l in line if l != "|"])
hamlet.take(15)

[['0', 'HAMLET'],
 ['10', 'DRAMATIS PERSONAE'],
 ['31', 'CLAUDIUS', 'king of Denmark. (KING CLAUDIUS:)'],
 ['75', 'HAMLET', 'son to the late, and nephew to the present king.'],
 ['132', 'POLONIUS', 'lord chamberlain. (LORD POLONIUS:)'],
 ['177', 'HORATIO', 'friend to Hamlet.'],
 ['204', 'LAERTES', 'son to Polonius.'],
 ['230', 'LUCIANUS', 'nephew to the king.'],
 ['261', 'VOLTIMAND'],
 ['273'],
 ['276', 'CORNELIUS'],
 ['288'],
 ['291', 'ROSENCRANTZ', '|  courtiers.'],
 ['317'],
 ['320', 'GUILDENSTERN']]

## Step 5: 

The elements that had a signle string with pipe character were removed. But there are still some elements that have pipe characters inside them i.e ID 291. Let's replace the pipe character with blank space and remove the blank space from the beginning and the end of the string.

In [10]:
def f(line):
    new_list = list()
    for string in line:
        modified = string.replace("|", "").strip()  ## remove the newly added black space from two sides of the string
        new_list.append(modified)
    return new_list

hamlet = hamlet.map(f)
hamlet.take(20)

[['0', 'HAMLET'],
 ['10', 'DRAMATIS PERSONAE'],
 ['31', 'CLAUDIUS', 'king of Denmark. (KING CLAUDIUS:)'],
 ['75', 'HAMLET', 'son to the late, and nephew to the present king.'],
 ['132', 'POLONIUS', 'lord chamberlain. (LORD POLONIUS:)'],
 ['177', 'HORATIO', 'friend to Hamlet.'],
 ['204', 'LAERTES', 'son to Polonius.'],
 ['230', 'LUCIANUS', 'nephew to the king.'],
 ['261', 'VOLTIMAND'],
 ['273'],
 ['276', 'CORNELIUS'],
 ['288'],
 ['291', 'ROSENCRANTZ', 'courtiers.'],
 ['317'],
 ['320', 'GUILDENSTERN'],
 ['335'],
 ['338', 'OSRIC'],
 ['348', 'A Gentleman, (Gentlemen:)'],
 ['376', 'A Priest. (First Priest:)'],
 ['405', 'MARCELLUS']]

## Step 6:

Now let's remove the lines that only have ID in them

In [11]:
hamlet = hamlet.filter(lambda line: len(line) > 1)
hamlet.take(25)

[['0', 'HAMLET'],
 ['10', 'DRAMATIS PERSONAE'],
 ['31', 'CLAUDIUS', 'king of Denmark. (KING CLAUDIUS:)'],
 ['75', 'HAMLET', 'son to the late, and nephew to the present king.'],
 ['132', 'POLONIUS', 'lord chamberlain. (LORD POLONIUS:)'],
 ['177', 'HORATIO', 'friend to Hamlet.'],
 ['204', 'LAERTES', 'son to Polonius.'],
 ['230', 'LUCIANUS', 'nephew to the king.'],
 ['261', 'VOLTIMAND'],
 ['276', 'CORNELIUS'],
 ['291', 'ROSENCRANTZ', 'courtiers.'],
 ['320', 'GUILDENSTERN'],
 ['338', 'OSRIC'],
 ['348', 'A Gentleman, (Gentlemen:)'],
 ['376', 'A Priest. (First Priest:)'],
 ['405', 'MARCELLUS'],
 ['417', 'officers.'],
 ['431', 'BERNARDO'],
 ['444', 'FRANCISCO', 'a soldier.'],
 ['466', 'REYNALDO', 'servant to Polonius.'],
 ['496', 'Players.'],
 ['506', '(First Player:)'],
 ['523', '(Player King:)'],
 ['539', '(Player Queen:)'],
 ['557', 'Two Clowns, grave-diggers.']]

## THE END:
As you see above, data is now ready for analysis. 