## Introduction
In the previous two missions, we covered the basics of PySpark, the MapReduce paradigm, transformations and actions, and how to do basic data cleanup in PySpark. In this challenge, you'll use the techniques you've learned to transform the text of Hamlet into a format that's more useful for data analysis.

#### Resources
* [PySpark's documentation for the RDD data structure](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)
* [Visual representation of methods](http://nbviewer.jupyter.org/github/jkthompson/pyspark-pictures/blob/master/pyspark-pictures.ipynb) (IPython Notebook format)
* [Visual representation of methods](https://training.databricks.com/visualapi.pdf) (PDF format)

## Extract Line Numbers
The first value in each element (or line from the play) is a line number that identifies the line of the play the text is from. It appears in the following format:

    'hamlet@0'
    'hamlet@8',
    'hamlet@9',
    ...

We don't need the __hamlet@__ at the beginning of these IDs for our data analysis. Let's extract just the integer part of the ID from each line, which is much more useful.

#### Instructions
Transform the RDD __split_hamlet__ into a new RDD __hamlet_with_ids__ that contains the clean version of the line ID for each element.
* For example, we want to transform __hamlet@0__ to __0__, and leave the rest of the values in that element untouched.
  * Recall that the __map()__ function will run on each element in the RDD, where each element is a list that we can access using regular Python mechanics.

In [1]:
# Find path to PySpark
import findspark
findspark.init()

# Import PySpark & initalize SparkContext object
import pyspark
sc = pyspark.SparkContext()
raw_hamlet=sc.textFile('hamlet.txt')
raw_hamlet.take(10)

[u'hamlet@0\t\tHAMLET',
 u'hamlet@8',
 u'hamlet@9',
 u'hamlet@10\t\tDRAMATIS PERSONAE',
 u'hamlet@29',
 u'hamlet@30',
 u'hamlet@31\tCLAUDIUS\tking of Denmark. (KING CLAUDIUS:)',
 u'hamlet@74',
 u'hamlet@75\tHAMLET\tson to the late, and nephew to the present king.',
 u'hamlet@131']

In [2]:
# Split RDD
split_hamlet = raw_hamlet.map(lambda line: line.split('\t'))
split_hamlet.take(10)

[[u'hamlet@0', u'', u'HAMLET'],
 [u'hamlet@8'],
 [u'hamlet@9'],
 [u'hamlet@10', u'', u'DRAMATIS PERSONAE'],
 [u'hamlet@29'],
 [u'hamlet@30'],
 [u'hamlet@31', u'CLAUDIUS', u'king of Denmark. (KING CLAUDIUS:)'],
 [u'hamlet@74'],
 [u'hamlet@75',
  u'HAMLET',
  u'son to the late, and nephew to the present king.'],
 [u'hamlet@131']]

In [3]:
# Remove hamlet@ from the row id's
def format_id(line):
    id = line[0].split('@')[1]
    result = []
    result.append(id)
    if len(line) > 1:
        for y in line[1:]:
            result.append(y)
    return result

hamlet_with_ids = split_hamlet.map(lambda line: format_id(line))

hamlet_with_ids.take(10)

[[u'0', u'', u'HAMLET'],
 [u'8'],
 [u'9'],
 [u'10', u'', u'DRAMATIS PERSONAE'],
 [u'29'],
 [u'30'],
 [u'31', u'CLAUDIUS', u'king of Denmark. (KING CLAUDIUS:)'],
 [u'74'],
 [u'75', u'HAMLET', u'son to the late, and nephew to the present king.'],
 [u'131']]

## Remove Blank Values
Next, we want to get rid of elements that don't contain any actual words (and just have an ID as the first value). These typically represent blank lines between paragraphs or sections in the play. We also want to remove any blank values ('') within elements, which don't contain any useful information for our analysis.

#### Instructions
* Clean up the RDD and store the result as a new RDD __hamlet_text_only__.

#### Comments on my first attempt
* The commented-out code and the utilized code is essetially the same, but it unnecessarily long and therefore not all that readable.
* So I've commented it out and condenced the code

In [4]:
# def remove_empty_lines(line):
#     if len(line) > 1:
#         return True
#     else:
#         return False
        
# hamlet_no_empty_lines = hamlet_with_ids.filter(lambda line: remove_empty_lines(line))
# hamlet_no_empty_lines.take(10)

In [5]:
# def remove_empty_elements(line):
#     cleaned_line = [i for i in line if i != '']
#     return cleaned_line
        
# hamlet_text_only = hamlet_no_empty_lines.map(lambda line: remove_empty_elements(line))
# hamlet_text_only.take(10)

In [6]:
hamlet_no_empty_lines = hamlet_with_ids.filter(lambda line: len(line)>1)
hamlet_no_empty_lines.take(10)

[[u'0', u'', u'HAMLET'],
 [u'10', u'', u'DRAMATIS PERSONAE'],
 [u'31', u'CLAUDIUS', u'king of Denmark. (KING CLAUDIUS:)'],
 [u'75', u'HAMLET', u'son to the late, and nephew to the present king.'],
 [u'132', u'POLONIUS', u'lord chamberlain. (LORD POLONIUS:)'],
 [u'177', u'HORATIO', u'friend to Hamlet.'],
 [u'204', u'LAERTES', u'son to Polonius.'],
 [u'230', u'LUCIANUS', u'nephew to the king.'],
 [u'261', u'VOLTIMAND', u'|'],
 [u'273', u'', u'|']]

In [7]:
hamlet_text_only = hamlet_no_empty_lines.map(lambda line: [l for l in line if l != ''])
hamlet_text_only.take(10)

[[u'0', u'HAMLET'],
 [u'10', u'DRAMATIS PERSONAE'],
 [u'31', u'CLAUDIUS', u'king of Denmark. (KING CLAUDIUS:)'],
 [u'75', u'HAMLET', u'son to the late, and nephew to the present king.'],
 [u'132', u'POLONIUS', u'lord chamberlain. (LORD POLONIUS:)'],
 [u'177', u'HORATIO', u'friend to Hamlet.'],
 [u'204', u'LAERTES', u'son to Polonius.'],
 [u'230', u'LUCIANUS', u'nephew to the king.'],
 [u'261', u'VOLTIMAND', u'|'],
 [u'273', u'|']]

## Removing Pipe Characters
If you've been using __take()__ to preview the RDD after each task, you may have noticed there are some pipe characters (|) in odd places that add no value for us. The pipe character may appear as a standalone value in an element, or as part of an otherwise useful string value.

#### Instructions
* Remove any list items that only contain the pipe character (|), and replace any pipe characters that appear within strings with an empty character.
  * Assign the resulting RDD to __clean_hamlet__.

In [8]:
no_pipe_elements = hamlet_text_only.map(lambda line: [l for l in line if l != '|'])
clean_hamlet = no_pipe_elements.map(lambda line: [l.replace('|',"") for l in line])
clean_hamlet.take(10)

[[u'0', u'HAMLET'],
 [u'10', u'DRAMATIS PERSONAE'],
 [u'31', u'CLAUDIUS', u'king of Denmark. (KING CLAUDIUS:)'],
 [u'75', u'HAMLET', u'son to the late, and nephew to the present king.'],
 [u'132', u'POLONIUS', u'lord chamberlain. (LORD POLONIUS:)'],
 [u'177', u'HORATIO', u'friend to Hamlet.'],
 [u'204', u'LAERTES', u'son to Polonius.'],
 [u'230', u'LUCIANUS', u'nephew to the king.'],
 [u'261', u'VOLTIMAND'],
 [u'273']]