## Introduction to the Data
*  Read the text file into an RDD named __raw_hamlet__ using the __textFile()__ method from SparkContext (this object instantiates to sc on our end).
*  Display the first five elements of the RDD.

In [1]:
# Find path to PySpark
import findspark
findspark.init()

# Import PySpark and initalize SparkContext object
import pyspark
sc = pyspark.SparkContext()

# read the hamlet.txt file into an RDD (Resilient Distributed Data Set)
raw_hamlet = sc.textFile('hamlet.txt')
first_five_elements = raw_hamlet.take(5)
first_five_elements

[u'hamlet@0\t\tHAMLET',
 u'hamlet@8',
 u'hamlet@9',
 u'hamlet@10\t\tDRAMATIS PERSONAE',
 u'hamlet@29']

## The Map Method
* The text file uses the tab character (__\t__) as a delimiter. We'll need to split the file on the tab delimiter and convert the results into an RDD that's more manageable.
* Use the __map__ method to convert:
* Name the resulting RDD __split_hamlet__.

In [2]:
split_hamlet = raw_hamlet.map(lambda line: line.split('\t'))

## Beyond Lambda Functions
Lambda functions are great for writing quick functions we can pass into PySpark methods with simple logic. They fall short when we need to write more customized logic, though. Thankfully, PySpark lets us define a function in Python first, then pass it in. Any function that returns a sequence of data in PySpark (versus a guaranteed Boolean value, like __filter()__ requires) must use a __yield__ statement to specify the values that should be pulled later.

If you're unfamiliar with the yield statement in Python, read this excellent [Stack Overflow answer](https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do/231855#231855) on the topic. To summarize, __yield__ is a Python technique that allows the interpreter to generate data on the fly and pull it when necessary, instead of storing it to memory immediately. Because of its unique architecture, Spark takes advantage of this technique to reduce overhead and improve the speed of computations.

Spark runs the named function on every element in the RDD and restricts it in scope. Each instance of the function only has access to the object(s) you pass into the function, and the Python libraries available in your environment. If you try to refer to variables outside the scope of the function or import libraries, those actions may cause the computation to crash. That's because Spark compiles the function's code to Java to run on the RDD objects (which are also in Java).

Finally, not all functions require us to use __yield__; only the ones that generate a custom sequence of data do. For __map()__ or __filter()__, we use __return__ to return a value for every single element in the RDD we're running the functions on.

## The FlatMap Method
In the following code cell, we'll use the __flatMap()__ method with the named function __hamlet_speaks__ to check whether a line in the play contains the text __HAMLET__ in all caps (indicating that Hamlet spoke). __flatMap()__ is different than __map()__ because it doesn't require an output for every element in the RDD. The __flatMap()__ method is useful whenever we want to generate a sequence of values from an RDD.

In this case, we want an RDD object that contains tuples of the unique line IDs and the text "hamlet speaketh!," __but only for the elements in the RDD that have "HAMLET" in one of the values.__ We can't use the __map()__ method for this because it requires a return value for every element in the RDD.

We want each element in the resulting RDD to have the following format:
1. The first value should be the unique line ID (e.g.__'hamlet@0'__) , which is the first value in each of the elements in the __split_hamlet__ RDD.
1. The second value should be the string "hamlet speaketh!"

In [3]:
def hamlet_speaks(line):
    id = line[0]
    speaketh = False
    
    if "HAMLET" in line:
        speaketh = True
    
    if speaketh:
        yield id,"hamlet speaketh!"

hamlet_spoken = split_hamlet.flatMap(lambda x: hamlet_speaks(x))
hamlet_spoken.take(10)

[(u'hamlet@0', 'hamlet speaketh!'),
 (u'hamlet@75', 'hamlet speaketh!'),
 (u'hamlet@1004', 'hamlet speaketh!'),
 (u'hamlet@9144', 'hamlet speaketh!'),
 (u'hamlet@12313', 'hamlet speaketh!'),
 (u'hamlet@12434', 'hamlet speaketh!'),
 (u'hamlet@12760', 'hamlet speaketh!'),
 (u'hamlet@12858', 'hamlet speaketh!'),
 (u'hamlet@14821', 'hamlet speaketh!'),
 (u'hamlet@15261', 'hamlet speaketh!')]