# WordCount Step-by-Step using Spark

Spark is a "functional programming" tool. What this means for us each line of code operates on the entire data set and transforms it in some way, returning the entire data set when finished. In order to illustrate the transformations that happen, we will read a large text file (the works of Shakespeare appended to itself several times) and show how we manipulate the data to arrive at a count of words in the file.


Even though Jupyter notebooks can be setup to have a Spark "kernel" allowing us to create Spark notebooks, setting up the notebook server to find and use spark is... difficult. 

Thus enter the **findspark** library. If we tell it where our Spark directory is, it will find Spark and return the starting object we work from -- the "spark context" usually abbreviated "sc."

In the original version of the course, this was running on a simulated cluster on our local machines.  We had to use `findspark` to tell pyspark where to find pyspark.  The original findspark code is left here, but commented out, for completeness.

In [1]:
#import findspark
#findspark.init('/usr/hdp/2.6.3.0-235/spark2')

import pyspark
# For pyspark kernel
sc = pyspark.SparkContext.getOrCreate()
# For Python3 kernel
#sc = pyspark.SparkContext(appName="WordCount")

In [2]:
sc

In [3]:
# This allows us to load csvs and text files easily with spark.read.csv(path_to_file)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Now that we have the spark context, let's use it to load our text data file from HDFS. This stores the data into a data structure called a **Resilient Distributed Dataset (RDD).** 

Next we need to download the shakespeare text we will use as a demo and put it into HDFS for spark to use.  This file is from here: https://norvig.com/ngrams/

In [4]:
!wget https://norvig.com/ngrams/shakespeare.txt

--2019-06-26 02:32:55--  https://norvig.com/ngrams/shakespeare.txt
Resolving norvig.com (norvig.com)... 158.106.138.13
Connecting to norvig.com (norvig.com)|158.106.138.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4538523 (4.3M) [text/plain]
Saving to: ‘shakespeare.txt.4’


2019-06-26 02:32:55 (15.1 MB/s) - ‘shakespeare.txt.4’ saved [4538523/4538523]



In [5]:
!hdfs dfs -mkdir /shake

mkdir: `/shake': File exists


In [6]:
!hdfs dfs -copyFromLocal shakespeare.txt /shake

copyFromLocal: `/shake/shakespeare.txt': File exists


In [7]:
!hdfs dfs -ls /shake

Found 2 items
-rw-r--r--   2 root hadoop    4538523 2019-06-26 01:31 /shake/shakespeare.txt
drwxr-xr-x   - root hadoop          0 2019-06-26 02:29 /shake/spark_wordcount.txt


Now we can load the file into spark easily.  This can also be done like

`text_file = sc.textFile('hdfs:///shake/shakespeare.txt')`

If you wanted to load a lot of files from a data lake, we can use the same command:
https://stackoverflow.com/a/24036343/4549682

In [8]:
# text_file = sc.textFile('hdfs:///shake/shakespeare.txt')
text_file = sc.textFile("/shake/shakespeare.txt")

# Check that text_file is an RDD
from pyspark.rdd import RDD
isinstance(text_file, RDD)

True

In [9]:
!hdfs dfs -ls

Found 1 items
drwxr-xr-x   - root hadoop          0 2019-06-26 02:32 .sparkStaging


### Always, _Always_, **ALWAYS!**  look at your data!

Spark doesn't *actually* load the entire data set when you evaluate the cell above. It waits until it absolutely has to. In order to look at the first few lines of the file you need to use the **take()** command:

In [10]:
text_file.take(5)

["A MIDSUMMER-NIGHT'S DREAM",
 '',
 'Now , fair Hippolyta , our nuptial hour ',
 'Draws on apace : four happy days bring in ',
 'Another moon ; but O ! methinks how slow ']

## Strategy ##

Recall that in traditional MapReduce WordCount, the map() function will emit single words paired with the number 1. The reduce() then collects all of those words and sums them up, outputting the sum for each word into the output file(s). 

Right now, however, we do not have single words, we have lines of multiple words. So the first step of our strategy is to break the lines into individual words. Then we can pair each word with a 1 and finally sum the counts for each word.

Our goal going into the reduce() is to have something like this:

`[('A', 1), ('MIDSUMMER-NIGHT'S', 1), ('DREAM', 1), ('Now',1), ... ]`

### 1) Break the lines into individual words
For this step we use the map() function of RDDs. Recall that the map() function takes 2 arguments:
* The function to apply to the data set
* The data itself
* Since our data is strings, we can do several "clean up" functions at once
    * .lower() to make everything lower case
    * .replace(',',' ') to replace commas with spaces
In this case, the RDD word_list is the data set so we just need to provide a function. It is **very** common to use a **"lambda function."** Lambda functions are just little one-liners like below:

In [11]:
# x is an element of the data set - in this case a line of text
word_list = text_file.map(lambda x: x.lower().replace(',',' ').split()) 
# BTW, there's nothing "magic" about x. it is just a variable. you can name it anything
word_list.take(5)

[['a', "midsummer-night's", 'dream'],
 [],
 ['now', 'fair', 'hippolyta', 'our', 'nuptial', 'hour'],
 ['draws', 'on', 'apace', ':', 'four', 'happy', 'days', 'bring', 'in'],
 ['another', 'moon', ';', 'but', 'o', '!', 'methinks', 'how', 'slow']]

### A list of... lists?
The output of the last map command is a little surprising. **What happened?**

Recall that map() applies the function to each "piece" of data. In this case, each piece of data is a *line of text* and the function, split(), takes a line of text and **returns a list of individual words.** 

Our first transformation gave us individual words but they are wrapped in lists that we don't need. Fortunately, we have a way to deal with it. 

A **list of lists** is called **nested lists.** If you convert nested lists into a single list, you **flatten** it. 

Spark has a function that will act like map() and also flatten nested lists. It is called **flatmap().**

Let's see if we can use flatmap() to output our (word, 1) pairs. BTW, the parentheses on the **(**word,1**)** pair means we are grouping each word with the number 1 (technically, this is called a "tuple").

In [12]:
word_tuple = word_list.flatMap(lambda wordlist: [(word, 1) for word in wordlist])
word_tuple.take(5)

[('a', 1), ("midsummer-night's", 1), ('dream', 1), ('now', 1), ('fair', 1)]

### OK, that is... complicated

Let's look at that last line in detail.

* `flatmap()` helps to flatten out the nested lists, and it takes a function just like `map()`. That's where the lambda comes in,
* On this lambda I called the incoming data **wordlist** just to help distinguish it.

"OK," I hear you saying, "but what the heck is with the brackets and `for` thingy?"

I'm glad you asked! The brackets make a **list comprehension**. This is a fancy, one-line version of a for loop.

Observe, in regular Python:

In [13]:
data = ['Now', ',', 'fair', 'Hippolyta', ',', 'our', 'nuptial', 'hour']
for word in data:
    print ((word, 1))

('Now', 1)
(',', 1)
('fair', 1)
('Hippolyta', 1)
(',', 1)
('our', 1)
('nuptial', 1)
('hour', 1)


That looks just like what we want. Let's try it as a list comprehension:

In [14]:
[(word, 1) for word in data]

[('Now', 1),
 (',', 1),
 ('fair', 1),
 ('Hippolyta', 1),
 (',', 1),
 ('our', 1),
 ('nuptial', 1),
 ('hour', 1)]

Notice the list comprehension gave us output in a list. The `flatmap()` function helped keep from nesting more lists.

## Next up: reduce()

"Reduce" in the MapReduce sense of the word means **to gather** or **to sum up.** 

You might remember that MapReduce has an *intermediate* shuffle/sort step that helps group all the occurrences of a word before going in to reduce. It effectivel turns this:

`('the', 1), ('the', 1), ('the', 1), ('the', 1), ('the', 1)`

into this:

`('the', 1,1,1,1,1)`

An interesting way of thinking of each tuple (thing with parentheses) is as a (key, value). So, above, each 'the' would be a key and each 1 is a value. Spark has a cool function called `reduceByKey()` that helps us:

In [15]:
word_counts = word_tuple.reduceByKey(lambda total, count: total + count)
word_counts.take(5)

[('his', 6529),
 ('world', 608),
 ('of', 16013),
 ('orleans', 28),
 ('bastard', 62)]

## Results

That operation seems to have done exactly what we want. **BUT** even though that operation may have taken a long time, that's just because of the .take(). Normally, the full set of transformations would are not applied until a `reduce()` -type action is taken. Also a call to `collect()` will finalize transformations and collect the results *(can be used when you don't need a reduce but need finalize before (for example) writing to an output file)*. Let's check how many elements are in word_counts:

In [16]:
word_counts.count()

28687

## Saving

RDDs can be saved back to disk very easily. Just use the `saveAsTextFile()` function.

If this file already exists, it will throw an error. It can be deleted with the command:

`hdfs dfs -rm /shake/spark_wordcount`

In [23]:
!hdfs dfs -rm -r /shake/spark_wordcount

rm: `/shake/spark_wordcount': No such file or directory


In [24]:
word_counts.saveAsTextFile('/shake/spark_wordcount')

In [25]:
!hdfs dfs -ls /shake

Found 2 items
-rw-r--r--   2 root hadoop    4538523 2019-06-26 01:31 /shake/shakespeare.txt
drwxr-xr-x   - root hadoop          0 2019-06-26 02:34 /shake/spark_wordcount


In [26]:
!hdfs dfs -ls /shake/spark_wordcount

Found 3 items
-rw-r--r--   2 root hadoop          0 2019-06-26 02:34 /shake/spark_wordcount/_SUCCESS
-rw-r--r--   2 root hadoop     229535 2019-06-26 02:34 /shake/spark_wordcount/part-00000
-rw-r--r--   2 root hadoop     225058 2019-06-26 02:34 /shake/spark_wordcount/part-00001


Here is another way to save a file.  A function that might be helpful is spark's `coalesce()`.  It collapses the RDD into a single partition so it is written to disk as a single file.

In [27]:
# Remember we can't have an existing file/folder when we write it to HDFS, so we delete it first.

In [28]:
!hdfs dfs -rm -r /testoutput

Deleted /testoutput


In [29]:
# Reading a file in with spark.read.csv works much better for csv files.
text_df = spark.read.csv('/shake/shakespeare.txt')
text_df.coalesce(1).write.format('json').save('/testoutput')

In [30]:
!hdfs dfs -ls /testoutput

Found 2 items
-rw-r--r--   2 root hadoop          0 2019-06-26 02:34 /testoutput/_SUCCESS
-rw-r--r--   2 root hadoop    3782833 2019-06-26 02:34 /testoutput/part-00000-4686c0b3-0111-4763-b027-2af2f5fc5cd3-c000.json


In [31]:
!hdfs dfs -ls /

Found 6 items
drwx------   - mapred hadoop          0 2019-06-26 01:18 /hadoop
drwxr-xr-x   - root   hadoop          0 2019-06-26 02:34 /shake
drwxr-xr-x   - root   hadoop          0 2019-06-26 02:27 /testfile.json
drwxr-xr-x   - root   hadoop          0 2019-06-26 02:34 /testoutput
drwxrwxrwt   - hdfs   hadoop          0 2019-06-26 01:18 /tmp
drwxrwxrwt   - hdfs   hadoop          0 2019-06-26 01:30 /user


Of course, you can also use Python's built-in saving, or convert an RDD to a pandas dataframe with `toPandas()` (which requires a DF in spark I think: https://stackoverflow.com/a/48111699/4549682).

In [35]:
import pandas as pd
pandas_df = text_df.toPandas()
pandas_df.to_csv('path_to_file.csv')

Here is one way with base Python to save data, if you had json data: https://stackoverflow.com/a/12309296/4549682.

# But Wait! There's More!

Spark commands are meant to be **"chained"** together -- in other words, the output from one function call is immediately used as input for the next function call.

Let's see how chaining would work on our WordCount example. Remember, the variable **text_file** holds the contents of our source data file.

Here is the code, with comments:

`text_file.map(lambda x: x.split()).  \ # We put a "." after the map() to "chain" results -- \ is a line continuation
            flatMap(lambda wordlist: [(word, 1) for word in wordlist]). \
            reduceByKey(lambda total, count: total + count). \
            take(5) `

In [36]:
text_file.map(lambda x: x.split()).      \
            flatMap(lambda wordlist: [(word, 1) for word in wordlist]). \
            reduceByKey(lambda total, count: total + count). \
            take(5) 

[("MIDSUMMER-NIGHT'S", 1),
 ('Now', 741),
 ('Hippolyta', 6),
 ('nuptial', 21),
 ('apace', 25)]