# Big Data Analytics - Lab 02



### Setting up our Spark environment
The next cell installs PySpark in the Google Colab environment. Spark is written in Scala and runs in a Java Virtual Machine. PySpark is a Python interface to a Spark backend virtual machine (VM). There are Java, Python, R, Scala and SQL frontend interfaces to Spark. Essentially, PySpark sends the Python Spark commands to the Spark VM for evaluation, then the results are returned to the PySpark frontend.

In [None]:
# Do not change or modify this cell
# Need to install pyspark
# if pyspark is already installed, will print a message indicating requirement already satisfied
! pip install pyspark >& /dev/null

In [None]:
# Create Spark Session and Spark Context
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('BDA-Lab-02').getOrCreate()
sc = spark.sparkContext

## Introduction to RDDs

Resilient Distributed Datasets (RDDs) are the core abstraction of Apache Spark. They are immutable data structures that can reside in the memory of multiple machines.

We can create an RDD from Python objects using the `parallelize` function from PySpark.

In [None]:
list(range(20))

In [None]:
rdd = sc.parallelize(range(20))

Print the RDD

In [None]:
print(rdd)

Print the RDD type

In [None]:
type(rdd)

Show the first element of the RDD

In [None]:
rdd.first()

Create a Python list containing the first 2 elements of the RDD. The `take` method is a heavyweight operation because data has to be transferred from HDFS into the Python interpreter's memory space. If you only take 2 then it's not a big deal but the more you take the heavier the operation becomes.

In [None]:
rdd.take(9)

Create a Python list containing all elements in the RDD.  Note that this is a *very* expensive operation since all of the data in the Spark Java VM memory space has to be collected and transferred into the Python interpreter's memory space.

In [None]:
rdd.collect()

We can apply functions to each element.  Let's define such a function.

In [None]:
def less_than_10(x):
    if x < 10:
        return True
    else:
        return False

We'll start by using the `filter` method of `rdd`.

In [None]:
rdd.filter(less_than_10) # note that RDDs are "lazy" â€” they will not execute until we call an "action"

In [None]:
rdd.filter(less_than_10).collect() # collect() is such an action

In [None]:
rdd.filter(less_than_10).count() # so is count()

Remember when we said RDDs are immutable? If we convert the rdd to a Python list, all original values are unchanged.

In [None]:
rdd.collect()

We can also use a `lambda` function for filtering.

In [None]:
rdd.filter(lambda x: x < 10).collect() # remember to call collect in order to see the results

We can apply functions to elements of an RDD using `map` or `flatMap`. Let's start by defining a function named `square` to apply to each element of `rdd`.

In [None]:
def square(x):
    return x**2

Apply the square function to each element of `rdd` using the `map` function.

In [None]:
rdd.map(square).collect()[:7] # we'll look at only the first 7 elements

We can also use a `lambda` function with `map`.

In [None]:
rdd.map(lambda x: x**2).collect()[:7]

### Activity
Let's re-use the function you created in lab 1 that checks if a number is prime. Apply that function to our `rdd` to return a list of prime numbers only.

In [None]:
# code

## MapReduce

In the next section, we will walk through some MapReduce excercises in order to develop an understanding for how MapReduce works.

The classic MapReduce paradigm can be accomplished by using `map`, `flatMap`, and `reduceByKey`.

### Computing **total** orders per month

The following RDD contains month, state, and number of orders per month.

In [None]:
# create a python list
sales = [
['JAN', 'NY', 3.],
['JAN', 'PA', 1.],
['JAN', 'NJ', 2.],
['JAN', 'CT', 4.],
['FEB', 'PA', 1.],
['FEB', 'NJ', 1.],
['FEB', 'NY', 2.],
['FEB', 'VT', 1.],
['MAR', 'NJ', 2.],
['MAR', 'NY', 1.],
['MAR', 'VT', 2.],
['MAR', 'PA', 3.]]

# use the parellize method to convert this list to an RDD
sales_rdd = sc.parallelize(sales)

Define the map function to apply to each element of the RDD.

In [None]:
def map_func(row):
    return [row[0], row[2]]

**Question:** What does this function do?

Apply `map_func` to each element of the RDD.

In [None]:
print("raw data:", sales_rdd.collect())
print("mapped data:", sales_rdd.map(map_func).collect())

Next, reduce to count the number of orders per month.

In [None]:
def reduce_func(value1, value2):
    return value1 + value2

Put it all together.

In [None]:
sales_rdd.map(map_func).reduceByKey(reduce_func).collect()

### Computing **average** orders per month

The cell below defines a function which will be called in the map function. The `avg_map_func` takes a row from the rdd defined above, and returns the value in the first col, and a tuple containing the the value in the 3rd col followd by a 1.  The 1 will be used in the reducer to count the number of items for the key where the key is the month.

In [None]:
MONTH_INDEX = 0
ORDER_INDEX = 2

def avg_map_func(row):
    return (row[MONTH_INDEX], (row[ORDER_INDEX], 1))

The `avg_reduce_func` takes value 1 and value 2 as inputs. Value 1 and value 2 are expected to be the tuples defined in the output from `avg_map_func` above. The goal of the function is to add up the floats and the 1's in the tuples. We are essentially summing up the floats and the 1's associated with each unique key. Note that the key is not one of the args, the `reduceByKey` function below will strip the keys out of the data returned by the map function.

In [None]:
COUNT_INDEX = 1
NUM_ORDER_INDEX = 0

def avg_reduce_func(value1, value2):
    # (current sum of orders + new num orders), (current number of keys + new num keys)
    return ((value1[NUM_ORDER_INDEX] + value2[NUM_ORDER_INDEX], value1[COUNT_INDEX] + value2[COUNT_INDEX]))

Test out `avg_map_func`.

In [None]:
sales_rdd.map(avg_map_func).collect()

Below we test the `map` and `reduceByKey` functions. The `map` function returns the month (used as the key for the `reduceByKey` function), and a tuple containing the 3rd col floating point value followed by a 1.

In [None]:
sales_rdd.map(avg_map_func).reduceByKey(avg_reduce_func).collect()

Finally, we present 2 different ways to compute the final average using `map` and `mapValues` functions to divide the sum of the floats by the sum of the 1's.  The `mapValues` function excludes the keys so there is no need for double indexing. The sum of the 1's is the number of rows per key so the result is the average.

In [None]:
TOTAL_INDEX = 0
print("Using mapValues:",
      sales_rdd.map(avg_map_func)\
      .reduceByKey(avg_reduce_func)\
      .mapValues(lambda x: x[TOTAL_INDEX]/x[COUNT_INDEX])\
      .collect())

KEY_INDEX = 0
VALUE_INDEX = 1
TOTAL_ORDER_INDEX = 0
COUNT_INDEX = 1
print("Using map:",
      sales_rdd.map(avg_map_func)\
      .reduceByKey(avg_reduce_func)\
      .map(lambda x: (x[KEY_INDEX], x[VALUE_INDEX][TOTAL_ORDER_INDEX]/x[VALUE_INDEX][COUNT_INDEX]))\
      .collect())

### Counting words in Shakespeare's collected works using **MapReduce**

We start by downloading data from a remote source. The `shakespeare.txt` file contains the complete works of William Shakespeare, obtained from Project Gutenburg (https://www.gutenberg.org/ebooks/100)

In [None]:
%%bash
if [[ ! -f shakespeare.txt ]]; then
   # download the data file from s3 and save it the local environment
   wget https://syr-bda.s3.us-east-2.amazonaws.com/shakespeare.txt -q
fi

Create an RDD from the downloaded text file, then print its unique identifier.

In [None]:
shakespeare_rdd = sc.textFile('shakespeare.txt')
shakespeare_rdd.id()

In [None]:
shakespeare_rdd.first()

Note that the call to `first` actually returns a Python string.

In [None]:
type(shakespeare_rdd.first())

Convert the first 10 elements of the RDD to a python list.

In [None]:
shakespeare_rdd.take(10)

Check how many times the word `love` appears

In [None]:
def count_love(line):
    return line.lower().split().count('love')

In [None]:
shakespeare_rdd.map(count_love).take(10)

In [None]:
shakespeare_rdd.map(count_love).sum()

In [None]:
def has_love(line):
    # should return True if line has word `love`, and False otherwise
    return "love" in line.lower()

In [None]:
shakespeare_rdd.filter(has_love).take(3)

### Activity
use a `lambda` function to achieve the same result as above.

In [None]:
# your code

Now, let's count every word in `shakespeare_rdd`.

Define utility functions to be used by `flatMap` and `reduceByKey`

In [None]:
def count_words(corpus):
    return [(word.lower(), 1) for word in corpus.split()]

def sum_words(first, second):
    return first + second

Let's break up the `flatMap` and `reduceByKey` operations. The `flatMap` operation takes a single element (in this case a list of words), and returns 0 or more output items.

In [None]:
shakespeare_rdd.flatMap(count_words).take(25)

For comparison purposes only, here is what happens if we use `map` instead of `flatMap`. Notice how `map` returns a list of lists while `flatMap` returns a single list.

**Question:** Why would this structure be problematic?

In [None]:
shakespeare_rdd.map(count_words).take(5)

Now, when we add the `reduceByKey` function onto the `flatMap` function, the `reduceByKey` function groups common words by key, and adds up all the ones associated with each word/key.

In [None]:
shakespeare_rdd.flatMap(count_words).reduceByKey(sum_words).take(10)