<a href="https://colab.research.google.com/github/megan91292/dsc-resilient-distributed-datasets-rdd-lab-onl01-dtsc-ft-041320/blob/master/Copy_of_index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark







In [9]:
!pip install pyspark==2.4.5



In [10]:
import pyspark
pyspark.__version__


'2.4.5'

In [11]:
import os
# /usr/lib/jvm/java-8-openjdk-amd64
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# /content/spark-2.4.5-bin-hadoop2.7
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"


In [12]:
import findspark
findspark.init()

In [13]:
pyspark.__version__


'2.4.5'

In [4]:
import findspark
findspark.init()

## Resilient Distributed Datasets (RDDs) - Lab

Resilient Distributed Datasets (RDD) are fundamental data structures of Spark. An RDD is essentially the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD can come from any data source, e.g. text files, a database, a JSON file, etc.


## Objectives

You will be able to:

- Apply the map(func) transformation to a given function on all elements of an RDD in different partitions 
- Apply a map transformation for all elements of an RDD 
- Compare the difference between a transformation and an action within RDDs 
- Use collect(), count(), and take() actions to trigger spark transformations  
- Use filter to select data that meets certain specifications within an RDD 
- Set number of partitions for parallelizing RDDs 
- Create RDDs from Python collections 


## What are RDDs? 

To get a better understanding of RDDs, let's break down each one of the components of the acronym RDD:

Resilient: RDDs are considered "resilient" because they have built-in fault tolerance. This means that even if one of the nodes goes offline, RDDs will be able to restore the data. This is already a huge advantage compared to standard storage. If a standard computer dies while performing an operation, all of its memory will be lost in the process. With RDDs, multiple nodes can go offline, and the action will still be held in working memory.

Distributed: The data is contained on multiple nodes of a cluster-computing operation. It is efficiently partitioned to allow for parallelism.

Dataset: The dataset has been * partitioned * across the multiple nodes. 

RDDs are the building block upon which more high-level Spark operations are based upon. Chances are, if you are performing an action using Spark, the operation involves RDDs. 



Key Characteristics of RDDs:

- Immutable: Once an RDD is created, it cannot be modified. 
- Lazily Evaluated: RDDs will not be evaluated until an action is triggered. Essentially, when RDDs are created, they are programmed to perform some action, but that function will not get activated until it is explicitly called. The reason for lazy evaluation is that allows users to organize the actions of their Spark program into smaller actions. It also saves unnecessary computation and memory load.
- In-Memory: The operations in Spark are performed in-memory rather than in the database. This is what allows Spark to perform fast operations with very large quantities of data.




### RDD Transformations vs Actions

In Spark, we first create a __base RDD__ and then apply one or more transformations to that base RDD following our processing needs. Being immutable means, **once an RDD is created, it cannot be changed**. As a result, **each transformation of an RDD creates a new RDD**. Finally, we can apply one or more **actions** to the RDDs. Spark uses lazy evaluation, so transformations are not actually executed until an action occurs.


<img src="https://github.com/megan91292/dsc-resilient-distributed-datasets-rdd-lab-onl01-dtsc-ft-041320/blob/master/images/rdd_diagram.png?raw=1" width=500>

### Transformations

Transformations create a new dataset from an existing one by passing each dataset element through a function and returning a new RDD representing the results. In short, creating an RDD from an existing RDD is ‘transformation’.
All transformations in Spark are lazy. They do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result that needs to be returned to the driver program.
A transformation is an RDD that returns another RDD, like map, flatMap, filter, reduceByKey, join, cogroup, etc.

### Actions
Actions return final results of RDD computations. Actions trigger execution using lineage graph to load the data into original RDD and carry out all intermediate transformations and return the final results to the driver program or writes it out to the file system. An action returns a value (to a Spark driver - the user program).

Here are some key transformations and actions that we will explore.


| Transformations   | Actions       |
|-------------------|---------------|
| map(func)         | reduce(func)  |
| filter(func)      | collect()     |
| groupByKey()      | count()       |
| reduceByKey(func) | first()       |
| mapValues(func)   | take()        |
| sample()          | countByKey()  |
| distinct()        | foreach(func) |
| sortByKey()       |               |


Let's see how transformations and actions work through a simple example. In this example, we will perform several actions and transformations on RDDs in order to obtain a better understanding of Spark processing. 

### Create a Python collection 

We need some data to start experimenting with RDDs. Let's create some sample data and see how RDDs handle it. To practice working with RDDs, we're going to use a simple Python list.

- Create a Python list `data` of integers between 1 and 1000 using the `range()` function. 
- Sanity check: confirm the length of the list (it should be 1000)

In [20]:
data = range(1,1001)
len(data)
# 1000

1000

### Initialize an RDD

When using Spark to make computations, datasets are treated as lists of entries. Those lists are split into different partitions across different cores or different computers. Each list of data held in memory is a partition of the RDD. The reason why Spark is able to make computations far faster than other big data processing languages is that it allows all data to be stored __in-memory__, which allows for easy access to the data and, in turn, high-speed processing. Here is an example of how the alphabet might be split into different RDDs and held across a distributed collection of nodes:

<img src ="./images/partitions_1.png" width ="500">  
To initialize an RDD, first import `pyspark` and then create a SparkContext assigned to the variable `sc`. Use `'local[*]'` as the master.

In [15]:
import pyspark
sc = pyspark.SparkContext('local[*]')

Once you've created the SparkContext, you can use the `.parallelize()` method to create an RDD that will distribute the list of numbers across multiple cores. Here, create one called `rdd` with 10 partitions using `data` as the collection you are parallelizing.

In [21]:
rdd = sc.parallelize(data, numSlices=10)
print(type(rdd))


<class 'pyspark.rdd.PipelinedRDD'>


Determine how many partitions are being used with this RDD with the `.getNumPartitions()` method.

In [22]:
rdd.getNumPartitions()
# 10

10

### Basic descriptive RDD actions

Let's perform some basic operations on our RDD. In the cell below, use the methods:
* `count`: returns the total count of items in the RDD 
* `first`: returns the first item in the RDD
* `take`: returns the first `n` items in the RDD
* `top`: returns the top `n` items
* `collect`: returns everything from your RDD


It's important to note that in a big data context, calling the collect method will often take a very long time to execute and should be handled with care!

In [23]:
# count
rdd.count()

1000

In [24]:
# first
rdd.first()

1

In [28]:
rdd.take(10)
# take


[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [29]:
# top
rdd.top(10)

[1000, 999, 998, 997, 996, 995, 994, 993, 992, 991]

In [30]:
# collect
rdd.collect()

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185

## Map functions

Now that you've been working a little bit with RDDs, let's make this a little more interesting. Imagine you're running a hot new e-commerce startup called BuyStuff, and you're trying to track of how much it charges customers from each item sold. In the next cell, we're going to create simulated data by multiplying the values 1-1000 with a random number from 0-1.

In [34]:
import random
import numpy as np

nums = np.array(range(1, 1001))
sales_figures = nums * np.random.rand(1000)
sales_figures

array([3.93928131e-01, 1.50942767e+00, 1.85584334e+00, 2.80605310e+00,
       4.78054228e+00, 5.31416262e+00, 3.80244547e+00, 1.01150017e+00,
       8.89924643e+00, 7.29565587e+00, 1.93414037e+00, 7.12266566e+00,
       1.08355305e+01, 7.06780783e+00, 3.52142160e+00, 2.58557529e+00,
       1.12508372e+01, 3.62129336e-01, 5.28468254e+00, 4.40800340e+00,
       2.37789267e+00, 1.98807513e+01, 2.00858608e+01, 2.07225521e+01,
       1.32366962e+01, 1.62586069e+01, 1.63881086e+01, 2.52112300e+01,
       2.79246170e+01, 1.44297913e+01, 4.76988446e+00, 2.54009030e+01,
       2.57723393e+01, 1.34798932e+01, 2.71502363e+01, 2.86786984e+01,
       2.59649596e+01, 2.78638437e+01, 3.13992007e+01, 3.15864605e+00,
       1.76704352e+01, 3.66480398e+01, 1.76452470e+01, 2.59419513e+01,
       1.63079913e+01, 1.61302309e+01, 3.02288717e+01, 3.51403541e+01,
       1.09965880e+01, 1.85033286e+01, 2.76751681e+01, 4.31178960e+01,
       4.39969002e+01, 5.02317825e+01, 2.36458360e+01, 3.81938520e+01,
      

We now have sales prices for 1000 items currently for sale at BuyStuff. Now create an RDD called `price_items` using the newly created data with 10 slices. After you create it, use one of the basic actions to see what's in the RDD.

In [35]:
price_items = sc.parallelize(sales_figures, numSlices=10)
price_items.take(4)

[0.3939281307108007, 1.5094276663589812, 1.8558433402773087, 2.80605310459775]

Now let's perform some operations on this simple dataset. To begin with, create a function that will take into account how much money BuyStuff will receive after sales tax has been applied (assume a sales tax of 8%). To make this happen, create a function called `sales_tax()` that returns the amount of money our company will receive after the sales tax has been applied. The function will have this parameter:

* `item`: (float) number to be multiplied by the sales tax.


Apply that function to the rdd by using the `.map()` method and assign it to a variable `renenue_minus_tax`

In [37]:
def sales_tax(num):
    return num*0.92

revenue_minus_tax = price_items.map(sales_tax)
revenue_minus_tax

PythonRDD[10] at RDD at PythonRDD.scala:53

Remember, Spark has __lazy evaluation__, which means that the `sales_tax()` function is a transformer that is not executed until you call an action. Use one of the collection methods to execute the transformer now a part of the RDD and observe the contents of the `revenue_minus_tax` rdd.

In [38]:
# perform action to retrieve rdd values
revenue_minus_tax.take(10)


[0.36241388025393667,
 1.3886734530502627,
 1.707375873055124,
 2.58156885622993,
 4.398098895947657,
 4.889029611366141,
 3.498249830757779,
 0.9305801603964347,
 8.187306712876552,
 6.712003403901118]

### Lambda Functions

Note that you can also use lambda functions if you want to quickly perform simple operations on data without creating a function. Let's assume that BuyStuff has also decided to offer a 10% discount on all of their items on the pre-tax amounts of each item. Use a lambda function within a `.map()` method to apply the additional 10% loss in revenue for BuyStuff and assign the transformed RDD to a new RDD called `discounted`.

In [51]:
discounted = price_items.map(lambda x:0.9*x)
discounted.count()

1000

In [42]:
discounted.take(10)

[0.3545353176397206,
 1.3584848997230832,
 1.670259006249578,
 2.525447794137975,
 4.302488050383577,
 4.782746358945137,
 3.4222009213934794,
 0.9103501569095557,
 8.009321784335757,
 6.566090286425007]

## Chaining Methods

You are also able to chain methods together with Spark. In one line, remove the tax and discount from the revenue of BuyStuff and use a collection method to see the 15 costliest items.

In [43]:
price_items.map(sales_tax).map(lambda x : x*0.9).top(15)


[796.2147333929859,
 770.4425215053794,
 751.0947906992728,
 749.4791802722693,
 744.4101787666149,
 734.2842900481018,
 718.3405563284479,
 716.6798449771912,
 712.7343634940011,
 710.5376684223185,
 698.9435220355577,
 698.5232497195992,
 693.1515038919542,
 689.1496849129852,
 688.0736091116239]

## RDD Lineage


We are able to see the full lineage of all the operations that have been performed on an RDD by using the `RDD.toDebugString()` method. As your transformations become more complex, you are encouraged to call this method to get a better understanding of the dependencies between RDDs. Try calling it on the `discounted` RDD to see what RDDs it is dependent on.

In [44]:
price_items.toDebugString()

b'(10) ParallelCollectionRDD[8] at parallelize at PythonRDD.scala:195 []'

### Map vs. Flatmap

Depending on how you want your data to be outputted, you might want to use `.flatMap()` rather than a simple `.map()`. Let's take a look at how it performs operations versus the standard map. Let's say we wanted to maintain the original amount BuyStuff receives for each item as well as the new amount after the tax and discount are applied. Create a map function that will return a tuple with (original price, post-discount price).

In [45]:
mapped = price_items.map(lambda x: (x, x*0.92 *0.9))
print(mapped.count())
print(mapped.take(10))

1000
[(0.3939281307108007, 0.326172492228543), (1.5094276663589812, 1.2498061077452365), (1.8558433402773087, 1.5366382857496117), (2.80605310459775, 2.323411970606937), (4.780542278203974, 3.958289006352891), (5.314162621050152, 4.400126650229526), (3.802445468214977, 3.148424847682001), (1.0115001743439507, 0.8375221443567912), (8.89924642703973, 7.368576041588897), (7.295655873805563, 6.040803063511007)]


Note that we have 1000 tuples created to our specification. Let's take a look at how `.flatMap()` differs in its implementation. Use the `.flatMap()` method with the same function you created above.

In [46]:
flat_mapped = price_items.flatMap(lambda x: (x, x*0.92 *0.9))
print(flat_mapped.count())
print(flat_mapped.take(10))

2000
[0.3939281307108007, 0.326172492228543, 1.5094276663589812, 1.2498061077452365, 1.8558433402773087, 1.5366382857496117, 2.80605310459775, 2.323411970606937, 4.780542278203974, 3.958289006352891]


Rather than being represented by tuples, all of the  values are now on the same level. When we are trying to combine different items together, it is sometimes necessary to use `.flatMap()` rather than `.map()` in order to properly reduce to our specifications. This is not one of those instances, but in the upcoming lab, you just might have to use it.

## Filter
After meeting with some external consultants, BuyStuff has determined that its business will be more profitable if it focuses on higher ticket items. Now, use the `.filter()` method to select items that bring in more than $300 after tax and discount have been removed. A filter method is a specialized form of a map function that only returns the items that match a certain criterion. In the cell below:
* use a lambda function within a `.filter()` method to meet the consultant's suggestion's specifications. set `RDD = selected_items`
* calculate the total number of items remaining in BuyStuff's inventory

In [50]:
selected_items = discounted.filter(lambda x: x>300)
selected_items.count()

286

## Reduce

Reduce functions are where you are in some way combing all of the variables that you have mapped out. Here is an example of how a reduce function works when the task is to sum all values:

<img src = "./images/reduce_function.png" width = "600">  


As you can see, the operation is performed within each partition first, after which, the results of the computations in each partition are combined to come up with one final answer.  

Now it's time to figure out how much money BuyStuff would make from selling one of all of its items after they've reduced their inventory. Use the `.reduce()` method with a lambda function to add up all of the values in the RDD. Your lambda function should have two variables. 

In [52]:
selected_items.reduce(lambda x,y: x + y)


138574.54356452645

The time has come for BuyStuff to open up shop and start selling its goods. It only has one of each item, but it's allowing 50 lucky users to buy as many items as they want while they remain in stock. Within seconds, BuyStuff is sold out. Below, you'll find the sales data in an RDD with tuples of (user, item bought).

In [53]:
import random
random.seed(42)
# generating simulated users that have bought each item
sales_data = selected_items.map(lambda x: (random.randint(1, 50), x))

sales_data.take(7)

[(42, 303.839786720822),
 (43, 305.8462050570233),
 (7, 317.4951351253698),
 (12, 322.1369333601892),
 (33, 320.8667169101951),
 (18, 326.8899128874424),
 (48, 334.7869376878819)]

It's time to determine some basic statistics about BuyStuff users.

Let's start off by creating an RDD that determines how much each user spent in total.
To do this we can use a method called `.reduceByKey()` to perform reducing operations while grouping by keys. After you have calculated the total, use the `.sortBy()` method on the RDD to rank the users from the highest spending to the least spending. 

In [54]:
# calculate how much each user spent
total_spent = sales_data.reduceByKey(lambda x, y: x + y)
total_spent.take(10)

[(20, 2550.62167010829),
 (40, 5329.892233205175),
 (50, 2125.161568465289),
 (10, 1648.705259073352),
 (30, 884.1275149021747),
 (41, 6467.923916367787),
 (31, 4066.897397390192),
 (11, 2249.2303672309895),
 (1, 2374.647489040058),
 (21, 1176.6334614527518)]

In [55]:
# sort the users from highest to lowest spenders
total_spent.sortBy(lambda x: x[1],ascending = False).collect()


[(48, 6760.551400801929),
 (41, 6467.923916367787),
 (33, 5921.934984531346),
 (44, 5408.487017223314),
 (40, 5329.892233205175),
 (29, 4649.261269696228),
 (24, 4577.34034123956),
 (39, 4306.754062236549),
 (47, 4259.512021230312),
 (31, 4066.897397390192),
 (38, 3761.6133466770793),
 (14, 3725.1742241285287),
 (42, 3710.513851125879),
 (27, 3663.7550434359587),
 (28, 3655.957720331094),
 (43, 3577.937890884301),
 (34, 3467.6428724687435),
 (2, 3384.173078334071),
 (17, 2852.447474972595),
 (8, 2847.324992527453),
 (16, 2797.5762806918865),
 (25, 2690.910962795764),
 (26, 2610.362550964748),
 (20, 2550.62167010829),
 (35, 2428.3531599310836),
 (1, 2374.647489040058),
 (49, 2366.687906939614),
 (11, 2249.2303672309895),
 (6, 2206.8207839784245),
 (50, 2125.161568465289),
 (23, 2106.2973593045544),
 (46, 2094.3183205861747),
 (12, 2084.5532638627938),
 (15, 2067.722976228968),
 (9, 1944.2104715923297),
 (37, 1874.5289496613143),
 (18, 1822.1427904132777),
 (19, 1777.8407286228164),
 (5,

Next, let's determine how many items were bought per user. This can be solved in one line using an RDD method. After you've counted the total number of items bought per person, sort the users from most number of items bought to least number of items. Time to start a customer loyalty program!

In [56]:
total_items = sales_data.countByKey()
sorted(total_items.items(),key=lambda kv:kv[1],reverse=True)

[(28, 14),
 (25, 11),
 (44, 10),
 (40, 10),
 (36, 10),
 (50, 9),
 (17, 9),
 (6, 8),
 (41, 8),
 (30, 8),
 (22, 8),
 (8, 8),
 (26, 7),
 (39, 7),
 (49, 7),
 (43, 7),
 (23, 7),
 (4, 7),
 (48, 6),
 (13, 6),
 (32, 6),
 (1, 6),
 (21, 6),
 (16, 5),
 (10, 5),
 (46, 5),
 (18, 5),
 (38, 5),
 (34, 5),
 (37, 4),
 (2, 4),
 (7, 4),
 (24, 4),
 (29, 4),
 (33, 4),
 (15, 4),
 (9, 4),
 (14, 4),
 (12, 4),
 (45, 4),
 (27, 4),
 (31, 3),
 (3, 3),
 (19, 3),
 (5, 3),
 (20, 3),
 (11, 2),
 (47, 2),
 (42, 2),
 (35, 2)]

### Additional Reading

- [The original paper on RDDs](https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf)
- [RDDs in Apache Spark](https://data-flair.training/blogs/create-rdds-in-apache-spark/)
- [Programming with RDDs](https://runawayhorse001.github.io/LearningApacheSpark/rdd.html)
- [RDD Transformations and Actions Summary](https://www.analyticsvidhya.com/blog/2016/10/using-pyspark-to-perform-transformations-and-actions-on-rdd/)

## Summary

In this lab we went through a brief introduction to RDD creation from a Python collection, setting a number of logical partitions for an RDD and extracting lineage. We also used transformations and actions to perform calculations across RDDs on a distributed setup. In the next lab, you'll get the chance to apply these transformations on different books to calculate word counts and various statistics.
