### Resilient Distributed Dataset (RDD) Tutorial

#### References:
* https://github.com/holdenk/learning-spark-examples
* https://www.youtube.com/watch?v=pZQsDloGB4w
* https://spark.apache.org/docs/latest/programming-guide.html
* https://www.youtube.com/watch?v=ZojIGRS3HLY
* https://www.youtube.com/watch?v=bJouNc1REno
* https://www.youtube.com/watch?v=vtxwXSGl9V8
* https://www.youtube.com/watch?v=Cn4xdiCxxtw
* https://www.youtube.com/watch?v=9mELEARcxJo
* https://www.youtube.com/watch?v=U-rqJEKFzVE
* https://porizi.wordpress.com/2014/02/21/flatmap-explained/
* https://cosminpupaza.wordpress.com/2015/10/28/imperative-programming-vs-functional-programming-a-beginners-approach-part-1-map/
* http://www.braveclojure.com/core-functions-in-depth/
* https://www.youtube.com/watch?v=4ZH6mpIFbrY
* https://www.youtube.com/watch?v=borv_KMI9Ac
* https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

### SparkContext
It's your window to the cluster. Allows you the following things
* Create RDDs
* Counters and Accumulators to communicate between nodes
* It's automatically created on the pyspark shell (or notebook)

In [1]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
#sc = SparkContext( 'spark://headnodehost:7077', 'pyspark')


### Create an RDD
Basically we distribute some data on the cluster. The idea is that latter we do some processing on it faster.

In [64]:
# Create a distributed list of numbers on the cluster
nums = sc.parallelize(range(20))
print(type(nums))

<class 'pyspark.rdd.PipelinedRDD'>


In [65]:
# Define a function that will be executed on each element
# Will only be executed when we call collect 
squared = nums.map(lambda x:x*x)

# Now execute the function
print(squared.collect())

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]


In [68]:
print(nums.collect())
print('Sample:',nums.sample(fraction=0.2, withReplacement=False).collect())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
Sample: [0, 1, 5, 11, 13, 16, 17, 18]


In [4]:
reduce_nums_sum = nums.reduce(lambda x,y:x+y)
print(reduce_nums_sum)

45


### Passing functions to cluster

In [5]:
# This function will be distributed on the cluster to execute 
def someOperation(input):
    return (input * 2) + 1

# Apply the function "someOperation" on all elements of the RDD
nums.map(someOperation).collect()

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

### Functional Programming
Most of the operations on apache spark comes from function programming paradigm

#### Map
The Map operation assign a function to each element of your data. It returns the same amount of data as the input.
![Map](map.png)

#### MapFlat
Similar to map, but each input item can be mapped to 0 or more output items
![MapFlat](flatmap.png)

#### Reduce
The Reduce continually applies the function to your data. It returns a single value
![Reduce Diagram](reduce_diagram.png)

#### Reduce By Key
It works like reduce but it apply only on elements with same key. It returns "key" values 
![Reduce by Key Diagram](reduce_by_key.png)

In [27]:
# Split the string on spaces and store into a list of strings
lst_words = 'leonardo araujo dos santos is leonardo'.split(" ")
print(lst_words)
lst_words_with_count = map(lambda x: (x,1), lst_words)
list(lst_words_with_count)

['leonardo', 'araujo', 'dos', 'santos', 'is', 'leonardo']


[('leonardo', 1),
 ('araujo', 1),
 ('dos', 1),
 ('santos', 1),
 ('is', 1),
 ('leonardo', 1)]

In [19]:
# Load a text file and distribute it's lines across the cluster
lines = sc.textFile('../data/TweetsText/tweets2.txt')

# Get list of all words
words = lines.flatMap(lambda x: x.split(" "))

# Count words: The reduce by key will group your lambda to execute only on same values 
word_count = (words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y))

# The computation actually starts here....
word_count.saveAsTextFile('word_count_result')
word_count.saveAsHadoopFile

#### Get all lines that says something about Obama

In [72]:
# Filter the rdd using a defined function
linesObama = lines.filter(lambda x: "obama" in x.lower())
print(linesObama.count())

# Just show the lines
linesObama.collect()

6


["The BBC's Washington correspondents track developments in the first 100 days of Barack Obama's presidency. http://tinyurl.com/d97jot",
 "Obama's Conversation Starter: National Journal.com's Amy Harder wrote an interesting profile of Joe Rospars who .. http://tinyurl.com/ck8436",
 'yuo refuse to accept the facts, only dismiss; Why would your pal Obama do everything he can with team of lawyers to block #TCOT',
 'Its as ridiculous as you posing as #TCOT but defending the #Obama tyranny LOL #TCOT',
 'Confirmed: The Obama DHS hit job on conservatives is real - Malkin - April 14 http://ff.im/-279qc',
 "Now that Obama is president it puts severe limits on how he can use email and indeed Twitter. That's why his account is dead."]