In [1]:
sc

**SparkContext** - an application instance that represents the connection to the spark workers and master. It is created by spark driver.
Once having SparkContext, we can use it to build RDDs.

## What is RDD (Resilient Distributed Dataset)?

This is a dataset that consists of records.

**Key properties**

- **Distributed:** the data in RDD is divided into **partitions** and distributed as in-memory collections of objects across worker nodes
- **Immutable**: once created, RDD is never changed
- **Resilient**: is automatically rebuilt on failure. Instead of replication, RDDs track lineage infor to rebuild the lost data


#### How to create an RDD?
1. Load an external data:

In [3]:
lines = sc.textFile("/Users/owner/USF/spark/2017-msan694-example-master/Data/README.md")

In [4]:
lines.take(5)

[u'# Apache Spark',
 u'',
 u'Spark is a fast and general cluster computing system for Big Data. It provides',
 u'high-level APIs in Scala, Java, Python, and R, and an optimized engine that',
 u'supports general computation graphs for data analysis. It also supports a']

2.Taking a collection as a sequence of arrays or lists and creates RDD from elements, distributing to Spark executors in a process 

In [5]:
lines = sc.parallelize(["spark","spark is fun!"])

In [6]:
lines.getNumPartitions() # To see the # of partitions

4

### RDD Opeartions


There are two types of operations:
1. **Tranformations**: produce **a new RDD** by performing data manipulations on another RDD.

Examples:
    - map(func): return a new RDD by parsing each element of siource through a *func*
    - flatMap(func): calls each element of RDD individually, concatenates multiple arrays in a collection of one level structure
    - filter(func): returns an RDD that passes filtering requirements.
    - mapPartitions()
    - union()
    - intersection()
    - join()
    - aggregateByKey()
    - sortByKey()
    ...
    
2. **Actions**: Trigger a computation to return the result to the calling program or to perform some actions on RDD's elements. 

Examples:
    - reduce()
    - collect()
    - count()
    - sum()
    - countByValue()
    - saveAsTextFile()
    ...


**Lazy evaluation principle**: computation doesn't take place until an action is triggered.

#### Example 1.1: Load a text file(“ignatian_pedagogy”) and split each line by space.

In [66]:
lines = sc.textFile("/Users/owner/USF/spark/2017-msan694-example-master/Data/ignatian_pedagogy")

In [67]:
lines.collect()

[u'= Ignatian Values =',
 u'The University of San Francisco enjoys a distinguished heritage and Jesuit tradition.  At the core of this tradition are transcendent values, including the integration of learning, faith and service; care for the whole person; character and conviction; religious truth and interfaith understanding; and a commitment to building a more just world.  The key values of this Jesuit tradition are as follows:',
 u'***********************************************************************************',
 u"1. Contemplative in Action - St. Ignatius Loyola believed that prayer and reflectivity should so guide our choices and actions that our activity itself becomes a way of entering into union with and praising God.  Being a contemplative in action also means seeing beyond the superficial in life to appreciate the mystery, beauty, and sacredness of all life.  It is a means of seeing God in all things and in everyone.  Contemplation is a critical dimension of the spiritual l

In [68]:
words = lines.map(lambda line: line.split(' ')) # split each line by space

In [70]:
#%%capture # supresses output
words.collect()[:1]

[[u'=', u'Ignatian', u'Values', u'=']]

In [71]:
len(lines.collect())

17

In [72]:
lines.getNumPartitions()

2

#### Example 1.2: Generate a list of words within one level structure. 

In [73]:
words2 = lines.flatMap(lambda line: line.split())

In [74]:
words2.collect()[:10]

[u'=',
 u'Ignatian',
 u'Values',
 u'=',
 u'The',
 u'University',
 u'of',
 u'San',
 u'Francisco',
 u'enjoys']

#### Example 1.3: Find words including “USF”

In [75]:
words = lines.flatMap(lambda line: line.split(" "))
words.collect()[:10]

[u'=',
 u'Ignatian',
 u'Values',
 u'=',
 u'The',
 u'University',
 u'of',
 u'San',
 u'Francisco',
 u'enjoys']

In [76]:
search = words.filter(lambda word: word is not u'')
search.collect()[:10]

[u'=',
 u'Ignatian',
 u'Values',
 u'=',
 u'The',
 u'University',
 u'of',
 u'San',
 u'Francisco',
 u'enjoys']

In [77]:
search_usf = search.filter(lambda word: "USF" in word)

In [78]:
search_usf.collect()[:10]

[u"USF's", u'USF', u"USF's", u"USF's", u'USF', u'USF', u'USF', u'USF']

### Partition-wise Transformations:

- mapPartition(func): returns a new RDD by applying a function to each partition
- mapPartitionWithIndex(func): returns a new RDD by applying a function to each partition, while trackin the index of the original partition

#### Example 1.4

In [89]:
def split_func(partition):
    word = str(list(partition)).split()
    return word

In [93]:
words = lines.mapPartitions(split_func)
words.collect()[:15]

["[u'=",
 'Ignatian',
 'Values',
 "=',",
 "u'The",
 'University',
 'of',
 'San',
 'Francisco',
 'enjoys',
 'a',
 'distinguished',
 'heritage',
 'and',
 'Jesuit']

#### Example 1.5

In [102]:
def split_function_index(index, partition):
    word = str(list(partition)).split()
    output = str(word) + ":" + str(index)
    yield output

In [104]:
words = lines.mapPartitionsWithIndex(split_function_index)
words.collect()[:2]

['["[u\'=", \'Ignatian\', \'Values\', "=\',", "u\'The", \'University\', \'of\', \'San\', \'Francisco\', \'enjoys\', \'a\', \'distinguished\', \'heritage\', \'and\', \'Jesuit\', \'tradition.\', \'At\', \'the\', \'core\', \'of\', \'this\', \'tradition\', \'are\', \'transcendent\', \'values,\', \'including\', \'the\', \'integration\', \'of\', \'learning,\', \'faith\', \'and\', \'service;\', \'care\', \'for\', \'the\', \'whole\', \'person;\', \'character\', \'and\', \'conviction;\', \'religious\', \'truth\', \'and\', \'interfaith\', \'understanding;\', \'and\', \'a\', \'commitment\', \'to\', \'building\', \'a\', \'more\', \'just\', \'world.\', \'The\', \'key\', \'values\', \'of\', \'this\', \'Jesuit\', \'tradition\', \'are\', \'as\', "follows:\',", "u\'***********************************************************************************\',", \'u"1.\', \'Contemplative\', \'in\', \'Action\', \'-\', \'St.\', \'Ignatius\', \'Loyola\', \'believed\', \'that\', \'prayer\', \'and\', \'reflectivity\'

#### Example 2:  
Parallelize numbers between 1 and 16. Calculate the count and sum in each partition.

In [26]:
numbers= sc.parallelize(range(1,17), 8)
numbers.glom().collect() # glom() allows us to see the allocations by partitions

[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14], [15, 16]]

In [27]:
#Defining the fucntion that returns a count and a sum as a list
def count_sum(nums):
    count_sum = [0,0]
    for num in nums:
        count_sum[0] +=1
        count_sum[1] +=num
    return count_sum

In [28]:
partition_count_sum = numbers.mapPartitions(count_sum)

In [29]:
partition_count_sum.glom().collect()

[[2, 3], [2, 7], [2, 11], [2, 15], [2, 19], [2, 23], [2, 27], [2, 31]]

In [41]:
# Find the total count and sum
one_num = numbers.map(lambda x: [1,x])

In [35]:
one_num.glom().collect()

[[[1, 1], [1, 2]],
 [[1, 3], [1, 4]],
 [[1, 5], [1, 6]],
 [[1, 7], [1, 8]],
 [[1, 9], [1, 10]],
 [[1, 11], [1, 12]],
 [[1, 13], [1, 14]],
 [[1, 15], [1, 16]]]

In [39]:
count_sum = one_num.reduce(lambda x,y: [x[0]+y[0],x[1]+y[1]])
count_sum

[16, 136]

In [42]:
average = float(count_sum[1])/float(count_sum[0])
average

8.5

### Set operations:
*Format*: 
    **rdd1.operator(rdd2)**
- **distinct():** returns only one of each element
- **union():** if there are duplicates, returns all duplicates
- **intersection:** returns common elements
- **subtract()**: returns elements that are in rdd1 only
- **cartesian()**: returns cartesian product (all pairs between rdd1 and rdd2)

#### Example 3.1: Find distinct words in “ignatian_pedagogy"

In [79]:
lines = sc.textFile("/Users/owner/USF/spark/2017-msan694-example-master/Data/ignatian_pedagogy")
words = lines.flatMap(lambda line: line.split(' '))


In [81]:
#Count the number of distinct words
distinct_words = words.distinct()
len(distinct_words.collect())

381

In [82]:
# This is the total number of words
len(words.collect())

726

In [83]:
# Print distinct words
words.distinct().collect()[:10]

[u'',
 u'1981,',
 u'all',
 u'just',
 u'Father',
 u'actions',
 u'discovered',
 u'schools',
 u'including',
 u'ecumenical']

#### Example 3.2: Create a flatmap of distinct words from “README.md”

In [84]:

readme = sc.textFile("/Users/owner/USF/spark/2017-msan694-example-master/Data/README.md")

In [85]:
readme_words = readme.flatMap(lambda line:line.split(' '))

In [86]:
readme_distinct_words = readme_words.distinct()

In [87]:
readme_distinct_words.collect()[:10]
len(readme_distinct_words.collect())

275

In [88]:
readme_words = readme.flatMap(lambda line: line.split(' '))

In [59]:
readme_distinct = readme.distinct()
len(readme_distinct.collect())

64

#### Example 3.3: What is union, intersection, subtract and cartesian product of the sets from Example 3-1 and Example 3-2?

In [107]:
len(distinct_words.union(readme_distinct_words).collect())

656

In [108]:
len(distinct_words.intersection(readme_distinct).collect())

1

In [110]:
len(distinct_words.intersection(readme_distinct).collect())

1

In [111]:
len(distinct_words.subtract(readme_distinct).collect())

380

In [112]:
len(distinct_words.cartesian(readme_distinct).collect())

24384

### RDD Operations - Actions

Compute a result based on RDD.
Return the result to the driver program or save it externally.
Return a **non-RDD object**.

**Examples**:
- reduce()
- fold(zeroValue)
- aggregate(zeroValue, seqOp, combOp)
- collect()
- count()
- countByValue()
- take()
- takeSample()
- foreach()

#### Example 4.1
For the numbers between 1 and 9, calculate sum of the odd numbers.


In [120]:
numbers = sc.parallelize(range(1,10),3)
numbers.glom().collect()

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

In [121]:
odd_numbers = numbers.filter(lambda x: x%2 ==1)

In [122]:
odd_numbers.glom().collect()

[[1, 3], [5], [7, 9]]

In [123]:
odd_numbers.reduce(lambda x,y: x+y)

25

#### Example 4.2: For the numbers between 1 and 9, calculate sum of the odd numbers using fold

__fold(zero, function)__ is similar to __reduce(function)__ but with provided zero value. It is often used to handle an error like the one from the above, e.g. use: _num.fold(0,lambda x,y: x+y)_ hThe function will return zero.

In [124]:
odd_numbers.fold(0, lambda x,y: x+y) # we provide  a zero value to a fold function

25

#### Example 4.3: Using aggregate(), return (sum, # of elements) of odd numbers

In [119]:
odd_numbers.aggregate((0,0),(lambda x,y: (x[0]+1,x[1]+y)), (lambda x,y: (x[0]+y[0],x[1]+y[1])))

(5, 25)

What's going on here?

**aggregate((zeroValue), seqOp, combOp)** works as follows:

_zeroValue_ - an initital value of a type we want to return (here we want a tuple, so provide (0,0)

_seqOp_ - function to combine the elements from RDD with the accumulator. Runs within a partition. In this example, let's consider partition 1: [1,3]. Here init value x = (0,0); we take first number y = 1 and get (0+1, 0+1) = (1,1).This becomes x = (1,1), y = 3 ==> (1+1,1+3) = (2,4). In 2nd partition x = (0,0), y = 5 ==> (1,5). 3rd partition: x = (0,0), y = 7 ==> (1,7); then x = (1,7), y = 9 => (1+1,7+9) = (2,16)

_combOp_ - function to merge accumulations: here we have three pairs as inputs: (2,4); (1,5), (2,16). We now apply _combOp_ to get (2+1,4+5) = (3,9) and then (3+2,9 + 1) => (5,25)


**sample()** vs **takeSample()**

- sample(withReplacement, fraction, seed) - is a TRANSFORMATION!
    - fraction:
      - expected _number of times each element (positive double) is going to be sampled_, __when replacement is used__
      - expected _probability that each element is going to be sampled_, __when replacement is NOT used__
      
- takeSample(withReplacement, num, seed)  is an ACTION
    - num - the exact number of sampled element (integer)


#### Example 5.1

In [126]:
x = sc.parallelize([3,4,1,2])
y = sc.parallelize(range(2,6))
z = x.union(y)

Try collect(), count(), countByValue(), top(n), take(n), first(), takeSample() operations on z.


In [127]:
z.collect()

[3, 4, 1, 2, 2, 3, 4, 5]

In [128]:
z.count()

8

In [129]:
z.countByValue()

defaultdict(int, {1: 1, 2: 2, 3: 2, 4: 2, 5: 1})

In [130]:
z.top(1)

[5]

What's the fidderence between **take()** and **first()**?

**take()** returns a list 

**first()** returns values

In [157]:
z.take(1)

[3]

In [156]:
z.first()

3

In [183]:
z.takeSample(False,20,1)

[3, 1, 2, 4, 5, 2, 3, 4]

In [149]:
z.takeSample(False,5,1)

[3, 1, 2, 4, 5]

In [150]:
z.takeSample(True,20,1)

[3, 2, 4, 5, 5, 1, 3, 4, 4, 1, 1, 2, 4, 2, 2, 3, 5, 4, 1, 2]

In [145]:
z.sample(False, 0.25,100).collect()

[3, 4, 4]

In [154]:
z.sample(True,2.0,100).collect()

[4, 4, 1, 1, 1, 1, 2, 2, 3, 5, 5]