# Spark - Part 2 - RDD

https://datascience-school.com/blog/practical-apache-spark-in-10-minutes-part-2-rdd/

## Create Spark instance

In [2]:
import pyspark 
from pyspark.sql import SQLContext
sc = pyspark.SparkContext('local[*]')
sqlContext = SQLContext(sc)

Simple test

In [4]:
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

[259, 957, 752, 818, 104]

## RDD

RDDs support two types of operations: transformations and actions. Transformations are operations on RDDs, which create a new RDD from existing one. Actions return a result to the driver program. All transformations in Spark are lazy. This means, they do not compute their result right away, they just remember all the transformations applied to the base dataset (or a file). Transformations are only computed when an action requires a result to be returned to driver program, or written to the storage. 

The simplest way to create RDDs 

In [21]:
num = sc.parallelize([4, 6, 6, 1, 3, 0, 2, 2, 2])
num2 = sc.parallelize([5, 5, 8, 2, 2, 1, 7, 3, 3]) 

### map

In [20]:
result = num.map(lambda x: x**2)        # transformation: apply a function element-wise

result.take(10)                         # actions: returns 10 first elements

[16, 36, 36, 1, 9, 0, 4, 4, 4]

### filter

In [18]:
result = num.filter(lambda x: x >= 3)

result.take(10)

[4, 6, 6, 3]

### distinct

In [19]:
result = num.distinct()

result.take(10)

[4, 6, 0, 2, 1, 3]

### union

In [23]:
result = num.union(num2)

result.take(20)

[4, 6, 6, 1, 3, 0, 2, 2, 2, 5, 5, 8, 2, 2, 1, 7, 3, 3]

### intersection

In [25]:
result = num.intersection(num2)

result.take(20)

[1, 2, 3]

### subsrtact

In [27]:
result = num.subtract(num2)

result.take(20)

[4, 0, 6, 6]

### cartesian product

In [29]:
result = num.cartesian(num2)

result.take(20)

[(4, 5),
 (4, 5),
 (4, 8),
 (4, 2),
 (6, 5),
 (6, 5),
 (6, 8),
 (6, 2),
 (6, 5),
 (6, 5),
 (6, 8),
 (6, 2),
 (1, 5),
 (1, 5),
 (1, 8),
 (1, 2),
 (4, 2),
 (4, 1),
 (4, 7),
 (4, 3)]

### other operations

In [31]:
num.count()

9

In [33]:
num.countByValue()

defaultdict(int, {4: 1, 6: 2, 1: 1, 3: 1, 0: 1, 2: 3})

In [41]:
num.collect()       # returns all elements from the dataset as a list

[4, 6, 6, 1, 3, 0, 2, 2, 2]

In [42]:
num.top(3)          # returns a number of top elements from the RDD   

[6, 6, 4]

In [43]:
num.takeOrdered(5)  # returns a number of elements in ascending order

[0, 1, 2, 2, 2]

### reduce
The most common action upon RDD

In [40]:
num.reduce(lambda x, y: x + y)

26

fold() is similar to reduce() but allows to take the zero value for the initial call

In [44]:
num.fold(0, lambda x,y : x + y)

26