# SparkContext and RDD Basics

#### Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

#### Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.

### Import Modules

In [1]:
from pyspark import SparkContext
import numpy as np

### Initialize a SparkContext which uses 4 cores

In [2]:
sc=SparkContext(master="local[4]")
print(sc)

<SparkContext master=local[4] appName=pyspark-shell>


### Spark Parallelization
#### RDD can be created in 2 ways:
1. parallelizing an existing collection in your driver program
2. referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

#### Lets create a RDD by the first method

In [6]:
lst = np.random.randint(0,10,20)
print(lst)
A = sc.parallelize(lst)
type(A)

[3 4 8 9 2 2 6 1 5 1 7 0 0 8 0 9 4 7 1 6]


pyspark.rdd.RDD

In [4]:
A

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195

#### Since 4 cores were used, the list was parallelized into 4 sublists

In [7]:
A.glom().collect()

[[3, 4, 8, 9, 2], [2, 6, 1, 5, 1], [7, 0, 0, 8, 0], [9, 4, 7, 1, 6]]

#### Checking the same with  2 cores

In [8]:
sc.stop()
sc=SparkContext(master="local[2]")
A = sc.parallelize(lst)
A.glom().collect()

[[3, 4, 8, 9, 2, 2, 6, 1, 5, 1], [7, 0, 0, 8, 0, 9, 4, 7, 1, 6]]

In [9]:
sc.stop()
sc=SparkContext(master="local[4]")
A = sc.parallelize(lst)

### Basic RDD Operations

#### count

In [10]:
A.count()

20

#### access first element

In [11]:
A.first()

3

#### access first few (4) elements

In [12]:
A.take(4)

[3, 4, 8, 9]

#### create rdd with duplicates removed

In [13]:
A_distinct=A.distinct()
A_distinct.collect()

[4, 8, 0, 9, 1, 5, 2, 6, 3, 7]

### Sum
#### Reduce method

In [14]:
A.reduce(lambda x,y: x+y)

83

#### Direct sum method

In [15]:
A.sum()

83

#### Fold method: Aggregates each partition first and then result of each partition

In [16]:
A.fold(0,lambda x,y:x+y)

83

### Maximum element/ longest word

#### maximum element by reduce

In [17]:
A.reduce(lambda x,y: x if x>y else y)

9

#### longest word by reduce

In [21]:
words = 'In Apache Spark a DataFrame is a distributed collection of rows under named columns'.split(' ')
wordsRDD = sc.parallelize(words)
wordsRDD.reduce(lambda x,y: x if len(x)>len(y) else y)

'distributed'

### Functions/filtering in RDD

#### simple filtering

In [23]:
A.filter(lambda x: x%3==0).collect()

[3, 9, 6, 0, 0, 0, 9, 6]

### Lambda functions are short and sweet but we can write regular Python functions to use with reduce

In [29]:
def largerThan(x,y):
    """
    Returns the last word among the longest words in a list
    """
    if len(x)> len(y):
        return x
    else:
        return y


In [30]:
wordsRDD.reduce(largerThan)

'distributed'

### RDD sampling
#### RDDs are often very large. Aggregates, such as averages, can be approximated efficiently by using a sample. This comes handy often for operation with extremely large datasets where a sample can tell a lot about the pattern and descriptive statistics of the data.
#### Sampling is done in parallel and requires limited computation.

#### The method RDD.sample(withReplacement,p) generates a sample of the elements of the RDD. where 
1. withReplacement is a boolean flag indicating whether or not a an element in the RDD can be sampled more than once.
2. p is the probability of accepting each element into the sample. Note that as the sampling is performed independently in each partition, the number of elements in the sample changes from sample to sample.


In [36]:
# get a sample whose expected size is m
# Note that the size of the sample is different in different runs
m=5
n=20
print('sample1=',A.sample(False,m/n).collect()) 
print('sample2=',A.sample(False,m/n).collect())

sample1= [3, 4, 1, 5, 1, 0, 8]
sample2= [4, 6, 1, 0, 8, 4, 1]


### Things to note and think about
### Each time you run the previous cell, you get a different estimate. The accuracy of the estimate is determined by the size of the sample $n*p$. Here, probability $p=\frac{m}{n}$ .See how the error changes as you vary 