# What is RDD?

RDD stands for “Resilient Distributed Dataset”. It is the fundamental data structure of Apache Spark. RDD in Apache Spark is an immutable collection of objects which computes on the different node of the cluster.
Decomposing the name RDD:

    1.Resilient i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to recompute missing or damaged partitions due to node failures.

    2.Distributed, since Data resides on multiple nodes.

    3.Dataset represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.

Hence, each and every dataset in RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. RDDs are fault tolerant i.e. It posses self-recovery in the case of failure.

There are three ways to create RDDs in Spark such as – Data in stable storage, other RDDs, and parallelizing already existing collection in driver program. One can also operate Spark RDDs in parallel with a low-level API that offers transformations and actions. We will study these Spark RDD Operations later in this section.

Spark RDD can also be cached and manually partitioned. Caching is beneficial when we use RDD several times. And manual partitioning is important to correctly balance partitions. Generally, smaller partitions allow distributing RDD data more equally, among more executors. Hence, fewer partitions make the work easy.

Programmers can also call a persist method to indicate which RDDs they want to reuse in future operations. Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM. Users can also request other persistence strategies, such as storing the RDD only on disk or replicating it across machines, through flags to persist.

# Contents:
    a.Creating RDD
    b.Basic Operations:
        1. .map(...)The method is applied to each element of the RDD and transformation is done 
        2. .filter(...)The method allows you to select elements of your dataset that fit specified criteria
        3. .flatMap(...)The method works similarly to .map(...) but returns a flattened results instead of a list. 
        4. .distinct(...)The method returns a list of distinct values in a specified column.
        5. .sample(...)The method returns a randomized sample from the dataset.
        6. .take(n)The method returns first n elements in RDD  
        7. .collect(...)The method used to print all elements in RDD  
        8. .reduce(...)The method reduces the elements of an RDD using a specified method  
        9. .count(...)The method used to return the number of elements in RDD  
        10. .first(...)The Method used to return first element in RDD  
        11. .foreach(...)A method that applies the same function to each element of the RDD in an iterative way.
        12. .sum()The method used to sum of all elements in RDD
        13. .stats()The method used to print all statistics of RDD

# Importing Libraries

In [29]:
import pyspark
from pyspark import SparkContext
import numpy as np
import pandas as pd

In [30]:
sc=SparkContext("local[*]")

# A. Creating RDD

In [31]:
lst=np.random.randint(0,10,20)
print(lst)

[0 1 3 5 0 5 5 9 5 8 8 3 1 7 8 1 7 3 5 3]


### What did we just do? We created a RDD? What is a RDD?
![](https://i.stack.imgur.com/cwrMN.png)
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a **fault-tolerant collection of elements that can be operated on in parallel**. SparkContext manages the distributed data over the worker nodes through the cluster manager. 

There are two ways to create RDDs: 
* parallelizing an existing collection in your driver program, or 
* referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

We created a RDD using the former approach

# `A` is a pyspark RDD object, we cannot access the elements directly

In [32]:
A=sc.parallelize(lst)

In [33]:
type(A)

pyspark.rdd.RDD

In [34]:
A

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195

### Opposite to parallelization - `collect` brings all the distributed elements and returns them to the head node. <br><br>Note - this is a slow process, do not use it often. 

In [35]:
A.collect()

[0, 1, 3, 5, 0, 5, 5, 9, 5, 8, 8, 3, 1, 7, 8, 1, 7, 3, 5, 3]

### How were the partitions created? Use `glom` method

In [36]:
A.glom().collect()

[[0],
 [1, 3],
 [5, 0],
 [5],
 [5, 9],
 [5, 8],
 [8],
 [3, 1],
 [7, 8],
 [1],
 [7, 3],
 [5, 3]]

# B. Transformations

### 1. `map` function

In [37]:
B=A.map(lambda x:x*x)

In [38]:
B.collect()

[0, 1, 9, 25, 0, 25, 25, 81, 25, 64, 64, 9, 1, 49, 64, 1, 49, 9, 25, 9]

`map` operation with regular Python function

In [39]:
def square(x):
    return x*x*x

In [40]:
C=A.map(square)

In [41]:
C.collect()

[0,
 1,
 27,
 125,
 0,
 125,
 125,
 729,
 125,
 512,
 512,
 27,
 1,
 343,
 512,
 1,
 343,
 27,
 125,
 27]

### 2. `filter` function

In [42]:
A.filter(lambda x:x%4==0).collect()

[0, 0, 8, 8, 8]

### 3. `flatmap` function

In [43]:
D=A.flatMap(lambda x:(x,x*x))

### `flatmap` method returns a new RDD by first applying a function to all elements of this RDD, and then flattening the results

In [44]:
D.collect()

[0,
 0,
 1,
 1,
 3,
 9,
 5,
 25,
 0,
 0,
 5,
 25,
 5,
 25,
 9,
 81,
 5,
 25,
 8,
 64,
 8,
 64,
 3,
 9,
 1,
 1,
 7,
 49,
 8,
 64,
 1,
 1,
 7,
 49,
 3,
 9,
 5,
 25,
 3,
 9]

### 4. `distinct` function

### The method `RDD.distinct()` Returns a new dataset that contains the distinct elements of the source dataset.

In [45]:
A.distinct().collect()

[0, 1, 3, 5, 7, 8, 9]

### 5. `sample` function

## Sampling an RDD
* RDDs are often very large.
* **Aggregates, such as averages, can be approximated efficiently by using a sample.** This comes handy often for operation with extremely large datasets where a sample can tell a lot about the pattern and descriptive statistics of the data.
* Sampling is done in parallel and requires limited computation.

The method `RDD.sample(withReplacement,p)` generates a sample of the elements of the RDD. where
- `withReplacement` is a boolean flag indicating whether or not a an element in the RDD can be sampled more than once.
- `p` is the probability of accepting each element into the sample. Note that as the sampling is performed independently in each partition, the number of elements in the sample changes from sample to sample.

In [46]:
m=5
n=20
print('sample1=',A.sample(False,m/n).collect()) 
print('sample2=',A.sample(False,m/n).collect())

sample1= [3, 5, 5, 5, 1, 8]
sample2= [1, 0, 5, 5]


### 6. `take` function

In [47]:
A.take(1)

[0]

### 7. `collect` function

In [48]:
A.collect()

[0, 1, 3, 5, 0, 5, 5, 9, 5, 8, 8, 3, 1, 7, 8, 1, 7, 3, 5, 3]

### 8. `reduce` function

In [49]:
A.reduce(lambda x,y:x+y)

87

### 9. `count` function

In [50]:
A.count()

20

### 10. `first` function

In [51]:
A.first()

0

### 11. `foreach` function 

In [55]:
def lm(x): 
    print(x)

In [56]:
A.foreach(lm)

### 12. `sum` function 

In [57]:
A.sum()

87

### 13. `stats` function 

In [58]:
A.stats()

(count: 20, mean: 4.35, stdev: 2.797766966707556, max: 9.0, min: 0.0)

In [54]:
# sc.stop()