#  Spark RDD and Beyond

## Types of RDD

- ParallelCollectionRDD — An RDD created by Spark Context through parallelizing an existing collection. Eg. sc.parallelize, sc.makeRDD
- CoGroupedRDD — An RDD that cogroups its parents. For each key k in parent RDDs, the resulting RDD contains a tuple with the list of values for that key.
- HadoopRDD is an RDD that provides core functionality for reading data stored in HDFS using the older MapReduce API. The most notable use case is the return RDD of SparkContext.textFile.
- MapPartitionsRDD — An RDD that applies the provided function to every partition of the parent RDD. A result of calling operations like map, flatMap, filter, mapPartitions, etc.
- CoalescedRDD — a result of repartition or coalesce transformations.
- ShuffledRDD — a result of shuffling, e.g. after repartition or coalesce transformations.
- PipedRDD — an RDD created by piping elements to a forked external process.
- SequenceFileRDD — is an RDD that can be saved as a SequenceFile.

https://medium.com/knoldus/rdd-sparks-fault-tolerant-in-memory-weapon-130f8df2f996

# Let's do practice

In [42]:
import findspark
import pyspark
findspark.find() 
findspark

<module 'findspark' from '/home/nics/anaconda3/lib/python3.7/site-packages/findspark.py'>

In [45]:
conf = pyspark.SparkConf().setAppName('Tap').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
sc

In [47]:
# Create a RDD 
digits = sc.parallelize([0,1,2,3,4,5,6,7,8,9])
digits

ParallelCollectionRDD[1] at parallelize at PythonRDD.scala:195

In [48]:
digits.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# Trasformation

![](https://i.ytimg.com/vi/2IQxywxlMO8/hqdefault.jpg)

Transformation return a new RDD

## map(func)	
Return a new distributed dataset formed by passing each element of the source through a function func. 

In [52]:
squares=digits.map(lambda x: x*x)
squares.collect()
squares

PythonRDD[5] at collect at <ipython-input-52-e6611284cd92>:2

## filter(func)	
Return a new dataset formed by selecting those elements of the source on which func returns true.

In [56]:
evens = digits.filter(lambda x: x % 2 == 0)
evens.collect()

[0, 2, 4, 6, 8]

## flatMap(func)	
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

In [62]:
def factors(nr):
    i = 2
    factors = []
    while i <= nr:
        if (nr % i) == 0:
            factors.append(i)
            nr = nr / i
        else:
            i = i + 1
    return factors
primes = digits.flatMap(factors)
primes.collect()

[2, 3, 2, 2, 5, 2, 3, 7, 2, 2, 2, 3, 3]

## distinct()
Return a new RDD that contains the distinct elements of a RDD.

In [63]:
primes=primes.distinct()
primes.collect()

[2, 3, 5, 7]

 ## sample()
Return a sampled subset of this RDD.

Parameters
- withReplacement – can elements be sampled multiple times (replaced when sampled out)
- fraction – expected size of the sample as a fraction of this RDD’s size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be >= 0
- seed – seed for the random number generator

In [83]:
digits.sample(False,0.2).collect()

[4, 6, 8]

## union()
Return a new dataset that contains the union of the elements in the source dataset and the argument.

In [84]:
odds = digits.filter(lambda x: x % 2 == 1)
yy = evens.union(odds)
yy.collect()

[0, 2, 4, 6, 8, 1, 3, 5, 7, 9]

## intersection()
Return a new RDD that contains the intersection of elements in the source dataset and the argument.

In [88]:
intersects=odds.intersection(squares)
intersects.collect()

[1, 9]

## cartesian()
When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

In [96]:
cart=evens.cartesian(odds)
cart.collect()

[(0, 1),
 (0, 3),
 (0, 5),
 (0, 7),
 (0, 9),
 (2, 1),
 (4, 1),
 (2, 3),
 (2, 5),
 (4, 3),
 (4, 5),
 (2, 7),
 (2, 9),
 (4, 7),
 (4, 9),
 (6, 1),
 (8, 1),
 (6, 3),
 (6, 5),
 (8, 3),
 (8, 5),
 (6, 7),
 (6, 9),
 (8, 7),
 (8, 9)]

## groupByKey()
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.

In [33]:
cartgroup=cart.groupByKey()
cartgroup.collect()

[(0, <pyspark.resultiterable.ResultIterable at 0x7f8f2d1a1f90>),
 (2, <pyspark.resultiterable.ResultIterable at 0x7f8f2d1a1d90>),
 (4, <pyspark.resultiterable.ResultIterable at 0x7f8f2d1a1c50>),
 (6, <pyspark.resultiterable.ResultIterable at 0x7f8f2d1a1e10>),
 (8, <pyspark.resultiterable.ResultIterable at 0x7f8f546efd50>)]

# Actions

## reduce(func)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.


In [36]:
sumOfDigits=digits.reduce(lambda a,b:a+b)
sumOfDigits

45

https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

In [44]:
sc.stop()

In [89]:
digits.count()

10

In [97]:
cart.collect()

[(0, 1),
 (0, 3),
 (0, 5),
 (0, 7),
 (0, 9),
 (2, 1),
 (4, 1),
 (2, 3),
 (2, 5),
 (4, 3),
 (4, 5),
 (2, 7),
 (2, 9),
 (4, 7),
 (4, 9),
 (6, 1),
 (8, 1),
 (6, 3),
 (6, 5),
 (8, 3),
 (8, 5),
 (6, 7),
 (6, 9),
 (8, 7),
 (8, 9)]

In [100]:
cart.countByKey()

defaultdict(int, {0: 5, 2: 5, 4: 5, 6: 5, 8: 5})

In [104]:
digits.histogram(2)

([0.0, 4.5, 9], [5, 5])