# Apache Spark RDD Operations

Apache Spark RDD supports two types of Operations-

- Transformations
- Actions

## RDD Transformation

Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD when we apply any transformation. Thus, the so input RDDs, cannot be changed since RDD are immutable in nature.

### Map

In [3]:
from pyspark.sql import *
spark = SparkSession.builder.master("local[*]").appName("music").getOrCreate()
txtData = spark.read.text('../covid19.txt')
data = txtData.rdd.map(lambda x: x.value.lower().split()).collect()
for d in data[:1]:
    print(d)

['coronavirus', 'disease', '2019', '(covid-19)', 'is', 'an', 'infectious', 'disease', 'caused', 'by', 'severe', 'acute', 'respiratory', 'syndrome', 'coronavirus', '2', '(sars-cov-2).[10]', 'it', 'was', 'first', 'identified', 'in', 'december', '2019', 'in', 'wuhan,', 'china,', 'and', 'has', 'resulted', 'in', 'an', 'ongoing', 'pandemic.[11][12]', 'the', 'first', 'confirmed', 'case', 'has', 'been', 'traced', 'back', 'to', '17', 'november', '2019.[13]', 'traces', 'of', 'the', 'virus', 'have', 'been', 'found', 'in', 'wastewater', 'that', 'was', 'collected', 'from', 'milan', 'and', 'turin,', 'italy', 'on', '18', 'december', '2019.[14]', 'as', 'of', '25', 'june', '2020,', 'more', 'than', '9.39', 'million', 'cases', 'have', 'been', 'reported', 'across', '188', 'countries', 'and', 'territories,', 'resulting', 'in', 'more', 'than', '481,000', 'deaths.', 'more', 'than', '4.71', 'million', 'people', 'have', 'recovered.[9]']


### Flatmap

In [87]:
data = txtData.rdd.flatMap(lambda line : line.value.split(" ")).collect()
for d in data[:5]:
    print(d)

Coronavirus
disease
2019
(COVID-19)
is


 ### Filter(func)

In [88]:
data = txtData.rdd.flatMap(lambda line: line.value.split(" ")) \
.filter(lambda x: x!='').collect()
for d in data[:3]:
    print(d)

Coronavirus
disease
2019


### GroupBy

In [92]:
data = txtData.rdd.flatMap(lambda line: line.value.split(" ")) \
.filter(lambda x: x!='').groupBy(lambda w: w[0:10]).collect()
for d in data[:3]:
    print(d)

('Coronaviru', <pyspark.resultiterable.ResultIterable object at 0x7f2feb2b2240>)
('disease', <pyspark.resultiterable.ResultIterable object at 0x7f2feb2b25f8>)
('2019', <pyspark.resultiterable.ResultIterable object at 0x7f2feb2b2748>)


### GroupByey / ReduceByKey

data = txtData.rdd.flatMap(lambda line: line.value.split(" ")) \
.filter(lambda x: x!='')
rdd3_mapped = data.map(lambda x: (x,1))
rdd3_mapped
rdd3_grouped = rdd3_mapped.groupByKey().collect()[:3]
rdd3_grouped
rdd3_grouped = rdd3_mapped.reduceByKey(lambda a,b : a+b ).collect()[:3]
rdd3_grouped

### Union

In [1]:
rdd1 = spark.createDataFrame([(1,"jan",2016),(3,"nov",2014),(16,"feb",2014)])
rdd2 = spark.createDataFrame([(5,"dec",2014),(17,"sep",2015)])
rdd3 = spark.createDataFrame([(6,"dec",2011),(16,"may",2015)])
rdd4 = rdd1.union(rdd2).union(rdd3)
rdd4.show()

NameError: name 'spark' is not defined

### Intersection

In [4]:
rdd1 = spark.createDataFrame([(6,"jan",2016),(3,"nov",2014),(16,"feb",2014)])
rdd2 = spark.createDataFrame([(5,"dec",2014),(16,"sep",2015)])
rdd3 = spark.createDataFrame([(6,"dec",2011),(16,"may",2015)])
rdd4 = rdd3.select("_2").intersect(rdd2.select("_2"))
rdd4.show()

+---+
| _2|
+---+
|dec|
+---+



### Subtract

In [6]:
rdd1 = spark.createDataFrame([(5,"dec",2014),(16,"sep",2015)])
rdd2 = spark.createDataFrame([(6,"dec",2011),(16,"may",2015)])
rdd3 = rdd1.select("_2").subtract(rdd2.select("_2"))
rdd3.show()

+---+
| _2|
+---+
|sep|
+---+



### Cartesian

In [10]:
rdd1 = spark.createDataFrame([(5,"dec",2014),(16,"sep",2015)])
rdd2 = spark.createDataFrame([(6,"dec",2011),(16,"may",2015)])
rdd3 = rdd1.select("_2").cartesian(rdd2.select("_2"))
rdd3.show()

TypeError: Can not infer schema for type: <class 'int'>

###  Distinct()

In [137]:
rdd1 = spark.createDataFrame([(6,"jan",2016),(3,"nov",2014),(16,"feb",2014),(5,"dec",2014),(16,"sep",2015)])
rdd2 = rdd1.select("_1").distinct()
rdd2.show()

+---+
| _1|
+---+
|  6|
|  5|
|  3|
| 16|
+---+



### SortByKey()

In [163]:
rdd1 = spark.createDataFrame([(6,"jan",2016),(3,"nov",2014),(16,"feb",2014),(5,"dec",2014),(16,"sep",2015)])
rdd2 = rdd1.rdd.sortBy(lambda x: x[2]).collect()
rdd2

[Row(_1=3, _2='nov', _3=2014),
 Row(_1=16, _2='feb', _3=2014),
 Row(_1=5, _2='dec', _3=2014),
 Row(_1=16, _2='sep', _3=2015),
 Row(_1=6, _2='jan', _3=2016)]

# RDD Action

Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion.

### Count

In [167]:
counts = txtData.rdd.flatMap(lambda line: line.value.split(" ")) \
.filter(lambda x: x!='')
counts.count()

319

### Collect()

In [170]:
counts = txtData.rdd.flatMap(lambda line: line.value.split(" ")) \
.filter(lambda x: x!='').collect()[:3]
counts

['Coronavirus', 'disease', '2019']

### Take(n)

In [172]:
counts = txtData.rdd.flatMap(lambda line: line.value.split(" ")) \
.filter(lambda x: x!='')
counts.take(3)

['Coronavirus', 'disease', '2019']

### Top

In [175]:
counts = txtData.rdd.flatMap(lambda line: line.value.split(" ")) \
.filter(lambda x: x!='')
counts.top(3)

['who', 'where', 'wastewater']

### CountByValue()

In [212]:
rdd1 = spark.createDataFrame([(6,"jan",2016),(3,"nov",2014),(16,"feb",2014),(5,"dec",2014),(16,"sep",2015)])
rdd2 = rdd1.select("_1").rdd.countByValue()
rdd2

defaultdict(int, {Row(_1=6): 1, Row(_1=3): 1, Row(_1=16): 2, Row(_1=5): 1})

### Reduce

In [247]:
rdd1 = spark.createDataFrame([(6,"jan",2016),(3,"nov",2014),(16,"feb",2014),(5,"dec",2014),(16,"sep",2015)])
rdd2 = rdd1.select("_1")
rdd2.groupBy().sum().show()

+-------+
|sum(_1)|
+-------+
|     46|
+-------+

