* 2 Types of lower level API
    - RDD : manipulating distributed data
    - Broadcast variable and accumulators : Distributing and manipulating shared variables
* Use lower level API when you we can not accomplish task using structured API. Structured API internally converted to RDD trasnformation.

* `SparkContext` is the entry point for lower level API. We can access it via SparkSession

In [1]:
import findspark
findspark.init('/home/purvil/spark-2.4.3-bin-hadoop2.7')

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Aggregation').getOrCreate()

* RDD represents immutable, partitioned collections of records that can be operated on in parallel.
* In DF we have each record with structured row containing fields with known schema. Record in RDD are just Java, Scala and Python objects. We can store anything in this objects, so RDD gives complete control.
* Here we have to implement all logics, optimization by ourself as Spark can not understand what is inside that python or java object.
* Structurd API has in built support for compressed binary format, to achieve it we need to implement it manually.
* Reordering of filter and aggregation that occur automatically in spark SQL need to be implemented by hand.

* We generally use generic RDD type or key-value RDD, which supports aggregation by key.

* Internally each RDD is characterized by 
    - List of partition
    - Function to compute each split
    - List of dependencies on other RDD
    - Optionally partitioner for key-value RDD
    - List of preferred location on which we compute each split
* RDD provides tranformation (lazily evaulated) and actions(eagerly evaluated)
* Each record of RDD is manually manipulated.
* Running python RDD is like running python UDF row by row. So it is better to use RDD with Java or Scala.
* Use RDD when we need fine grained control over physical distribution of data (custome partitioning)
* Dataset vs RDD
    - Dataset provides functions and optimization that structured API offers.

### Creating RDD

In [4]:
spark.range(500).rdd

MapPartitionsRDD[4] at javaToPython at NativeMethodAccessorImpl.java:0

* To oprate on this data we have to convert Row object to correct data type or extract value out of it.

In [6]:
spark.range(10).toDF("id").rdd.map(lambda row:row[0])

PythonRDD[10] at RDD at PythonRDD.scala:53

In [7]:
spark.range(10).rdd.toDF() # Create dataframe from RDD

DataFrame[id: bigint]

* To create RDD from collection, you will need to use the parallelize method on a SparkContext which convert single  node collection into a parallel collection. We can state number of partition into which you would like to distribute this array.

In [9]:
myCollection = "Spark is fun to learn".split(" ")

In [11]:
words = spark.sparkContext.parallelize(myCollection, 2)

In [13]:
words.setName("myWords")

myWords ParallelCollectionRDD[21] at parallelize at PythonRDD.scala:195

In [14]:
words.name()

'myWords'

* RDD from data source

In [15]:
# spark.sparkContext.textFile("path/to/textfiles")

* Each record in RDD is line of text file. 

In [16]:
# spark.sparkContext.wholeTextFiles("path/to/textFiles")

* Each record is entire textFile. Name of the file is first object and value of the text file is second string object.

### distinct
* Removes duplicate from RDD


In [17]:
words.distinct().count()

5

### filter

In [20]:
def startsWithS(individual):
    return individual.startswith("S")

In [21]:
words.filter(lambda word:startsWithS(word)).collect()

['Spark']

### map

In [22]:
words2 = words.map(lambda word: (word, word[0], word.startswith("S")))

In [23]:
words2

PythonRDD[29] at RDD at PythonRDD.scala:53

In [25]:
words2.filter(lambda record:record[2]).take(5)

[('Spark', 'S', True)]

### flatMap
* Sometime each current row should return multiple row

In [27]:
words.flatMap(lambda word: list(word)).take(5)

['S', 'p', 'a', 'r', 'k']

In [28]:
words.flatMap(lambda word: list(word)).take(10)

['S', 'p', 'a', 'r', 'k', 'i', 's', 'f', 'u', 'n']

### sort

* use `sortBy` method. Specify function to extract value from the object and sort based on that

In [30]:
words.sortBy(lambda word:len(word)).take(2)

['is', 'to']

In [31]:
words.sortBy(lambda word:len(word)*-1).take(2)

['Spark', 'learn']

### Random Splits
* Randomly split RDD into array of RDD by using `randomSplit` method.

In [34]:
fiftyFiftySplit = words.randomSplit([0.5,0.5]) # array of weights

## Actions

### reduce

* Function that reduce RDD to 1 value.

In [36]:
spark.sparkContext.parallelize(range(1,21)).reduce(lambda x,y : x+ y)

210

In [37]:
def wordLengthReducer(leftWord, rightWord):
    return leftWord if len(leftWord) > len(rightWord) else rightWord

In [38]:
words.reduce(wordLengthReducer)

'learn'

### count
* Count number of rows in RDD

In [40]:
words.count()

5

### countApprox
* Approximation of count method which can execute within a timeout. Confidence is probability which tells error bound of the result will contain true value, meaning countApprox with confidence 0.9 called repeatedly means 90% of the results to contain the true count.

In [43]:
words.countApprox(400, 0.95) # timeout (ms), confidence

5

### countApproxDistinct

In [44]:
words.countApproxDistinct(0.05) # relative accuracy

5

### countByValue

* count number of values in RDD

In [45]:
words.countByValue()

defaultdict(int, {'Spark': 1, 'is': 1, 'fun': 1, 'to': 1, 'learn': 1})

* It loads result set in memory of driver, so make sure result set is smaller than memory of driver

### first

In [48]:
words.first()

'Spark'

### max and min

In [52]:
spark.sparkContext.parallelize(range(1,21)).max()

20

In [53]:
spark.sparkContext.parallelize(range(1,21)).min()

1

### take

* Read the first partition and using the result determine how many more partition needed to generate result

In [55]:
words.take(5)

['Spark', 'is', 'fun', 'to', 'learn']

In [56]:
words.takeOrdered(5)

['Spark', 'fun', 'is', 'learn', 'to']

In [62]:
words.takeSample(True, 5, 100) # withReplacement, totalSample, seed

['is', 'to', 'learn', 'to', 'is']

In [63]:
words.top(5) # choose top value according to implicit ordering

['to', 'learn', 'is', 'fun', 'Spark']

### SaveAsTextFile

In [64]:
words.saveAsTextFile("wordText")

In [65]:
!ls wordText/

part-00000  part-00001	_SUCCESS


In [67]:
!cat wordText/part-00000 

Spark
is


In [68]:
!cat wordText/part-00001

fun
to
learn


### caching

In [72]:
words.cache()

myWords ParallelCollectionRDD[21] at parallelize at PythonRDD.scala:195

### Checkpointing

* Save RDD on disk so that future references to this RDD point to those intermediate partitions on disk rather than computing the RDD from its original source

In [74]:
spark.sparkContext.setCheckpointDir(".")

In [75]:
words.checkpoint()

* Now when we reference words it will be derived from this checkpoint

### pipe

* Given process will be executed one per partition

In [77]:
words.pipe("wc -l").collect()

['2', '3']

* All elements of each input partition are written to a process’s stdin as lines of input separated by a newline. The resulting partition consists of the process’s stdout output, with each line of stdout resulting in one element of the output partition.

### mapPartitions
* map an individual partition

In [79]:
words.mapPartitions(lambda part:[1]).sum()

2

* This is useful when we want to perform some algo on entire subset of RDD

### mapPartitionWithIndex

* We can specify function that accept an index of partition and iterator that goes to all items within that partition

In [83]:
def indexedFunc(partitionIndex, withinParIterator):
    return ["partition {}=>{}".format(partitionIndex, x) for x in withinParIterator]

In [84]:
words.mapPartitionsWithIndex(indexedFunc).collect()

['partition 0=>Spark',
 'partition 0=>is',
 'partition 1=>fun',
 'partition 1=>to',
 'partition 1=>learn']

### foreachPartition
* Iterate over all partitions of the data, function does not have return value. We can do some thing on each partition like writing out to database

### glom
* Takes every partition in dataset and convert them to arrays.

In [85]:
spark.sparkContext.parallelize(["Hello", "World"],2).glom().collect()

[['Hello'], ['World']]

In [86]:
words.glom().collect()

[['Spark', 'is'], ['fun', 'to', 'learn']]

## Key-value RDD


In [91]:
words.map(lambda word: (word.lower(), 1)).collect()

[('spark', 1), ('is', 1), ('fun', 1), ('to', 1), ('learn', 1)]

### KeyBy

In [89]:
keyword = words.keyBy(lambda word:word.lower()[0])

In [90]:
keyword.collect()

[('s', 'Spark'), ('i', 'is'), ('f', 'fun'), ('t', 'to'), ('l', 'learn')]

* If we have tuple spark will assume that first is key and second is value

### map over value

In [92]:
keyword.mapValues(lambda word:word.upper()).collect()

[('s', 'SPARK'), ('i', 'IS'), ('f', 'FUN'), ('t', 'TO'), ('l', 'LEARN')]

### flatMapValues

In [96]:
keyword.flatMapValues(lambda word:word.upper()).collect()

[('s', 'S'),
 ('s', 'P'),
 ('s', 'A'),
 ('s', 'R'),
 ('s', 'K'),
 ('i', 'I'),
 ('i', 'S'),
 ('f', 'F'),
 ('f', 'U'),
 ('f', 'N'),
 ('t', 'T'),
 ('t', 'O'),
 ('l', 'L'),
 ('l', 'E'),
 ('l', 'A'),
 ('l', 'R'),
 ('l', 'N')]

In [97]:
keyword.keys().collect()

['s', 'i', 'f', 't', 'l']

In [98]:
keyword.values().collect()

['Spark', 'is', 'fun', 'to', 'learn']

In [101]:
keyword.lookup('s')

['Spark']

### Aggregation

In [102]:
chars = words.flatMap(lambda word:word.lower())

In [103]:
chars.collect()

['s',
 'p',
 'a',
 'r',
 'k',
 'i',
 's',
 'f',
 'u',
 'n',
 't',
 'o',
 'l',
 'e',
 'a',
 'r',
 'n']

In [104]:
KVcharacters = chars.map(lambda letter:(letter, 1))

In [105]:
def maxFunc(left, right):
    return max(left,right)

In [106]:
def addFunc(l,r):
    return l + r

In [107]:
nums = spark.sparkContext.parallelize(range(1,31), 5)

### countByKey

In [109]:
KVcharacters.countByKey()

defaultdict(int,
            {'s': 2,
             'p': 1,
             'a': 2,
             'r': 2,
             'k': 1,
             'i': 1,
             'f': 1,
             'u': 1,
             'n': 2,
             't': 1,
             'o': 1,
             'l': 1,
             'e': 1})

### groupByKey

In [111]:
from functools import reduce

In [113]:
KVcharacters.groupByKey().map(lambda row: (row[0], reduce(addFunc, row[1]))).collect()

[('s', 2),
 ('p', 1),
 ('r', 2),
 ('i', 1),
 ('l', 1),
 ('a', 2),
 ('k', 1),
 ('f', 1),
 ('u', 1),
 ('n', 2),
 ('t', 1),
 ('o', 1),
 ('e', 1)]

* To perform it each executor will have all keys in memory

### reduceByKey
* Combines tuple with the same key using function we specify

In [115]:
KVcharacters.reduceByKey(addFunc).collect()

[('s', 2),
 ('p', 1),
 ('r', 2),
 ('i', 1),
 ('l', 1),
 ('a', 2),
 ('k', 1),
 ('f', 1),
 ('u', 1),
 ('n', 2),
 ('t', 1),
 ('o', 1),
 ('e', 1)]

In [130]:
raw_data = spark.sparkContext.textFile("spark_data/daily_show.tsv")

In [131]:
raw_data # RDD object reference

spark_data/daily_show.tsv MapPartitionsRDD[154] at textFile at NativeMethodAccessorImpl.java:0

In [132]:
raw_data.take(5)

['YEAR\tGoogleKnowlege_Occupation\tShow\tGroup\tRaw_Guest_List',
 '1999\tactor\t1/11/99\tActing\tMichael J. Fox',
 '1999\tComedian\t1/12/99\tComedy\tSandra Bernhard',
 '1999\ttelevision actress\t1/13/99\tActing\tTracey Ullman',
 '1999\tfilm actress\t1/14/99\tActing\tGillian Anderson']

In [133]:
daily_show = raw_data.map(lambda line:line.split('\t'))

In [134]:
daily_show.take(5)

[['YEAR', 'GoogleKnowlege_Occupation', 'Show', 'Group', 'Raw_Guest_List'],
 ['1999', 'actor', '1/11/99', 'Acting', 'Michael J. Fox'],
 ['1999', 'Comedian', '1/12/99', 'Comedy', 'Sandra Bernhard'],
 ['1999', 'television actress', '1/13/99', 'Acting', 'Tracey Ullman'],
 ['1999', 'film actress', '1/14/99', 'Acting', 'Gillian Anderson']]

* Calculate total guests per year

In [136]:
tally = daily_show.map(lambda x:(x[0], 1)).reduceByKey(lambda x,y: x+y)

In [137]:
tally

PythonRDD[166] at RDD at PythonRDD.scala:53

In [138]:
tally.take(5)

[('YEAR', 1), ('2002', 159), ('2003', 166), ('2004', 164), ('2007', 141)]

In [140]:
tally.count()

18

In [143]:
tally.filter(lambda x:x[0]!= 'YEAR').take(5)

[('2002', 159), ('2003', 166), ('2004', 164), ('2007', 141), ('2010', 165)]