## Spark introduction

At its core Spark is a generic engine for processing large amounts of data.It is one of the main frameworks in big data and an essential part of hadoop ecosystem. Spark was created by Apache and inherits most of Apache map reduce concepts and functionalties to  process vast amounts of data in-parallel in a reliable, fault -tolerant manner.<br>
Spark was built in Scala however a python interface called Pyspark exists allowing you to take advantage of Spark using python.

In programming context, Pyspark is capable of processing large amounts of data by handling all the mutiproccessing and threading modules without the need to implement them yourself.  

### Functional programming

There are multiple paradigms/techniques in prograamming namely object oriented,array-oriented programming and many more.One paradigm that is of interest to big data is functional programming. In functional programming, a program consists entirely of evaluation of pure functions.One of the most important aspects of the functional programmming is that data should be manipulated by functions without maintaining any external state,meaning that a fucntion should always returns new results instead of manipulating the data inplace.The biggest advantage that functional programmign offesr and why it is popular in Big data processing is because writing code in fucntional manner allows the programm to be run in parallel one multiple cpus or machines.<br>Pyspark allows you easily take functional code and distribute it among big data sets.In short Pyspark is based on the funcitonal programming paradigm for 2 main reason.<br>a) scala; spark's native language is based on functional programming.<br>b)Funtional code is much easier to run in parallel.


### Anonymous functions

As functional programming is all about passing and evaluating functions, it is common to have many fucntions predefined using the keyword *def*. However sometimes it is convenient to create and use a function on the fly, without having to define it and give it a name beforehand.This what is known as anonymous functions.Anonymous functions are defined inline and are limited to single expression, they keyword to create anonymous fucntions in python is lambda and the syntax is as following *lambda (parameter_list): (expression)* <br>
Anonymous functions are an important aspect of functional programming and will be used regularly when coding with pyspark and applying operations on datasets so it is important to be familiar with it before jumping into pyspark.

The following summarizes the parts of a lambda expression:

lambda =>   The keyword that introduces a lambda expression<br>
parameter_list	=>   An optional comma-separated list of parameter names<br>
':'	=> Punctuation that separates parameter_list from expression<br>
expression =>	An expression usually involving the names in parameter_list <br>

Anonymous functions are most commonly used in conjuction with other python methods.<br>
The following are examples of anonymous functions in Python applied within various python methods.

Anonymous function to change all names to lowercase before sorting them,the *key* parameter to sorted is called for each item in the iterable list.

In [None]:
x = ['Python', 'programming', 'is', 'awesome!']
print(sorted(x))
['Python', 'awesome!', 'is', 'programming']
print(sorted(x, key=lambda arg: arg.lower()))

['Python', 'awesome!', 'is', 'programming']
['awesome!', 'is', 'programming', 'Python']


Anonymous function to filter out a list by choosing the strings which are shorter than 8 characters.filter() filters items out of an iterable based on a condition, in that case the condition is passed as an anonymous fucntion.<br>
Note that filter() returns an iterable and you would have to loop over it to access the items. Thus list() is required to force storing the iterable in memory as a single and directly accessing it without having to loop over it.

In [None]:
print(list(filter(lambda arg: len(arg) < 8, x)))

['Python', 'is']


Change all elements to uppercase.<br>
Similar to filter() , map() allows you to apply a function on each item in an iterable, however it always produces a 1-to-1 mapping of the original items.Meaning that the result that map() returns is always the same size(number of elements) as the original input;which was not the case with filter()

In [None]:
print(list(map(lambda arg: arg.upper(), x)))

['PYTHON', 'PROGRAMMING', 'IS', 'AWESOME!']


Finally the function reduce allows you to combine an iterable into a single element. The following code reduces the list of strings into a single string by combing each element together.

In [None]:
from functools import reduce
print(reduce(lambda val1, val2: val1 + val2,x))

Pythonprogrammingisawesome!


The built-in filter(), map(), and reduce() functions are all common in functional programming. You’ll soon see that these concepts can make up a significant portion of the functionality of a PySpark program.

# Spark architecture

*italicized text*![image.png](attachment:image.png)

A Spark appllication has two main processes:<br>
- Driver: A single process which creates tasks for the cluster to execute
- Executor: Multiple processes throughout the cluser that executes tasks in parallel

Spark Contexts are the main entry point of any spark application and is created from the driver.<br>
A spark context is capable of defining the jobs (i.e the function to execute) and communicate with the cluster manager to break down the job into multiple tasks to be sent to the cluster and be executed in parallel.<br>
Such task are then executed on different partitions of the dataset so that it could be executed in parallel.
![image.png](attachment:image.png)
    

A worker is a cluster node which is capable of executor processes to run task.Each executor is allocated a number of cores,having each core execute a task.Thus increasing executors and cores would increase the cluster parallelism capabilties

The main data abstraction in Spark is known as RDD (Resilient distributed datasets).There are two main operations on that could be executed on a dataset. Transformation and action. A transformation creates a new rdd however it is not the operation is not computed.The new rdd is only computed when the driver calls an action.<br> How do Transformations and Actions occur?<br>
Sparks executes Transformations and actions by creating A DAG (Direct Acyclic graph) and a DAG scheduler. DAG is graphical structure containing vertices and edges.Whereby edges are only created from an older vertix , hence the word acyclic.When a new rdd is created a new vertex is created in the DAG and the edges represent the transformation/action on an rdd.  

Here are 5 steps that the Sparks program undergoes to create a DAG.<br>
1. creates the DAG when creating an RDD.
2. Enable the DAG schedular to perform transformation and updates the DAG.
3. The DAG now points to the new RDD.
4. The pointer that transforms the RDD is returned back to the spark driver.
5. If there is an action the driver program first computes the action and then updates the DAG.<br>
Finally it is important to note that if a node goes down, Spark replicates the DAG and restores the node.

## Creating our first Spark program.

Create spark context to interact with the driver to create and execute tasks.<br>
Driver is the main component of Spark allowing you to split and execute tasks in parallel.

In [1]:
from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("Spark RDD Course")
sc = SparkContext(conf=conf)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/04 12:42:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark RDD is the main data structure of Spark.RDD stands for resilient distributed dataset.

In [2]:
rdd = sc.parallelize([1,2,3,4,5,6])

Parallelize could take an additional input to split the dataset into multiple partitions

In [3]:
rdd.getNumPartitions()

48

In [6]:
rdd = sc.parallelize(range(1000), 10)
rdd.getNumPartitions()

10

In [6]:
lst = [1,2,3,4]

In [7]:
rdd.collect()

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


2 types of operations on Rdd. Transformations and actions.Transformations is what is referred to as lazy evaluation meiang that that the operation is not executed immediately but just requested and stored internally by spark.Actions allow you trigger these operations and get back a new result.

### Transformations

Mapping -

In [16]:
rdd = sc.parallelize(range(1000),5)

In [17]:
rdd.collect()

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


25/01/04 09:17:20 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: Master removed our application: KILLED
25/01/04 09:17:20 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
	at org.apache.spark.errors.SparkCoreErrors$.clusterSchedulerError(SparkCoreErrors.scala:291)
	at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:981)
	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:165)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:263)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:170)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.proce

In [10]:
rdd = rdd.map(lambda x: list(range(1, x)))

In [11]:
rdd.collect()

                                                                                

[[1], [1, 2], [1, 2, 3]]

Flatmap - combines all the map result into a single list, hence the word flat.

In [None]:
rdd = sc.parallelize([2, 3, 4, 5])
rdd.flatMap(lambda x: range(1, x)).collect()

[1, 1, 2, 1, 2, 3, 1, 2, 3, 4]

Filter -

In [None]:
rdd = sc.parallelize(range(10),16)

In [None]:
rdd.filter(lambda x: x % 2 == 0).collect()

[0, 2, 4, 6, 8]

Distinct -

In [None]:
rdd = sc.parallelize([1, 1, 4, 2, 1, 3, 3])
rdd.distinct().collect()

Unionize -

In [None]:
rdd1 = sc.parallelize(range(5))
rdd2 = sc.parallelize(range(3, 9))
rdd3 = rdd1.union(rdd2)
rdd3.collect()

In [None]:
rdd.distinct().collect()

### Actions

Collect is obviously in action,count and count by value are also actions.

In [None]:
rdd = sc.parallelize([1, 3, 1, 2, 2, 2])
rdd.count()

6

In [None]:
rdd.countByValue()

defaultdict(int, {1: 2, 3: 1, 2: 3})

Be aware when using collect,count,countbyvalue as these methods pull the entire dataset into memory, which will not work if the dataset is too big to fit into the RAM of a single machine.
<br>Instead you could just extract a certain number of elements from the result set by using take method

In [None]:
rdd.take(2)

take(n) returns n elements from the RDD and attempts to minimize the number of partitions it accesses.So the result may not be as expected.

takeordered allows you to extract the data by ascending order or by a specific metric you specify using the key paramater.

In [None]:
rdd = sc.parallelize([(3, 'a'), (1, 'b'), (2, 'd')])

In [None]:
rdd.takeOrdered(2)

[(1, 'b'), (2, 'd')]

In [None]:
rdd.takeOrdered(2, key=lambda x: x[1])

[(3, 'a'), (1, 'b')]

Reduce

In [None]:
rdd = sc.range(1, 4)
rdd.reduce(lambda a, b: a + b)

6

In [8]:
sc.stop()