#### Spark

* In memory distributed computing engine
* Faster than Hadoop (upto 100x)
* Less coding effort (5-10x)
* Interactive or batch processing
* Built-in rich set of functionalities

<img src="./images/spark.png" width="400" height="200" />



<img src="./images/cluster.png" width="800" height="400" />

Image source: https://spark.apache.org/docs/1.1.1/img/cluster-overview.png

**Spark Context**: It holds a connection with Spark cluster manager and acts as a client. It is also the coordinator of all spark processes running for the application. 

**Driver**: A driver is incharge of the process of running the main() function of an application and creating the SparkContext. 

**A worker**: A worker is any node that can run program in the cluster. 

**Cluster Manager**: Cluster manager allocates resources to each application in driver program. There are three types of cluster managers supported by Apache Spark – Standalone, Mesos and YARN.

#### Python vs Scala:

Spark computation in Python is much slower than in Scala.
* Scala is native language for Spark (because Spark itself written in Scala).
* Scala is a compiled language where as Python is an interpreted language.
* Python has process based executors where as Scala has thread based executors.
* Python is not a JVM (java virtual machine) language.
 
Many people ask whether it is really necessary to learn Scala to use Spark. Here's an answer. 
* If you plan to process serious data across nodes in a large cluster, choose Scala.
* However, for most users, Python is sufficient.


#### Apache Spark data representations: RDD and Dataframe

* **RDD** (Resilient Distributed Database) is a collection of immutable distributed elements of your data, partitioned across nodes in a spark cluster. 

* **Dataframe**, like an RDD, is a collection of immutable distributed data. Unlike an RDD, data is organized into named columns, like a table in a relational database. 

* **DataSet** has recently been introduced (will not be covered in the class).



#### RDD and map, filter, reduce, etc.... 
We can apply 2 types of operations on RDDs:

**Transformation**: Transformation refers to the operation applied on a RDD to create new RDD(s). <br>
**Action**: Actions refer to an operation which also apply on RDD that perform computation and send the result back to driver.

Example: Map (Transformation) performs operation on each element of RDD and returns a new RDD. But, in case of Reduce (Action), it reduces / aggregates the output of a map by applying some functions (Reduce by key). There are many transformations and actions are defined in Apache Spark documentation, 

Transformations are *_lazy_*, i.e. are not executed immediately. Only after calling an action are transformations executed.


<img src="./images/rdd_transformation.png" width = 600 height = 300/>
<img src="./images/spark-rdd-trasf-action.png" />

#### Two common ways to create RDD
* **_parallelize_** creates an RDD from a list
* **_textFile_** creates an RDD from a text file

### RDD Transformations 

* _**map**(func)_
* _**flatMap**(func)_
* _filter(func)_
* _mapPartitions(func)_
* _mapPartitionWithIndex()_
* _union(dataset)_
* _intersection(dataset)_
* _distict()_
* _groupByKey()_
* _**reduceByKey**()_
* _**sortByKey**()_
* ...



***
***

### map vs flatMap and filter

In [3]:
x = sc.parallelize([1,2,3,4])
y = x.map(lambda x: (x, x**2))

#print(x)
#print(x.collect())
print(y.collect())

[(1, 1), (2, 4), (3, 9), (4, 16)]


In [4]:
y = x.flatMap(lambda x: (x, x**2))
print(y.collect())

[1, 1, 2, 4, 3, 9, 4, 16]


In [5]:
z = y.filter(lambda x: x % 2 == 1)
print(z.collect())

[1, 1, 3, 9]


***
### distinct, union, and intersection

In [6]:
print(z.distinct().collect())

[1, 9, 3]


In [7]:
r1 = sc.parallelize([1,2,3,4,5])
r2 = sc.parallelize([2,3,4,5,6])

In [8]:
print(r1.union(r2).collect())

[1, 2, 3, 4, 5, 2, 3, 4, 5, 6]


In [9]:
print(r1.intersection(r2).collect())

[2, 3, 4, 5]


***
### operations on (key, value)
sortByKey and reduceByKey

In [14]:
x = sc.parallelize([('a',1), ('b',2), ('c', 3), ('a', 3), ('b', 4), ('a', 10)])

In [11]:
print(x.collect())

[('a', 1), ('b', 2), ('c', 3), ('a', 3), ('b', 4)]


In [12]:
print(x.sortByKey().collect())

[('a', 1), ('a', 3), ('b', 2), ('b', 4), ('c', 3)]


In [15]:
print(x.reduceByKey(lambda x,y: x+y).collect())

[('a', 14), ('c', 3), ('b', 6)]


### Narrow vs Wide Transformations

Recall a transformation is applied on an **rdd** and creates another (or possibly many) **rdd**(s).


A transformation can be **_narrow_** or **_wide_** depending on whether shuffling of data across partitions is required. 

<img src="./images/rdd.png" width="800" height="400" />

### RDD Actions

* _count()_
* _**collect**()_
* _**take**(n)_
* _top(n)_
* _**reduce**()_
* ...

***

### Example: Computation of Pi

<img src="./images/mcpi.png" width="800" height="400" />

In [20]:
import random
NUM_SAMPLES = 2000000
def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

count = sc.parallelize(range(0, NUM_SAMPLES)).filter(inside).count()
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)

Pi is roughly 3.144174


Q) What does **_sc.parallelize(range(0, NUM_SAMPLES)).filter(inside)_** return?

#### Load text data into RDD

Consider the following sample Emails
<pre>
...
Qaddafi's cousin, Col. Ali Qaddafiddam had failed in efforts to recruit fighters among the
Egyptian population living immediately across the border with Libya.
These individuals added that during the week of February 21 the Libyan Leader spoke to Syrian
President Bashir al-Assad on at least three occasions by secure telephone lines. During the
conversations Qaddafi asked that Syrian officers and technicians currently training the Libyan
Air Force be placed under command of the Libyan Army and allowed to fight against the rebel
forces.
(Source Comment: Senior Libyan Army officers still loyal to Qaddafi added that On February
23, President Assad told General Isam Hallaq, the commander in chief of the Syrian Air Force,
to instruct the pilots and technicians in Tripoli to help the Libyan regime, should full scale Civil
War breaks out in the immediate future.)
On March 2, a military officer with ties to Qaddfi's son Khamis stated privately that the number
of Libyan pilots defecting to the opposition has destroyed the morale and professional spirit of
the Libyan Air Force at this critical moment, when Tripoli's air superiority is its principal weapon
against insurgents. In the opinion of this individual Qaddafi and his senior military advisors are
convinced that the European Union and the U.S will impose a no-fly zone over Libya in the
immediate future. These advisors believe that the no fly zone will serve as air support for
opposition forces. They are also prepared for the Western allies to bomb anti-aircraft facilities in
and around Tripoli in preparation for the establishment of the no-fly zone. Foreign Minister
Mousa Kousa is convinced that that Russia and Turkey will oppose the move, and may prevent
the implementation of the no fly zone.
...
</pre>

### Steps to perform word counts
1. read data as an RDD of lines
2. filter out empty lines
3. split each line into words
4. convert each word into (key/value) pair
5. reduce them by key 
6. flip k/v to v/k for sorting
7. sort by key in descending order



<img src="./images/wordcount.png" width="1000" height="800" />

In [None]:
# Let us first load data into RDD
lines = sc.textFile('data/sample.txt')

In [22]:
nonempty_lines = lines.filter(lambda x: len(x) > 0 )

In [24]:
nonempty_lines.take(1)

[u'1,C05739545,WOW,H,"Sullivan, Jacob J",87,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545/C05739545.pdf,F-2015-04841,HRC_Email_296,FW: Wow,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,"UNCLASSIFIED']

In [25]:
words = nonempty_lines.flatMap(lambda x: x.split(' '))

In [26]:
words.take(1)

[u'1,C05739545,WOW,H,"Sullivan,']

#### we need to further remove non-

In [21]:
# Let us now filter out empty lines

# And split lines
# words = nonempty_lines.flatMap(lambda x: x.split(' '))
# print words.take(10)

#make k/v pairs

# reduce them by key. Note: this generates RDD

# flip k/v to v/k for sorting: Note: use map

# sort by key in descending order

# print top 10

# Can we remove stop words?

# total counts?

<img src="./images/DataFrame-in-Spark.png" width="600" height="400" /> 