# Introduction to Spark

### Compute-Intensive vs Data-Intensive Computing

Parallel processing approaches can be *generally* classified as:
* Compute-intensive (task-parallelism), or 
* Data-intensive (data-parallelism)



### Compute Intensive: Titan example

<img src="./images/md_on_titan.png" width="1000" height="800" />

### Big Data Problems and Data Intensive Computing
Explosion of Data in size and complexity creates problems:
* Instrumentation (how to aquire the data)
* Storage (how to store data)
* Processing data (how to analyze the data or get the best use of the data)
* Management of the data (especially how to grow the data)
* And many more, e.g. visualization

Data-intensive computing is about efficiently dealing with the BigData workflow.


<img src="./images/ornl.png" width="1000" height="800" />

### MapReduce and Hadoop Revisited 
Data-intensive computing uses data parallellism to process large volumes of data typically referred to as big data.
Data-intensive applications spend most of their processing time movement and manipulation of data. 

Typical processing steps:
* Partitioning or subdividing the data into multiple segments 
* Independently processing segments using the same executable application program in parallel 
* Reassembling the results to produce the completed output data

#### MapReduce
The MapReduce by Google is designed for data-intensive computing.

* Creates a map function that processes a key-value pair associated with the input data to generate a set of intermediate key-value pairs (**Map**)
* Merges all intermediate values associated with the same intermediate key (**Reduce**).

#### Hadoop
* Apache Hadoop is an open source software project which implements the MapReduce architecture. 
* and HDFS distributed filesystem. 

Distributed storage system stores large datasets across a large number of data servers, rather than storing all the datasets on a single server. When additional 

When the amount of data increases, you can add as many servers as you want in the distributed storage system. This makes a distributed storage system scalable and cost efficient, because you are using additional hardware (servers) only when there is a demand.

### Hadoop VS Spark

Both Spark and Hadoop are open source big data infrastructure frameworks that store and process large data sets, realizing *mapReduce* concept.
<p>
Spark is said to work faster than Hadoop in many circumstances, however it doesn’t have its own distributed storage system. 

* HDFS uses MapReduce to process and analyze data. 
* Hadoop's MapReduce stores all the data in a physical server after each operation. (They do not believe volatile RAM) 

In contrast, 
* Spark copies most of the data from disks to a RAM memory (i.e. **_in-memory_**)
* In-memory operations reduce the time required to interact with servers and thus makes Spark faster 
* Spark uses a system called Resilient Distributed Datasets (**_RDD_**) to recover data when there is a failure.


#### Hadoop and Spark are not competetor
It's not fair to compare Hadoop and Spark, and say "Spark is better," or "Hadoop is better."

* Hadoop was designed to handle data that does not fit in the memory 
* Spark was designed to deal with data that fits in the memory.


<img src="./images/hadoop_vs_spark.png" width="800" height="300" />

Again, Spark does not have its own system to organize files in a distributed way(the file system). For this reason, in many cases, Spark is installed on top of Hadoop.

<img src="./images/inmemory.png" width="800" height="400" />



### Map, Reduce, Lambda and Python

Python offers the **_lambda operator_** or **_lambda function_** to create small anonymous functions, i.e. functions without a name.

These anonymous functions are:
* created when they are needed, but thrown away upon completion.
* mostly used in combination with: *filter(), map(), and reduce()*
* added to Python due to the demand from *List* community


####  *lambda arg1,arg2,...: expressions*

Recall: **.** and **()**

In [5]:
def mysum(a,b):
    return a + b
    
def indirect_sum(f, a, b):
    return f(a,b)
    
a = 5; b = 10

#print(indirect_sum(mysum, a, b))
print(indirect_sum(lambda x,y: x + y, a, b))


15


#### _map_ function:
* **_map(func, sequence)_**

The first argument *func* is the name of a function and the second a sequence (e.g. a list) seq. *map()* applies the function *func* to all the elements of *sequence*. It returns a new list with the elements changed by *func*.

In [10]:
items = [1, 2, 3, 4, 5]
''''''
squared = []
for x in items:
    squared.append(x ** 2)

squared
''''''

def sqr(x): return x ** 2

#squared = map(sqr, items)
# or using lambda function

square = map(lambda x: x**2, items)
print squared


[1, 4, 9, 16, 25]


#### _filter_ function:
* **_filter(func, sequence)_**

As the name suggests **_filter_** extracts each element in *sequence* for which *func* returns **True**. 
**_func_** therefore should be boolean function that returns either **True** or **False**.


In [6]:
import numpy as np
a = np.random.randint(0,10000, 10)
print a
b = filter(lambda x: x%2==0, a)
print b
#print len(b)


[ 482 7156 7458 5415  458  269 8558 2585 4947 5642]
[482, 7156, 7458, 458, 8558, 5642]


#### _reduce_ function:
* **_reduce(func, sequence)_**

The reduce function **shrinks** (reduces) *sequence* to a single value by combining elements by iteratively applying *func* to *sequence*.

Consider 
```python
seq = [ s1, s2, s3, ... , sn ]
```
Calling reduce(func, seq) works as follows:
1. The first two elements of seq will be passed to func, i.e. func(s1,s2)
2. Now *seq = [ func(s1, s2), s3, ... , sn ]*
3. *func* will be applied on the previous result and the third element of the list, i.e. **_func(func(s1, s2),s3)_**
4. *seq = [ func(func(s1, s2),s3), ... , sn ]*

The steps are repeated until one element is left in *seq*, which is returned as the result of *reduce()*

Consider the following example

In [26]:
a = [47,11,42,13]
print reduce(lambda x,y: x+y, a)

113


<img src="./images/reduce.png" width="500" height="500" />