# Apache Spark Introduction
In this class we talk about Spark framework and his components.

## Summary
- <a href='#1'>1. Context and Motivation</a>
- <a href='#2'>2. Apache Spark</a>
    - <a href='#2.1'>2.1.  Spark Components</a>
    - <a href='#2.2'>2.2.  Spark Applications</a>
    - <a href='#2.3'>2.3.  Spark Session</a>
    - <a href='#2.4'>2.4.  DataFrames</a>
    - <a href='#2.5'>2.5.  Partitions</a>
    - <a href='#2.6'>2.6.  Transformations</a>
    - <a href='#2.7'>2.7.  Lazy Evaluation</a>
    - <a href='#2.8'>2.8.  Actions</a>
    - <a href='#2.9'>2.9.  Spark UI</a>
    - <a href='#2.10'>2.10.  SQL</a>
- <a href='#3'>3.  Exercises</a>
    

# <a id='1'>1.Context and Motivation</a>

**Why do we need spark?** 

Over the years computers became faster every year through processor speed increases year by year computers processes more and more information, however most of the applications was design to run only on a single processor 
The trend of faster computers every year stopped dued to hard limits. The hardware developers switch to adding more paralel CPU processing all running at the same time. This change leads to that applications needed to be modified to add paralelism in order to run faster witch set stage for new programing models such **Apache Spark**. 

**Apache Spark** is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

 # <a id='2'>2. Apache Spark</a>

 ## <a id='2.1'>2.1. Spark Components</a>
 Spark includes multiple components described below

<img src="spark_stack.png" width="450px" />

### Spark Core
Spark Core contains basic Spark functionalities required for running jobs and neededby  other  components.  The  most  important  of  these  is  the  resilient  distributed  dataset(RDD), which is the main element of the Spark API. It’s an abstraction of a distributed collection of items with operations and transformations applicable to the dataset. It’s resilient because it’s capable of rebuilding datasets in case of node failures.

### Spark SQL
Spark SQL  provides  functions  for  manipulating  large  sets  of  distributed,  structured data  using  an  SQL  subset  supported  by  Spark  and  Hive  SQL  (HiveQL). Spark SQL can also be used for reading and writing data to and from various structured formats and datasources, such as JavaScript Object Notation (JSON) files, Parquet files (an increasingly popular  file  format  that  allows  for  storing  a  schema  along  with  the  data), relational databases, Hive, and others.
Operations  on  DataFrames  and  DataSets  at  some  point  translate  to  operations  on RDDs and execute as ordinary Spark jobs. Spark SQL provides a query optimization framework called Catalyst that can be extended by custom optimization rules.

### Spark Streaming
Spark  Streaming  is  a  framework  for  ingesting  real-time  streaming  data  from  various sources.  The  supported  streaming  sources  include  HDFS,  Kafka,  Flume,  Twitter,ZeroMQ,   and   custom   ones.   Spark   Streaming   operations   recover   from   failure automatically,  which  is  important  for  online  data  processing.  Spark  Streaming represents  streaming  data  using  discretized  streams (DStreams),  which  periodically create RDDs containing the data that came in during the last time window. Spark Streaming can be combined with other Spark components in a single program,unifying real-time processing with machine learning, SQL, and graph operations. This is something unique in the Hadoop ecosystem. And since Spark 2.0, the new StructuredStreaming API makes Spark streaming programs more similar to Spark batch programs.

### Spark MLib
Spark MLlib is a library of machine-learning algorithms grown from the MLbase proj-ect at UC Berkeley. Supported algorithms include logistic regression, naïve Bayes clas-sification,  support  vector  machines  (SVMs),  decision  trees,  random  forests,  linearregression, and k-means clustering.  Apache Mahout is an existing open source project offering implementations of dis-tributed machine-learning algorithms running on Hadoop. Although Apache Mahoutis more mature, both Spark MLlib and Mahout include a similar set of machine-learningalgorithms. But with Mahout migrating from MapReduce to Spark, they’re bound to bemerged in the future.  Spark  MLlib  handles  machine-learning  models  used  for  transforming  datasets,which are represented as RDDs or DataFrames.

### Spark GraphX
Graphs  are  data  structures  comprising  vertices  and  the  edges  connecting  them.GraphX  provides  functions  for  building  graphs,  represented  as  graph RDDs: EdgeRDDand VertexRDD. GraphX contains implementations of the most important algorithmsof  graph  theory,  such  as  page  rank,  connected  components,  shortest  paths,  SVD++,and  others.  It  also  provides  the  Pregel  message-passing  API,  the  same  API  for  large-scale  graph  processing  implemented  by  Apache  Giraph,  a  project  with  implementa-tions of graph algorithms and running on Hadoop.

 ## <a id='2.2'>2.2. Spark Applications</a>

  <img src="driver_executor.png" width="300px" />

Spark applications consist of a **Driver Process** and a set of **executor processes**:

 **Driver process:**
 
   * Run the main function.
   * Mantain information about spark application.
   * Responding to the user program or input. 
   * Analysing, Distributing, and scheduling work across executors.
   
**Executor processes:**
 
   * Responsible for carrying out the work that driver assigns to them.
   * Executing the code assign to it by the driver. 
   * Reporting the state of the computation on that executor back to driver node.
   
### Note
There is two modes **Cluster mode** and **Local mode** In Cluster mode, Driver and executors are processes can live in the same machine or different machines. in Local mode driver and executors run(as threads) on your own computer.

## <a id='2.3'>2.3. Spark Session</a>
 
Spark SessionSession instance is the way Sparks execute user defined manipulations across the cluster.   
To call spark Session object use `spark` command.

In [11]:
print(spark) # Spark Session Object

<pyspark.sql.session.SparkSession object at 0x115b2d710>


In [1]:
spark

NameError: name 'spark' is not defined

In [13]:
# Create a data Range of numbers:
# This range of numbers represents a distributed collection. Each part of this 
# range of numbers exists on a different executor.
numbers_to_n = spark.range(1000000000).toDF("Number")

 ## <a id='2.4'>2.4. Dataframes</a>
 Dataframe is the most common structed and simply represents  a table of data with rows and columns.  
 The list that defines columns and the types within those columns is called the schema.
 Spark Dataframes can span thousands of computers.  
 
 ### Note  
 Spark has multiple core abstractions: Datasets,Dataframes, SQL Tables and Resilient Distributed Datasets(RDD). These abstractions all represents distributed collections of Data.

 

In [14]:

columns = ['id', 'dogs', 'cats']
vals = [
     (1, "bulldog", "persian"),
     (2, "German Shepherd", "Siamese")
]

# create DataFrame
df = spark.createDataFrame(vals, columns)

In [15]:
df.printSchema() # see the schema

root
 |-- id: long (nullable = true)
 |-- dogs: string (nullable = true)
 |-- cats: string (nullable = true)



 ## <a id='2.5'>2.5. Partitions</a>
 
 To allow every executor perform work in paralel, Spark breaks up the data into chuncks called partitions.
 Partition is a collection of rows that sits on one physical machine in your cluster. 
 
 if we have multiple partitions but only one executor Spark will have a paralelism of only one because there is only one computation resource.  
 
 If we have one partitions spark will have a parallelism of only one even if we have a thousands of executors.  
 
 ### Note
 In Dataframes we don't (for the most part) manipulate partitions individualy, we simply specify high level transformations of data in the physical partitions, and spark determines how this work will actually execute on the cluster
 

 ## <a id='2.6'>2.6. Transformations</a>
 
 In Spark core data structures are **Immutable**(cannot be changed after they're created).   
 To "change" a Dataframe we need to instruct spark how to modify it to do what we want. these are called transformations  

In [None]:
divis_by_two = numbers_to_n.where("number % 2 = 0") # why didn't return the output?


### Types of transformations
* Narrow dependencies -> each input contribute only one output partition.All performed in memory 
* Wide dependencies -> each input partitions contributing to many output partitions. Spark writes the result to disk

 ## <a id='2.7'>2.7. Lazy Evaluation</a>
 
 Lazy evaluation means that spark will wait until the very last moment to execute the graph of computation instructions.
 In spark we build a plan of transformations that we would like to apply to the data. By waiting until the last moment to execute the code. Spark compiles the plan and optimize the entire flow end to end.

 ## <a id='2.8'>2.8. Actions</a>
 
 Transformations allow us to build our logical transformation plan and trigger the computation. It's like the play button 
 
 

In [3]:
divis_by_two.count()

NameError: name 'divis_by_two' is not defined

In [None]:
# other action code

In [None]:
# other action code

In [None]:
# other action code

### Kinds of actions:
* Actions to view Data in the console
* Actions to collect data to native objects in the respective language
* Actions to write to output data sources


 ## <a id='2.9'>2.9. Spark UI</a>
 With Spark UI you can monitor  the progress of a job. Usually Spark UI is available on port 4040 of the driver node.   
 Spark displays information about the state of spark jobs, its environment and cluster state.
 It is very useful for tunning and debuging.

In [16]:
spark.sparkContext.uiWebUrl # Check where spark ui is running

NameError: name 'spark' is not defined

 ## <a id='2.10'>2.10. SQL and DataFrames</a>
 Spark can run the same transformations regredless of the language in the exact same way.  
 Spark will compile the logic to a underlying plan.

In [5]:
# load a dataset
# create a dataframe 
flight_data_2015 = spark.read.option.("inferSchema","true").option("header","true").csv("2015-summary.csv")

SyntaxError: invalid syntax (<ipython-input-5-02a9622b61e6>, line 3)

In [None]:
flight_data_2015.take(3)# first three lines of the dataset

In [17]:
flight_data_2015.sort("count").explain() ## check the spark physical plan of count

In [6]:
#Spark by default has 200 shuffle output partitions. 
# Set partitions to 5
spark.conf.set("spark.sql.shuffle.partitions","5")

NameError: name 'spark' is not defined

In [18]:
# create a view of flight data 
flight_data_2015.createOrReplaceTempView("flight_data_2015")

In [7]:
# SQL WAY
country_name_sql = spark.sql("""SELECT DEST_COUNTRY_NAME, count(1) FROM flight_data_2015 GROUP BY DEST_COUNTRY_NAME""") 

NameError: name 'spark' is not defined

In [9]:
#DATAFRAME WAYz
country_name_dataframe = flight_data_2015.groupBy("DEST_COUNTRY_NAME").count() # dataframe way

NameError: name 'flight_data_2015' is not defined

In [None]:
country_name_sql.explain() # same logical plan

In [10]:
country_name_dataframe.explain() # same logical plan

NameError: name 'country_name_dataframe' is not defined

 ## <a id='3'>3. Exercises</a>

In [None]:
# What are the max count from flight data?

In [None]:
#Result output 370002

In [11]:
#What are the  top 5 destinations in data?