<a href="https://colab.research.google.com/github/jalorenzo/SparkNotebookColab/blob/master/BDF_01_Introduction_to_Apache_Spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#01 - Introduction to Apache Spark


## A fast cluster computing platform

-   It extends the MapReduce model to support efficiently other computing types

    -   Interactive queries

    -   Streaming processing

-   Supports in-memory computations

-   Surpasses MapReduce for complex operations (10-20x faster)

### General-purpose

-   Batch, interactive or streaming processing modes

-   Reduces the number of tools to use and maintain


### History

-   Started in 2009 in the UC Berkeley RAD Lab (AMPLab)

    -   Motivated by MapReduce's lack of efficiency for iterative and interactive jobs

-   Main contributors: [Databricks](https://databricks.com/), Yahoo! and Intel

-   Licensed as an open source project in March 2010

-   Transferred to the Apache Software Foundation in June 2013, Top Level Project in February 2014

-   One of the most active Big Data projects

-   Version 1.0 launched in May 2014




## Spark features

- It supports a variety of workloads: batch, interactive queries, streaming, machine learning, graph processing...
- APIs in Scala, Java, Python, SQL and R
- Interactive shells in Scala and Python
- Integrates smoothly with other Big Data solutions like HDFS, Cassandra, etc.

## The Spark stack
<hr />

![sparkstack](https://www.oreilly.com/library/view/learning-spark/9781449359034/assets/lnsp_0101.png)

Source: H. Karau, A. Konwinski, P. Wendell, M. Zaharia, "Learning Spark", O'Reilly, 2015


## Spark Core APIs

Spark provides two APIs:

 - High-level or structured API
 - Low-level API

Each API provides different data types:os:

 - The structured API is preferable for its better performance
 - The low-level API allows for a better control on how data are distributed
 - The high-level API uses the low-level primitives

## Structured API data types

### DataSets
Distributed collection of same-type objects

- Introduced on Spark > 1.6
- The DataSets API is only available on Scala and Java
- Not available on Python nor R because of the dynamic type nature of these languages

### DataFrames
A DataFrame is a DataSet organised in named columns

- Conceptually is like a table  in a database or a dataframe in Python Pandas or R.
- API available on Scala, Java, Python and R
- On [Java](http://spark.apache.org/docs/latest/api/java/index.html "Interface Row") and [Scala](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row "trait Row extends Serializable"), a DataFrame is a DataSet of objects of *Row* type


## Low-level API data types
### RDDs (Resilient Distributed Datasets)

It is a distributed list of objects
- It is the basic data type on Spark v1.X
-We will be working on Spark v2.X


## Better performance of the structured API

- Spark with DataFrames and DataSets takes advantage of the structured data to improve the performance by using the  [Catalyst](https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html "Deep Dive into Spark SQL’s Catalyst Optimizer")  Optimizer, a query optimiser and the run engine  [Tungsten](https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html "Project Tungsten: Bringing Apache Spark Closer to Bare Metal").

<img src="https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM.png" alt="Performance improvement" style="width: 650px;"/>

Source: [Recent performance improvements in Apache Spark: SQL, Python, DataFrames, and More](https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python-dataframes-and-more.html "Recent performance improvements in Apache Spark: SQL, Python, DataFrames, and More")


## Key concepts
<hr />

![sparkcontext](http://spark.apache.org/docs/latest/img/cluster-overview.png)

(Source: [Spark Cluster Mode Overview](https://spark.apache.org/docs/2.4.0/cluster-overview.html).)


### Driver

-   It creates a `SparkContext`

-   Turns the user program into a set of tasks:

    -   Logical operations `DAG` -> physical execution plan

-   Schedules the tasks to perform on the executors.

### SparkSession and SparkContext

-    **SparkSession:** entry point for all functionalities of Spark

    -  Defines the configuration of the Spark application
    -  It is automatically defined as a `spark` variable

-   The **SparkContext** performs the connection to the cluster

    -   Allows building RDDs from files, lists or other objects
    -   Entry point for the low-level API
    -   In this Colaboratory notebook (or in the Spark shell) it is automatically defined (`sc` variable)

-   Creation on a Python script (see [below](https://colab.research.google.com/drive/1JtPhnvpU1sZnLr2v54EQ-_d1TyeXnTnF#scrollTo=lDeOybgfVT6-)):



### Executors

-   Execute each individual task and return the results to the Driver

-   Provide a store space in memory for the tasks data


### Cluster Manager

-   *Pluggable* component on Spark

-   YARN, Mesos or Spark Standalone

## Documentation

The official documentation for Apache Spark can be found on https://spark.apache.org/docs/latest/

The APIs documentation for the different languages is on:

  - Python: https://spark.apache.org/docs/latest/api/python/
  - Scala: https://spark.apache.org/docs/latest/api/scala/
  - Java: https://spark.apache.org/docs/latest/api/java/

###Installing Java, Spark, and Findspark


---


This code installs Apache Spark 3.3.1, Java 8, and [Findspark](https://github.com/minrk/findspark), a library that makes it easy for Python to find Spark.

In [None]:
import os

os.environ["SPARK_VERSION"] = "spark-3.5.0"
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget  http://apache.osuosl.org/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop3.tgz
!echo $SPARK_VERSION-bin-hadoop3.tgz
!tar xf $SPARK_VERSION-bin-hadoop3.tgz
!rm $SPARK_VERSION-bin-hadoop3.tgz
!pip install -q findspark

### Set Environment Variables
Set the locations where Spark and Java are installed.

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark/"
os.environ["DRIVE_DATA"] = "/content/gdrive/My Drive/Enseignement/2023-2024/ING3/HPDA/BigDataFrameworks/data/"

!rm /content/spark
!ln -s /content/$SPARK_VERSION-bin-hadoop3 /content/spark
!export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
!echo $SPARK_HOME
!env |grep  "DRIVE_DATA"

### Start a SparkSession
This will start a local Spark session.

In [None]:
!python -V

import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

# Example: shows the PySpark version
print("PySpark version {0}".format(sc.version))

# Example: parallelise an array and show the 2 first elements
sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)

In [None]:
# Another way to create a SparkSession, more detailed
from pyspark.sql import SparkSession
# We create a SparkSession object (or we retrieve it if it is already created)
spark = SparkSession \
.builder \
.appName("My application") \
.config("spark.some.config.option", "some-value") \
.master("local[4]") \
.getOrCreate()
# We get the SparkContext
sc = spark.sparkContext

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
!ls "$DRIVE_DATA"