<a href="https://colab.research.google.com/github/jalorenzo/SparkNotebookColab/blob/master/BDF_02_Basic_operations_on_Spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#00 - Configuration of Apache Spark on Collaboratory

---




###Installing Java, Spark, and Findspark


---


This code installs Apache Spark 2.2.1, Java 8, and [Findspark](https://github.com/minrk/findspark), a library that makes it easy for Python to find Spark.

In [None]:
import os

os.environ["SPARK_VERSION"] = "spark-3.5.0"
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget  http://apache.osuosl.org/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop3.tgz
!tar xf $SPARK_VERSION-bin-hadoop3.tgz
!echo $SPARK_VERSION-bin-hadoop3.tgz
!rm $SPARK_VERSION-bin-hadoop3.tgz
!pip install -q findspark

### Set Environment Variables
Set the locations where Spark and Java are installed.

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark/"
os.environ["DRIVE_DATA"] = "/content/gdrive/My Drive/Enseignement/2023-2024/ING3/HPDA/BigDataFrameworks/data/"

!rm /content/spark
!ln -s /content/$SPARK_VERSION-bin-hadoop3 /content/spark
!export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
!echo $SPARK_HOME
!env |grep  "DRIVE_DATA"

### Start a SparkSession
This will start a local Spark session.

In [None]:
!python -V

import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

# Example: shows the PySpark version
print("PySpark version {0}".format(sc.version))

# Example: parallelise an array and show the 2 first elements
sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)

In [None]:
from pyspark.sql import SparkSession
# We create a SparkSession object (or we retrieve it if it is already created)
spark = SparkSession \
.builder \
.appName("My application") \
.config("spark.some.config.option", "some-value") \
.master("local[4]") \
.getOrCreate()
# We get the SparkContext
sc = spark.sparkContext

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
!ls "$DRIVE_DATA"



---


# 02 - Basic operations on Spark

- Spark operates with immutable and distributed collections of elements, managing them in parallel
    - Structured API: DataFrames and DataSets
    - Low-level API: RDDs

-   Operations on these collections
    -   Creation
    -   Transformations (sorting, filtering, etc.)
    -   Actions to obtain results

-   Spark automatically distributes data and parallelises operations



## Example: creation of a DataFrame from a CSV file

**Note:** To learn how to upload a file into collaboratory from your machine or from Google Drive, check [this link](https://colab.research.google.com/notebooks/io.ipynb)


### Option 1: Uploading the *2015-summary.csv* CSV file from your computer

1.   List item
2.   List item



In [None]:
from google.colab import files
import pandas as pd
import io

uploaded = files.upload()


In [None]:
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  df = pd.read_csv(io.StringIO(uploaded[fn].decode('utf-8')))
  print(format(df.head()))

In [None]:
!ls -lh 2015-summary.csv
!head 2015-summary.csv

### Option 2: Uploading the CSV file from Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
!head "$DRIVE_DATA/2015-summary.csv"

### Creating the DataFrame

In this example, Spark infers the data schema automatically

  - It is better to specify the schema in a explicit way, as we will see later

We define the first line to be the header.

In [None]:
from pyspark.sql import SparkSession
# We create a SparkSession object (or we retrieve it if it is already created)
spark = SparkSession \
.builder \
.appName("My application") \
.config("spark.some.config.option", "some-value") \
.master("local[4]") \
.getOrCreate()

flightData2015 = (spark
    .read
    .option("inferSchema", "true")
    .option("header", "true")
    .csv(os.environ["DRIVE_DATA"] +"/2015-summary.csv"))

flightData2015.printSchema()

flightData2015.show(5)
print(flightData2015.count())

## Rows

Rows in a DataFrame are objects of `Row`  type

- Row API in Python: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Row.html#pyspark.sql.Row

### Row manipulation

In [None]:
# Get the two first rows of the DataFrame
row1_2 = flightData2015.take(2)
print(row1_2)

In [None]:
# Get the first row as a Python dictionary
print(row1_2[0].asDict())

## Partitions

The elements in a DataFrame (or DataSet or RDD) are splitted between the nodes of the cluster, dividing the collection in partitions. Each partition is then processed by a given executor.

-  The number of partitions by default is a function of the cluster size (total number of cores from every executor) and the data size (number of blocks of the files in HDFS)
-  In the case of an RDD, a different partition size can be specified at creation time.
- The partition size can be modified once they are created.

![Partitioning](https://docs.google.com/drawings/d/1GAasfY7P7uaMXhvGHuZ1nOqPqv6TrE7-N96RqUn1NqE/pub?w=960&h=540)



In [None]:
print("Number of partitions: {0}"
    .format(flightData2015.rdd.getNumPartitions()))

# Create a new DataFrame with 4 partitions
flightData2015_4P = flightData2015.repartition(4)
print("Number of partitions: {0}"
    .format(flightData2015_4P.rdd.getNumPartitions()))

##Transformations vs Actions

### Transformations

Operations that transform data

  - Origin data are not transformed ( *immutability* )
  - Transformations are computed in a "lazy" way ( *lazyness* ),  in the sense that they do not actually do anything until an action is executed.

Two types:

  - *Narrow* Transformations
    - Each input partition contributes to a single output partition
    - The number of partitions is not modified
    - Typically performed in memory
  - *Wide* Transformations
    - Each output partition depends on several (or all) input partitions
    - They imply data shuffling
    - The number of partitions can be modified
    - They may imply disk writes
    
Examples:
* map
* filter
* replace

In [None]:
# Narrow transformation example
flightData2015_EEUU = flightData2015.replace("United States", "Estados Unidos")
flightData2015_EEUU.show(5)

In [None]:
# Wide transformation example
flightData2015_Ord = flightData2015_EEUU.sort("count", ascending=False)
flightData2015_Ord.cache()
flightData2015_Ord.show(5)  #we don't want to force an action

### Actions

They return a result to the driver program, forcing therefore to perform the pending transformations

  - When an action is triggered, a *plan* is created with the transformations needed to obtain the requested data
    - A Directed Acyclic Graph (DAG) is created to connect the transformations to apply
    - Spark will optimise this graph, removing unnecessary tranformations and joining them when possible
  - Actions translate the DAG into an execution plan

Types of actions:

  - Actions to show data in the console
  - Actions to convert Spark data into language-related data
  - Actions to write data to disk
  
Examples:
* reduce
* collect
* take
* show


In [None]:
# Action example
print("Number of rows in the table: {0}".format(flightData2015_Ord.count()))

print(flightData2015_Ord.take(3))

flightData2015_Ord.show()

### DAG example
Each job is represented by a graph (specifically a directed acyclic graph (DAG)):

![DAG](http://2.bp.blogspot.com/-5sDP78mSdlw/Ur3szYz1HpI/AAAAAAAABCo/Aak2Xn7TmnI/s1600/p2.png)