In [None]:
# Getting Started
## Setup Spark Locally - Ubuntu

Let us setup Spark Locally on Ubuntu.

* Install latest version of Anaconda
* Make sure Jupyter Notebook is setup and validated.
* Setup Spark and Validate.
* Setup Environment Variables to integrate Pyspark with Jupyter Notebook.
* Launch Jupyter Notebook using `pyspark` command.
* Setup PyCharm (IDE) for application development.
## Setup Spark Locally - Mac

### Let us setup Spark Locally on Ubuntu.

* Install latest version of Anaconda
* Make sure Jupyter Notebook is setup and validated.
* Setup Spark and Validate.
* Setup Environment Variables to integrate Pyspark with Jupyter Notebook.
* Launch Jupyter Notebook using `pyspark` command.
* Setup PyCharm (IDE) for application development.

## Signing up for ITVersity Labs

Here are the steps for signing to ITVersity labs.
* Go to https://labs.itversity.com
* Sign up to our website
* Purchase lab access
* Go to lab page and create lab account
* Login and practice

## Using ITVersity Labs

Let us understand how to submit the Spark Jobs in ITVersity Labs.

* You can either use Jupyter based environment or `pyspark` in terminal to submit jobs in ITVersity labs.
* You can also submit Spark jobs using `spark-submit` command.
* As we are using Python we can also use the help command to get the documentation - for example `help(spark.read.csv)`

## Interacting with File Systems

Let us understand how to interact with file system using %fs command from Databricks Notebook.

* We can access datasets using %fs magic command in Databricks notebook
* By default, we will see files under dbfs
* We can list the files using ls command - e. g.: `%fs ls`
* Databricks provides lot of datasets for free under databricks-datasets
* If the cluster is integrated with AWS or Azure Blob we can access files by specifying the appropriate protocol (e.g.: s3:// for s3)
* List of commands available under `%fs`
  * Copying files or directories `-cp`
  * Moving files or directories `-mv`
  * Creating directories `-mkdirs`
  * Deleting files and directories `-rm`
  * We can copy or delete directories recursively using `-r` or `--recursive`

  ## Getting File Metadata

Let us review the source location to get number of files and the size of the data we are going to process.

* Location of airlines data dbfs:/databricks-datasets/airlines
* We can get first 1000 files using %fs ls dbfs:/databricks-datasets/airlines
* Location contain 1919 Files, however we will not be able to see all the details using %fs command.
* Databricks File System commands does not have capability to understand metadata of files such as size in details.
* When Spark Cluster is started, it will create 2 objects - spark and sc
* sc is of type SparkContext and spark is of type SparkSession
* Spark uses HDFS APIs to interact with the file system and we can access HDFS APIs using sc._jsc and sc._jvm to get file metadata.
* Here are the steps to get the file metadata.
  * Get Hadoop Configuration using `sc._jsc.hadoopConfiguration()` - let's say `conf`
  * We can pass conf to `sc._jvm.org.apache.hadoop.fs.FileSystem` get to get FileSystem object - let's say `fs`
  * We can build `path`  object by passing the path as string to `sc._jvm.org.apache.hadoop.fs.Path`
  * We can invoke `listStatus` on top of fs by passing path which will return an array of FileStatus objects - let's say files.  
  * Each `FileStatus` object have all the metadata of each file.
  * We can use `len` on files to get number of files.
  * We can use `>getLen` on each `FileStatus` object to get the size of each file. 
  * Cumulative size of all files can be achieved using `sum(map(lambda file: file.getLen(), files))`
  
Let us first get list of files 

note: dbfs databricks file system


%fs ls dbfs:/databricks-datasets/airlines

Here is the consolidated script to get number of files and cumulative size of all files in a given folder.

In [None]:
conf = sc._jsc.hadoopConfiguration()
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(conf)
path = sc._jvm.org.apache.hadoop.fs.Path("dbfs:/databricks-datasets/airlines")

files = fs.listStatus(path)
sum(map(lambda file: file.getLen(), files))/1024/1024/1024

## Platforms to Practice

Let us understand different platforms we can leverage to practice Apache Spark using Python.

* Local Setup
* Databricks Platform
* Setting up your own cluster
* ITVersity Labs

## Setup Spark Locally - Windows

Let us understand how to setup Spark locally on Windows. Even though it can be setup directly, we would recommend to use virtual machine.

* Here are the pre-requisites to setup Spark locally on Windows using Virtual Machine.
* Make sure to setup Virtual Box and then Vagrant.


## Setup Spark Locally - Mac

Let us understand how to setup Spark locally on Mac.

* Here are the pre-requisites to setup Spark Locally on mac.
  * At least 8 GB RAM is highly desired.
  * Make sure JDK 1.8 is setup
  * Make sure to have Python 3. If you do not have it, you can install it using **homebrew**.
* Here are the steps to setup Pyspark and validate.
  * Create Python Virtual Environment - `python3 -m venv spark-venv`.
  * Activate the virtual environment - `source spark-venv/bin/activate`.
  * Run `pip install pyspark==2.4.6` to install Spark 2.4.6.
  * Run `pyspark` to launch Spark CLI using Python as programming language.
* Here are some of the limitations related to running Spark locally.
  * You will be able to run Spark using local mode by default. But you will not be able to get the feel of Big Data.
  * Actual production implementations will be on multinode cluters, which run using YARN or Spark Stand Alone or Mesos.
  * You can understand the development process but you will not be able to explore best practices to build effective large scale data engineering solutions.

  ## Setup Spark Locally - Ubuntu

Let us understand how to setup Spark locally on Ubuntu.

* Here are the pre-requisites to setup Spark Locally on Ubuntu.
  * At least 8 GB RAM is highly desired.
  * Make sure JDK 1.8 is setup
  * Make sure to have Python 3. If you do not have it, you can install it using **apt** or **snap**.
* Here are the steps to setup Pyspark and validate.
  * Create Python Virtual Environment - `python3 -m venv spark-venv`.
  * Activate the virtual environment - `source spark-venv/bin/activate`.
  * Run `pip install pyspark==2.4.6` to install Spark 2.4.6.
  * Run `pyspark` to launch Spark CLI using Python as programming language.
* Here are some of the limitations related to running Spark locally.
  * You will be able to run Spark using local mode by default. But you will not be able to get the feel of Big Data.
  * Actual production implementations will be on multinode cluters, which run using YARN or Spark Stand Alone or Mesos.
  * You can understand the development process but you will not be able to explore best practices to build effective large scale data engineering solutions.

  ## Using ITVersity Labs

Let me demonstrate how to use ITVersity Labs to practice Spark.
* Once you sign up for the lab, you will get access to the cluster via Jupyter based environment.
* You can connect to the labs using browser and practice in interactive fashion.
* You can either use our material or upload your material to practice using Jupyter based environment.
* Here are some of the advantages of using our labs.
  * Interactive or Integrated learning experience.
  * Access to multi node cluster.
  * Pre-configured data sets as well as databases.
  * You will be focused on the learning rather than troubleshooting the issues.

## Overview of File Systems

Let us get an overview of File Systems you can work with while learning Spark.

* Here are the file systems that can be used to learn Spark.
  * Local file system when you run in local mode.
  * Hadoop Distributed File System.
  * AWS S3
  * Azure Blob
  * GCP Cloud Storage
  * and other supported file systems.
* It is quite straight forward to learn underlying file system. You just need to focus on the following:
  * Copy files into the file system from different sources.
  * Validate files in the file system.
  * Ability to preview the data using Spark related APIs or direct tools.
  * Delete files from the file system.
* Typically we ingest data into underlying file system using tools such as Informatica, Talend, NiFi, Kafka, custom applications etc. 

## Different Spark Modules

Let us understand details about different spark modules. We will be focusing on high level modules that are made available since Spark 2.2 and later.
* Here are the different Spark Modules.
  * Spark Core - RDD and Map Reduce APIs
  * Spark Data Frames and Spark SQL
  * Spark Structured Streaming
  * Spark MLLib (Data Frame based)
* As engineers, we need not focus too much on Spark Core libraries to build Data Pipelines. We should focus on Spark Data Frames as well as Spark SQL.

## Spark Cluster Manager Types

Let us get an overview of different Spark Cluster Managers on which typically Spark Applications are deployed.

* Here are the supported cluster manager types.
  * Local (used for development and unit testing).
  * Stand Alone
  * YARN
  * Mesos
* Here are the popular distributions which use YARN to deploy Spark Applications.
  * Cloudera
  * AWS EMR
  * Google Dataproc
  * Hortonworks
  * MapR
* Databricks uses Stand Alone for running or deploying Spark Jobs.

## Launching Spark CLI

Let us understand how to launch Pyspark CLI. We will be covering both local as well as our labs.
* Once pyspark is installed you can run `pyspark` to launch Pyspark CLI.
* In our labs, we have integrated Spark with Hadoop and Hive and you can interact with Hive Database as well.
* You need to run the following command to launch Pyspark using Terminal.

```shell
export PYSPARK_PYTHON=python3
export SPARK_MAJOR_VERSION=2
pyspark --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* Alternatively, you can also run the following command to launch Pyspark CLI.

```shell
pyspark --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* Here is what happens when you launch Pyspark CLI.
  * Launches Python CLI.
  * All Spark related libraries will be loaded.
  * Creates SparkSession as well as SparkContext objects.
  * It facilitates us to explore Spark APIs in interactive fashion.

## Using Jupyter Lab Interface

As part of our labs, you can learn Spark using Jupyter based interface.
* Make sure you are using right kernel **Pyspark 2** (top right corner of the notebook).
* Use below code to start the Spark Session object so that you can learn Spark in interactive fashion.

In [None]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Getting Started'). \
    master('yarn'). \
    getOrCreate()