# Big Data Fundamentals with PySpark

There's been a lot of buzz about Big Data over the past few years, and it's finally become mainstream for many companies. But what is this Big Data? This course covers the fundamentals of Big Data via PySpark. Spark is “lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc., to interact with works of William Shakespeare, analyze Fifa football 2018 data and perform clustering of genomic datasets. At the end of this course, you will gain an in-depth understanding of PySpark and it’s application to general Big Data analysis.

## Table of Contents

- [Introduction](#intro)
- [Programming in PySpark RDD's](#rdd)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

path = "data/dc32/"

---
<a id='intro'></a>

## What is Big Data?

<img src="images/spark2_001.png" alt="" style="width: 800px;"/>

<img src="images/spark2_002.png" alt="" style="width: 800px;"/>

<img src="images/spark2_003.png" alt="" style="width: 800px;"/>

<img src="images/spark2_004.png" alt="" style="width: 800px;"/>

<img src="images/spark2_005.png" alt="" style="width: 800px;"/>

<img src="images/spark2_006.png" alt="" style="width: 800px;"/>

<img src="images/spark2_007.png" alt="" style="width: 800px;"/>

## PySpark: Spark with Python

<img src="images/spark2_008.png" alt="" style="width: 800px;"/>

<img src="images/spark2_009.png" alt="" style="width: 800px;"/>

<img src="images/spark2_010.png" alt="" style="width: 800px;"/>

<img src="images/spark2_011.png" alt="" style="width: 800px;"/>

<img src="images/spark2_012.png" alt="" style="width: 800px;"/>

<img src="images/spark2_013.png" alt="" style="width: 800px;"/>

In [2]:
#from pyspark.sql import SparkSession
#sc = SparkSession.builder.getOrCreate()
#print(sc)

# https://www.tutorialspoint.com/pyspark/pyspark_sparkcontext.htm

# Initialize a SparkSession via PySpark
from pyspark import SparkContext
sc = SparkContext("local", "First App")
print(sc)

<SparkContext master=local appName=First App>


## Understanding SparkContext

A `SparkContext` represents the entry point to Spark functionality. It's like a key to your car. PySpark automatically creates a SparkContext for you in the PySpark shell (so you don't have to create it by yourself) and is exposed via a variable `sc`.

In this simple exercise, you'll find out the attributes of the SparkContext in your PySpark shell which you'll be using for the rest of the course.

- Print the version of SparkContext in the PySpark shell.
- Print the Python version of SparkContext in the PySpark shell.
- What is the master of SparkContext in the PySpark shell?

In [3]:
# Print the version of SparkContext
print("The version of Spark Context in the PySpark shell is", sc.version)

# These two are PySpark specific
# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is", sc.pythonVer)

# Print the master of SparkContext
print("The master of Spark Context in the PySpark shell is", sc.master)

The version of Spark Context in the PySpark shell is 2.4.4
The Python version of Spark Context in the PySpark shell is 3.7
The master of Spark Context in the PySpark shell is local


## Interactive Use of PySpark

`Spark comes with an interactive python shell in which PySpark is already installed in it`. PySpark shell is useful for basic testing and debugging and it is quite powerful. The easiest way to demonstrate the power of PySpark’s shell is to start using it. In this example, you'll load a simple list containing numbers ranging from 1 to 100 in the PySpark shell.

The most important thing to understand here is that `we are not creating any SparkContext object because PySpark automatically creates the SparkContext object named sc, by default in the PySpark shell`.

- Create a python list named numb containing the numbers 1 to 100.
- Load the list into Spark using Spark Context's `parallelize` method and assign it to a variable spark_data.

In [4]:
# Create a python list of numbers from 1 to 100 
numb = range(1, 101)

# Load the list into PySpark  
spark_data = sc.parallelize(numb)

## Loading data in PySpark shell

`In PySpark, we express our computation through operations on distributed collections that are automatically parallelized across the cluster`. In the previous exercise, you have seen an example of loading a list as parallelized collections and in this exercise, you'll load the data from a local file in PySpark shell.

Remember you already have a SparkContext `sc` and `file_path` variable (which is the path to the README.md file) already available in your workspace.

- Load a local text file README.md in PySpark shell.

In [6]:
file_path = path+'README.md'

# Load a local file into PySpark shell
lines = sc.textFile(file_path)
lines

data/dc32/README.md MapPartitionsRDD[4] at textFile at NativeMethodAccessorImpl.java:0

## Review of functional programming in Python

<img src="images/spark2_014.png" alt="" style="width: 800px;"/>

<img src="images/spark2_015.png" alt="" style="width: 800px;"/>

<img src="images/spark2_016.png" alt="" style="width: 800px;"/>

<img src="images/spark2_017.png" alt="" style="width: 800px;"/>

<img src="images/spark2_018.png" alt="" style="width: 800px;"/>

## Use of lambda() with map()

The `map()` function in Python returns a list of the results after applying the given function to each item of a given iterable (list, tuple etc.). The general syntax of map() function is `map(fun, iter)`. We can also use lambda functions with map(). The general syntax of map() function with lambda() is `map(lambda <agument>:<expression>, iter)`.

In this exercise, you'll be using `lambda function inside the map()` built-in function to square all numbers in the list.

- Print my_list which is available in your environment.
- Square each item in my_list using map() and lambda().
- Print the result of map function.

In [7]:
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Print my_list in the console
print("Input list is", my_list)

# Square all numbers in my_list
squared_list_lambda = list(map(lambda x: x**2, my_list))

# Print the result of the map function
print("The squared numbers are", squared_list_lambda)

Input list is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The squared numbers are [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


## Use of lambda() with filter()

Another function that is used extensively in Python is the `filter()` function. The `filter()` function in Python takes in a function and a list as arguments. The general syntax of the filter() function is `filter(function, list_of_input)`. Similar to the map(), filter() can be used with lambda() function. The general syntax of the filter() function with lambda() is `filter(lambda <argument>:<expression>, list)`.

In this exercise, you'll be using lambda() function inside the filter() built-in function to find all the numbers divisible by 10 in the list.

- Print my_list2 which is available in your environment.
- Filter the numbers divisible by 10 from my_list2 using filter() and lambda().
- Print the numbers divisible by 10 from my_list2.

In [8]:
my_list2 = [10, 21, 31, 40, 51, 60, 72, 80, 93, 101]

# Print my_list2 in the console
print("Input list is:", my_list2)

# Filter numbers divisible by 10
filtered_list = list(filter(lambda x: (x%10 == 0), my_list2))

# Print the numbers divisible by 10
print("Numbers divisible by 10 are:", filtered_list)

Input list is: [10, 21, 31, 40, 51, 60, 72, 80, 93, 101]
Numbers divisible by 10 are: [10, 40, 60, 80]


---
<a id='rdd'></a>

## Programming in PySpark RDD's

## Abstracting Data with RDDs

<img src="images/spark2_019.png" alt="" style="width: 800px;"/>

<img src="images/spark2_020.png" alt="" style="width: 800px;"/>

<img src="images/spark2_021.png" alt="" style="width: 800px;"/>

<img src="images/spark2_022.png" alt="" style="width: 800px;"/>

<img src="images/spark2_023.png" alt="" style="width: 800px;"/>

<img src="images/spark2_024.png" alt="" style="width: 800px;"/>

## RDDs from Parallelized collections

`Resilient Distributed Dataset (RDD)` is the basic abstraction in Spark. It is an immutable distributed collection of objects. Since RDD is a fundamental and backbone data type in Spark, it is important that you understand how to create it. In this exercise, you'll create your first RDD in PySpark from a collection of words.

Remember you already have a SparkContext sc available in your workspace.

- Create an RDD named RDD from a list of words.
- Confirm the object created is RDD.

In [9]:
# Create an RDD from a list of words
RDD = sc.parallelize(["Spark", "is", "a", "framework", "for", "Big Data processing"])

# Print out the type of the created object
print("The type of RDD is", type(RDD))

The type of RDD is <class 'pyspark.rdd.RDD'>


## RDDs from External Datasets

PySpark can easily create RDDs from files that are stored in external storage devices such as `HDFS (Hadoop Distributed File System)`, `Amazon S3` buckets, etc. However, the most common method of creating RDD's is from files stored in your local file system. This method takes a file path and reads it as a collection of lines. In this exercise, you'll create an RDD from the file path (`file_path`) with the file name `README.md` which is already available in your workspace.

Remember you already have a SparkContext sc available in your workspace.

- Print the file_path in the PySpark shell.
- Create an RDD named fileRDD from a file_path with the file name README.md.
- Print the type of the fileRDD created.

In [10]:
file_path = path+'README.md'

# Print the file_path
print("The file_path is", file_path)

# Create a fileRDD from file_path
fileRDD = sc.textFile(file_path)

# Check the type of fileRDD
print("The file type of fileRDD is", type(fileRDD))

The file_path is data/dc32/README.md
The file type of fileRDD is <class 'pyspark.rdd.RDD'>


In [None]:
<img src="images/spark2_025.png" alt="" style="width: 800px;"/>

In [None]:
---
<a id='intro'></a>