# Big Data Fundamentals with PySpark

There's been a lot of buzz about Big Data over the past few years, and it's finally become mainstream for many companies. But what is this Big Data? This course covers the fundamentals of Big Data via PySpark. Spark is “lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc., to interact with works of William Shakespeare, analyze Fifa football 2018 data and perform clustering of genomic datasets. At the end of this course, you will gain an in-depth understanding of PySpark and it’s application to general Big Data analysis.

## Table of Contents

- [Introduction](#intro)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

path = "data/dc32/"

---
<a id='intro'></a>

## What is Big Data?

<img src="images/spark2_001.png" alt="" style="width: 800px;"/>

<img src="images/spark2_002.png" alt="" style="width: 800px;"/>

<img src="images/spark2_003.png" alt="" style="width: 800px;"/>

<img src="images/spark2_004.png" alt="" style="width: 800px;"/>

<img src="images/spark2_005.png" alt="" style="width: 800px;"/>

<img src="images/spark2_006.png" alt="" style="width: 800px;"/>

<img src="images/spark2_007.png" alt="" style="width: 800px;"/>

## PySpark: Spark with Python

<img src="images/spark2_008.png" alt="" style="width: 800px;"/>

<img src="images/spark2_009.png" alt="" style="width: 800px;"/>

<img src="images/spark2_010.png" alt="" style="width: 800px;"/>

<img src="images/spark2_011.png" alt="" style="width: 800px;"/>

<img src="images/spark2_012.png" alt="" style="width: 800px;"/>

<img src="images/spark2_013.png" alt="" style="width: 800px;"/>

In [2]:
#from pyspark.sql import SparkSession
#sc = SparkSession.builder.getOrCreate()
#print(sc)

# https://www.tutorialspoint.com/pyspark/pyspark_sparkcontext.htm

# Initialize a SparkSession via PySpark
from pyspark import SparkContext
sc = SparkContext("local", "First App")
print(sc)

<SparkContext master=local appName=First App>


## Understanding SparkContext

A `SparkContext` represents the entry point to Spark functionality. It's like a key to your car. PySpark automatically creates a SparkContext for you in the PySpark shell (so you don't have to create it by yourself) and is exposed via a variable `sc`.

In this simple exercise, you'll find out the attributes of the SparkContext in your PySpark shell which you'll be using for the rest of the course.

- Print the version of SparkContext in the PySpark shell.
- Print the Python version of SparkContext in the PySpark shell.
- What is the master of SparkContext in the PySpark shell?

In [3]:
# Print the version of SparkContext
print("The version of Spark Context in the PySpark shell is", sc.version)

# These two are PySpark specific
# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is", sc.pythonVer)

# Print the master of SparkContext
print("The master of Spark Context in the PySpark shell is", sc.master)

The version of Spark Context in the PySpark shell is 2.4.4
The Python version of Spark Context in the PySpark shell is 3.7
The master of Spark Context in the PySpark shell is local


## Interactive Use of PySpark

`Spark comes with an interactive python shell in which PySpark is already installed in it`. PySpark shell is useful for basic testing and debugging and it is quite powerful. The easiest way to demonstrate the power of PySpark’s shell is to start using it. In this example, you'll load a simple list containing numbers ranging from 1 to 100 in the PySpark shell.

The most important thing to understand here is that `we are not creating any SparkContext object because PySpark automatically creates the SparkContext object named sc, by default in the PySpark shell`.

- Create a python list named numb containing the numbers 1 to 100.
- Load the list into Spark using Spark Context's `parallelize` method and assign it to a variable spark_data.

In [4]:
# Create a python list of numbers from 1 to 100 
numb = range(1, 101)

# Load the list into PySpark  
spark_data = sc.parallelize(numb)

## Loading data in PySpark shell

`In PySpark, we express our computation through operations on distributed collections that are automatically parallelized across the cluster`. In the previous exercise, you have seen an example of loading a list as parallelized collections and in this exercise, you'll load the data from a local file in PySpark shell.

Remember you already have a SparkContext `sc` and `file_path` variable (which is the path to the README.md file) already available in your workspace.

- Load a local text file README.md in PySpark shell.

In [6]:
file_path = path+'README.md'

# Load a local file into PySpark shell
lines = sc.textFile(file_path)
lines

data/dc32/README.md MapPartitionsRDD[4] at textFile at NativeMethodAccessorImpl.java:0

In [None]:
<img src="images/spark2_014.png" alt="" style="width: 800px;"/>

In [None]:
---
<a id='intro'></a>