# 1. Introduction to Spark

Quick Introduction to Spark. Introduce the API through the interactive shell, then write some simple applications in Python. 

__Requirements__

Download a packaged release of Spark from [here](https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz). Will need to extract the tarball file first.

Note: Resilient Distributed Dataset (RDD) which was the default programming interface has been replaced by `Dataset`, which is strongly-typed like an RDD, but with richer optimizations. Recommend to use Dataset which has better performance than RDD


In [1]:
# Installing Spark tarball file - Windows
!tar -xzvf <filename.tar.gz>

# Starting the interactive shell - navigate to folder
!./bin
!pyspark

 Volume in drive D is New Volume
 Volume Serial Number is E4B7-2A40

 Directory of d:\MA Stuff\Random

09/09/2022  11:27 AM    <DIR>          .
09/09/2022  11:27 AM    <DIR>          ..
09/09/2022  10:29 AM               792 1. Spark Introduction.ipynb
09/09/2022  10:29 AM               113 test.py
               2 File(s)            905 bytes
               2 Dir(s)  92,224,647,168 bytes free


# 1a. Interactive Analysis with Spark Shell

Spark's primary abstraction is a distributed collection of items called a Dataset.

* Can be created from Hadoop InputFormats or transformaing any Datasets
* Dataset doesn't need to be strongly-typed
* Datasets are `Dataset[Row]` called `DataFrame`

In [None]:
!textFile = spark.read.text("ps_README.md")

In [None]:
# Getting information about DataFrame
!textFile.count()
!textFile.first()

In [None]:
# Filter for subset of lines in file
!linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
!textFile.filter(textFile.value.contains("Spark")).count()

## More Dataset Operations

Dataset actions and transformations can be used for more complex computations. We'll start using Python here.

In [15]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.getOrCreate()
textFile = spark.read.text("ps_README.md")
lines_with_spark = textFile.filter(textFile.value.contains("Spark"))

textFile.select(size(split(textFile.value, "\s+")).name("numWords")).agg(max(col("numWords"))).collect()

[Row(max(numWords)=16)]

The function above 

1. Maps the number of words in a line to an integer value and aliases it as "numWords"
2. Creates a new DataFrame
3. Runs the `agg` command to find the largest number of words in a line 

In [14]:
# MapReduce in Spark
wordCounts = textFile.select(explode(split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
wordCounts.show(5)

+--------------------+-----+
|                word|count|
+--------------------+-----+
|          [![PySpark|    1|
|              online|    1|
|              graphs|    1|
|Build](https://gi...|    1|
|                 API|    1|
+--------------------+-----+
only showing top 5 rows



The function above shows the number of times a word appears in the document:

1. `split` splits each row into individual items in an array and `explode` makes each element in the array (word) a new row
2. Each word is then grouped together using `groupBy`
3. Count the number of words that are "grouped" together

## 1b. Caching

Supports pulling data into cluster-wide in-memory cache. Useful when querying hot datasets, or iterative algorithms like PageRank

_Commands_

* `.cache` - stores DataFrame in cache

In [16]:
lines_with_spark.cache()
lines_with_spark.count()

20

## 1c. Self-Contained Applications

Self-contained applications are single, installable bundle that contains application and a set of dependencies needed to run the application. When installed it behaves the same way as a native application

To build a packaged PySpark application, add this to the setup.py file:

```
install_requires=[
	'pyspark=3.3.0'
]
```

In [3]:
# Simple PySpark application
from pyspark.sql import SparkSession

logFile = "ps_README.md"
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print(f"Lines with a: {numAs}, lines with b: {numBs}")

spark.stop()

Lines with a: 71, lines with b: 38


In [None]:
# Run application on Python interpreter
!python SimpleApp.py

For additional dependencies, they can be added to `spark-submit` through its `--py-files` argument by packaging them into a .zip file