# Spark: Getting Started

## Step 0: Prerequisites & Installation

Follow [these instructions](https://docs.databricks.com/notebooks/notebooks-manage.html#import-a-notebook) to import this notebook into Databricks

Run these commands in your terminal (just once) if you want to run Spark locally.

 * These instructions require a Mac with [Anaconda3](https://anaconda.com/) and [Homebrew](https://brew.sh/) installed.
 * Useful for small data only. For larger data, try [Databricks](https://databricks.com/).

```bash
# Make Homebrew aware of old versions of casks
brew tap caskroom/versions

# Install Java 1.8 (OpenJDK 8)
brew cask install adoptopenjdk8

# Install the current version of Spark
brew install apache-spark

# Install Py4J (connects PySpark to the Java Virtual Machine)
pip install py4j

# Add JAVA_HOME to .bash_profile (makes Java 1.8 your default JVM)
echo "\n# Apache Spark\nexport JAVA_HOME=$(/usr/libexec/java_home -v 1.8)" >> ~/.bash_profile

# Add SPARK_HOME to .bash_profile
echo "\nexport SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.3/libexec" >> ~/.bash_profile

# Add PySpark to PYTHONPATH in .bash_profile
echo "\nexport PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH" >> ~/.bash_profile

# Update current environment
source ~/.bash_profile

```

## Step 1: Create a SparkSession with a SparkContext

In [None]:
import pyspark

# un-comment the following lines if you are running Spark locally
# spark = pyspark.sql.SparkSession.builder.getOrCreate()
# sc = spark.sparkContext

In [None]:
spark

In [None]:
sc

If we need a broadcast (i.e. global) variable, we can declare it like so:

In [None]:
glob = sc.broadcast(list(range(1, 3)))
glob.value

## Step 2: Download some Amazon reviews (Toys & Games)

In [None]:
# Data is already in the repo, but you can also get it with
# !wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Toys_and_Games_5.json.gz
# !gunzip reviews_Toys_and_Games_5.json.gz

Follow [these instructions](https://docs.databricks.com/data/data.html#import-data-1) to import `reviews_Toys_and_Games_5.json` into Databricks

## Step 3: Create a Spark DataFrame

In [None]:
# this file path will be different if you are running Spark locally
df = spark.read.json('/FileStore/tables/reviews_Toys_and_Games_5.json')

In [None]:
df.persist()

This last command, `.persist()`, simply stores the DataFrame in memory. See [this page](https://unraveldata.com/to-cache-or-not-to-cache/). It is similar to `.cache()`, but actually more flexible than the latter since you can specify which storage level you want. See [here](https://stackoverflow.com/questions/26870537/what-is-the-difference-between-cache-and-persist).

In [None]:
type(df)

In [None]:
df.show(5) # default of 20 lines

In [None]:
pdf = df.limit(5).toPandas()
pdf

In [None]:
type(pdf)

In [None]:
df.count()

In [None]:
df.columns

In [None]:
df.printSchema()

The 'nullable = true' bit means that the relevant column tolerates null values.

In [None]:
df.describe().show()

In [None]:
df.describe('overall').show()

In [None]:
reviews_df = df[['asin', 'overall']]

In [None]:
reviews_df.show()

In [None]:
def show(df, n=5):
    return df.limit(n).toPandas()

In [None]:
show(reviews_df)

In [None]:
reviews_df.count()

In [None]:
sorted_review_df = reviews_df.sort('overall')

In [None]:
show(sorted_review_df)

In [None]:
import pyspark.sql.functions as F

In [None]:
counts = reviews_df.agg(F.countDistinct('overall'))

In [None]:
counts.show()

In [None]:
query = """
SELECT overall, COUNT(*)
FROM reviews
GROUP BY overall
ORDER BY overall
"""

In [None]:
reviews_df.createOrReplaceTempView('reviews')

In [None]:
output = spark.sql(query)

In [None]:
show(output)

In [None]:
output.collect()

In [None]:
reviews_df.count() - sum(output.collect()[i][1] for i in range(5))

In [None]:
type(reviews_df)

Convert to RDD!

In [None]:
reviews_df.rdd

In [None]:
type(reviews_df.rdd)

### Count the words in the first row

In [None]:
row_one = df.first()

In [None]:
row_one

In [None]:
def word_count(text):
    return len(text.split())

In [None]:
word_count(row_one['reviewText'])

In [None]:
from pyspark.sql.types import IntegerType

#'udf' is for User Defined Function!

word_count_udf = F.udf(word_count, returnType=IntegerType())

In [None]:
review_text_col = df['reviewText']

In [None]:
counts_df = df.withColumn('wordCount', word_count_udf(review_text_col))

In [None]:
# Remember that we set the default number of lines to show at 5.

show(counts_df).T

In [None]:
from pyspark.sql.types import IntegerType
word_count_udf = F.udf(word_count, IntegerType())

# Registering our word_count() function so that we
# can use it with SQL! See documentation here:
# https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-UDFRegistration.html

df.createOrReplaceTempView('reviews')
spark.udf.register('word_count', word_count_udf)

In [None]:
query = """
SELECT asin, overall, reviewText, word_count(reviewText) AS wordCount
FROM reviews
"""

In [None]:
counts_df = spark.sql(query)

In [None]:
show(counts_df)

In [None]:
def count_all_the_things(text):
    return [len(text), len(text.split())]

In [None]:
from pyspark.sql.types import ArrayType, IntegerType
count_udf = F.udf(count_all_the_things, returnType=ArrayType(IntegerType()))

In [None]:
counts_df = df.withColumn('counts', count_udf(df['reviewText']))

In [None]:
show(counts_df, 1)

In [None]:
slim_counts_df = (
    df.drop('reviewTime', 'helpful')
#       .drop('helpful')
      .withColumn('counts', count_udf(df['reviewText']))
      .drop('reviewText')
)

In [None]:
show(slim_counts_df, n=1)

In [None]:
aggs = counts_df.groupBy('reviewerID').agg({'overall': 'mean'})
aggs.collect()

### A few more basic commands

Please refer also to the [official programming guide](http://spark.apache.org/docs/latest/rdd-programming-guide.html).

In [None]:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)

In [None]:
distData.

In [None]:
def multiply(a, b):
    return a * b

In [None]:
distData.reduce(multiply)

In [None]:
def subtract1(a, b):
    return a - b

In [None]:
distData.reduce(subtract1)

In [None]:
def subtract2(a, b):
    return b - a

In [None]:
distData.reduce(subtract2)

Can you explain these "subtraction" results?

In [None]:
distData.filter(lambda x: x < 4).collect()

### Reading files

```sc.textFile()``` for .txt files

`.toJSON()` for .json files

In [None]:
dfjson = counts_df.toJSON()

In [None]:
df2 = spark.read.json(dfjson)

In [None]:
df2.printSchema()

In [None]:
counts_df

In [None]:
type(df.toPandas())