# Spark: Getting Started
 * These instructions require a Mac with [Anaconda3](https://anaconda.com/) and [Homebrew](https://brew.sh/) installed.
 * Useful for small data only. For larger data, try [Databricks](https://databricks.com/).

## Step 0: Prerequisites & Installation

Run these commands in your terminal (just once).

```bash
# Make Homebrew aware of old versions of casks
brew tap caskroom/versions

# Install Java 1.8 (OpenJDK 8)
brew cask install adoptopenjdk8

# Install the current version of Spark
brew install apache-spark

# Install Py4J (connects PySpark to the Java Virtual Machine)
pip install py4j

# Add JAVA_HOME to .bash_profile (makes Java 1.8 your default JVM)
echo "export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)" >> ~/.bash_profile

# Add SPARK_HOME to .bash_profile
export SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.3/libexec
echo "export SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.3/libexec" >> ~/.bash_profile

# Add PySpark to PYTHONPATH
echo "export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH" >> ~/.bash_profile

# Update current environment
source ~/.bash_profile

```

## Step 1: Create a SparkSession with a SparkContext

In [1]:
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
sc = spark.sparkContext

In [2]:
spark

In [3]:
sc

## Step 2: Download some Amazon reviews (Toys & Games)

In [7]:
# Download data (run this only once)
#!wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Toys_and_Games_5.json.gz
#!gunzip reviews_Toys_and_Games_5.json.gz

## Step 3: Create a Spark DataFrame

In [23]:
df = spark.read.json('reviews_Toys_and_Games_5.json')

In [24]:
df.persist()

DataFrame[asin: string, helpful: array<bigint>, overall: double, reviewText: string, reviewTime: string, reviewerID: string, reviewerName: string, summary: string, unixReviewTime: bigint]

In [20]:
df.limit(5).toPandas().T

Unnamed: 0,0,1,2,3,4
asin,0439893577,0439893577,0439893577,0439893577,0439893577
helpful,"[0, 0]","[1, 1]","[1, 1]","[0, 0]","[1, 1]"
overall,5,4,5,5,4
reviewText,I like the item pricing. My granddaughter want...,Love the magnet easel... great for moving to d...,Both sides are magnetic. A real plus when you...,Bought one a few years ago for my daughter and...,I have a stainless steel refrigerator therefor...
reviewTime,"01 29, 2014","03 28, 2014","01 28, 2013","02 8, 2014","05 5, 2014"
reviewerID,A1VXOAVRGKGEAK,A8R62G708TSCM,A21KH420DK0ICA,AR29QK6HPFYZ4,ACCH8EOML6FN5
reviewerName,Angie,Candace,capemaychristy,dcrm,DoyZ
summary,Magnetic board,it works pretty good for moving to different a...,love this!,Daughters love it,Great to have so he can play with his alphabet...
unixReviewTime,1390953600,1395964800,1359331200,1391817600,1399248000


In [21]:
df.count()

167597

In [25]:
reviews_df = df[['asin', 'overall']]

In [26]:
def show(df, n=5):
    return df.limit(n).toPandas()

In [27]:
show(reviews_df)

Unnamed: 0,asin,overall
0,439893577,5.0
1,439893577,4.0
2,439893577,5.0
3,439893577,5.0
4,439893577,4.0


In [28]:
reviews_df.count()

167597

In [29]:
show(reviews_df)

Unnamed: 0,asin,overall
0,439893577,5.0
1,439893577,4.0
2,439893577,5.0
3,439893577,5.0
4,439893577,4.0


In [31]:
sorted_review_df = reviews_df.sort('overall')

In [32]:
show(sorted_review_df)

Unnamed: 0,asin,overall
0,786955708,1.0
1,976990709,1.0
2,963679600,1.0
3,786955708,1.0
4,974665207,1.0


In [33]:
import pyspark.sql.functions as F

In [35]:
counts = reviews_df.agg(F.countDistinct('overall'))