# Spark: Getting Started


## Step 0: Prerequisites & Installation

Follow [these instructions](https://docs.databricks.com/notebooks/notebooks-manage.html#import-a-notebook) to import this notebook into Databricks

Run these commands in your terminal (just once) if you want to run Spark locally.

 * These instructions require a Mac with [Anaconda3](https://anaconda.com/) and [Homebrew](https://brew.sh/) installed.
 * Useful for small data only. For larger data, try [Databricks](https://databricks.com/).

```bash
# Make Homebrew aware of old versions of casks
brew tap caskroom/versions

# Install Java 1.8 (OpenJDK 8)
brew cask install adoptopenjdk8

# Install the current version of Spark
brew install apache-spark

# Install Py4J (connects PySpark to the Java Virtual Machine)
pip install py4j

# Add JAVA_HOME to .bash_profile (makes Java 1.8 your default JVM)
echo "\n# Apache Spark\nexport JAVA_HOME=$(/usr/libexec/java_home -v 1.8)" >> ~/.bash_profile

# Add SPARK_HOME to .bash_profile
echo "\nexport SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.5/libexec" >> ~/.bash_profile

# Add PySpark to PYTHONPATH in .bash_profile
echo "\nexport PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH" >> ~/.bash_profile

# Update current environment
source ~/.bash_profile

```

## Step 1: Create a SparkSession with a SparkContext

In [4]:
import pyspark

# un-comment the following lines if you are running Spark locally
# spark = pyspark.sql.SparkSession.builder.getOrCreate()
# sc = spark.sparkContext

In [5]:
spark

In [6]:
sc

If we need a broadcast (i.e. global) variable, we can declare it like so:

In [8]:
glob = sc.broadcast(list(range(1, 3)))
glob.value

## Step 2: Download some Amazon reviews (Toys & Games)

In [10]:
# Data is already in the repo, but you can also get it with
# !wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Toys_and_Games_5.json.gz
# !gunzip reviews_Toys_and_Games_5.json.gz

Follow [these instructions](https://docs.databricks.com/data/data.html#import-data-1) to import `reviews_Toys_and_Games_5.json` into Databricks

## Step 3: Create a Spark DataFrame

In [13]:
# this file path will be different if you are running Spark locally
df = spark.read.json('/FileStore/tables/reviews_Toys_and_Games_5.json')

In [14]:
df.persist()

This last command, `.persist()`, simply stores the DataFrame in memory. See [this page](https://unraveldata.com/to-cache-or-not-to-cache/). It is similar to `.cache()`, but actually more flexible than the latter since you can specify which storage level you want. See [here](https://stackoverflow.com/questions/26870537/what-is-the-difference-between-cache-and-persist).

In [16]:
type(df)

In [17]:
df.show(5) # default of 20 lines

In [18]:
pdf = df.limit(5).toPandas()
pdf

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,439893577,"[0, 0]",5.0,I like the item pricing. My granddaughter want...,"01 29, 2014",A1VXOAVRGKGEAK,Angie,Magnetic board,1390953600
1,439893577,"[1, 1]",4.0,Love the magnet easel... great for moving to d...,"03 28, 2014",A8R62G708TSCM,Candace,it works pretty good for moving to different a...,1395964800
2,439893577,"[1, 1]",5.0,Both sides are magnetic. A real plus when you...,"01 28, 2013",A21KH420DK0ICA,capemaychristy,love this!,1359331200
3,439893577,"[0, 0]",5.0,Bought one a few years ago for my daughter and...,"02 8, 2014",AR29QK6HPFYZ4,dcrm,Daughters love it,1391817600
4,439893577,"[1, 1]",4.0,I have a stainless steel refrigerator therefor...,"05 5, 2014",ACCH8EOML6FN5,DoyZ,Great to have so he can play with his alphabet...,1399248000


In [19]:
type(pdf)

In [20]:
df.count()

In [21]:
df.columns

In [22]:
df.printSchema()

The 'nullable = true' bit means that the relevant column tolerates null values.

In [24]:
df.describe().show()

In [25]:
df.describe('overall').show()

In [26]:
reviews_df = df[['asin', 'overall']]

In [27]:
reviews_df.show()

In [28]:
def show(df, n=5):
    return df.limit(n).toPandas()

In [29]:
show(reviews_df)

Unnamed: 0,asin,overall
0,439893577,5.0
1,439893577,4.0
2,439893577,5.0
3,439893577,5.0
4,439893577,4.0


In [30]:
reviews_df.count()

In [31]:
sorted_review_df = reviews_df.sort('overall')

In [32]:
show(sorted_review_df)

Unnamed: 0,asin,overall
0,786955708,1.0
1,976990709,1.0
2,963679600,1.0
3,786955708,1.0
4,974665207,1.0


In [33]:
import pyspark.sql.functions as F

In [34]:
counts = reviews_df.agg(F.countDistinct('overall'))

In [35]:
counts.show()

In [36]:
query = """
SELECT overall, COUNT(*)
FROM reviews
GROUP BY overall
ORDER BY overall
"""

In [37]:
reviews_df.createOrReplaceTempView('reviews')

In [38]:
output = spark.sql(query)

In [39]:
show(output)

Unnamed: 0,overall,count(1)
0,1.0,4707
1,2.0,6298
2,3.0,16357
3,4.0,37445
4,5.0,102790


In [40]:
output.collect()

In [41]:
reviews_df.count() - sum(output.collect()[i][1] for i in range(5))

In [42]:
type(reviews_df)

Convert to RDD!

In [44]:
reviews_df.rdd

In [45]:
type(reviews_df.rdd)

### Count the words in the first row

In [47]:
row_one = df.first()

In [48]:
row_one

In [49]:
def word_count(text):
    return len(text.split())

In [50]:
word_count(row_one['reviewText'])

In [51]:
from pyspark.sql.types import IntegerType

#'udf' is for User Defined Function!

word_count_udf = F.udf(word_count, returnType=IntegerType())

In [52]:
review_text_col = df['reviewText']

In [53]:
counts_df = df.withColumn('wordCount', word_count_udf(review_text_col))

In [54]:
# Remember that we set the default number of lines to show at 5.

show(counts_df).T

Unnamed: 0,0,1,2,3,4
asin,0439893577,0439893577,0439893577,0439893577,0439893577
helpful,"[0, 0]","[1, 1]","[1, 1]","[0, 0]","[1, 1]"
overall,5,4,5,5,4
reviewText,I like the item pricing. My granddaughter want...,Love the magnet easel... great for moving to d...,Both sides are magnetic. A real plus when you...,Bought one a few years ago for my daughter and...,I have a stainless steel refrigerator therefor...
reviewTime,"01 29, 2014","03 28, 2014","01 28, 2013","02 8, 2014","05 5, 2014"
reviewerID,A1VXOAVRGKGEAK,A8R62G708TSCM,A21KH420DK0ICA,AR29QK6HPFYZ4,ACCH8EOML6FN5
reviewerName,Angie,Candace,capemaychristy,dcrm,DoyZ
summary,Magnetic board,it works pretty good for moving to different a...,love this!,Daughters love it,Great to have so he can play with his alphabet...
unixReviewTime,1390953600,1395964800,1359331200,1391817600,1399248000
wordCount,20,22,76,31,47


In [55]:
from pyspark.sql.types import IntegerType
word_count_udf = F.udf(word_count, IntegerType())

# Registering our word_count() function so that we
# can use it with SQL! See documentation here:
# https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-UDFRegistration.html

df.createOrReplaceTempView('reviews')
spark.udf.register('word_count', word_count_udf)

In [56]:
query = """
SELECT asin, overall, reviewText, word_count(reviewText) AS wordCount
FROM reviews
"""

In [57]:
counts_df = spark.sql(query)

In [58]:
show(counts_df)

Unnamed: 0,asin,overall,reviewText,wordCount
0,439893577,5.0,I like the item pricing. My granddaughter want...,20
1,439893577,4.0,Love the magnet easel... great for moving to d...,22
2,439893577,5.0,Both sides are magnetic. A real plus when you...,76
3,439893577,5.0,Bought one a few years ago for my daughter and...,31
4,439893577,4.0,I have a stainless steel refrigerator therefor...,47


In [59]:
def count_all_the_things(text):
    return [len(text), len(text.split())]

In [60]:
from pyspark.sql.types import ArrayType, IntegerType
count_udf = F.udf(count_all_the_things, returnType=ArrayType(IntegerType()))

In [61]:
counts_df = df.withColumn('counts', count_udf(df['reviewText']))

In [62]:
show(counts_df, 1)

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,counts
0,439893577,"[0, 0]",5.0,I like the item pricing. My granddaughter want...,"01 29, 2014",A1VXOAVRGKGEAK,Angie,Magnetic board,1390953600,"[100, 20]"


In [63]:
slim_counts_df = (
    df.drop('reviewTime', 'helpful')
#       .drop('helpful')
      .withColumn('counts', count_udf(df['reviewText']))
      .drop('reviewText')
)

In [64]:
show(slim_counts_df, n=1)

Unnamed: 0,asin,overall,reviewerID,reviewerName,summary,unixReviewTime,counts
0,439893577,5.0,A1VXOAVRGKGEAK,Angie,Magnetic board,1390953600,"[100, 20]"


In [65]:
aggs = counts_df.groupBy('reviewerID').agg({'overall': 'mean'})
aggs.collect()

### A few more basic commands

Please refer also to the [official programming guide](http://spark.apache.org/docs/latest/rdd-programming-guide.html).

In [67]:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)

In [68]:
distData.

In [69]:
def multiply(a, b):
    return a * b

In [70]:
distData.reduce(multiply)

In [71]:
def subtract1(a, b):
    return a - b

In [72]:
distData.reduce(subtract1)

In [73]:
def subtract2(a, b):
    return b - a

In [74]:
distData.reduce(subtract2)

Can you explain these "subtraction" results?

In [76]:
distData.filter(lambda x: x < 4).collect()

### Reading files

```sc.textFile()``` for .txt files

`.toJSON()` for .json files

In [80]:
dfjson = counts_df.toJSON()

In [81]:
df2 = spark.read.json(dfjson)

In [82]:
df2.printSchema()

In [83]:
counts_df

In [84]:
type(df.toPandas())