# Spark Session

Spark can work with data located on HDFS or a non-distributed filesystem. It can also use YARN from Hadoop, or [Mesos](https://mesos.apache.org/), or a resource manager of its own. For this example we use the last option.

All distributed operations with Spark are done using so-called Spark Session. Let's create one:

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.driver.memory", "100g").getOrCreate()
spark

Here we limit an amount of memory used to prevent Spark trying to eat all the server. Configuring Spark might be tricky at times. When you work on a dedicated Spark cluster, your admin will probably create a session for you.

# Reading Data

Spark can consume data in a variety of formats, e.g. in JSON. We use the [YELP Dataset](https://www.yelp.com/dataset) for this example. It's easily obtainable and free to use in education and research.

In [5]:
%%time

spark.read.text("/home/boris/Downloads/yelp_dataset/yelp_academic_dataset_review.json").count()

CPU times: user 1.99 ms, sys: 1.85 ms, total: 3.84 ms
Wall time: 601 ms


8021122

This code simply reads a JSON file as a text, line by line, and counts the number of lines. Let's compare the speed with `wc` tool:

In [6]:
!time wc -l /home/boris/Downloads/yelp_dataset/yelp_academic_dataset_review.json

8021122 /home/boris/Downloads/yelp_dataset/yelp_academic_dataset_review.json
wc -l /home/boris/Downloads/yelp_dataset/yelp_academic_dataset_review.json  0,44s user 1,27s system 99% cpu 1,706 total


Although `wc` is implemented in C and is more efficient in general than JVM code behind Spark, it uses only one CPU, thus working several times slower than it's distributed counterpart from Spark.

Parsing JSON in Spark is really simple:

In [7]:
reviews = spark.read.json(
    "/home/boris/Downloads/yelp_dataset/yelp_academic_dataset_review.json"
).sample(0.01)
reviews.show(n=5)

+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|         business_id|cool|               date|funny|           review_id|stars|                text|useful|             user_id|
+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|PmYIuaV18eyr5IMXZ...|   0|2015-03-04 20:37:18|    0|H1hdeQ1TdZ4fF1Gct...|  3.0|I've got a 2 hour...|     0|gj7dwFiadFO5gyrH8...|
|5gl2GLgimBz1-Q-uX...|   1|2016-01-30 03:26:26|    2|mscF94aHl77US0MB1...|  5.0|Excellent pizza! ...|     2|yzF_-JtpMVPqM-TnY...|
|Zw9wEAk9L6oTZi63f...|   0|2017-03-27 03:17:02|    0|tN-8bLuiQc4wZ9G-z...|  4.0|Food is yummy, ho...|     0|rIvh-c-Qkhme6FG0d...|
|0TBTV3q6QXCn9vNhy...|   0|2017-03-15 05:15:48|    0|9WsORk6y22cnCqRdC...|  5.0|My wife is gluten...|     0|Bn0xnhjUMOQib_Y-7...|
|MDtMV0ld7q0BsQPKN...|   0|2014-02-05 18:37:14|    0|XJvECqxFNuAB5nhTG...|  5.0|The best c

Here we use only 1% of all data, since it's about 6GB. If you have a sufficiently large cluster of you're ready to way more than several minutes, you can change the sampling ratio.

# Transforming Data

Spark has a vast library of feature engineering functions. For example, we can get TF-IDF representation for our review corpus. In the following snippet we construct a data preparation pipeline with three stages:
1. we get review text parsed into words
1. we count term frequencies of our bags of words
1. we normalise by inverted document frequency

+--------------------+--------------------+--------------------+
|                text|               words|           embedding|
+--------------------+--------------------+--------------------+
|I've got a 2 hour...|[i've, got, a, 2,...|[-0.0149417100758...|
|Excellent pizza! ...|[excellent, pizza...|[0.01702762021929...|
|Food is yummy, ho...|[food, is, yummy,...|[-0.0035166407428...|
|My wife is gluten...|[my, wife, is, gl...|[0.05412049598267...|
|The best chain bu...|[the, best, chain...|[0.00780793732867...|
+--------------------+--------------------+--------------------+
only showing top 5 rows

CPU times: user 146 ms, sys: 30.7 ms, total: 177 ms
Wall time: 2min 41s


In [9]:
%%time

from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import Tokenizer, HashingTF, IDF

data_preparation = Pipeline(stages=[
    Tokenizer(inputCol="text", outputCol="words"),
    HashingTF(inputCol="words", outputCol="term_frequency"),
    IDF(inputCol="term_frequency", outputCol="embedding")
])
prepared_reviews = data_preparation.fit(reviews).transform(reviews)
prepared_reviews.select("text", "words", "term_frequency", "embedding").show(n=5)

+--------------------+--------------------+--------------------+--------------------+
|                text|               words|      term_frequency|           embedding|
+--------------------+--------------------+--------------------+--------------------+
|I've got a 2 hour...|[i've, got, a, 2,...|(262144,[7221,844...|(262144,[7221,844...|
|Excellent pizza! ...|[excellent, pizza...|(262144,[1689,870...|(262144,[1689,870...|
|Food is yummy, ho...|[food, is, yummy,...|(262144,[12409,27...|(262144,[12409,27...|
|My wife is gluten...|[my, wife, is, gl...|(262144,[4959,553...|(262144,[4959,553...|
|The best chain bu...|[the, best, chain...|(262144,[1689,182...|(262144,[1689,182...|
+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows

CPU times: user 22.9 ms, sys: 8.32 ms, total: 31.2 ms
Wall time: 4.75 s


Let's look into the details of the first row:

In [11]:
prepared_reviews.select("text", "words", "term_frequency", "embedding").head()

Row(text="I've got a 2 hour layover so armed with my trusty ipad I decided to review this box restaurant. I usually don't review places like this. They are all the same to me.  Food is on the average side. I ordered a chilli, coffee, and a donut.  The food tastes the same as every other T Hortons I've been to. That's the advantage of these box restaurants, generally there are no surprises. \n\nMy tirade is going to be on the service staff. These women are awful!  The counter person who served me did not dare look me in the eyes when addressing me. Avoiding eye contact, looking down, she mumbled something incoherent to me. I had to ask her twice what she was trying to say to me. She wanted to know what kind of bun do I want with my chilli.... Whole wheat or white.\n\nI know many of the counter people who work at T Hortons are immigrants working for min wage and they are working long hours, but have a bit of empathy!  A weary traveler would certainly appreciate a friendly smile and a swe

Mind the representation of TF-IDF vectors - it's sparse.

# Do It Yourself
* calculate `word2vec` embeddings instead of TF-IDF
* build a linear regression (predict stars by text)
* split data into train and validation sets and evaluate your model
* compare quality of models (TF-IDF vs word2vec, linear vs random forest vs gradient goosted trees)