# Spark Session

Spark can work with data located on HDFS or a non-distributed filesystem. It can also use YARN from Hadoop, or [Mesos](https://mesos.apache.org/), or a resource manager of its own. For this example we use the last option.

All distributed operations with Spark are done using so-called Spark Session. Let's create one:

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.driver.memory", "400g").getOrCreate()
spark

Here we limit an amount of memory used to prevent Spark trying to eat all the server. Configuring Spark might be tricky at times. When you work on a dedicated Spark cluster, your admin will probably create a session for you.

# Reading Data

Spark can consume data in a variety of formats, e.g. in JSON. We use the [YELP Dataset](https://www.yelp.com/dataset) for this example. It's easily obtainable and free to use in education and research.

In [6]:
%%time

spark.read.text("/workdir/boris/data/yelp_dataset/yelp_academic_dataset_review.json").count()

CPU times: user 25.3 ms, sys: 7.89 ms, total: 33.2 ms
Wall time: 4.25 s


8021122

This code simply reads a JSON file as a text, line by line, and counts the number of lines. Let's compare the speed with `wc` tool:

In [7]:
!time wc -l /workdir/boris/data/yelp_dataset/yelp_academic_dataset_review.json

8021122 /workdir/boris/data/yelp_dataset/yelp_academic_dataset_review.json
wc -l /workdir/boris/data/yelp_dataset/yelp_academic_dataset_review.json  0,30s user 2,02s system 99% cpu 2,323 total


Although `wc` is implemented in C and is more efficient in general than JVM code behind Spark, it uses only one CPU, thus working several times slower than it's distributed counterpart from Spark.

Parsing JSON in Spark is really simple:

In [12]:
reviews = spark.read.json(
    "/workdir/boris/data/yelp_dataset/yelp_academic_dataset_review.json"
)
reviews.show(n=5)

+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|         business_id|cool|               date|funny|           review_id|stars|                text|useful|             user_id|
+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|-MhfebM0QIsKt87iD...|   0|2015-04-15 05:21:16|    0|xQY8N_XvtGbearJ5X...|  2.0|As someone who ha...|     5|OwjRMXRC0KyPrIlcj...|
|lbrU8StCq3yDfr-QM...|   0|2013-12-07 03:16:52|    1|UmFMZ8PyXZTY2Qcwz...|  1.0|I am actually hor...|     1|nIJD_7ZXHq-FX8byP...|
|HQl28KMwrEKHqhFrr...|   0|2015-12-05 03:18:11|    0|LG2ZaYiOgpr2DK_90...|  5.0|I love Deagan's. ...|     1|V34qejxNsCbcgD8C0...|
|5JxlZaqCnk1MnbgRi...|   0|2011-05-27 05:30:52|    0|i6g_oA9Yf9Y31qt0w...|  1.0|Dismal, lukewarm,...|     0|ofKDkJKXSKZXu5xJN...|
|IS4cv902ykd8wj1TR...|   0|2017-01-14 21:56:57|    0|6TdNDKywdbjoTkize...|  4.0|Oh happy d

This dataset is quite large, about 6GB. If you have only a handful CPUs, you'd like to use method [sample](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sample#pyspark.sql.DataFrame.sample) after reading the JSON.

# Transforming Data

Spark has a vast library of feature engineering functions. For example, we can get TF-IDF representation for our review corpus. In the following snippet we construct a data preparation pipeline with three stages:
1. we get review text parsed into words
1. we count term frequencies of our bags of words
1. we normalise by inverted document frequency

In [13]:
%%time

from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import Tokenizer, HashingTF, IDF

data_preparation = Pipeline(stages=[
    Tokenizer(inputCol="text", outputCol="words"),
    HashingTF(inputCol="words", outputCol="term_frequency"),
    IDF(inputCol="term_frequency", outputCol="embedding")
])
prepared_reviews = data_preparation.fit(reviews).transform(reviews)
prepared_reviews.select("text", "words", "term_frequency", "embedding").show(n=5)

+--------------------+--------------------+--------------------+--------------------+
|                text|               words|      term_frequency|           embedding|
+--------------------+--------------------+--------------------+--------------------+
|As someone who ha...|[as, someone, who...|(262144,[991,1578...|(262144,[991,1578...|
|I am actually hor...|[i, am, actually,...|(262144,[1619,296...|(262144,[1619,296...|
|I love Deagan's. ...|[i, love, deagan'...|(262144,[9698,135...|(262144,[9698,135...|
|Dismal, lukewarm,...|[dismal,, lukewar...|(262144,[4106,462...|(262144,[4106,462...|
|Oh happy day, fin...|[oh, happy, day,,...|(262144,[2701,425...|(262144,[2701,425...|
+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows

CPU times: user 694 ms, sys: 487 ms, total: 1.18 s
Wall time: 3min 40s


Let's look into the details of the first row:

In [14]:
prepared_reviews.select("text", "words", "term_frequency", "embedding").head()

Row(text='As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas. When I saw they would be showing infamous eggs of the House of Faberge from the Virginia Museum of Fine Arts (VMFA), I knew I had to go!\n\nTucked away near the gelateria and the garden, the Gallery is pretty much hidden from view. It\'s what real estate agents would call "cozy" or "charming" - basically any euphemism for small.\n\nThat being said, you can still see wonderful art at a gallery of any size, so why the two *s you ask? Let me tell you:\n\n* pricing for this, while relatively inexpensive for a Las Vegas attraction, is completely over the top. For the space and the amount of art you can fit in there, it is a bit much.\n* it\'s not kid friendly at all. Seriously, don\'t bring them.\n* the security is not trained properly for the show. When the curating and design teams collaborate for exhibitions, there is a definite flow. That means visitors should vi

Mind the representation of TF-IDF vectors - it's sparse.

# Do It Yourself
* calculate `word2vec` embeddings instead of TF-IDF
* build a linear regression (predict stars by text)
* split data into train and validation sets and evaluate your model
* compare quality of models (TF-IDF vs word2vec, linear vs random forest vs gradient goosted trees)