# Spark WordCount Example

Example wordcount code with Apache Spark, run locally in Jupyter iPython kernel, using **Spark's DataFrame API**.

First we need to connect this Jupyter notebook to pySpark:

In [None]:
import findspark
findspark.init()

In [None]:
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import random

Starting point of every Spark program is the instantiation of a **SparkSession**.

In [None]:
spark = SparkSession.builder \
        .master("local[1]") \
        .appName("wordcount") \
        .getOrCreate()

Spark web-GUI now available on port 4040 of machine, on where this notebook runs (e.g. http://localhost:4040 ).

Need to wait till previous cell has finished executing...

In [None]:
text_df = spark.read.text("data/shakespeare-hamlet.txt")

Note, the following lines will start the execution in Spark; if you want to demonstrate / experience the lazy evaluation in Spark, jump over the next three cells and continue with the .withColumn() step.

In [None]:
text_df.describe()

In [None]:
text_df.printSchema()

In [None]:
text_df.show()

In [None]:
words = text_df.withColumn('word', f.explode(f.split(f.col('value'), ' '))) # lower?
words

In [None]:
wordgroups = words.groupBy('word')
wordgroups

In [None]:
counts = wordgroups.count().sort('count', ascending=False)

In [None]:
counts.show()

The execution plan for a Spark DataFrame includes multiple operations on the lower RDD level within the Spark engine. They are also automatically parallelised, in case that more than one executor is involved. An executor can be a cluster node in a distributed installation, or multiple threads in a standalone installation.

The execute() method allows us a look at this internal execution plan:

In [None]:
counts.explain()

To retrieve data from the Spark execution engine into the code here in the Jupyter notebook, we have two options: 
1. using the DataFrame.collect() method
2. converting a Spark DataFrame with the  DataFrame.toPandas()  function into a Pandas DataFrame

Approach 2 is shown below:

In [None]:
import pandas as pd
df = counts.toPandas()
df

In [None]:
spark.stop()