# Spark WordCount Example

Example wordcount code with Apache Spark, run locally in Jupyter iPython kernel, using **Spark's RDD API**.

In [None]:
import findspark
findspark.init()

In [None]:
import pyspark
from pyspark.sql import SparkSession
import random

Every Start program starts with getting access to either a SparkSession. We get acces to the lower-level RDD API by then retrieving the SparkContext from the SparkSession:

In [None]:
spark = SparkSession.builder \
        .master("local[2]") \
        .appName("wordcount") \
        .getOrCreate()
sc = spark.sparkContext

Let's now do the actual word count using the Spark RDD API:

In [None]:
text_file = sc.textFile("data/shakespeare-hamlet.txt")

In [None]:
counts = text_file.flatMap(lambda line: line.lower().split()) \
                  .map(lambda word: (word, 1)) \
                  .reduceByKey(lambda a, b: a + b)

The actual execution is trigger by a RDD 'Action' - such as the collect() call to retrieve the results into a Python variable:

In [None]:
wordcount = counts.collect()

In [None]:
print(wordcount)

**Cleanup** Stop Spark at the end:

In [None]:
sc.stop()