# Spark Preparation
We check if we are in Google Colab.  If this is the case, install all necessary packages.

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 3.2.1 with hadoop 3.2, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
Learn more from [A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!](https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/)

In [None]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

In [None]:
if IN_COLAB:
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    !wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
    !tar xf spark-3.2.1-bin-hadoop3.2.tgz
    !mv spark-3.2.1-bin-hadoop3.2 spark
    !pip install -q findspark

In [None]:
if IN_COLAB:
  import os
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
  os.environ["SPARK_HOME"] = "/content/spark"

# Start a Local Cluster
Use findspark.init() to start a local cluster.  If you plan to use remote cluster, skip the findspark.init() and change the cluster_url according.

In [None]:
import findspark
findspark.init()

For Spark Streaming, we will need **at least 2 cores** for operation, receiving data (socket, kafka, etc.) and processing data.  We will use **'local[2]'** for our local cluster.

In [None]:
cluster_url = 'local[2]'

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master(cluster_url)\
        .appName("Spark Streaming")\
        .config('spark.ui.port', '4040')\
        .getOrCreate()
sc = spark.sparkContext

# Basic Spark Streaming Commands

Create a streaming context with 5-second mini-batch interval

In [None]:
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 5)
ssc.checkpoint('./checkpoints/')

Due to network setup difficulties, we will use queue of RDDs as our input streams.  You can find another version of socketTextStream in the spark streaming programming guide.  However, it is just substitue the next few code blocks with:

`lines = ssc.socketTextStream("localhost", 9000)`

In [None]:
!wget https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/code/week10_spark_streaming/star-wars.txt

In [None]:
# read a text file and create a list of 10 rdds, each rdd have i lines of text
rdds = []
with open('star-wars.txt', encoding='ISO-8859-1') as fd:
    for i in range(1, 10):
        data = []
        for k in range(i):
            # read a line of text, strip newline at the end and also skip blank line
            text = fd.readline().strip()
            while not text:
                text = fd.readline().strip()
            data.append(text)
        rdds.append(sc.parallelize(data))

In [None]:
lines = ssc.queueStream(rdds)

## Example of word count in spark streaming

In [None]:
# Split each line into words
words = lines.flatMap(lambda line: line.split(' '))

# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Window operations with varied window parameters
twoWindowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 10, 5)
threeWindowedTwoSlideWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 15, 10)

# Print counts of a word 'the' from different calculation to the console
wordCounts.filter(lambda x: x[0] == 'the').pprint()
twoWindowedWordCounts.filter(lambda x: x[0] == 'the').pprint()
threeWindowedTwoSlideWordCounts.filter(lambda x: x[0] == 'the').pprint()

## Trigger the stream processing

In [None]:
ssc.start()

# we will wait for 60 seconds and then continue to stop the stream processing
# we can wait forever with empty parameter
ssc.awaitTermination(60)

# stop streaming context, this is also stop spark context
ssc.stop()