# Spark Preparation
We check if we are in Google Colab.  If this is the case, install all necessary packages.

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 3.2.1 with hadoop 3.2, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
Learn more from [A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!](https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/)

In [None]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

In [None]:
if IN_COLAB:
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    !wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
    !tar xf spark-3.2.1-bin-hadoop3.2.tgz
    !mv spark-3.2.1-bin-hadoop3.2 spark
    !pip install -q findspark

In [None]:
if IN_COLAB:
  import os
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
  os.environ["SPARK_HOME"] = "/content/spark"

# Start a Local Cluster
Use findspark.init() to start a local cluster.  If you plan to use remote cluster, skip the findspark.init() and change the cluster_url according.

In [None]:
import findspark
findspark.init()

For Spark Streaming, we will need **at least 2 cores** for operation, receiving data (socket, kafka, etc.) and processing data.  We will use **'local[2]'** for our local cluster.

In [None]:
cluster_url = 'local[2]'

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master(cluster_url)\
        .appName("Spark Streaming")\
        .config('spark.ui.port', '4040')\
        .getOrCreate()
sc = spark.sparkContext

# Basic Structured Streaming Commands

We use rate source to generates data at the specified number of rows per second, each output row contains a timestamp and value. Where timestamp is a Timestamp type containing the time of message dispatch, and value is of Long type containing the message count, **starting from 0 as the first row**. This source is intended for testing and benchmarking.

In [None]:
df = spark \
    .readStream \
    .format('rate') \
    .option('rowsPerSecond', 1) \
    .load()
df.printSchema()

In [None]:
# check if streaming
print(df.isStreaming)


In [None]:
from pyspark.sql.functions import avg, count

In [None]:
new_df = df.select(count('value').alias('count'), avg('value').alias('mean'))
new_df.printSchema()

## Trigger the stream processing

In [None]:
query_df = df \
    .writeStream \
    .format("console") \
    .trigger(processingTime='3 seconds') \
    .start(truncate=False)

In [None]:
# Start running the query that prints the running counts to the console
query_newdf = new_df \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .trigger(processingTime='3 seconds') \
    .start(truncate=False)

In [None]:
query_newdf.awaitTermination(30)
query_newdf.stop()
query_df.stop()