# SQL Operations Exercise

The exercise below is very similar to the word count examples often seen in the [Spark documentation](https://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example). However, this adds on the option of specifying which words to count within the dataframe using SQL queries


## Exercise

The script below is skeleton code aimed at producing a wordcount from the `words` DStream. 

The objectives are the following:
1. Create a socket stream on target ip:port and count the words in input stream of `\n` delimited text (eg. generated by `nc`)
2. In the section for converting RDDs of the `words` DStream to a DataFrame, run an SQL query


In [1]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [2]:
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import Row, SparkSession

In [3]:
def getSparkSessionInstance(sparkConf):
    if ('sparkSessionSingletonInstance' not in globals()):
        globals()['sparkSessionSingletonInstance'] = SparkSession.builder.config(conf=sparkConf).getOrCreate()
    return globals()['sparkSessionSingletonInstance']

In [4]:
host = 'localhost'
port = 7777
sc = SparkContext(appName="PythonSqlNetworkWordCount")
ssc = StreamingContext(sc, 1)

# TODO: Create a socket stream on target ip:port and count the
# words in input stream of \n delimited text (eg. generated by 'nc')
lines = ssc.socketTextStream(host, int(port))
words = lines.flatMap(lambda line: line.split(" "))

# The section below is for converting RDDs of the words DStream to DataFrame and run SQL query
def process(time, rdd):
    print("========= %s =========" % str(time))

    try:
        # TODO: Get the singleton instance of SparkSession
        spark = getSparkSessionInstance(rdd.context.getConf())

        # TODO: Convert RDD[String] to RDD[Row] to DataFrame
        rowRdd = rdd.map(lambda w: Row(word=w))
        wordsDataFrame = spark.createDataFrame(rowRdd)

        # TODO: Creates a temporary view using the DataFrame.
        wordsDataFrame.createOrReplaceTempView("words")

        # TODO: Do word count on table using SQL and print it
        wordCountsDataFrame = spark.sql("select word, count(*) as total from words group by word")
        wordCountsDataFrame.show()
    except:
        pass

words.foreachRDD(process)

In [5]:
ssc.start()
# ssc.awaitTermination()

+---------+-----+
|     word|total|
+---------+-----+
|      few|    2|
|    inner|    2|
|    often|    1|
|     now.|    2|
|   valley|    2|
|     grow|    2|
|    sense|    2|
|    among|    4|
|existence|    2|
|       us|    2|
|  present|    2|
|   within|    1|
|    could|    2|
| infinite|    1|
| talents.|    2|
|  stream;|    2|
|     down|    2|
|      who|    1|
| eternity|    1|
|   power,|    1|
+---------+-----+
only showing top 20 rows



In [6]:
ssc.stop(stopSparkContext=True, stopGraceFully=True)



## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations