# SQL Operations Demo

One of the benefits of Spark is its support for SQL operations when it comes to streaming data. 

To do this, one first creates a `SparkSession` using the `SparkContext` that the `StreamingContext` is using. 


### Demo

For this demonstration, we will be using Spark and SQL to analyze data from a DStream.

In [1]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [2]:
from pyspark.sql import SQLContext
from pyspark import SparkContext
from operator import add
from pyspark.sql.functions import regexp_replace, col, trim, lower, desc
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext
from pyspark.sql import Row, SparkSession
import re
import sys

In [3]:
sc = SparkContext(master="local[2]", appName="PysparkSQLNetworkWordCount")
ssc = StreamingContext(sc, 1)
sqlContext = SQLContext(sc)

In [4]:
lines = ssc.textFileStream("data")
words = lines.flatMap(lambda line: re.split(' ', line.lower().strip()))

In [5]:
# The section below is for converting RDDs of the words DStream to DataFrame and run SQL query
def analyze(time, rdd):
    print("========= %s =========" % str(time))
    try:
        # words of more than three characters
        rdd = rdd.filter(lambda x: len(x)>3)
        rdd.collect()

        # 1 count per word
        rdd = rdd.map(lambda w:(w,1))
        words_df = sqlContext.createDataFrame(rdd,['word','count'])
        words_df.show()

        # replace punctuation
        df_transformed = words_df.select(lower(trim(regexp_replace(col('word'),r'[.,\/#$%^&*()-_+=~!"\s]*',''))).alias('keywords'))

        # Print the top 250 words
        top_words = sqlContext.createDataFrame(df_transformed.groupby('keywords').count().sort(desc('count')).take(251))
        top_words.pprint()
    except:
        pass
words.foreachRDD(analyze)

In [6]:
ssc.start()

+---------+-----+
|     word|count|
+---------+-----+
|    seuss|    1|
|   shine.|    1|
|    play.|    1|
|    house|    1|
|     that|    1|
|    cold,|    1|
|    cold,|    1|
|     day.|    1|
|    there|    1|
|     with|    1|
|   sally.|    1|
|   there,|    1|
|     two.|    1|
|    said,|    1|
|     "how|    1|
|     wish|    1|
|something|    1|
|     do!"|    1|
|     cold|    1|
|     play|    1|
+---------+-----+
only showing top 20 rows



In [None]:
ssc.stop(stopSparkContext=True, stopGraceFully=True)

## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
2. https://spark.apache.org/docs/latest/sql-programming-guide.html