# SQL Operations Demo

One of the benefits of Spark is its support for SQL operations when it comes to streaming data. 

To do this, one first creates a `SparkSession` using the `SparkContext` that the `StreamingContext` is using. 


### Demo

For this demonstration, we will be using Spark and SQL to analyze data from a DStream.

In [None]:
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import Row, SparkSession


if len(sys.argv) != 3:
    print("Usage: sql_network_wordcount.py <hostname> <port> ", file=sys.stderr)
    exit(-1)
host, port = sys.argv[1:]
sc = SparkContext(appName="PythonSqlNetworkWordCount")
ssc = StreamingContext(sc, 1)

lines = ssc.socketTextStream(host, int(port))
words = lines.flatMap(lambda line: re.split(' ', line.lower().strip()))
   
# The section below is for converting RDDs of the words DStream to DataFrame and run SQL query
def analyze(time, rdd):
    print("========= %s =========" % str(time))
    try:            
        # words of more than three characters
        rdd = rdd.filter(lambda x: len(x)>3)
        rdd.collect()

        # 1 count per word
        rdd = rdd.map(lambda w:(w,1))
        words_df = sqlContext.createDataFrame(rdd,['word','count'])
        words_df.show()

        # replace punctuation
        df_transformed = words_df.select(lower(trim(regexp_replace(col('word'),r'[.,\/#$%^&*()-_+=~!"\s]*',''))).alias('keywords'))

        # Print the top 250 words
        top_words = sqlContext.createDataFrame(df_transformed.groupby('keywords').count().sort(desc('count')).take(251))
        top_words.pprint()

    except:
        pass

words.foreachRDD(analyze)
ssc.start()
ssc.awaitTermination()

## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
2. https://spark.apache.org/docs/latest/sql-programming-guide.html