# SQL Operations Demo

One of the benefits of Spark is its support for SQL operations when it comes to streaming data. 

To do this, one first creates a `SparkSession` using the `SparkContext` that the `StreamingContext` is using. 


### Demo

For this demonstration, we will be using Spark and SQL to analyze data from a table of data on Wikipedia Articles.

In [None]:
from pyspark.sql import SQLContext
from pyspark import SparkContext
from operator import add
from pyspark.sql.functions import regexp_replace,col,trim,lower,desc
from pyspark.streaming import StreamingContext
import re
import sys

sc = SparkContext("local[2]", "pysparkWordCountforFile")
ssc = StreamingContext(sc, 1)
sqlContext = SQLContext(sc)
lines = ssc.socketTextStream("localhost", 9999)

data = sc.textFile(sys.argv[1])
print 'number of lines in file: %s' % data.count()

chars = data.map(lambda s:len(s)).reduce(add)
print 'number of characters in file:%s' % chars

words = data.flatMap(lambda line: re.split(' ', line.lower().strip()))

# words of more than three characters
words = words.filter(lambda x: len(x)>3)
words.collect()

# 1 count per word
words = words.map(lambda w:(w,1))
words_df = sqlContext.createDataFrame(words,['word','count'])
words_df.show()

# replace punctuation
df_transformed = words_df.select(lower(trim(regexp_replace(col('word'),r'[.,\/#$%^&*()-_+=~!"\s]*',''))).alias('keywords'))

# Print the top 250 words
top_words = sqlContext.createDataFrame(df_transformed.groupby('keywords').count().sort(desc('count')).take(251))
top_words.pprint()

ssc.start()
ssc.awaitTermination()

## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
2. https://spark.apache.org/docs/latest/sql-programming-guide.html