#Introduction to Spark SQL

Spark SQL is the component of Spark and provides a SQL like interface.  
In this tutorial we will show how use this component.

###Initialize SQL context:  
1 - Import SQL context.  
2 - Create Context.  
3 - Set flag binaryAsString, this flag tells Spark SQL to treat binary-encoded data as strings.  
4 - Set flag useDataSourceApi, for compatibility to parquet dataset

In [12]:
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
sqlCtx.sql("SET spark.sql.parquet.binaryAsString=true")
sqlCtx.sql("SET spark.sql.parquet.useDataSourceApi=false")

DataFrame[: string]

###Load dataset file wiki_parquet into hdfs

In [6]:
%%bash
hdfs dfs -put /notebooks/data/wiki_parquet wiki_parquet

###Load dataset as RDD from hdfs

In [13]:
wikiData = sqlCtx.parquetFile("wiki_parquet")

Now we can count rows of dataset to verify RDD

In [14]:
wikiData.count()

39365L

###Register RDD as table

In [15]:
wikiData.registerTempTable("wikiData")

Now we can count with Spark SQL query

In [17]:
result = sqlCtx.sql("SELECT COUNT(*) AS pageCount FROM wikiData").collect()
result[0].pageCount

39365

SQL can be a powerfull tool from performing complex aggregations. For example, the following query returns the top 10 usersnames by the number of pages they created.
Command to avoid java.lang.OutOfMemoryError (restart pyspark with memory limits):
usb/$ spark/bin/pyspark --driver-memory 1G

In [21]:
sqlCtx.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10").collect()

[Row(username=u'Waacstats', cnt=2003),
 Row(username=u'Cydebot', cnt=949),
 Row(username=u'BattyBot', cnt=939),
 Row(username=u'Yobot', cnt=890),
 Row(username=u'Addbot', cnt=853),
 Row(username=u'Monkbot', cnt=668),
 Row(username=u'ChrisGualtieri', cnt=438),
 Row(username=u'RjwilmsiBot', cnt=387),
 Row(username=u'OccultZone', cnt=377),
 Row(username=u'ClueBot NG', cnt=353)]

Now you can try write fellowing query.  
How many articles contain the word “california”?

In [24]:
#Solution
sqlCtx.sql("SELECT count(*) FROM wikiData where text like '%california%'").collect()

[Row(c0=1145)]