# Introduction to Spark SQL

Spark SQL is the component of Spark and provides a SQL like interface.  
In this tutorial we will show how use this component.

### Initialize SQL context:  
1 - Import SQL context.  
2 - Create Context.  
3 - Set flag binaryAsString, this flag tells Spark SQL to treat binary-encoded data as strings.  
4 - Set flag useDataSourceApi, for compatibility to parquet dataset

In [None]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sqlCtx.sql("SET spark.sql.parquet.binaryAsString=true")
sqlCtx.sql("SET spark.sql.parquet.useDataSourceApi=false")

### Load dataset as RDD from hdfs

In [None]:
wikiData = sqlCtx.parquetFile("data/wiki_parquet")

Now we can count rows of dataset to verify RDD

In [None]:
wikiData.count()

In addition to standard RDD operatrions, SchemaRDDs (DataFrame from Spark 1.3.0) also have extra information about the names and types of the columns in the dataset. This extra schema information makes it possible to run SQL queries against the data after you have registered it as a table.

In [None]:
# to describe the schema
wikiData.printSchema()

###  Register RDD as table

In [None]:
wikiData.registerTempTable("wikiData")

Now we can count with Spark SQL query

In [None]:
result_df = sqlCtx.sql("SELECT COUNT(*) AS pageCount FROM wikiData")
result = result_df.collect()
result[0].pageCount

In [None]:
result_df.toPandas()

SQL can be a powerfull tool from performing complex aggregations. For example, the following query returns the top 10 usersnames by the number of pages they created.
Command to avoid java.lang.OutOfMemoryError (restart pyspark with memory limits):
usb/$ spark/bin/pyspark --driver-memory 1G

In [None]:
sqlCtx.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10").collect()

Now you can try write fellowing query.  
How many articles contain the word “california”?

In [None]:
#Solution
sqlCtx.sql("SELECT count(*) FROM wikiData where text like '%california%'").collect()