# Introduction to Spark SQL

Spark SQL is the component of Spark and provides a SQL like interface.  
In this tutorial we will show how use this component.

### Initialize SQL context:  
1 - Import SQL context.  
2 - Create Context.  
3 - Set flag binaryAsString, this flag tells Spark SQL to treat binary-encoded data as strings.  
4 - Set flag useDataSourceApi, for compatibility to parquet dataset

In [14]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import *
from pyspark.sql.types import *
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sqlCtx.sql("SET spark.sql.parquet.binaryAsString=true")
sqlCtx.sql("SET spark.sql.parquet.useDataSourceApi=false")

### Load dataset as RDD from file

In [16]:
lines = sc.textFile('data/iris.csv')
parts = lines.map(lambda l: l.split(","))
data = parts.map(lambda p: Row(f0=float(p[0]), f1=float(p[1]), f2=float(p[2]), f3=float(p[3]), label=p[4]))
schema = sqlCtx.createDataFrame(data)

Now we can count rows of dataset to verify RDD

In [22]:
schema.count()

150

In addition to standard RDD operatrions, SchemaRDDs (DataFrame from Spark 1.3.0) also have extra information about the names and types of the columns in the dataset. This extra schema information makes it possible to run SQL queries against the data after you have registered it as a table.

In [23]:
# to describe the schema
schema.printSchema()

root
 |-- f0: double (nullable = true)
 |-- f1: double (nullable = true)
 |-- f2: double (nullable = true)
 |-- f3: double (nullable = true)
 |-- label: string (nullable = true)



###  Register RDD as table

In [24]:
schema.registerTempTable("iris")

Now we can count with Spark SQL query

In [27]:
result_df = sqlCtx.sql("SELECT COUNT(*) AS Count FROM iris")
result = result_df.collect()
result[0].Count

150

In [29]:
result_df.toPandas()

Unnamed: 0,Count
0,150


In [33]:
schema.write.parquet('data/iris.parquet')

In [34]:
schema1 = sqlCtx.read.parquet('data/iris.parquet')

In [35]:
schema1.show()

+---+---+---+---+-----------+
| f0| f1| f2| f3|      label|
+---+---+---+---+-----------+
|5.1|3.5|1.4|0.2|Iris-setosa|
|4.9|3.0|1.4|0.2|Iris-setosa|
|4.7|3.2|1.3|0.2|Iris-setosa|
|4.6|3.1|1.5|0.2|Iris-setosa|
|5.0|3.6|1.4|0.2|Iris-setosa|
|5.4|3.9|1.7|0.4|Iris-setosa|
|4.6|3.4|1.4|0.3|Iris-setosa|
|5.0|3.4|1.5|0.2|Iris-setosa|
|4.4|2.9|1.4|0.2|Iris-setosa|
|4.9|3.1|1.5|0.1|Iris-setosa|
|5.4|3.7|1.5|0.2|Iris-setosa|
|4.8|3.4|1.6|0.2|Iris-setosa|
|4.8|3.0|1.4|0.1|Iris-setosa|
|4.3|3.0|1.1|0.1|Iris-setosa|
|5.8|4.0|1.2|0.2|Iris-setosa|
|5.7|4.4|1.5|0.4|Iris-setosa|
|5.4|3.9|1.3|0.4|Iris-setosa|
|5.1|3.5|1.4|0.3|Iris-setosa|
|5.7|3.8|1.7|0.3|Iris-setosa|
|5.1|3.8|1.5|0.3|Iris-setosa|
+---+---+---+---+-----------+
only showing top 20 rows

