# Spark SQL
---
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.

---
## DataFrames
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

The DataFrame API is available in Scala, Java, and Python.

All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell or the pyspark shell.

In [1]:
# Import SQLContext and data types
from pyspark import SparkConf, SparkContext
from pyspark.sql import *
from pyspark.sql.types import *
conf = SparkConf()
sc = SparkContext(conf=conf)

The entry point into all relational functionality in Spark is the SQLContext class, or one of its decedents. To create a basic SQLContext, all you need is a SparkContext.

In [2]:
# sc is an existing SparkContext.
sqlContext = SQLContext(sc)

## Creating DataFrames

Load a text file and convert each line to a tuple.
- 'file://' because it's local file

In [3]:
fname = '/usr/local/spark/examples/src/main/resources/people.txt'
lines = sc.textFile(fname)

In [4]:
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

### Infer the schema, and register the DataFrame as a table.

In [5]:
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.registerTempTable("people")

#### SQL can be run over DataFrames that have been registered as a table.

In [6]:
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

The results of SQL queries are RDDs and support all the normal RDD operations.

In [9]:
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
  print teenName

Name: Justin


In [10]:
schemaPeople.cache()

DataFrame[age: bigint, name: string]

## DataFrame Operations
DataFrames provide a domain-specific language for structured data manipulation in Scala, Java, and Python.

Here some basic examples of structured data processing using DataFrames:

In [11]:
#from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

In [12]:
df_ = sqlContext.read.json("/usr/local/spark/examples/src/main/resources/people.json")

# Displays the content of the DataFrame to stdout
df_.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [13]:
df = df_.fillna(0)
df.show()

+---+-------+
|age|   name|
+---+-------+
|  0|Michael|
| 30|   Andy|
| 19| Justin|
+---+-------+



In [14]:
df = df_.fillna(df_.select('age').groupBy().mean().collect()[0][0])
df.show()

+---+-------+
|age|   name|
+---+-------+
| 24|Michael|
| 30|   Andy|
| 19| Justin|
+---+-------+



In [15]:
df.printSchema()

root
 |-- age: long (nullable = false)
 |-- name: string (nullable = true)



In [16]:
df.select("name").show()

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+



In [17]:
df.select("name", df.age + 1).show()

+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|       25|
|   Andy|       31|
| Justin|       20|
+-------+---------+



In [18]:
df.filter(df.age > 18  ).show()

+---+-------+
|age|   name|
+---+-------+
| 24|Michael|
| 30|   Andy|
| 19| Justin|
+---+-------+



In [19]:
df.groupBy("age").count().show()

+---+-----+
|age|count|
+---+-----+
| 19|    1|
| 24|    1|
| 30|    1|
+---+-----+



In [20]:
df = df_.fillna(df_.select('age').groupBy().max().collect()[0][0])
df.show()

+---+-------+
|age|   name|
+---+-------+
| 30|Michael|
| 30|   Andy|
| 19| Justin|
+---+-------+



In [21]:
df.groupBy("age").count().show()

+---+-----+
|age|count|
+---+-----+
| 19|    1|
| 30|    2|
+---+-----+



In [22]:
# help(df.agg)

## Programmatically Specifying the Schema

In [23]:
# Import SQLContext and data types
from pyspark.sql import *
from pyspark.sql.types import *

In [24]:
# sc is an existing SparkContext.
sqlContext = SQLContext(sc)

#### Load a text file and convert each line to a tuple.

In [25]:
fname = "/usr/local/spark/examples/src/main/resources/people.txt"
lines = sc.textFile(fname)
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: (p[0], p[1].strip()))

#### The schema is encoded in a string.

In [26]:
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)

#### Apply the schema to the RDD.

In [27]:
schemaPeople = sqlContext.createDataFrame(people, schema)

#### Register the DataFrame as a table.

In [28]:
schemaPeople.registerTempTable("people")

#### SQL can be run over DataFrames that have been registered as a table.

In [29]:
results = sqlContext.sql("SELECT name FROM people")
#results.show()

#### The results of SQL queries are RDDs and support all the normal RDD operations.

In [30]:
names = results.map(lambda p: "Name: " + p.name)
for name in names.collect():
  print name

Name: Michael
Name: Andy
Name: Justin
