## Overview
In the previous mission, we learned how to read JSON into a Spark DataFrame, as well as some basic techniques for interacting with DataFrames. In this mission, we'll learn how to use Spark's SQL interface to query and interact with the data. We'll continue to work with the 2010 U.S. Census data set in this mission. Later on, we'll add other files to demonstrate how to take advantage of SQL to work with multiple data sets.

## Register the DataFrame as a Table
Before we can write and run SQL queries, we need to tell Spark to treat the DataFrame as a SQL table. Spark internally maintains a virtual database within the SQLContext object. This object, which we enter as __sqlCtx__, has methods for registering temporary tables.

To register a DataFrame as a table, call the __registerTempTable()__ [method](https://spark.apache.org/docs/1.5.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.registerTempTable) on that DataFrame object. This method requires one string parameter, name, that we use to set the table __name__ for reference in our SQL queries.

#### Instructions
* Use the __registerTempTable()__ [method](https://spark.apache.org/docs/1.5.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.registerTempTable) method to register the DataFrame __df__ as a table named __census2010__.
* Then, run the SQLContext method __tableNames__ to return the list of tables.
* Assign the resulting list to __tables__, and use the __print__ function to display it.

In [1]:
# Find path to PySpark
import findspark
findspark.init()

# Import PySpark & initalize SparkContext object
import pyspark
sc = pyspark.SparkContext()

# Import SQLContext
from pyspark.sql import SQLContext

# Pass in the SparkContext object `sc`
sqlCtx = SQLContext(sc)

# Read JSON data into a DataFrame object `df`
df = sqlCtx.read.json("census_2010.json")
df.registerTempTable('census2010')

In [2]:
tables = sqlCtx.tableNames()
print(tables)

[u'census2010']


## Querying
Now that we've registered the table within __sqlCtx__, we can start writing and running SQL queries. With Spark SQL, we represent our query as a string and pass it into the __sql()__ [method](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext.sql) within the SQLContext object. The __sql()__ method requires a single parameter, the query string. Spark will return the query results as a spark DataFrame object. This means you'll have to use __show()__ to display the results, due to lazy loading.

While SQLite requires that queries end with a semi-colon, Spark SQL will actually throw an error if you include it. Other than this difference in syntax, Spark's flavor of SQL is identical to SQLite, and all the queries you've written for the SQL course will work here as well.

#### Instructions
* Write a SQL query that returns the __age__ column from the table __census2010__ and the __show()__ method to display the first 20 results.

In [3]:
print(sqlCtx.sql("SELECT age FROM census2010 LIMIT 20").show())

+---+
|age|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+

None


## Filtering
In the previous mission, we used DataFrame methods to find all of the rows where __age__ was greater than 5. If we only wanted to retrieve data from the __males__ and __females__ columns where that criteria were true, we'd need to chain additional operations to the Spark DataFrame. To return the results in descending order instead of ascending order, we'd have to chain another method. The DataFrame methods are quick and powerful for simple queries, but chaining them can be cumbersome for more advanced queries.

SQL shines at expressing complex logic in a more compact manner. Let's brush up on SQL by writing a query that expresses more specific criteria.

#### Instructions
Write and run a SQL query that returns:
* The __males__ and __females__ columns (in that order) where __age__ > 5 and __age__ < 15

In [4]:
sqlCtx.sql('SELECT males, females FROM census2010 \
            WHERE age > 5 AND age < 15').show()

+-------+-------+
|  males|females|
+-------+-------+
|2093905|2007781|
|2097080|2010281|
|2101670|2013771|
|2108014|2018603|
|2114217|2023289|
|2118390|2026352|
|2132030|2037286|
|2159943|2060100|
|2195773|2089651|
+-------+-------+



## Mixing Functionality
Because the results of SQL queries are DataFrame objects, we can combine the best aspects of both DataFrames and SQL to enhance our workflow. For example, we can write a SQL query that quickly returns a subset of our data as a DataFrame.

#### Instructions
* Write a SQL query that returns a DataFrame containing the __males__ and __females__ columns from the census2010 table.
* Use the __describe()__ [method](https://spark.apache.org/docs/1.5.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.describe) to calculate summary statistics for the DataFrame and the __show()__ method to display the results.

In [5]:
sqlCtx.sql('SELECT males, females FROM census2010').describe().show()

+-------+------------------+-----------------+
|summary|             males|          females|
+-------+------------------+-----------------+
|  count|               101|              101|
|   mean|1520095.3168316833|1571460.287128713|
| stddev| 818587.2080168233|748671.0493484349|
|    min|              4612|            25673|
|    max|           2285990|          2331572|
+-------+------------------+-----------------+



## Multiple Tables
One of the most powerful use cases in SQL is joining tables. Spark SQL takes this a step further by enabling you to run join queries across data from multiple file types. Spark will read any of the file types and formats it supports into DataFrame objects and we can register each of these as tables within the SQLContext object to use for querying.

As we mentioned briefly in the previous mission, most data science organizations use a variety of file formats and data storage mechanisms. Spark SQL was built with the industry use cases in mind and enables data professionals to use one common query language, SQL, to interact with lots of different data sources. We'll explore joins in Spark SQL further, but first let's introduce the other datasets we'll be using:
* census_1980.json - 1980 U.S. Census data
* census_1990.json - 1990 U.S. Census data
* census_2000.json - 2000 U.S. Census data

#### Instructions
Read these additional datasets into DataFrame objects and then use the registerTempTable() function to register these tables individually within SQLContext:
* census_1980.json as census1980,
* census_1990.json as census1990,
* census_2000.json as census2000.

Then use the method tableNames() to list the tables within the SQLContext object, assign to tables, and finally print tables.

In [6]:
df = sqlCtx.read.json("census_1980.json")
df.registerTempTable('census1980')

df = sqlCtx.read.json("census_1990.json")
df.registerTempTable('census1990')

df = sqlCtx.read.json("census_2000.json")
df.registerTempTable('census2000')

In [7]:
tables = sqlCtx.tableNames()
print(tables)

[u'census2000', u'census1990', u'census2010', u'census1980']


## Joins
Now that we have a table for each dataset, we can write join queries to compare values across them. Since we're working with Census data, let's use the __age__ column as the joining column.

#### Instructions
*  Write a query that returns a DataFrame with the __total__ columns for the tables __census2010__ and __census2000__ (in that order).
*  Then, run the query and use the __show()__ method to display the first 20 results.

In [18]:
sql_query = """SELECT 
                    census2010.total, 
                    census2000.total 
                FROM census2010 
                INNER JOIN census2000 ON census2000.age = census2010.age"""
sqlCtx.sql(sql_query).show()

+-------+-------+
|  total|  total|
+-------+-------+
|4079669|3733034|
|4085341|3825896|
|4089295|3904845|
|4092221|3970865|
|4094802|4024943|
|4097728|4068061|
|4101686|4101204|
|4107361|4125360|
|4115441|4141510|
|4126617|4150640|
|4137506|4152174|
|4144742|4145530|
|4169316|4139512|
|4220043|4138230|
|4285424|4137982|
|4347028|4133932|
|4410804|4130632|
|4451147|4111244|
|4454165|4068058|
|4432260|4011192|
+-------+-------+
only showing top 20 rows



## SQL Functions
The functions and operators from SQLite that we've used in the past are available for us to use in Spark SQL:
* COUNT()
* AVG()
* SUM()
* AND
* OR

#### Instructions
Write a query that calculates the sums of the total column from each of the tables, in the following order:
* census2010,
* census2000,
* census1990.

You'll need to perform two inner joins for this query (all datasets have the same values for age, which makes things convenient for joining).

In [19]:
sql_query = """SELECT
                    SUM(census2010.total),
                    SUM(census2000.total),
                    SUM(census1990.total)
                FROM census2010
                INNER JOIN census2000 ON census2000.age = census2010.age
                INNER JOIN census1990 ON census1990.age = census2010.age"""

sqlCtx.sql(sql_query).show()

+---------+---------+---------+
|      _c0|      _c1|      _c2|
+---------+---------+---------+
|312247116|284594395|254506647|
+---------+---------+---------+

