<h1>Querying Data using Spark Sql</h1>
<h2>Through Demo 1</h2>

<hr>
<h2>Spark Sql Notes</h2>
<ul>
    <li>Spark sql enables querying of dataframes as database tables</li>
    <li>Temorary per-session and global tables can be used</li>
    <li>The catalyst optimizer makes sql queries very fast</li>
    <li>Tables shemas can be inferred or explicitly specified</li>
    <li>Advancded windowing operations are also supported</li>
<ul>

<hr>
<h2>Setting up the Notebook</h2>
<ul>
    <li>Setting up import statements</li>
    <li>Setting up the spark session</li>
</ul>

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import Row
from datetime import datetime

In [2]:
spark = SparkSession.builder\
                    .appName("Analyzing airline data")\
                    .getOrCreate()

<hr>
<h2>Demo: Basic Spark SQL Operations</h2>

In [3]:
record = sc.parallelize([Row(id = 1,
                            name="Jill",
                            active = True,
                            clubs = ['chess', 'hockey'],
                            subjects = {"math": 80, "english": 56},
                            enrolled = datetime(2014, 8, 1, 14, 1, 5)),
                        Row(id = 2,
                           name = "George",
                           active = False,
                           clubs = ['chess', 'soccer'],
                           subjects = {"math": 60, "english": 96},
                           enrolled = datetime(2015, 3, 21, 8, 2, 5))])

In [4]:
record_df = record.toDF()
record_df.show()

+------+---------------+-------------------+---+------+--------------------+
|active|          clubs|           enrolled| id|  name|            subjects|
+------+---------------+-------------------+---+------+--------------------+
|  true|[chess, hockey]|2014-08-01 14:01:05|  1|  Jill|[english -> 56, m...|
| false|[chess, soccer]|2015-03-21 08:02:05|  2|George|[english -> 96, m...|
+------+---------------+-------------------+---+------+--------------------+



In [5]:
# create a table that is per-session and not shared across spark sessions
record_df.createOrReplaceTempView("records")

In [8]:
all_records_df = sqlContext.sql('SELECT * FROM records')
all_records_df.show()

+------+---------------+-------------------+---+------+--------------------+
|active|          clubs|           enrolled| id|  name|            subjects|
+------+---------------+-------------------+---+------+--------------------+
|  true|[chess, hockey]|2014-08-01 14:01:05|  1|  Jill|[english -> 56, m...|
| false|[chess, soccer]|2015-03-21 08:02:05|  2|George|[english -> 96, m...|
+------+---------------+-------------------+---+------+--------------------+



In [9]:
sqlContext.sql('SELECT id, clubs[1], subjects["english"] from records').show()

+---+--------+-----------------+
| id|clubs[1]|subjects[english]|
+---+--------+-----------------+
|  1|  hockey|               56|
|  2|  soccer|               96|
+---+--------+-----------------+



In [10]:
sqlContext.sql('SELECT id, NOT active FROM records').show()

+---+------------+
| id|(NOT active)|
+---+------------+
|  1|       false|
|  2|        true|
+---+------------+



In [12]:
sqlContext.sql('SELECT * FROM records WHERE active').show()

+------+---------------+-------------------+---+----+--------------------+
|active|          clubs|           enrolled| id|name|            subjects|
+------+---------------+-------------------+---+----+--------------------+
|  true|[chess, hockey]|2014-08-01 14:01:05|  1|Jill|[english -> 56, m...|
+------+---------------+-------------------+---+----+--------------------+



In [11]:
sqlContext.sql('SELECT * FROM records WHERE subjects["english"] > 90').show()

+------+---------------+-------------------+---+------+--------------------+
|active|          clubs|           enrolled| id|  name|            subjects|
+------+---------------+-------------------+---+------+--------------------+
| false|[chess, soccer]|2015-03-21 08:02:05|  2|George|[english -> 96, m...|
+------+---------------+-------------------+---+------+--------------------+



In [13]:
# create a global table
record_df.createGlobalTempView("global_records")

In [14]:
# in order to access data on a global table you have to specify the global_temp namespace
sqlContext.sql('SELECT * FROM global_temp.global_records').show()

+------+---------------+-------------------+---+------+--------------------+
|active|          clubs|           enrolled| id|  name|            subjects|
+------+---------------+-------------------+---+------+--------------------+
|  true|[chess, hockey]|2014-08-01 14:01:05|  1|  Jill|[english -> 56, m...|
| false|[chess, soccer]|2015-03-21 08:02:05|  2|George|[english -> 96, m...|
+------+---------------+-------------------+---+------+--------------------+

