# Unit 5: Programming with Spark SQL

## Contents
** Structured data processing **

** SQL **

** DataFrames **

** Performance improvement **

** SQLContext **

** Creating DataFrames **

** Saving a DataFrame **

** DataFrame operations **

** Query Strings **

** Column Expressions **

** DataFrames and RDDs **

** SQL Queries **

## Structured data processing
Spark SQL is a Spark module for **structured data processing**.

Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. **Internally, Spark SQL uses this extra information to perform extra optimizations**.

**There are several ways to interact with Spark SQL including SQL, the DataFrames API and the Datasets API**. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between the various APIs based on which provides the most natural way to express a given transformation.

## SQL
One use of Spark SQL is to **execute SQL queries written using either a basic SQL syntax or HiveQL**. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section. When running SQL from within another programming language the results will be returned as a DataFrame. You can also interact with the SQL interface using the command-line or over JDBC/ODBC.

Reference: [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/1.6.1/sql-programming-guide.html)

## DataFrames
A DataFrame is a distributed collection of data **organized into named columns**. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

## Performance improvement
Spark SQL and DataFrames take advantadge of the fact that they are using structured data to optimize the performance using the [Catalyst query optimizer](https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html).

![Performance comparison](https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM.png)
Reference: [Performance improvements in Spark](https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python-dataframes-and-more.html)


## SQLContext
To use Spark SQL or DataFrames we use a **SQLContext** object as the main entry point to the API, in a similar way as we use the SparkContext (sc) as the main entry point to the RDD API.

There are two implementations of the SQLContext object:
* SQLContext: basic
* HiveContext: more advanced
  * It is able to read Hive tables directly
  * Supports HiveQL language

In our case the Spark Shell provides us automatically with a sqlContext of the HiveContext type.

In [20]:
type(sqlContext)

pyspark.sql.context.HiveContext

## Creating DataFrames

### Creating a DataFrame from an existing file

Reading an existing file in **JSON** format:

In [21]:
processes = sqlContext.read.json('datasets/pacct_20170701_c66.json')

In [22]:
processes.printSchema()

root
 |-- command: string (nullable = true)
 |-- etime: double (nullable = true)
 |-- exitcode: long (nullable = true)
 |-- flag: long (nullable = true)
 |-- gid: long (nullable = true)
 |-- host: string (nullable = true)
 |-- io: long (nullable = true)
 |-- majflt: long (nullable = true)
 |-- mem: long (nullable = true)
 |-- minflt: long (nullable = true)
 |-- pid: long (nullable = true)
 |-- ppid: long (nullable = true)
 |-- rw: long (nullable = true)
 |-- stime: double (nullable = true)
 |-- swaps: long (nullable = true)
 |-- tm_hour: long (nullable = true)
 |-- tm_isdst: long (nullable = true)
 |-- tm_mday: long (nullable = true)
 |-- tm_min: long (nullable = true)
 |-- tm_mon: long (nullable = true)
 |-- tm_sec: long (nullable = true)
 |-- tm_wday: long (nullable = true)
 |-- tm_yday: long (nullable = true)
 |-- tm_year: long (nullable = true)
 |-- tty: long (nullable = true)
 |-- uid: long (nullable = true)
 |-- utime: double (nullable = true)
 |-- version: long (nullable = true)

Reading an existing file in **parquet** format:

In [23]:
processes2 = sqlContext.read.parquet('datasets/pacct_20170701.parquet')

In [24]:
processes2.printSchema()

root
 |-- host: string (nullable = true)
 |-- flag: integer (nullable = true)
 |-- version: integer (nullable = true)
 |-- tty: integer (nullable = true)
 |-- exitcode: integer (nullable = true)
 |-- uid: integer (nullable = true)
 |-- gid: integer (nullable = true)
 |-- pid: integer (nullable = true)
 |-- ppid: integer (nullable = true)
 |-- tm_year: integer (nullable = true)
 |-- tm_mon: integer (nullable = true)
 |-- tm_mday: integer (nullable = true)
 |-- tm_hour: integer (nullable = true)
 |-- tm_min: integer (nullable = true)
 |-- tm_sec: integer (nullable = true)
 |-- tm_wday: integer (nullable = true)
 |-- tm_yday: integer (nullable = true)
 |-- tm_isdst: integer (nullable = true)
 |-- etime: decimal(10,2) (nullable = true)
 |-- utime: decimal(10,2) (nullable = true)
 |-- stime: decimal(10,2) (nullable = true)
 |-- mem: integer (nullable = true)
 |-- io: integer (nullable = true)
 |-- rw: integer (nullable = true)
 |-- minflt: integer (nullable = true)
 |-- majflt: integer (nul

Other formats like **CSV**, HBase, Avro, etc. are also supported using third party data sources.

### Creating a DataFrame from a Hive table

In [25]:
jobs = sqlContext.sql('select * from cesga__slurm.ft2_job_table limit 1000')

In [26]:
jobs.printSchema()

root
 |-- job_db_inx: integer (nullable = true)
 |-- mod_time: long (nullable = true)
 |-- deleted: integer (nullable = true)
 |-- account: string (nullable = true)
 |-- array_task_str: string (nullable = true)
 |-- array_max_tasks: long (nullable = true)
 |-- array_task_pending: long (nullable = true)
 |-- cpus_req: long (nullable = true)
 |-- cpus_alloc: long (nullable = true)
 |-- derived_ec: long (nullable = true)
 |-- derived_es: string (nullable = true)
 |-- exit_code: long (nullable = true)
 |-- job_name: string (nullable = true)
 |-- id_assoc: long (nullable = true)
 |-- id_array_job: long (nullable = true)
 |-- id_array_task: long (nullable = true)
 |-- id_block: string (nullable = true)
 |-- id_job: long (nullable = true)
 |-- id_qos: long (nullable = true)
 |-- id_resv: long (nullable = true)
 |-- id_wckey: long (nullable = true)
 |-- id_user: long (nullable = true)
 |-- id_group: long (nullable = true)
 |-- kill_requid: integer (nullable = true)
 |-- mem_req: long (nullable

In [27]:
jobs.count()

1000

### Creating a DataFrame from an RDD

A DataFrame is built from **an RDD that has a collection of Row objects** using the toDF() function.

In [28]:
from pyspark.sql import Row

In [29]:
peopleRDD = sc.parallelize([('Aroa', 18, 'student'), ('Lara', 15, 'student'), ('Susana', 35, 'teacher')])

In [30]:
type(peopleRDD)

pyspark.rdd.RDD

In this case we have to convert the collection of tuples in a collection of Rows and then we can transform it in an DF.

In [31]:
peopleDF = peopleRDD.map(lambda p: Row(name=p[0], age=int(p[1]), profession=p[2])).toDF()

In [32]:
type(peopleDF)

pyspark.sql.dataframe.DataFrame

In [33]:
peopleDF.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)
 |-- profession: string (nullable = true)



## Saving a DataFrame

In [34]:
jobs.write.parquet('jobs_filtered')

It creates a directory in HDFS and stores there the data using one file per partition:
```bash
[jlopez@login6 datasets]$ hdfs dfs -ls jobs_filtered
Found 4 items
-rw-r--r--   3 jlopez jlopez          0 2017-07-06 13:05 jobs_filtered/_SUCCESS
-rw-r--r--   3 jlopez jlopez       3635 2017-07-06 13:05 jobs_filtered/_common_metadata
-rw-r--r--   3 jlopez jlopez       8801 2017-07-06 13:05 jobs_filtered/_metadata
-rw-r--r--   3 jlopez jlopez      34596 2017-07-06 13:05 jobs_filtered/part-r-00000-e020f05d-df99-4186-ab81-26c079f8cf85.gz.parquet
```

We can also use other output formats like JSON or ORC:

In [35]:
jobs.write.json('jobs_filtered_json')

```bash
[jlopez@login6 datasets]$ hdfs dfs -ls jobs_filtered_json
Found 2 items
-rw-r--r--   3 jlopez jlopez          0 2017-07-06 13:16 jobs_filtered_json/_SUCCESS
-rw-r--r--   3 jlopez jlopez     685974 2017-07-06 13:16 jobs_filtered_json/part-r-00000-12959a3f-4bd5-46f0-8e7f-5ddc6e9ae549
```

In [36]:
jobs.write.orc('jobs_filtered_orc')

```bash
[jlopez@login6 datasets]$ hdfs dfs -ls jobs_filtered_orc
Found 2 items
-rw-r--r--   3 jlopez jlopez          0 2017-07-06 13:07 jobs_filtered_orc/_SUCCESS
-rw-r--r--   3 jlopez jlopez      28545 2017-07-06 13:07 jobs_filtered_orc/part-r-00000-3f825b83-b0e1-4b88-a55d-6cf673ed4a1d.orc
```

## DataFrame Operations

As in the case of RDDs where we had transformations and actions in this case we have:
* Queries: **lazy** transformations that create a new DataFrame
* Actions: trigger the execution of queries and return the data to the driver

### Actions

#### show

Displays the first n rows

In [37]:
peopleDF.show(2)

+---+----+----------+
|age|name|profession|
+---+----+----------+
| 18|Aroa|   student|
| 15|Lara|   student|
+---+----+----------+
only showing top 2 rows



#### take

Returns the first n rows

In [38]:
peopleDF.take(2)

[Row(age=18, name=u'Aroa', profession=u'student'),
 Row(age=15, name=u'Lara', profession=u'student')]

#### collect

Returns all rows

In [39]:
peopleDF.collect()

[Row(age=18, name=u'Aroa', profession=u'student'),
 Row(age=15, name=u'Lara', profession=u'student'),
 Row(age=35, name=u'Susana', profession=u'teacher')]

#### count

In [40]:
peopleDF.count()

3

### Queries

#### distinct

In [41]:
df = sqlContext.createDataFrame([{'name': 'aroa', 'age': 17}, {'name': 'aroa', 'age': 17}, {'name': 'lara', 'age': 14}])



In [42]:
df.show()

+---+----+
|age|name|
+---+----+
| 17|aroa|
| 17|aroa|
| 14|lara|
+---+----+



In [43]:
df.distinct().show()

+---+----+
|age|name|
+---+----+
| 17|aroa|
| 14|lara|
+---+----+



#### limit

In [44]:
processes.count()

832119

In [45]:
processes_chunk = processes.limit(100)

In [46]:
processes_chunk.count()

100

#### where/filter

The **where** and **filter** operations are equivalent: where() is an alias for filter().

In [47]:
peopleDF.where('name = "Aroa"').show()

+---+----+----------+
|age|name|profession|
+---+----+----------+
| 18|Aroa|   student|
+---+----+----------+



The query is expressed using a Query String (see Query Strings section below).

#### select

We can do projections, for example reducing the number of columns:

In [48]:
processes.select('pid', 'command', 'etime')

DataFrame[pid: bigint, command: string, etime: double]

We can also do transformations (see Column Expressions section below):

In [49]:
peopleDF.select(peopleDF.name, peopleDF.age + 100).show()

+------+-----------+
|  name|(age + 100)|
+------+-----------+
|  Aroa|        118|
|  Lara|        115|
|Susana|        135|
+------+-----------+



#### orderBy

In [50]:
peopleDF.orderBy(peopleDF.age.desc()).show()

+---+------+----------+
|age|  name|profession|
+---+------+----------+
| 35|Susana|   teacher|
| 18|  Aroa|   student|
| 15|  Lara|   student|
+---+------+----------+



The order is controlled by a Column Expression: .asc() and .desc() are column expressions (see the Column Expressions section below).

#### groupBy

We can group data (it returns a GroupedData object with additional operations):

In [51]:
peopleDF.groupBy('profession')

<pyspark.sql.group.GroupedData at 0x7f347452f8d0>

And then we can perform operations on grouped data:

* Calculate the maximum/minimum:

In [52]:
peopleDF.groupBy('profession').max('age').show()

+----------+--------+
|profession|max(age)|
+----------+--------+
|   student|      18|
|   teacher|      35|
+----------+--------+



* Calculate the mean:

In [53]:
peopleDF.groupBy('profession').mean('age').show()

+----------+--------+
|profession|avg(age)|
+----------+--------+
|   student|    16.5|
|   teacher|    35.0|
+----------+--------+



* Calculate the sum:

In [54]:
peopleDF.groupBy('profession').sum('age').show()

+----------+--------+
|profession|sum(age)|
+----------+--------+
|   student|      33|
|   teacher|      35|
+----------+--------+



Reference: [Available GroupedData operations](https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.GroupedData)

#### join

We can join DataFrames in a similar way as we did with PairRDDs:

In [55]:
professionsDF = sqlContext.createDataFrame([{'name': 'student', 'description': 'A person engaged in study'}, {'name': 'teacher', 'description': 'A person whose occupation is teaching'}])

In [56]:
peopleDF.join(professionsDF, peopleDF.profession == professionsDF.name).show()

+---+------+----------+--------------------+-------+
|age|  name|profession|         description|   name|
+---+------+----------+--------------------+-------+
| 18|  Aroa|   student|A person engaged ...|student|
| 15|  Lara|   student|A person engaged ...|student|
| 35|Susana|   teacher|A person whose oc...|teacher|
+---+------+----------+--------------------+-------+



### Query Strings

It is important to understand what type of **Query Strings** we can use:

In [57]:
peopleDF.where(peopleDF.age > 10).show()

+---+------+----------+
|age|  name|profession|
+---+------+----------+
| 18|  Aroa|   student|
| 15|  Lara|   student|
| 35|Susana|   teacher|
+---+------+----------+



In [58]:
peopleDF.where('age > 10').show()

+---+------+----------+
|age|  name|profession|
+---+------+----------+
| 18|  Aroa|   student|
| 15|  Lara|   student|
| 35|Susana|   teacher|
+---+------+----------+



In [59]:
peopleDF.where(peopleDF['age'] > 10).show()

+---+------+----------+
|age|  name|profession|
+---+------+----------+
| 18|  Aroa|   student|
| 15|  Lara|   student|
| 35|Susana|   teacher|
+---+------+----------+



### Column Expressions

Some queries like select, sort, join or where can take column expressions.

In [60]:
# A column
peopleDF.name

Column<name>

We can operate on columns:

In [61]:
peopleDF.select(peopleDF.name, peopleDF.age + 100).show()

+------+-----------+
|  name|(age + 100)|
+------+-----------+
|  Aroa|        118|
|  Lara|        115|
|Susana|        135|
+------+-----------+



In [62]:
peopleDF.orderBy(peopleDF.age.desc()).show()

+---+------+----------+
|age|  name|profession|
+---+------+----------+
| 35|Susana|   teacher|
| 18|  Aroa|   student|
| 15|  Lara|   student|
+---+------+----------+



In [63]:
peopleDF.orderBy(peopleDF.age.asc()).show()

+---+------+----------+
|age|  name|profession|
+---+------+----------+
| 15|  Lara|   student|
| 18|  Aroa|   student|
| 35|Susana|   teacher|
+---+------+----------+



In [64]:
peopleDF.orderBy(peopleDF.profession.like('stu%')).show()

+---+------+----------+
|age|  name|profession|
+---+------+----------+
| 35|Susana|   teacher|
| 18|  Aroa|   student|
| 15|  Lara|   student|
+---+------+----------+



## DataFrames and RDDs

Sometimes it is useful to use a DataFrame as an RDD so you all the flexibility of the RDD API.

It is very easy to access the underlying RDD of Rows, it is exposed under the **.rdd** property:

In [65]:
import math
peopleDF.rdd.map(lambda row: (row.name, math.sqrt(row.age))).collect()

[(u'Aroa', 4.242640687119285),
 (u'Lara', 3.872983346207417),
 (u'Susana', 5.916079783099616)]

## SQL Queries

In [66]:
peopleDF.registerTempTable('people')

In [67]:
sqlContext.sql('SELECT * FROM people WHERE age > 20').show()

+---+------+----------+
|age|  name|profession|
+---+------+----------+
| 35|Susana|   teacher|
+---+------+----------+



In [68]:
sqlContext.sql('''SELECT * FROM people WHERE name LIKE "Ar%"''').show()

+---+----+----------+
|age|name|profession|
+---+----+----------+
| 18|Aroa|   student|
+---+----+----------+



## List of built-in functions

There are several builtin functions that can be useful to operate on Columns:

* [List of built-in functions](http://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#module-pyspark.sql.functions)

When we are in doubt about how to do some transformation it is useful to check this list before proceeding to use the underlying RDD directly. For example we have methods for:
* abs
* avg
* cos
* concat
* regexp_extract
* regexp_replace
* sum
* when
* otherwise
* lit

## UDFs

When there is no builtin function available, we can create our own user-defined functions (UDF) that will allow us to use custom python code to operate on Columns.

In [71]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
is_adult = udf(lambda n: 1 if n > 18 else 0, IntegerType())

In [73]:
peopleDF.select(peopleDF.name, is_adult(peopleDF.age).alias('adult')).collect()

[Row(name=u'Aroa', adult=0),
 Row(name=u'Lara', adult=0),
 Row(name=u'Susana', adult=1)]

## Exercises
Review the documentation:
* [pyspark.sql documentation](https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html)

Exercises:
* Unit 5 Working with meteorological data, now using DataFrames. You can also try using SQL.
* Unit 5 Sentiment Analysis: Review the Sentiment Analysis notebook that makes use of DataFrames and Spark ML