# Query processing
In this Notebook, you will explore how the order of processing of `SELECT` clauses affects the performance of the 
execution of `SELECT` statements using the *Movies dataset*.

Enable access to the PostgreSQL database engine via [SQL Cell Magic](https://pypi.python.org/pypi/ipython-sql).

In [1]:
%load_ext sql
%sql postgresql://test:test@localhost:5432/tm351test

'Connected: test@tm351test'

Create the `movie` and `movie_genre` tables.

In [2]:
%%sql
DROP TABLE IF EXISTS movie_genre CASCADE;
DROP TABLE IF EXISTS movie CASCADE;

CREATE TABLE movie (
 movie_id INTEGER NOT NULL,
 title VARCHAR(250) NOT NULL,
 year INTEGER NOT NULL,
 rt_all_critics_rating REAL,
 rt_top_critics_rating REAL,
 rt_audience_rating REAL,
 ml_user_rating REAL,
 PRIMARY KEY (movie_id)
);

CREATE TABLE movie_genre (
 movie_id INTEGER NOT NULL,
 genre VARCHAR(20) NOT NULL,
 PRIMARY KEY (movie_id, genre),
 FOREIGN KEY (movie_id) REFERENCES movie(movie_id)
);

Done.
Done.
Done.
Done.


[]

Populate the tables from the Movies dataset using [Psycopg](http://initd.org/psycopg/docs/index.html), 
a PostgreSQL database adapter for Python.

In [3]:
import psycopg2 as pg
import pandas as pd
import pandas.io.sql as psqlg

In [4]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()

# populate 'movie' table
io = open('data/movie.dat', 'r')
c.copy_from(io, 'movie')
io.close()
conn.commit()

# populate 'movie_genre' table
io = open('data/movie_genre.dat', 'r')
c.copy_from(io, 'movie_genre')
io.close()
conn.commit()

# close cursor
c.close()
# close database connection
conn.close()

## Activity

Consider the following `SELECT` statement:

```
SELECT COUNT(*)
FROM movie NATURAL JOIN movie_genre
WHERE genre = 'Comedy'
```

In this Notebook, we will compare the processing efficiency of the following two different orders of processing of the `SELECT` clauses in this query:
    
1. `FROM` (join) -> `WHERE` (selection) -> `SELECT` (aggregation)
2. `WHERE` (selection) -> `FROM` (join) -> `SELECT` (aggregation)

We will execute each ordering, clause by clause, recording the runtime required to process each clause. We will link the execution of clauses by recording the output from each clause in a *table*.

### 1. `FROM` (join) -> `WHERE` (selection) -> `SELECT` (aggregation)

In [None]:
%%sql
DROP TABLE IF EXISTS step_1;

In [None]:
runtime_statistics=%sql \
 EXPLAIN ANALYZE \
 CREATE TEMPORARY TABLE step_1 AS \
  SELECT * \
  FROM movie NATURAL JOIN movie_genre
pd.DataFrame(runtime_statistics).tail(1)

In [None]:
%%sql
DROP TABLE IF EXISTS step_2;

In [None]:
runtime_statistics=%sql \
 EXPLAIN ANALYZE \
 CREATE TEMPORARY TABLE step_2 AS \
  SELECT * \
  FROM step_1 \
  WHERE genre = 'Comedy'
pd.DataFrame(runtime_statistics).tail(1)

In [None]:
%%sql
DROP TABLE IF EXISTS step_3;

In [None]:
runtime_statistics=%sql \
 EXPLAIN ANALYZE \
 CREATE TEMPORARY TABLE step_3 AS \
  SELECT COUNT(*) \
  FROM step_2
pd.DataFrame(runtime_statistics).tail(1)

### 2. `WHERE` (selection) -> `FROM` (join) -> `SELECT` (aggregation)

In [None]:
%%sql
DROP TABLE IF EXISTS step_1;

In [None]:
runtime_statistics=%sql \
 EXPLAIN ANALYZE \
 CREATE TEMPORARY TABLE step_1 AS \
  SELECT * \
  FROM movie_genre \
  WHERE genre = 'Comedy'
pd.DataFrame(runtime_statistics).tail(1)

In [None]:
%%sql
DROP TABLE IF EXISTS step_2;

In [None]:
runtime_statistics=%sql \
 EXPLAIN ANALYZE \
 CREATE TEMPORARY TABLE step_2 AS \
  SELECT * \
  FROM movie NATURAL JOIN step_1
pd.DataFrame(runtime_statistics).tail(1)

In [None]:
%%sql
DROP TABLE IF EXISTS step_3;

In [None]:
runtime_statistics=%sql \
 EXPLAIN ANALYZE \
 CREATE TEMPORARY TABLE step_3 AS \
  SELECT COUNT(*) \
  FROM step_2
pd.DataFrame(runtime_statistics).tail(1)

## Analysis

Our results are given below.

 Ordering | Step1 | Step 2 | Step 3 | Total
------|-------|--------|--------|------
1 | 58.164 | 14.298 | 3.910 | 76.372 | `FROM` (join) -> `WHERE` (selection) -> `SELECT` (aggregation)
2 | 10.669 | 16.490 | 3.159 | 30.318 | `WHERE` (selection) -> `FROM` (join) -> `SELECT` (aggregation)

These results demonstrate that the order of processing of `SELECT` clauses has an impact on the efficiency of 
query execution.

Ordering 2 has a shorter total runtime because the *join* operation in Step 2 operates a smaller collection of data than the *join* operation in Step 1 of Ordering 1.

## Summary
In this Notebook you have explored how the order of processing of `SELECT` clauses can affect the performance of the execution of `SELECT` statements.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, you've completed the Part 12 Notebooks.