# Indexing

In Notebook `10.4 Normalised v. unnormalised data`, when you compared the runtimes for the same queries executed against the *normalised* and *unnormalised* forms of the *Movies dataset*, you observed that when only a small fraction of the data needed to be accessed to answer the query (Queries 3 and 4), the runtimes are less for the *normalised data* than with *unnormalised data*. This is because the query optimiser can minimise the join operations to just the data required, whereas in the case of the *unnormalised data* all of the data has to be accessed.

In this Notebook you will explore the effect of indexing on the performance of the execution of these queries.

Enable access to the PostgreSQL database engine via [SQL Cell Magic](https://pypi.python.org/pypi/ipython-sql).

In [1]:
%load_ext sql
%sql postgresql://test:test@localhost:5432/tm351test

'Connected: test@tm351test'

Normalised data: create the `movie`, `movie_actor`, `movie_country`, `movie_director` and `movie_genre` tables.

In [2]:
%%sql
DROP TABLE IF EXISTS movie_actor CASCADE;
DROP TABLE IF EXISTS movie_country CASCADE;
DROP TABLE IF EXISTS movie_director CASCADE;
DROP TABLE IF EXISTS movie_genre CASCADE;
DROP TABLE IF EXISTS movie CASCADE;

CREATE TABLE movie (
 movie_id INTEGER NOT NULL,
 title VARCHAR(250) NOT NULL,
 year INTEGER NOT NULL,
 rt_all_critics_rating REAL,
 rt_top_critics_rating REAL,
 rt_audience_rating REAL,
 ml_user_rating REAL,
 PRIMARY KEY (movie_id)
);

CREATE TABLE movie_actor (
 movie_id INTEGER NOT NULL,
 actor_name VARCHAR(50) NOT NULL,
 ranking INTEGER NOT NULL,
 PRIMARY KEY (movie_id, actor_name),
 FOREIGN KEY (movie_id) REFERENCES movie(movie_id)
);

CREATE TABLE movie_country (
 movie_id INTEGER NOT NULL,
 country VARCHAR(30) NOT NULL,
 PRIMARY KEY (movie_id),
 FOREIGN KEY (movie_id) REFERENCES movie(movie_id)
);

CREATE TABLE movie_director (
 movie_id INTEGER NOT NULL,
 director_name VARCHAR(50) NOT NULL,
 PRIMARY KEY (movie_id),
 FOREIGN KEY (movie_id) REFERENCES movie(movie_id)
);

CREATE TABLE movie_genre (
 movie_id INTEGER NOT NULL,
 genre VARCHAR(20) NOT NULL,
 PRIMARY KEY (movie_id, genre),
 FOREIGN KEY (movie_id) REFERENCES movie(movie_id)
);

Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.


[]

Populate the tables from the Movies dataset using [Psycopg](http://initd.org/psycopg/docs/index.html), 
a PostgreSQL database adapter for Python.

In [3]:
import psycopg2 as pg
import pandas as pd
import pandas.io.sql as psqlg

In [4]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()

# populate 'movie' table
io = open('data/movie.dat', 'r')
c.copy_from(io, 'movie')
io.close()
conn.commit()

# populate 'movie_actor' table
io = open('data/movie_actor.dat', 'r')
c.copy_from(io, 'movie_actor')
io.close()
conn.commit()

# populate 'movie_country' table
io = open('data/movie_country.dat', 'r')
c.copy_from(io, 'movie_country')
io.close()
conn.commit()

# populate 'movie_director' table
io = open('data/movie_director.dat', 'r')
c.copy_from(io, 'movie_director')
io.close()
conn.commit()

# populate 'movie_genre' table
io = open('data/movie_genre.dat', 'r')
c.copy_from(io, 'movie_genre')
io.close()
conn.commit()

# close cursor
c.close()
# close database connection
conn.close()

Unnormalised data: create the `movie_unnormalised` table by 'joining' together the `movie`, `movie_actor`, 
`movie_country`, `movie_director` and `movie_genre` tables defined above.

In [5]:
%%sql
DROP TABLE IF EXISTS movie_unnormalised;

CREATE TABLE movie_unnormalised AS
  SELECT movie.*, 
         actor_name, ranking, 
         country,
         director_name,
         genre
  FROM (((movie LEFT OUTER JOIN movie_actor    ON movie.movie_id = movie_actor.movie_id)
                LEFT OUTER JOIN movie_country  ON movie.movie_id = movie_country.movie_id)
                LEFT OUTER JOIN movie_director ON movie.movie_id = movie_director.movie_id)
                           JOIN movie_genre    ON movie.movie_id = movie_genre.movie_id; 

Done.
484795 rows affected.


[]

Notes:

As identified in Notebook `08.1 Movies dataset`, because the actors, country of origin and director are missing 
for some movies, the `LEFT OUTER JOIN` operations on the `movie_actor`, `movie_country` and `movie_director` tables are 
necessary to ensure that these movies appear in the `movie_unnormalised` table.

## PostgreSQL
When a table is created in PostgreSQL using the 
[`CREATE TABLE`](http://www.postgresql.org/docs/9.3/static/sql-createtable.html) statement, 
an *index* is created automatically on the *primary key* column(s). For example, the `movie` table defined above will have an index defined on `movie_id`, the *primary key* column.

If, however, a table is created in PostgreSQL from the results of a query using the 
[`CREATE TABLE AS`](http://www.postgresql.org/docs/9.3/static/sql-createtableas.html) statement,
no indexes are defined on that table. For example, the `movie_unnormalised` table defined above will have no indexes defined on it.

Additional indexes can be created using the 
[`CREATE INDEX`](http://www.postgresql.org/docs/9.3/static/sql-createindex.html) statement.

In Notebook `09.1 SQL DDL`, you learnt that relational database management systems maintain a *data dictionary* called the [*System Catalogue*](http://www.postgresql.org/docs/9.3/static/catalogs.html) which stores *metadata*, such as data about tables, columns and constraints. And, that the SQL Standard specifies a uniform means to access the System Catalogue called the 
[*Information Schema*](http://www.postgresql.org/docs/9.3/static/information-schema.html).

However, as indexing is considered to be an implementation detail, there are no provisions for creating indexes in the SQL standard, thus no information about indexes is provided in the *Information Schema*, and so we have to access PostgreSQL's *System Catalogue*. As retrieving metadata from PostgreSQL's *System Catalogue* is not straightforward, we have provided an SQL function (`table_indexes`) which returns a table containing information about the indexes defined on a given table. (You are not required to understand how this function determines what indexes are defined on a table.)

In [None]:
%%sql
CREATE OR REPLACE FUNCTION table_indexes(p_table NAME)
RETURNS TABLE (table_name NAME, index_name NAME, column_names TEXT) AS $$
BEGIN
 RETURN QUERY

-- See http://stackoverflow.com/questions/2204058/list-columns-with-indexes-in-postgresql
select
    t.relname as table_name,
    i.relname as index_name,
    array_to_string(array_agg(a.attname), ', ') as column_names
from
    pg_class t,
    pg_class i,
    pg_index ix,
    pg_attribute a
where
    t.oid = ix.indrelid
    and i.oid = ix.indexrelid
    and a.attrelid = t.oid
    and a.attnum = ANY(ix.indkey)
    and t.relkind = 'r'
    and t.relname like p_table
group by
    t.relname,
    i.relname
order by
    t.relname,
    i.relname;

END;
$$ LANGUAGE plpgsql;

In [None]:
%%sql
SELECT *
FROM table_indexes('movie_actor')

As the `table_indexes` function uses the 
[`LIKE`](http://www.postgresql.org/docs/9.3/static/functions-matching.html#FUNCTIONS-LIKE) 
predicate to select the table when extracting information about the indexes defined on a given table, 
it can be expressed as a pattern.

In [None]:
%%sql
SELECT *
FROM table_indexes('movie%')

Note that the `movie_unnormalised` table is not present in the above table as it has no indexes defined on it, 
for reasons stated above.

## Activity

Execute each of the following SQL statements and record the runtime displayed.

### Query 3

In [None]:
# Query 3, normalised data
runtime_statistics=%sql \
 EXPLAIN ANALYZE \
 SELECT COUNT(DISTINCT movie.movie_id) \
 FROM (((movie LEFT OUTER JOIN movie_actor    ON movie.movie_id = movie_actor.movie_id) \
               LEFT OUTER JOIN movie_country  ON movie.movie_id = movie_country.movie_id) \
               LEFT OUTER JOIN movie_director ON movie.movie_id = movie_director.movie_id) \
                          JOIN movie_genre    ON movie.movie_id = movie_genre.movie_id \
 WHERE genre = 'Comedy'
pd.DataFrame(runtime_statistics).tail(1)

Notes:

When a PostgreSQL `SELECT` statement is prefixed with 
[`EXPLAIN ANALYZE`](http://www.postgresql.org/docs/9.3/static/sql-explain.html), the SQL statement is executed but 
instead of the resultant table being displayed, runtime statistics are displayed instead, with the runtime on the 
last line of the output.

As `movie_genre.genre` is a *primary key* column, it already has an index defined on it.

In [None]:
%%sql
-- This statement is included to allow the sequence of statments in this Notebook to be rerun.
DROP INDEX IF EXISTS movie_unnormalised_genre;

In [None]:
# Query 3, unnormalised data
runtime_statistics=%sql \
 EXPLAIN ANALYZE \
 SELECT COUNT(DISTINCT movie_id) \
 FROM movie_unnormalised \
 WHERE genre = 'Comedy'
pd.DataFrame(runtime_statistics).tail(1)

Create an index on the `movie_unnormalised.genre` column.

In [None]:
%%sql
-- Please be patient, indexing may take some time to complete.
CREATE INDEX movie_unnormalised_genre ON movie_unnormalised (genre);

In [None]:
# Query 3, unnormalised data
runtime_statistics=%sql \
 EXPLAIN ANALYZE \
 SELECT COUNT(DISTINCT movie_id) \
 FROM movie_unnormalised \
 WHERE genre = 'Comedy'
pd.DataFrame(runtime_statistics).tail(1)

### Query 4

In [None]:
# Query 4, normalised data
runtime_statistics=%sql \
 EXPLAIN ANALYZE \
 SELECT COUNT(DISTINCT movie.movie_id) \
 FROM (((movie LEFT OUTER JOIN movie_actor    ON movie.movie_id = movie_actor.movie_id) \
               LEFT OUTER JOIN movie_country  ON movie.movie_id = movie_country.movie_id) \
               LEFT OUTER JOIN movie_director ON movie.movie_id = movie_director.movie_id) \
                          JOIN movie_genre    ON movie.movie_id = movie_genre.movie_id \
 WHERE country = 'Tunisia'
pd.DataFrame(runtime_statistics).tail(1)

Notes:

As `movie_country.country` is a *primary key* column, it already has an index defined on it.

In [None]:
%%sql
-- This statement is included to allow the sequence of statments in this Notebook to be rerun.
DROP INDEX IF EXISTS movie_unnormalised_country;

In [None]:
# Query 4, unnormalised data
runtime_statistics=%sql \
 EXPLAIN ANALYZE \
 SELECT COUNT(DISTINCT movie_id) \
 FROM movie_unnormalised \
 WHERE country = 'Tunisia'
pd.DataFrame(runtime_statistics).tail(1)

In [None]:
%%sql
-- Please be patient, indexing may take some time to complete.
CREATE INDEX movie_unnormalised_country ON movie_unnormalised (country);

In [None]:
# Query 4, unnormalised data
runtime_statistics=%sql \
 EXPLAIN ANALYZE \
 SELECT COUNT(DISTINCT movie_id) \
 FROM movie_unnormalised \
 WHERE country = 'Tunisia'
pd.DataFrame(runtime_statistics).tail(1)

## Analysis

Our results are given below.

Query | Normalised data* | Unnormalised data* | Unnormalised data + index
------|-----------------|-------------------|--------------------------
3 | 117.833 | 166.070 | 75.236
4 | 14.020 | 125.870 | 0.118

\* results from Notebook `10.4 Normalised v. unnormalised data`.

As we stated in Notebook `10.4 Normalised v. unnormalised data`, Queries 3 and 4 only access a fraction of the data: 
only 3703 movies are classified as comedies (Query 3) and only 1 movie was made in Tunisia (Query 4). 
The runtimes are less for the *normalised data* than for the *unnormalised data* because the query optimiser can 
minimise the join operations to just data about comedies and Tunisia by making use of the indexes already 
defined on the *primary keys* of the `movie_genre` and `movie_country` tables respectively.

In this Notebook, we can see that defining indexes on the the `genre` and `country` columns of the `movie_unnormalised` 
table can significantly improve the efficiency of query execution by enabling the query optimiser to select data 
about comedies and Tunisia by making use of these indexes.

## Summary
In this Notebook you have explored the effect of indexing on the performance of the execution of queries.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `12.3 Query processing`.