# Six Degrees of Kevin Bacon

This notebook explores **materialised views**, using the example of finding the [Bacon number](https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon) of film stars. It's an idea to see how "connected" people are in movies.

In this notebook, we will use a (conceptually) simple task to investigate how materialised views can make some database tasks more tractable in practice. We will compare the time taken for queries with materialised views against queries which use views which are not materialised.

You should spend around one hour on this notebook. Note that this notebook does not contain any new concepts; you should treat it just as a worked example of the material you have seen so far on views.

## The Bacon number

An actor's bacon number is how many "movies away" they are from appearing with Kevin Bacon. 

Kevin Bacon has a Bacon number of zero. 

Everyone he's been in a film with has a Bacon number of one.

If a person has been in a film with someone with a Bacon number of one, that person has a Bacon number of two.

If a person has been in a film with someone with a Bacon number of two, that person has a Bacon number of three.

The urban myth is that everyone in the movies has a Bacon number of six or less.

We will investigate whether this is true.

## Setting up

The next group of cells set up your database connection, and reset the database to a clean state. Check notebook *08.1 Data Definition Language in SQL* if you are unsure what the next cells do.

You may need to change the given values of the variables `DB_USER` and `DB_PWD`, depending on which environment you are using

In [None]:
# Make the connection

%run sql_init.ipynb
print("Connecting with connection string : {}".format(DB_CONNECTION))
%sql $DB_CONNECTION

## The database ERD: recap



Before we start, here's the ERD of the database again, which can be useful in getting orientated around the information it contains.


![Movies ERD](./images/movies-erd.svg)


As with `Notebook 11.1: Movie analysis`, this notebook uses the `movies` schema, so let's set the `search_path` so that we don't need to qualify all the table names:

In [None]:
%%sql

SET search_path TO movies, public;

# First steps
First, how many actors are there, and can we identify Kevin Bacon?

To start with, the next cell removes any views or materialized views which you might have created if this is not the first time you have used this notebook.

In [None]:
%%sql 

DROP MATERIALIZED VIEW IF EXISTS movie_bacon;
DROP MATERIALIZED VIEW IF EXISTS mbaconn;

DROP VIEW IF EXISTS bacon6;
DROP VIEW IF EXISTS bacon5;
DROP VIEW IF EXISTS bacon4;
DROP VIEW IF EXISTS bacon3;
DROP VIEW IF EXISTS bacon2;
DROP VIEW IF EXISTS bacon1;

DROP VIEW IF EXISTS jbacon6;
DROP VIEW IF EXISTS jbacon5;
DROP VIEW IF EXISTS jbacon4;
DROP VIEW IF EXISTS jbacon3;
DROP VIEW IF EXISTS jbacon2;
DROP VIEW IF EXISTS jbacon1;

DROP MATERIALIZED VIEW IF EXISTS mbacon6;
DROP MATERIALIZED VIEW IF EXISTS mbacon5;
DROP MATERIALIZED VIEW IF EXISTS mbacon4;
DROP MATERIALIZED VIEW IF EXISTS mbacon3;
DROP MATERIALIZED VIEW IF EXISTS mbacon2;
DROP MATERIALIZED VIEW IF EXISTS mbacon1;

In [None]:
%%sql

SELECT COUNT(DISTINCT person_id)
FROM cast_member;

Using the standard SQL similarity matching.

In [None]:
%%sql

SELECT id, name
FROM person 
WHERE name LIKE '%Kevin%Bacon%';

...and using the Postgres regular expression matcher:

In [None]:
%%sql 

SELECT id, name 
FROM person 
WHERE name ~* 'kevin.*bacon';

In [None]:
kevin_bacon_id = 4724

### Activity 1


Find the person id and name of everyone with a Bacon number of one. That is, all the people who have been cast members in a movie where Kevin Bacon has also been a cast member.

Hint. To include a Python variable in an SQL Magic query, prefix the variable name with a colon, like the query below which counts how many films Kevin Bacon has been in.

In [None]:
%%sql

SELECT COUNT(DISTINCT movie_id)
FROM cast_member
WHERE person_id = :kevin_bacon_id;

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

This query uses correlation names to look twice at the `cast_member` table (once to find Kevin Bacon, once to find other people in his films). Also note the final clause of the `WHERE` condition, to ensure that Kevin Bacon is not included in the people with a Bacon number of 1.

In [None]:
%%sql

SELECT DISTINCT(id), name
FROM person, cast_member AS c1, cast_member AS c2
WHERE c1.movie_id = c2.movie_id
    AND c1.person_id = :kevin_bacon_id
    AND person.id = c2.person_id
    AND c2.person_id <> :kevin_bacon_id;

#### End of Activity 1

-----------------------------------------------------

# The Bacon views

Next, we'll create views to hold each group of people, ordered by Bacon number.

In [None]:
%%sql

DROP VIEW IF EXISTS bacon6;
DROP VIEW IF EXISTS bacon5;
DROP VIEW IF EXISTS bacon4;
DROP VIEW IF EXISTS bacon3;
DROP VIEW IF EXISTS bacon2;
DROP VIEW IF EXISTS bacon1;

### Activity 2

Create a view, `bacon1`, of everyone with a Bacon number of one or less (this will include Kevin Bacon himself).

How many people are there with Bacon number ≤ 1?

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

In [None]:
%%sql

CREATE VIEW bacon1 AS 
    SELECT DISTINCT (c2.person_id)
    FROM cast_member AS c1, cast_member AS c2
    WHERE c1.movie_id = c2.movie_id
        AND c1.person_id = :kevin_bacon_id;

In [None]:
%%sql

SELECT COUNT(*) AS number_of_people
FROM bacon1;

#### End of Activity 2

-----------------------------------------------

### Activity 3

Create a view, `bacon2`, which holds everyone with a Bacon number of two or less. You can use the `bacon1` view in your solution: `bacon2` should contain everyone who's been in a movie with someone listed in the `bacon1` view.

How many people are there with Bacon number ≤ 2?

Repeat the process, creating a view for everyone with a Bacon number of _n_ or less, for all Bacon numbers up to and including 6.

How many people have a Bacon number of 7 or more?

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

**Bacon 2**

First, I join `bacon1` and `cast_member AS c1` to find the `movie_id`s of all the appearances for people with Bacon number of 1. Anyone who's in one of those movies has a Bacon number of 2 (or lower). I find those people with another join between `c1` and `cast_member AS c2` (excluding the cases where it's the same person in `c1` and `c2`.

In [None]:
%%sql

CREATE VIEW bacon2 AS 
    SELECT DISTINCT(c2.person_id)
    FROM bacon1 AS b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id <> c1.person_id;

In [None]:
%%sql 

SELECT COUNT(person_id) AS number_of_people
FROM bacon2;

**Bacon 3 – 6**

These follow the same pattern as `bacon2`, but starting the pattern from a different view.

In [None]:
%%sql

CREATE VIEW bacon3 AS
    SELECT DISTINCT(c2.person_id)
    FROM bacon2 AS b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id <> c1.person_id;

In [None]:
%%sql

CREATE VIEW bacon4 as 
    SELECT DISTINCT(c2.person_id)
    FROM bacon3 AS b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id <> c1.person_id;

In [None]:
%%sql

CREATE VIEW bacon5 AS 
    SELECT DISTINCT(c2.person_id)
    FROM bacon4 AS b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id <> c1.person_id;

In [None]:
%%sql

CREATE VIEW bacon6 AS 
    SELECT DISTINCT(c2.person_id)
    FROM bacon5 as b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id <> c1.person_id;

**Finding the counts with Bacon numbers**

Now we have the views, it's easy to just `COUNT` the number of `person_id`s in each view. (Yes, these queries are taking quite some time to complete.)

In [None]:
%%time
%%sql 

SELECT COUNT(person_id) AS number_of_people
FROM bacon3;

In [None]:
%%time
%%sql 

SELECT COUNT(person_id) AS number_of_people
FROM bacon4;

In [None]:
%%time
%%sql 

SELECT COUNT(person_id) AS number_of_people
FROM bacon5;

In [None]:
%%time
%%sql 

SELECT COUNT(person_id) AS number_of_people
FROM bacon6;

**Those with Bacon number ≥ 7**

In [None]:
%%time
%%sql 

SELECT COUNT(DISTINCT person_id) AS number_of_people
FROM cast_member 
WHERE cast_member.person_id NOT IN (SELECT person_id 
                                    FROM bacon6);

#### End of Activity 3

--------------------------------------

### Activity 4

**Exact Bacon numbers**

The activities above found people with Bacon numbers of _n_ or less. 

Repeat the exercise and find the number of people with Bacon number of exactly _n_, for 1 ≤ _n_ ≤ 6.

Call these views `jbacon1` to `jbacon6` ("Just Bacon _n_")

How many people have a Bacon number of 6? How long does that query take?

Hint: you may find it more consistent to define a view `jbacon0`, which returns just Kevin Bacon.

In [None]:
%%sql

DROP VIEW IF EXISTS jbacon6;
DROP VIEW IF EXISTS jbacon5;
DROP VIEW IF EXISTS jbacon4;
DROP VIEW IF EXISTS jbacon3;
DROP VIEW IF EXISTS jbacon2;
DROP VIEW IF EXISTS jbacon1;
DROP VIEW IF EXISTS jbacon0;

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

In [None]:
%%sql

CREATE VIEW jbacon0 AS 
    SELECT DISTINCT (person_id)
    FROM cast_member
    WHERE person_id = :kevin_bacon_id;

The view `jbacon1` is the same as `bacon1`, but with the additional condition that people present in `jbacon0` are not included in `jbacon1`

In [None]:
%%sql

CREATE VIEW jbacon1 AS 
    SELECT DISTINCT(c2.person_id)
    FROM jbacon0 AS b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id NOT IN (SELECT person_id 
                                 FROM jbacon0);

Similarly, `jbacon2` is the same as `bacon2`, so long as a person is not in `jbacon0` and not in `jbacon1`.

A similar pattern continues for the other Bacon numbers. We need to check all previous Bacon sets at each step.

In [None]:
%%sql

CREATE VIEW jbacon2 AS 
    SELECT DISTINCT(c2.person_id)
    FROM jbacon1 AS b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id NOT IN (SELECT person_id 
                                 FROM jbacon0)
        AND c2.person_id NOT IN (SELECT person_id 
                                 FROM jbacon1);

In [None]:
%%sql

CREATE VIEW jbacon3 AS 
    SELECT DISTINCT(c2.person_id)
    FROM jbacon2 AS b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id NOT IN (SELECT person_id
                                 FROM jbacon0)
        AND c2.person_id NOT IN (SELECT person_id
                                 FROM jbacon1)
        AND c2.person_id NOT IN (SELECT person_id
                                 FROM jbacon2);

In [None]:
%%sql

CREATE VIEW jbacon4 AS 
    SELECT distinct(c2.person_id)
    FROM jbacon3 AS b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id NOT IN (SELECT person_id 
                                 FROM jbacon0)
        AND c2.person_id NOT IN (SELECT person_id 
                                 FROM jbacon1)
        AND c2.person_id NOT IN (SELECT person_id 
                                 FROM jbacon2)
        AND c2.person_id NOT IN (SELECT person_id 
                                 FROM jbacon3);

In [None]:
%%sql

CREATE VIEW jbacon5 AS 
    SELECT DISTINCT(c2.person_id)
    FROM jbacon4 AS b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id NOT IN (SELECT person_id
                                 FROM jbacon0)
        AND c2.person_id NOT IN (SELECT person_id
                                 FROM jbacon1)
        AND c2.person_id NOT IN (SELECT person_id
                                 FROM jbacon2)
        AND c2.person_id NOT IN (SELECT person_id
                                 FROM jbacon3)
        AND c2.person_id NOT IN (SELECT person_id
                                 FROM jbacon4);

In [None]:
%%sql

CREATE VIEW jbacon6 AS 
    SELECT DISTINCT(c2.person_id)
    FROM jbacon5 as b, cast_member as c1, cast_member as c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id NOT IN (SELECT person_id 
                                 FROM jbacon0)
        AND c2.person_id NOT IN (SELECT person_id 
                                 FROM jbacon1)
        AND c2.person_id NOT IN (SELECT person_id 
                                 FROM jbacon2)
        AND c2.person_id NOT IN (SELECT person_id 
                                 FROM jbacon3)
        AND c2.person_id NOT IN (SELECT person_id 
                                 FROM jbacon4)
        AND c2.person_id NOT IN (SELECT person_id
                                 FROM jbacon5);

In [None]:
%%time
%%sql 

SELECT COUNT(person_id) AS number_of_people
FROM jbacon2;

In [None]:
%%time
%%sql 

SELECT COUNT(person_id)  AS number_of_people
FROM jbacon6;

In [None]:
%%time
%%sql 

SELECT COUNT(person_id) AS number_of_people
FROM bacon6;

#### End of Activity 4

--------------------------------------

# Materialised views
One way to speed up these queries is with materialised views, where the view contents are stored in the database for easy lookup. They may take time to create, but are much quicker to use.

In [None]:
%%sql

DROP MATERIALIZED VIEW IF EXISTS mbacon6;
DROP MATERIALIZED VIEW IF EXISTS mbacon5;
DROP MATERIALIZED VIEW IF EXISTS mbacon4;
DROP MATERIALIZED VIEW IF EXISTS mbacon3;
DROP MATERIALIZED VIEW IF EXISTS mbacon2;
DROP MATERIALIZED VIEW IF EXISTS mbacon1;

This creates a materialised view of people with Bacon number ≤ 1. As you can see, it's identical in form to the definition of the `bacon1` view, just with the addition of the `materialized` keyword.

Note that PostgreSQL tells us how many rows were created in the materialised view.

In [None]:
%%sql

CREATE MATERIALIZED VIEW mbacon1 AS 
    SELECT DISTINCT(c2.person_id)
    FROM cast_member AS c1, cast_member AS c2
    WHERE c1.movie_id = c2.movie_id
        AND c1.person_id = :kevin_bacon_id;

### Activity 5

Repeat the above, creating materialised views for people with Bacon numbers between 2 and 6 (inclusive).

Compare the times to count the number of people with Bacon number less than or equal to 6, using both materialised and standard views.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

In [None]:
%%sql

CREATE MATERIALIZED VIEW mbacon2 AS 
    SELECT DISTINCT(c2.person_id)
    FROM mbacon1 AS b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id <> c1.person_id;

In [None]:
%%sql

CREATE MATERIALIZED VIEW mbacon3 AS 
    SELECT DISTINCT(c2.person_id)
    FROM mbacon2 as b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id <> c1.person_id;

In [None]:
%%sql

CREATE MATERIALIZED VIEW mbacon4 AS 
    SELECT DISTINCT(c2.person_id)
    FROM mbacon3 as b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id <> c1.person_id;

In [None]:
%%sql

CREATE MATERIALIZED VIEW mbacon5 AS 
    SELECT DISTINCT(c2.person_id)
    FROM mbacon4 as b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id <> c1.person_id;

In [None]:
%%sql

CREATE MATERIALIZED VIEW mbacon6 AS 
    SELECT DISTINCT (c2.person_id)
    FROM mbacon5 AS b, cast_member AS c1, cast_member AS c2
    WHERE b.person_id = c1.person_id
        AND c1.movie_id = c2.movie_id
        AND c2.person_id <> c1.person_id;

#### End of Activity 5

---------------------------------------------

## Using materialised views
We can now see if these materialised views make a difference in performance.

In [None]:
%%time
%%sql 

SELECT COUNT(person_id) AS number_of_people
FROM mbacon3;

In [None]:
%%time
%%sql 

SELECT COUNT(person_id) AS number_of_people
FROM mbacon4;

In [None]:
%%time
%%sql 

SELECT COUNT(person_id) AS number_of_people
FROM mbacon5;

Using the materialised view:

In [None]:
%%time
%%sql 

SELECT COUNT(person_id) AS number_of_people
FROM mbacon6;

Using the non-materialised view:

In [None]:
%%time
%%sql 

SELECT COUNT(person_id) AS number_of_people 
FROM bacon6;

We can see how much space these materialised views are taking with this query. 

The `relkind = 'm'` picks out materialised views. 

(You're not expected to be able to generate this, and it's very much Postgresql-specific).

In [None]:
%%sql 

SELECT c.oid
     , relname AS table_name
     , c.reltuples AS row_estimate
     , pg_total_relation_size(c.oid) AS total_bytes
     , pg_size_pretty(pg_total_relation_size(c.oid)) AS total
FROM pg_class AS c
LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE relkind = 'm'
ORDER BY total_bytes;

These aren't large tables: the materialised views take around 10MB in total. But materialised views with more data could easily take up a lot of space. As with so many things in the pragmatics of database use, whether to use materialised views, and to what degree, is an exercise in judgment with consideration of the different demands on the database.

# Conclusion

This notebook has been an investigation into materialised views. You've seen how they represent a trade-off between time and space. Materialised views can allow a database to quickly return the results of a complex and time-consuming query, by caching the results of that query in the materialsed view. The downsides are the space needed to create them, and the issue of out-of-date information if the underlying tables change before the materialised view is regnerated.