# Views



In this notebook, you will look at how *views* can be used to support the process of data exploration and database interrogation. As we saw in section 3 of Part 11, a view is defined as *a query with a name*. Once created, it is available for use in any other query: it becomes usable in exactly the same way as a table. This means that a complex query can be written once, its result stored as a view, and then that view be used in subsequent queries.

Views can also be used to save time: so-called *materialized views* allow us to save a copy of the results of a query into a table that we can call on that data at a future date, without reevaluating the query.

In addition, in Part 23, you will see how views can be used to enforce security constraints.

You should spend around 90 minutes on this notebook.

## Setting up

The next group of cells set up your database connection, and reset the database to a clean state. Check notebook *08.1 Data Definition Language in SQL* if you are unsure what the next cells do.

You may need to change the given values of the variables `DB_USER` and `DB_PWD`, depending on which environment you are using

In [None]:
# Make the connection

%run sql_init.ipynb
print("Connecting with connection string : {}".format(DB_CONNECTION))
%sql $DB_CONNECTION

## The database ERD: recap



Before we start, here's the ERD of the database again, which can be useful in getting orientated around the information it contains.


![Movies ERD](./images/movies-erd.svg)


As with `Notebook 11.1: Movie analysis`, this notebook uses the `movies` schema, so let's set the `search_path` so that we don't need to qualify all the table names:

In [None]:
%%sql

SET search_path TO movies, public;

# Budget and cast / crew size

Let's start this section with an investigation into the size of cast and crew in a movie. Is there something interesting to find out about how the money either spent or earned in a movie relates to the number of people involved with making it?

We can find the sizes of cast and crew quite easily, with a subquery for each total we want:

In [None]:
%%sql

SELECT id, title, year, budget, revenue, 
    (SELECT COUNT(*)
     FROM cast_member 
     WHERE cast_member.movie_id = movie.id) AS cast_size,
    (SELECT COUNT(*)
     FROM crew 
     WHERE crew.movie_id = movie.id) AS crew_size
FROM movie
WHERE budget > 0 AND revenue > 0
LIMIT 10;

But those subqueries are a pain to write if we want to do a lot of investigation in this area. Let's create a *view* to make subsequent queries easier. One way of thinking of a view is as a temporary, derived table containing the answer to a query you might want to keep running as part other queries.

In the following example, the view is an alias to a query that just returns the cast and crew sizes for a movie. We'll rely on the existing `movie` table for the rest of the information about movies.

In [None]:
%%sql 
DROP VIEW IF EXISTS movie_staff_size CASCADE;

CREATE VIEW movie_staff_size AS
    SELECT id, 
        (SELECT COUNT(*) 
         FROM cast_member 
         WHERE cast_member.movie_id = movie.id) AS cast_size,
        (SELECT COUNT(*) 
         FROM crew 
         WHERE crew.movie_id = movie.id) AS crew_size
    FROM movie;

We can now query the `movie_staff_size` view as though it were a table in the database:

In [None]:
%%sql
SELECT *
FROM movie_staff_size
LIMIT 10;

and, similarly, we can use the `movie_staff_size` view in queries in the same way as other tables in the database:

In [None]:
%%sql
SELECT movie.id, title, cast_size, crew_size
FROM movie, movie_staff_size
WHERE movie.id = movie_staff_size.id
LIMIT 10;

Let's take a quick look at how a movie's budget varies with the crew size.

*The following query may take some time to run.*

In [None]:
%%sql budget_crew <<

SELECT movie.id, budget, crew_size
FROM movie, movie_staff_size
WHERE movie.id = movie_staff_size.id 
    AND budget > 1000 
    AND revenue > 1000;

We can quickly eyeball the data as a sanity check:

In [None]:
budget_crew.head()

Let's put this data on a scatter plot. (Refer back to Notebook 5.3 for the use of `matplotlib.patches` to manually draw shapes on plots.)

In [None]:
import matplotlib.patches as mpatches

In [None]:
ax=budget_crew.plot.scatter(x='crew_size', y='budget', alpha=0.3)

ax.set(title="Movie budget vs crew size, one dot per movie",
      xlabel="Crew size",
      ylabel="Budget (in hundred millions)")

e1 = mpatches.Ellipse((20, 1.5e8), width=40, height=15e7, color='g', fill=False, linewidth=5)
ax.add_patch(e1)

e1 = mpatches.Ellipse((110, 1e8), width=100, height=9e7, color='k', fill=False, linewidth=5)
ax.add_patch(e1)

pass

The scatter diagram indicates two groups in the data: one with small crew sizes (in the green ellipse) and one with larger crew sizes (in the black ellipse). I suspect that these are to do with how accurately the crew sizes are reported, with the smaller crew sizes being just the "headline" crew members, such as department leads.

### Activity 1

Repeat the analysis above, but this time using the total personnel size (cast size plus crew size). Do you still see the same two clusters in the data?

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

The only difference in the query is to ask for the `personnel_size` as the `cast_size` + `crew_size`.

In [None]:
%%sql budget_personnel << 

SELECT movie.id, budget, (cast_size + crew_size) AS personnel_size
FROM movie, movie_staff_size
WHERE movie.id = movie_staff_size.id AND
    budget > 1000 AND
    revenue > 1000;

In [None]:
budget_personnel.head()

In [None]:
ax = budget_personnel.plot.scatter(x='personnel_size', y='budget', alpha=0.3)
ax.set(title="Movie budget vs personnel size,\none dot per movie",
      xlabel="Personnnel size",
      ylabel="Budget (in hundred millions)");


Yes, the same clusters are there, but the separation between them is less pronounced.

As an investigation, this is an unilluminating dead end. But it does serve its purpose of being a gentle introduction into SQL views.

#### End of Activity 1

--------------------------------------

## Views and Materialised Views

When you create a view, you may notice that the cell appears to run very quickly, but when you run a query that uses the view it takes a noticeably longer time to run. This is because a view is simply an alias for a query: no work is actually done when you create a view; the query is only run when you reference the view.

Let's create a view derived from the previous view and the previous query, albeit with a different name and higher budget and revenue limits.

We can use the `%%time` magic to time how long it takes the cell to execute and return any result.

In [None]:
%%time
%%sql

DROP VIEW IF EXISTS movie_staff_size_10Mplus;

CREATE VIEW movie_staff_size_10Mplus AS
    SELECT movie.id, budget, crew_size
    FROM movie, movie_staff_size
    WHERE movie.id = movie_staff_size.id 
        AND budget > 10000000 
        AND revenue > 10000000;

Using the *SQL* `EXPLAIN ANALYZE` command, we can see how long it takes to run a query against this view; the actual time is given in the final row of the table as the `Execution Time`. You'll also see there is `planning time` being spent as Postgres works out the most efficient way to run the query.

In [None]:
%%sql

EXPLAIN ANALYZE
    SELECT * 
    FROM movie_staff_size_10Mplus

Alternatively, we can use the `%%time` magic again to time how long the cell execution takes.

In [None]:
%%time
%%sql _tmp <<

SELECT * 
FROM movie_staff_size_10Mplus;

### Activity 2

Run the above cell two or three times, making a note of the time spent running the query in each case. What do you notice?

Write your answer in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

You may have noticed that the query takes a few seconds each time you run it.

This is one of the downsides of using views: the query that a view represents has to be run each time you call on the view.

This makes sense if your database is being updated all the time, because you want the view to return the state of the current database, but if the database is unchanging, this seems wasteful. In such a case, we can define a *materialised view* which is essentially a table that contains the result of running the query.

#### End of Activity 2

--------------------------------------

### Materialised views

The following cell creates a *materialised view* based on the same query as the one we used for the simple view.

See how long it takes to run. (How long is the `*` in the `In []:` cell display rendered for?) A vanishingly small time, as when we created the original view, or a more substantial time?

In [None]:
%%time
%%sql

DROP MATERIALIZED VIEW IF EXISTS movie_staff_size_10Mplus_MATERIALISED;

CREATE MATERIALIZED VIEW movie_staff_size_10Mplus_MATERIALISED AS

SELECT movie.id, budget, crew_size
FROM movie, movie_staff_size
WHERE movie.id = movie_staff_size.id 
    AND budget > 10000000 
    AND revenue > 10000000;

When I ran the cell, it took several seconds to create the materialised view: the query was actually run and the returned data used to create a table containing that data.

But what happens when we run a query against the materialised view. How long does that query take to run, compared to the query against the simpler, unmaterialised view?

Again, we can use the SQL `EXPLAIN ANALYZE` report or the `%%time` cell magic to get a feel for the time involved.

In [None]:
%%sql
EXPLAIN ANALYZE

SELECT * 
FROM movie_staff_size_10Mplus_MATERIALISED;

In [None]:
%%time
%%sql _tmp <<

SELECT * 
FROM movie_staff_size_10Mplus_MATERIALISED

Now the query runs very quickly indeed: the query used to define the (materialised) view has already been run, the data saved as a materialised view, and when we call on it we look at the materialised *results* of the query, rather than having to run a query to obtain them.

The downside is that we have added an extra table to the database, which adds a storage resource overhead to the database. However, the materialised view means we require less computational resource when running a query against it compared to a query that uses a simple view.

Try running the cell again. You should find it takes a similar, brief, amount of time to run.

(You might also notice that the `EXPLAIN ANALYZE` analyse report is much smaller because there is no planning to report on - the query is essentially just a simple select onto a table.)

# Both cast and crew

Cinema has a long tradition of certain people appearing on both sides of the camera, as a cast member *and* as a crew member. These range from Hitchcock's cameos in the films he directed, to Woody Allen's combination of directing and performing as male lead actor.

There are a few questions we can ask of the dataset around this topic, including:

1. How prevalent is this phenomenon? Are there some people who regularly cross the divide?
2. Do different types of film have different types of shared roles between cast and crew?
3. Do different types of people do different jobs on either side of the camera?

### Activity 3

How can we find the data about people who are both cast and crew on the same movie?

Write a query which returns one row for each time a person appears in the cast of a film and the crew of that same film. If a person has several acting roles in a film, or several crew roles, return one row for each combination.

The query should return:
* movie id
* movie title
* person id
* person name
* person gender
* character played
* cast order of role
* department of "crew" job
* job in crew

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

This is reasonably straightforward, but note that all the joins are there to ensure we're talking about the same person in both `cast_member` and `crew` tables.

In [None]:
%%sql

SELECT movie.id, title, person.id AS person_id, person.name, person.gender, character, cast_order, department, job
FROM movie, person, cast_member, crew
WHERE movie.id = cast_member.movie_id
    AND movie.id = crew.movie_id
    AND person.id = cast_member.person_id
    AND person.id = crew.person_id
LIMIT 10;

#### End of Activity 3

--------------------------------------

### Activity 4

We'll be using these results a lot, so create a **view** called `cast_crew` to return them.

Define the view to return the same result set as above. (The query ran quite quickly, so there's probably no real need here to render it as a materialised view.)

Run a simple test query against it to check that it works.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

The `CREATE VIEW` just calls the query from above.

In [None]:
%%sql
DROP VIEW IF EXISTS cast_crew;

CREATE VIEW cast_crew AS

    SELECT movie.id AS movie_id, person.id AS person_id, 
        cast_member.credit_id AS cast_member_credit_id, crew.credit_id AS crew_credit_id,
        title, 
        person.name, person.gender,
        character, cast_order, department, job
    FROM movie, person, cast_member, crew
    WHERE movie.id = cast_member.movie_id
        AND movie.id = crew.movie_id
        AND person.id = cast_member.person_id
        AND person.id = crew.person_id;

We can query the view as though it were a base table to just return the information we want.

In [None]:
%%sql

SELECT movie_id, title, person_id, name, character, cast_order, department, job
FROM cast_crew
LIMIT 10;

#### End of Activity 4

--------------------------------------

### Activity 5


Now we've built up some infrastructure for addressing the cast/crew combinations, we can start to do some data investigation.

To start with, how many different movies have at least one person both in cast and crew? How many people have been in both cast and crew on the same movie?

Can you answer both questions with a single query?

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

Rather than counting all the rows, we count the `DISTINCT` ids.

In [None]:
%%sql 

SELECT COUNT(DISTINCT movie_id) AS movies, 
       COUNT(DISTINCT person_id) AS people 
FROM cast_crew;

#### End of Activity 5

--------------------------------------

### Activity 6

What are the most common crew jobs done by people appearing in the cast of a movie?

(There are many different jobs, so limit your results to the 15 most common jobs.)

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

We want to count the number of times each job was done by a someone who is on the `cast_crew` view. That means `GROUP BY job` (and we'll include the department in there as well, as it's helpful).

"15 most common" means an `ORDER BY` and a `LIMIT` on the query.

In [None]:
%%sql

SELECT department, job, COUNT(movie_id) AS number_of_occurrences
FROM cast_crew, movie
WHERE movie_id = movie.id
GROUP BY department, job
ORDER BY COUNT(movie_id) DESC
LIMIT 15;

Director is the most popular job, followed by producer and some form of writer. If we look at the results grouped by department, we see this is reinforced: writing, producing, and directing all have other a thousand instances, with the next highest having only a few hundred. 

In [None]:
%%sql

SELECT department, COUNT(movie_id) AS number_of_movies
FROM cast_crew, movie
WHERE movie_id = movie.id
GROUP BY department
ORDER BY COUNT(movie_id) DESC
LIMIT 15;

This implies that the cast/crew combination is mainly one of the "auteur", where the entire movie is one person's vision, from writing to direction to acting. The production roles could well be from successful actors adding money and contacts to make a movie happen.

#### End of Activity 6

--------------------------------------

### Activity 7

If the notion that cast/crew combinations reflect auteur creators, we would expect the cast members to have low `cast order` ratings. (`cast_order` reflects where someone appears in the credits of a film: low `cast_order`s are further up the bill, with starring roles corresponding to the lowest `cast_order`.)

We can look at how many times each `cast_order` appears with a familiar query structure:

In [None]:
%%sql

SELECT cast_order, COUNT(cast_order) AS n_occurrences
FROM cast_member 
GROUP BY cast_order
ORDER BY cast_order
LIMIT 10;

However, there are many people in the dataset with a `cast_order` of zero, which is slightly suspicious. If people are numbering cast order, it's more usual to start from one. Could the `cast_order` of zero be used as a NULL `cast_order`, rather than being an actual position in the cast?

We already know that the dataset is incomplete, with many `crew` members missing from some films. Could `cast_order` be similarly unreliable?

How would you check this? Spend a few minutes thinking about how you could check the reliability of the `cast_order` data. Don't write any queries yet.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

This is a very underspecified question, so there are several ways you could approach it. 

My approach is this. If a `cast_order` of zero is used as a NULL value, you'd expect to see several `cast_member`s with `cast_order` = 0 for the same movie. If it's a genuine zero-based indexing of cast order, you'd expect to see just one person with each `cast_order` in each movie. 

Therefore, I could write a query to count the number of each `cast_order` in each `movie`. If there are lots of movies with many `cast_order = 0`, it's used as a null. If there are none, `cast_order` is a reliable indicator.

#### End of Activity 7

--------------------------------------

### Activity 8

Using your approach, determine if `cast_order` is a reliable value in this dataset.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

Here's my implementation of my approach. 

For each `movie_id` / `cast_order` combination, I count how many times a `person_id` occurs for that `cast_order` in that `movie_id`. I order the results so that the most common "doubling ups" are listed first.

In [None]:
%%sql

SELECT movie_id, cast_order, COUNT(person_id) AS people_count
FROM cast_member
GROUP BY movie_id, cast_order
HAVING COUNT(person_id) > 1
ORDER BY people_count DESC;

This suggests that where there are people doubling up with the same `cast_order`, it tends to be in the minor parts, lower down the billing.

And, just to be sure, let's look at how many rows of `cast_member` have `cast_order IS NULL`.

In [None]:
%%sql 

SELECT movie_id, COUNT(person_id) AS people_count
FROM cast_member
WHERE cast_order IS NULL
GROUP BY movie_id
HAVING COUNT(person_id) > 1
ORDER BY people_count DESC;

These indicate that there are very few cases where `cast_order` is doubled up in a movie. There are also no cases with `cast order IS NULL`. Therefore, we can be fairly confident that `cast_order` is a reliable element of this dataset and we don't need to treat `cast_order = 0` differently from `cast_order = 1`: `cast_order = 0` is higher up the bill than `cast_order = 1`.

#### End of Activity 8

--------------------------------------

### Activity 9


Back to the original activity. 

If the notion that cast/crew combinations reflect auteur creators, we would expect the cast members to have low `cast order` ratings. Alternatively, if the appearances are quick Hitchcockian cameos, we'd expect cast-and-crew cast members to have a relatively high `cast_order` rating. Does our hypothesis about auteur creators appear to hold true?

Before you start, think about how you'd show this to be the case. What data would answer this question, and how would you present that data?

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

**Approach**

The idea is to count how many times each `cast_order` value appears in the `cast_crew` view. If people in the `cast_crew` view are auteurs, I'd expect to see a lot of occurrences where the `cast_order` is very low, and few occurrences where the `cast_order` is high. 

I can show that by producing a graph of the number of times each `cast_order` occurs in the data.

That means I need to write a query that `COUNT`s the number of occurrences of each `cast_order` in the `cast_crew` view.

**Query**

This is a simple query on the `cast_crew` view, with a `GROUP` and `COUNT` to aggregate the number of movies at each `cast_order`. The limit is just to eliminate the long tail, by restricting results to where there are at least ten jobs or roles for a movie.

In [None]:
%%sql cast_order_counts <<

SELECT cast_order, COUNT(cast_crew.movie_id) AS movie_count
FROM cast_crew
GROUP BY cast_order
ORDER BY cast_order
LIMIT 40;

In [None]:
cast_order_counts.set_index('cast_order', inplace=True)
cast_order_counts.head()

**Chart**

In [None]:
ax = cast_order_counts.movie_count.plot()
ax.set(title="Number of appearances with particular cast order", 
       xlabel='Cast order', 
       ylabel='Number of appearances');

**Comments**

This does support the hypothesis that most people who are members of both `cast_member` and `crew` are leading actors, and therefore the movie is somewhat of a personal project for these people.

**Further Reflection**

This distribution of cast orders looks very skewed. But is that an artefact of the dataset overall, or is it a genuine difference between the people who are both cast & crew, and the general "population" of movie actors?

Let's find out by doing exactly the same analysis on the whole `cast_member` table.

In [None]:
%%sql all_cast_order_counts <<

SELECT cast_order, COUNT(cast_member.movie_id) AS movie_count
FROM cast_member
GROUP BY cast_order
HAVING COUNT(cast_member.movie_id) > 10
ORDER BY cast_order
LIMIT 40;

In [None]:
all_cast_order_counts.set_index('cast_order', inplace=True)
all_cast_order_counts.head()

In [None]:
ax = all_cast_order_counts.movie_count.plot()
ax.set(title="Number of appearances with particular cast order (all actors)", 
       xlabel='Cast order', 
       ylabel='Number of appearances');

**Comment**

This distribution is a very different shape from the one for the cast/crew people. This difference in shape indicates that there is a real effect here, that people in both cast and crew do not receive high-profile roles at the same rate as actors generally.

#### End of Activity 9

--------------------------------------

### Activity 10


The pattern of auteur actors driving films forward may be true for small indie productions where just a few people do many of the jobs. Is it also true for major films?

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

First, we have to define "major film". I take the arbitrary threshold that a "major film" is one with a budget ≥ $5,000,000.

In [None]:
%%sql major_cast_order_counts <<

SELECT cast_order, COUNT(cast_crew.movie_id) AS movie_count
FROM cast_crew, movie
WHERE movie.id = cast_crew.movie_id
    AND movie.budget > 5000000
GROUP BY cast_order
ORDER BY cast_order
LIMIT 40;

In [None]:
major_cast_order_counts.set_index('cast_order', inplace=True)
major_cast_order_counts.head()

In [None]:
ax = major_cast_order_counts.movie_count.plot()
ax.set(title="Number of appearances with particular cast order\n(major films)", 
       xlabel='Cast order', 
       ylabel='Number of appearances');

This is essentially the same shape as the distribution above, so "size of movie" does not seem to affect the prevalence of auteurs creating and starring in films.

#### End of Activity 10

--------------------------------------

### Activity 11 (Optional Worked Example)


Gender equality and representation is a current topic when it comes to Hollywood and similar public walks of life. What does this data say about the gender split across jobs in the crew of films?

This activity will show you how to develop a query which produces a table of results listing each job, the department it's in, the number of men who do that job, and the number of women who do it.

Ploting the results allows us to compare them visually and ask the question: *are film crew jobs equally distributed across men and women?*

(In the dataset, `gender` is 1 for women, 2 for men.)

### Note
*This is a complex question which requires many stages and queries to get to the solution. You are welcome to try to answer this question yourself, although it may take you some time... If you do want to give it a go, before you start,  think about how you will tackle this question and what "parts" you will need to develop a whole and compelling answer. Be prepared to create new views on the data as you go, if that makes your analysis easier. Note also that this query is not about the `cast_crew` view, but about roles in general.*

#### Worked example

To reveal the worked solution, click on the triangle symbol on the left-hand end of this cell.

**General approach**

Counting the number of people for each job and department should be easy enough. I can do that twice, to create similar results for each gender. 

Once I have that, I can somehow merge these two results (one for male, one for female) into a unified view. There may be different jobs in each result set, so perhaps using an `OUTER JOIN` would be sensible to ensure no data is lost. 

I can then plot this combined result set, comparing the number of men and women in each role. I'm expecting to do several visualisations of the data. As I look into the results, different aspects will become clear and I'll want to shift from visualisations that allow me to _explore_ the data to those that _reveal_ interesting aspects.

**Queries**

Finding the total for each gender is easy enough, if we do it one at a time. These are similar to the queries above, but 
1. using the `crew` table rather than the `cast_crew` view
2. with the added restrictions on gender for the people included in the results.

In [None]:
%%sql

SELECT department, job, gender, COUNT(movie_id) AS female_count
FROM crew, movie
WHERE movie_id = movie.id
    AND gender = 1                 -- female
GROUP BY department, job, gender
ORDER BY COUNT(movie_id) DESC
LIMIT 10;

In [None]:
%%sql

SELECT department, job, gender, COUNT(movie_id) AS male_count
FROM crew, movie
WHERE movie_id = movie.id
    AND gender = 2                 -- male
GROUP BY department, job, gender
ORDER BY count(movie_id) desc
LIMIT 10;

Putting these two results into one query is a bit trickier. Rather than build something very complex with subqueries, let's create a `view` for each gender which we can build on.

In [None]:
%%sql

DROP VIEW IF EXISTS female_job;

CREATE VIEW female_job AS
    SELECT department, job, COUNT(movie_id) AS female_count
    FROM crew
    WHERE gender = 1                 -- female
    GROUP BY department, job;

In [None]:
%%sql

DROP VIEW IF EXISTS male_job;

CREATE VIEW male_job AS
    SELECT department, job, COUNT(movie_id) AS male_count
    FROM crew
    WHERE gender = 2                 -- male
    GROUP BY department, job;

A quick sanity check of the views.

In [None]:
%%sql 

SELECT * 
FROM female_job
ORDER BY female_count DESC
LIMIT 10;

In [None]:
%%sql 

SELECT * 
FROM male_job 
ORDER BY male_count DESC
LIMIT 10;

We can then join these two views to get the result. However, we have to be careful to include the jobs which are done exclusively by one gender only. That means we need a `FULL OUTER JOIN` to combine the results. Note the use of the [`COALESCE`](https://www.postgresql.org/docs/9.5/static/functions-conditional.html) function, used in Notebook 3.4, to return a valid department and job, even if the job is never done by a man.

*Reminder*: the `COALESCE()` *SQL* function takes a series of arguments and returns the first one that is not NULL.

For example:

```SQL
SELECT COALESCE(NULL,'one','two') AS first_not_null;
SELECT COALESCE('one',NULL,'two');
```

would both return `one` as the result in column `first_not_null`.


In [None]:
%%sql mf_crew <<
SELECT COALESCE(male_job.department, female_job.department) AS department, 
    COALESCE(male_job.job, female_job.job) AS job, 
    male_count, female_count
FROM male_job FULL OUTER JOIN female_job ON male_job.department = female_job.department
    AND male_job.job = female_job.job
ORDER BY male_count DESC, female_count DESC;

352 rows is too much to inspect manually, but we still need to see if the result is sensible or needs tidying up.

In [None]:
mf_crew.head()

Tidy the results by replacing the `NaN`s with zero and setting `job` as the index.

In [None]:
mf_crew.fillna(0, inplace=True)
mf_crew.set_index('job', inplace=True)
mf_crew.head()

Now we have some data, what does it look like?

We can't plot 350 comparisons on one plot. Let's start by summarising by department:

In [None]:
mf_crew.groupby('department').sum()

In [None]:
ax = mf_crew.groupby('department').sum().plot.barh()
ax.set(title="Number of people per department, split by gender",
      xlabel="Number of people",
      ylabel="Department");

The main thing we can conclude from this plot is that there are a lot more men working in the movies than women. It's difficult to pick out trends about _types_ of job from this chart. 

Let's pick out the ten most popular jobs for men and women and merge them into a smaller dataset.

We use `sort_values` and a slice to find the ten most popular male jobs.

In [None]:
mf_crew.sort_values('male_count', ascending=False)[:10]

We can do the same for the female jobs, and use `append` to butt the two DataFrames together, and `drop_duplicates` to merge the two result sets.

In [None]:
# most common jobs for men
m_crew_popular = mf_crew.sort_values('male_count', ascending=False)[:10]
# most common jobs for women
f_crew_popular = mf_crew.sort_values('female_count', ascending=False)[:10]

# combine the two sets
mf_crew_popular = m_crew_popular.append(f_crew_popular)
mf_crew_popular.drop_duplicates(inplace=True)


mf_crew_popular

In [None]:
ax = mf_crew_popular.plot.barh(figsize=(8, 8))
ax.set(title="Most popular jobs, split by gender",
      xlabel="Number of people",
      ylabel="Job");

Again, men outnumber women (a lot), but it seems that women work in casting and costuming rather than production and direction. 

Let's look at the _distribution_ of roles within each gender. We do that by finding what portion of each job is done by each gender. We calculate that by finding the total number of men and women and dividing each cell by that total. 

In [None]:
mf_crew['scaled_male'] = mf_crew.male_count / mf_crew.male_count.sum()
mf_crew['scaled_female'] = mf_crew.female_count / mf_crew.female_count.sum()
mf_crew.head()

In [None]:
ax = mf_crew.groupby('department').sum()[['scaled_male', 'scaled_female']].plot.barh(figsize=(8, 8))
ax.set(title="Fraction of jobs by department, split by gender",
      xlabel="Fraction of this gender in this department",
      ylabel="Department");

This shows women are prominent in production and costume & make-up, and men are prominent in camera, sound, and writing. The higher values for women in this sample suggest that women are concentrated into just a few departments while men have a wider range of jobs.

Again, looking at the most popular jobs.

In [None]:
m_crew_popular_scaled = mf_crew.sort_values('scaled_male', ascending=False)[:10]
f_crew_popular_scaled = mf_crew.sort_values('scaled_female', ascending=False)[:10]
mf_crew_popular_scaled = m_crew_popular_scaled.append(f_crew_popular_scaled)
mf_crew_popular_scaled.drop_duplicates(inplace=True)
mf_crew_popular_scaled

In [None]:
ax = mf_crew_popular_scaled[['scaled_male', 'scaled_female']].plot.barh(figsize=(8, 8))
ax.set(title="Most popular jobs, split by gender",
      xlabel="Fraction of this gender in this job",
      ylabel="Job");

This is a very similar story. Women are concentrated in casting, costume design, and script supervision. Men are less concentrated, but dominate directing, photography, and screenplay writing.

("[Script supervisor](https://en.wikipedia.org/wiki/Script_supervisor)" seems to be an admin job, keeping track of continuity between takes.)

**Comments**

The gender divide is quite evident, especially in just the sheer numbers of men vs women. These are the sorts of results and visualisations which could lend powerful support to arguments about gender equality and power in US film-making. You could apply similar ideas to other domains, where you want to illustrate differences between subgroups.

Looking at the categories, women do casting, clothes & hair, and admin. Men direct and write stories. 

#### End of Activity 11

-------------------------------------------------

**Question: Can queries incorporate bias?**

In some of the above queries, the `COALESCE` function returned the first non-NULL job role. Note the bias built into the query of the male job being tested first. It can be very easy to unconsciously build possible bias in to a query, although in this case it should have no effect (we have an exclusive-OR case of either male job or female job).

The query also includes a full outer join, which means we take data from each table if there is no join condition. If we were using an explicit or implied left join, we would be biasing our result to the left table, i.e. towards the `male_jobs`.

You might argue that a biased query is an incorrect query, but this would be a mistake: it is often the case that a query will unavoidably reflect some bias in the question, the query writer or the database. After all, very often you will not be able to go back and collect the data that you would have liked. However, it is absolutely vital that you should consistently review your queries to try to identify any biases that may exist (or error from mistaken assumptions). Then either try to remove them, or ensure that your final analysis is explicit about any limitations which may arise from the biases you have identified.
</div>

# Conclusion
This notebook has shown you how to create and use views and materialized views. In this notebook, you've used them to encapsulate parts of a complex query, and to provide a more convenient view onto the database which is easier for further queries to build on. 

You have also looked at some more extended and exploratory data investigations. These are the sorts of investigations you would need to do when exploring datasets, especially when you have new datasets such as in TMA02 and the EMA.