**Important**: Before starting this notebook, you should run notebook `11.0 Setting up the Movie database` to build the movies database that we will use in this Part.

# Exploring the Movies dataset

How do you go about understanding a new and unfamiliar dataset?

One good approach is to think about some of the questions you might want to ask of the data, and then go through the process of moving from a vague question to precise, computable statements in SQL or pandas. Throughout this part, we'll be looking at a database that contains a dataset of information about movies.

This notebook introduces the dataset and shows you how to can explore its contents using the following three step process:

1. pose some questions you could ask of the dataset,
2. identify how to write SQL queries to address those questions, and
3. identify how to interpret those results to form answers to the original question.

You'll continue this process of exploratory data investigation throughout the rest of the notebooks in this part.

You should spend around 2 hours on this notebook.

## The movies dataset

The dataset started as a sample for the [HetRec machine learning challenge](https://grouplens.org/datasets/hetrec-2011/) and has been extended with data from [The Movie Database (TMDb)](https://www.themoviedb.org/?language=en).


The first step in understanding any dataset is normally to read any supporting documentation for it. This will often give an overview of the data, as well as describing the structure and content of the tables. 

In this case, the movies dataset spans several SQL tables, so understanding it is a more complicated task than if you're looking at a dataset in a single table.

As this dataset was assembled by the module team, we won't subject you to having to read all the details up-front. Instead, we'll follow an incremental approach, where you'll look at different parts of the dataset as we pose and ask questions of it.

But before that, we'll give the ERD of the dataset, which contains much of the information you would expect to find in a documented dataset:


![Movies ERD](./images/movies-erd.svg)


The core of the dataset is the `movie` table, which lists a large number of movies. The `person` table holds details about people connected with movies. The `crew` and `cast_member` tables connect people to movies, depending on whether the person appears in the film or is a member of the crew making the film (or both).

Movies can also have zero or more `genre`s, `language`s, `production country`s and `production company`s. The dataset also holds the composite entities to connect `movie`s to `genre`s and so on.

Finally, each `movie` can be in a `collection`, such as the _Star Trek_ films, and a person can be `also known as` different names (often transliterations into different languages).

From this initial information, we can start to explore the dataset.

## Setting up

The next group of cells set up your database connection, and reset the database to a clean state. Check notebook *08.1 Data Definition Language in SQL* if you are unsure what the next cells do.

You may need to change the given values of the variables `DB_USER` and `DB_PWD`, depending on which environment you are using

In [None]:
# Make the connection

%run sql_init.ipynb
print("Connecting with connection string : {}".format(DB_CONNECTION))
%sql $DB_CONNECTION

## Required SQL knowledge


Notebooks 3.2, 3.3, and 4.5 contain several examples of how to use SQL. You may need to refer back to those Notebooks for examples of how to perform various tasks with SQL. We have also included [a short cheat sheet notebook](SQL_cheatsheet.ipynb) in this directory with some brief reminders of the main elements of an SQL query.

## Using the `movies` schema

As with notebooks *09.2 Using foreign keys in SQL* and *09.3 Working With FOREIGN KEY Constraints*, we have defined a separate schema to contain the movies database. We have called this schema `movies`; you can see that this is where the data is held by making a `SELECT` query with the qualified table:

In [None]:
%%sql

SELECT *
FROM movies.movie
LIMIT 5;

As with the `hospital` schema, it makes sense to tell PostgreSQL which schemas to use first. To tell PostgreSQL to search the `movies` schema before the `public` schema, we use the following:

In [None]:
%%sql

SET search_path TO movies, public;

We can see the value of `search_path` with:

In [None]:
%%sql

SHOW search_path;

which should list `movies` before `public`.

If we now attempt to `SELECT` from the `movie` table, we should find that the `movies.movie` table is queried:

In [None]:
%%sql

SELECT *
FROM movie
LIMIT 5;

## Columns in the database's tables


In many cases, the meaning of a field within a table should be clear from the name, which you can find either from the [movie database build script](../sql_initial_state_movies.py) or a simple exploratory query such as:

```SQL
SELECT *
FROM <table> 
LIMIT 5;
```

Another possibility is to ask PostgreSQL to tell you more about a table by querying the information schema, PostgreSQL's record of its databases. The query below lists the columns in the `person` table along with their data 
type:

In [None]:
%%sql 

SELECT column_name, data_type
FROM INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'person';

### Activity 1


A good place to start the investigation might be to see what columns are present in the `movie` table. That seems important, as this is the "movies" dataset.

Find the names of the columns in the `movie` table.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

To view the columns in a given table, we can carry out a `SELECT` on the table:

In [None]:
%%sql

SELECT *
FROM movie
LIMIT 5;

Note that in this case, the table returned is so large that some of the columns might have been removed for readability. To see the actual columns, we can use the `<<` syntax to put the returned table into a named dataframe, and then use the `.columns` property of the dataframe to see the complete list of columns:

In [None]:
%%sql df <<

SELECT *
FROM movie
LIMIT 5;

In [None]:
df.columns

Alternatively, we could use PostgreSQL's own record of the tables:

In [None]:
%%sql

SELECT column_name, data_type, character_maximum_length
FROM INFORMATION_SCHEMA.COLUMNS 
WHERE table_name = 'movie';

#### End of Activity 1

-------------------------------------------

## The movie table

The `movie` table has many columns, most of which are self-explanatory. Some of the less obvious one are:

* `rt_all_critics_num_fresh`
* `rt_all_critics_num_reviews`
* `rt_all_critics_num_rotten`
* `rt_all_critics_rating`
* `rt_all_critics_score`

* `rt_audience_num_ratings`
* `rt_audience_rating`
* `rt_audience_score`

* `rt_top_critics_num_fresh`
* `rt_top_critics_num_reviews`
* `rt_top_critics_num_rotten`
* `rt_top_critics_rating`
* `rt_top_critics_score`

* `popularity`
* `vote_average`
* `vote_count`

The `rt_`… fields are related to [Rotten Tomatoes](https://www.rottentomatoes.com/), the crowdsourced movie rating site. The `critics` are (semi-) professional film critics, and the `audience` fields are for the crowdsourced ratings. `vote_count` and `vote_average` are the crowdsourced ratings from TMDb. `popularity` is also from TMDb, but exactly what the values in this column represent is unclear.

Each movie has ID fields:
* `id`: an arbitrary value from HetRec
* `tmdb_id`: the movie's ID on TMDb
* `rt_id`: the movie's ID on Rotten Tomatoes
* `imdb_id`: the movie's ID on the [Internet Movie Database (IMDb)](https://www.imdb.com/)

## Size of dataset

A sensible next step is to understand just how large the dataset is. A simple SQL `SELECT COUNT(*)` will count the number of rows in a table.


### Activity 2

How many movies are in the dataset? How many people?

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

To see how many movies are in the dataset, we can use the query:

In [None]:
%%sql 

SELECT COUNT(*) AS number_of_movies
FROM movie;

Similarly, to see how many people are in the dataset, we can use the query:

In [None]:
%%sql 

SELECT COUNT(*) AS number_of_people
FROM person;

#### End of Activity 2

-------------------------------------------

This is a dataset which is too large to see all the detail. We'll have to rely on aggregation and filtering tools to find understandable answers to any questions we have.

## SQL to DataFrames

The `SqlMagic` feature of the notebook can return two different sorts of object: a raw `ResultSet` object, or a pandas DataFrame. Which type is returned is determined by a SQLMagic configuration setting.

The configuration options available, along with their current settings, can be inspected by running the command:

```%config SqlMagic```

The boolean `SqlMagic.autopandas` setting determine whether or not a pandas DataFrame or a raw `ResultSet` is returned:

```python
%config SqlMagic.autopandas=True  #return a pandas DataFrame
%config SqlMagic.autopandas=False #return a raw ResultSet
```

By default, the `ipython-sql` magic sets `SqlMagic.autopandas=False` BUT we have overridden this with a setting that sets `SqlMagic.autopandas=True` whenever a notebook is loaded in the TM351 VM.

`SqlMagic` also allows us to assign the results of a query directly to a Python variable (e.g. `myResponseObject`) using the magic command:

<code>%%sql myResponseObject << </code>

The combination of these means we can easily capture the results of an SQL query into a DataFrame for further manipulation.

Perhaps confusingly, pandas DataFrame and raw `ResultSet` objects are rendered the same way in a notebook. To check which sort of object is returned, we can find its `type`, for example by running a command of the form:

`type( myResponseObject )`

Let's see how the variable assignment works.

First, ensure that we are returning pandas DataFrame objects from the SQL query:

In [None]:
%config SqlMagic.autopandas=True

Here are a few rows from the `person` table:

In [None]:
%%sql some_people <<

SELECT *
FROM person
LIMIT 5;

In [None]:
some_people

In [None]:
#Check its type - it should be a pandas dataframe
type(some_people)

We can do all those things we can normally do with a DataFrame, such as ask for a summary…

In [None]:
some_people.describe()

although numerical summary statistics may not be meaningful for all columns. For example, some columns use a numerical value to encode a particular category, such as gender.

We can also pick out some columns from the dataframe in the usual pandas way:

In [None]:
some_people[['id', 'name', 'birthday']]

One thing to note is the auto-generated index of the DataFrame. Often, you'll find it more useful to create your own index, usually based on the primary key. You can do this with `set_index()`, like this:

In [None]:
# id is the primary key of the people table

some_people.set_index('id', inplace=True)
some_people

Note that the `inplace=True` means that the dataframe itself is changed. If we had used `inplace=False`, the method would have returned a new sorted dataframe, with `some_people` not being affected.

It can also be useful to sort the DataFrame by that index, using the `sort_index()` method:

In [None]:
some_people.sort_index(inplace=True)
some_people

# Difference between critic and audience ratings
Now we have a brief understanding of the dataset, let's look into the `movie` table a bit more. 

Each movie has many different ratings. Are they telling us the same thing, or different things? If the critic and audience ratings are different, for which films are they most different?

### Activity 3

Create a DataFrame of films, containing:
- the film's title,
- the film's critic rating (`rt_all_critics_rating`),
- the film's audience rating (`rt_audience_rating`), and 
- the absolute difference between the ratings.

Only include movies where both the critic and audience ratings are given (i.e. where both are greater than zero). The DataFrame should be indexed by movie ID, but ordered so the largest differences in ratings are in the first rows.

(Refer to the cells above for how to load the results of a query into a DataFrame and how to set the index of a DataFrame.)

What do you think of the movies that most divided critics and audience?

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

The query itself is straightforward: it pulls the relevant fields from the table, with the difference calculated for each returned movie. The `WHERE` clause selects only the movies we're interested in. 

I use the <code>%%sql &#x2329;DataFrameName&#x232A; << </code> notation (combined with the `SqlMagic.autopandas=True` setting in the boilerplate) to create the DataFrame directly from the query result. The result is ordered by the `ORDER BY` clause in the SQL query.

Finally, I set the index of the DataFrame to be the movie id, by using `id`, the primary key of the `movie` table.

In [None]:
%%sql ca_ratings <<

SELECT id, title, 
    rt_all_critics_rating, 
    rt_audience_rating,
    ABS(rt_all_critics_rating - rt_audience_rating) AS diff
FROM movie
WHERE rt_all_critics_rating > 0
    AND rt_audience_rating > 0
ORDER BY diff DESC;

In [None]:
ca_ratings.set_index('id', inplace=True)
ca_ratings.head(10)

At first glance, the audience rating is surprisingly much lower than the critics rating. However, checking the source of the data ([the Rotten Tomatoes site](https://www.rottentomatoes.com)) reveals that the critics rating is out of 10, whereas the audience rating is out of 5. If there seems to be a problem with the data, it's important to check that you know what the data is actually showing. As in this case, assuming that the critics and audience figures are equivalent would lead you to seriously misinterpret the meanings of the numbers.

#### End of Activity 3

-------------------------------------------

## Visualising the differences

Having created a dataframe of some of the information we're interested in, it can be useful to visualise the result.

### Activity 4

Create a scatter chart of critic rating vs audience rating. (See the notebooks in Part 4 if you need a reminder of how to do this.)

What does this tell you about any correlation between them?

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

In [None]:
ax = ca_ratings.plot.scatter(x='rt_all_critics_rating', y='rt_audience_rating',
                       title='Critic vs audience ratings')
ax.set(xlabel='Critic rating', ylabel='Audience rating');

The plot is a large blob, but it does broadly seem to show a trend.

The general shape of the blob suggests that critic and audience ratings correlate: films with higher audience ratings also have higher critic ratings, and *vice versa*. However, the density of points within the blob makes it difficult to see exactly what's going on there. The blob could be rather smeared out (indicating a lower correlation), or most of the points could be focused on one line (indicating a higher correlation).

There are a couple of outliers, above and below the main blob, which we can pull out.

In [None]:
ca_ratings[ca_ratings.rt_audience_rating > 4.8]

In [None]:
ca_ratings[ca_ratings.rt_audience_rating < 1.6]

#### End of Activity 4

-------------------------------------------

# How big are the stars?


Filmmaking, and Hollywood particularly, seem to be "star dominated" where a few prominent actors get all the roles, especially the starring roles. How true is this?

One measure of how "big" a star might be is the number of films they have appeared in.

The `cast_member` table holds the movies each person appears in. What does this table look like?

In [None]:
%%sql 

SELECT * 
FROM cast_member
LIMIT 2;

It looks like a composite entity linking `movie` and `person`, as expected from the ERD. It has one row for each actor's appearance in a movie. It has information on the appearance of that person in the movie, such as the character played.

### Activity 5

Write a query which returns the number of appearances of each actor, over all the movies.

Put the results in a DataFrame, called `appearances`, with the number of appearances in a column called `appearance_count`. Present the first view rows of the table in decreasing `appearance_count` order.

What limits the usefulness of this query?

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

In [None]:
%%sql appearances <<

SELECT person_id, COUNT(movie_id) AS appearance_count 
FROM cast_member 
GROUP BY person_id;

In [None]:
appearances.sort_values('appearance_count', ascending=False).head()

Alternatively, I could have sorted the table, and even limited the number of results returned, in the SQL query step.

```SQL
SELECT person_id, COUNT(movie_id) AS appearance_count 
FROM cast_member 
GROUP BY person_id
ORDER BY appearance_count DESC
LIMIT 5;
```

The usefulness of the query is limited by returning the `person_id` rather than person name. From the returned table, I still have no idea who the person with the most number of appearances recorded actually is!

#### End of Activity 5

-------------------------------------------

## "Eyeballing" the appearance counts

Out of interest, what does the distribution of appearance counts look like?

In [None]:
appearances.describe()

That _looks_ very unequal. The mean is 2.2, but at least 50% of actors have only one appearance, and 75% of actors have two or less. The maximum is 72, so there is a large range of number of appearances.

Can we plot the data in a `hist`ogram?

In [None]:
ax = appearances.appearance_count.hist()
ax.set(title="Number of actors with numbers of appearances", 
       xlabel='Number of appearances', 
       ylabel='Number of actors');

This isn't overly useful, as the columns don't line up with the axis labels. But it does reinforce what `describe` told us: just over 70,000 of the 74,000 actors have just a few appearances, and there's a long but essentially invisible tail of prolific actors.

A more detailed plot, using [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html), gives the same message.

(`value_counts()` just counts the number of time each value appears in a series. I use [`sort_index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_index.html) to ensure the result is in order of number of appearances, not number of actors with that appearance count.)

In [None]:
appearances.appearance_count.value_counts().sort_index().head()

In [None]:
ax = appearances.appearance_count.value_counts().sort_index().plot()
ax.set(title="Number of actors with numbers of appearances", 
       xlabel='Number of appearances', 
       ylabel='Number of actors');

Of the 74,000 actors, about 50,000 have only one appearance and a further 10,000 have two. That leaves only 14,000 actors with three or more appearances.

I can check my maths by finding actors with 3 or more appearances (the `loc[3:]`) and `SUM`ming the number of actors.

In [None]:
#Get the count of each time a particular number of appearances is recorded,
#then skip the rows with index / appearance count 0, 1, 2 using pandas index slicing on the index values.
#Finally, sum those counts for the remaining rows (that is, counts of 3 or more appearances)

appearances['appearance_count'].value_counts().sort_index().loc[3:].sum()

Close enough.

## A Better Estimate? The Gini coefficient

One measure of inequality is the [_Gini coefficient_](https://en.wikipedia.org/wiki/Gini_coefficient), also known as the _Gini index_. You may have come across this in terms of how unequal societies are in terms of wealth distribution, but it applies to any distribution. 

The Gini coefficient ranges from 0 to 1. A Gini coefficient of 0 is total equality: everyone has the same amount of stuff. In the movie dataset context, it would mean everyone has the same number of film appearances. A Gini coefficient of 1 is total inequality, such as there was only one working actor and everyone else had zero appearances. A very uneven distribution, where 1% of the population have 50% of the roles would give a Gini coefficient of at least 0.49.

You don't need to know how to calculate the Gini coefficient (either formula or function), but for completeness this is one method for doing so.

To calculate the Gini coefficient, we start a series of "scores" of individuals, where the score could be the person's wealth, or in this case, the number of movie appearances. We call the set of scores $\mathbf{y}$ and each individual score is $y_i$. If these scores are in order, and indexed from 1 to $n$, we can calculate the Gini coefficient as:

$$G = \frac{2 \Sigma_{i=1}^n \; i \; y_i}{n \Sigma_{i=1}^n y_i} -\frac{n+1}{n}$$

In _pandas_ terms, we can find the Gini coefficient of a `Series` of values with this function:

In [None]:
def gini_coeff(given_ser):
    """Find the Gini coefficient of a series given_ser."""
    # Create a copy of the series, in sorted order, indexed from 0 to n
    ser = given_ser.sort_values().reset_index(drop=True)
    
    #The .shape() of a dataframe returns the tuple (num_rows, num_cols)
    n = ser.shape[0] #That is, the number of rows in the data frame
    sum_y = ser.sum()
    sum_iy = ((ser.index + 1) * ser).values.sum()
    return (2 * sum_iy) / (n * sum_y) - ((n + 1) / n);

We can calculate the Gini coefficient of the appearances like this:

In [None]:
gini_coeff(appearances.appearance_count)

This is high. Recall, if 1% of the actors had 50% of the appearances, the Gini coefficient would be at least 0.49.

### Activity 6


Instead of looking at appearances over all time, group the appearances for each actor by year, so each row of the results is the number of films a particular person has appeared in, in a particular year: one row per person per year.

What is the Gini coefficient now? Why might there be a difference?

* Hint: how do you know the year of an appearance? It's not in the `cast_member` table.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

First, we find the data. Rather than grouping just by person, we also group by year of the movie. That means we need to join the `cast_member` appearance with the `movie`, as `year` is in movie.

In [None]:
%%sql year_appearances <<

SELECT person_id, year, COUNT(movie_id) AS appearance_count
FROM cast_member, movie
WHERE cast_member.movie_id = movie.id
GROUP BY person_id, year
ORDER BY appearance_count, year;

Again, quickly "eyeball" the data with `tail`, `describe` and a quick `hist`ogram.

Which items, if any, in the `describe` result are less useful?

In [None]:
year_appearances.tail()

In [None]:
year_appearances.describe()

The calculations performed on the`person_id` are not really meaningful. The usefulness of mean and standard deviation calculated for the `year` are also of questionable value.

In [None]:
ax = year_appearances.appearance_count.hist()
ax.set(title="Number of actors with numbers of appearances in one year", 
       xlabel='Number of appearances', 
       ylabel='Number of actors');

This is still a very uneven distribution, but the maximum number of appearances per year is far less than it is for a whole career. Perhaps this is a less uneven distribution?

Calculate the Gini coefficient

In [None]:
gini_coeff(year_appearances.appearance_count)

This is much less unequal. This is probably due to the fact that appearing in a film takes time, and there's only so much time in each year. This limits how different the appearance rates can be.

#### End of Activity 6

-------------------------------------------

### Activity 7

The first example is for _all_ actors, but that could include bit parts and supporting actors who have minor roles in lots of films.

Instead, let's look at the main actors in a film, given by the `cast_order` attribute. For the purposes of this exercise, define a "star" of a film as a person with a `cast_order` ≤ 5.

Repeat the activity above, for all a person's appearances over time, but only including starring roles. 

Comment on your results.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

In [None]:
%%sql starring <<

SELECT person_id, count(movie_id) as appearance_count 
FROM cast_member 
WHERE cast_order <= 5 
GROUP BY person_id;

In [None]:
starring.head()

In [None]:
starring.describe()

In [None]:
ax = starring.appearance_count.hist()
ax.set(title="Number of starring actors with numbers of apperances", 
       xlabel='Number of appearances', 
       ylabel='Number of starring actors');

This is still a very uneven distribution, and we're back to a very large number of roles for the most prolific actors. What does the Gini coefficient say?

In [None]:
gini_coeff(starring.appearance_count)

For comparison, here are the other Gini coefficients we calculated.

In [None]:
gini_coeff(appearances.appearance_count)

In [None]:
gini_coeff(year_appearances.appearance_count)

The starring roles are spread even more unequally than the result for all appearances, indicating that there really is a tendency for starring roles to be distributed among a small group of individuals. 

#### End of Activity 7

-------------------------------------------

# Who is the best Bond?
[James Bond](https://en.wikipedia.org/wiki/James_Bond_in_film) has been played by many actors. A perennial question that entertains film buffs is *"Who is best?"*.

Can this dataset shed any light on this most vital of questions?

The first step is to identify the Bond films, and the actors who have played him.

(Note that we're using the Postgres-specific `~*` operator to pull out the role name. (`~*` is the case-insensitive regular expression match; `~` is the case-sensitive version.)

What this means is that we can search the database using a regular expression to match results.

In [None]:
%%sql bond_films << 

SELECT name, title, rt_all_critics_rating, rt_audience_rating
FROM cast_member, person, movie
WHERE cast_member.person_id = person.id
    AND cast_member.movie_id = movie.id
    AND character ~* 'james.*bond'
ORDER BY name, title;

In [None]:
bond_films

Standard SQL uses `LIKE` and a different notation for its not-quite regular expressions.)

In [None]:
%%sql bond_films_like << 

SELECT name, title, rt_all_critics_rating, rt_audience_rating
FROM cast_member, person, movie
WHERE cast_member.person_id = person.id
    AND cast_member.movie_id = movie.id
    AND LOWER(character) LIKE '%james%bond%'
ORDER BY name, title;

In [None]:
#Check the results are the same using the dataframe .equals() method
bond_films_like.equals(bond_films)

Now find all the different actors.

In [None]:
bond_films['name'].unique()

Who is the best Bond? Of those who've played Bond, who is the best actor?

And who is Bob Simmons?

As with many data explorations, asking questions of a dataset often leads to yet more questions...

### Activity 8

Spend a few minutes thinking about how you'd answer the question of *"who is the best Bond?"*, using this dataset. Don't write any code, just think about your plan of attack. 

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

This is my thinking. Yours will almost certainly be different.

> There are two different questions here:
> 1. Who is the best at portraying Bond?
> 2. Who is the best overall actor who's played Bond?

> There's nothing in the data that directly relates to quality of acting. However, the quality of Bond films are heavily dependent on whoever's playing Bond, so we could reasonably choose the Rotten Tomatoes (RT) rating as a proxy for quality of the star. The activity above shows that critic and audience ratings are similar but not identical.

> We can show the results on a scatter plot to identify who has the highest ratings.

> To find who's the best _actor_, we need to look at the other films that actor has appeared in. Again, we can use the RT rating as a proxy for the acting quality in that other film. We should probably use the `cast_order` field to pick up only the films the actor is starring in.

> Actors with the best RT ratings make the best Bonds. Actors with the best ratings in all (or perhaps other) films are the best actors.

How does your approach compare to mine?

Post your approach on your tutor group forum. Have a look at what others have posted. What are the advantages and disadvantages of the different approaches?

Which do you think is best?

#### End of Activity 8

-------------------------------------------

## Plotting Bond films: a first attempt
For the remainder of this notebook, we'll work on my proposal for answering the "best Bond" question. If you have a better idea, do that instead. (But please look at these solutions, as they contain SQL techniques which will come in handy later.)

Let's plot the Bond films on a scatter plot, with the axes being the two RT ratings. We'll aim to group the data by actor, showing each actor's films with different colour and shape points.

As a first attempt, let's just plot the data.

In [None]:
ax = bond_films.plot.scatter(x='rt_all_critics_rating', y='rt_audience_rating')
ax.set(title="Critic vs audience ratings of Bond films", 
       xlabel='Critic rating', 
       ylabel='Audience rating');

That plots all the films, but doesn't distinguish between the different actors.

There's also an outlier with an *audience vs critic* rating of 0,0 which we could get rid of.

How about a `groupby` clause, to plot each group one at a time?

In [None]:
good_bond_films = bond_films[bond_films['rt_all_critics_rating']>0]
good_bond_films.groupby('name').plot.scatter(x='rt_all_critics_rating', y='rt_audience_rating')

Well, that's progress, of a sort. But it's not that helpful. We've just created one scatter chart for each actor, where we want one chart for all actors.

Notebook 5.3 shows how to plot several visualisation on the same chart: pass around a matplotlib `axis` object.

In this instance, we create an initial `axes` object with `plt.axes()` then plot each new scatter chart using that `axes` object. The `label` keyword parameters tells *pandas* how to show the series in the legend.

In [None]:
import matplotlib.pyplot as plt

In [None]:
ax = plt.axes()
for name, movies in good_bond_films.groupby('name'):
    movies.plot.scatter(x='rt_all_critics_rating', y='rt_audience_rating', 
                        ax=ax, label=name)
ax.set(title="Critic vs audience ratings of Bond films", 
       xlabel='Critic rating', 
       ylabel='Audience rating');

More progress: each actor is grouped separately, and all the points appear on the same scatter chart. But we can't distinguish between the actors. To do that, we need to tell matplotlib to use different colours and shapes for each group of points.

When we're doing that, we need be careful to distinguish between the series so that all viewers can tell the difference. Your chart may be presented in monochrome, or may be viewed by someone with colour blindness (about 8% of men). Good colours to use ([Okabe & Ito 2008](http://jfly.iam.u-tokyo.ac.jp/color/) via [Connelly 2013](http://bconnelly.net/2013/10/creating-colorblind-friendly-figures/)) are:
* `black`
* `orange`
* `blue` 
* `lightseagreen`
* `darkgoldenrod`
* `dodgerblue`
* `tomato`
* `orchid`
    
(Okabe & Ito's palette suggests colours similar to `skyblue` and `yellow`, but they have low contrast with the white or pale grey background to _pandas_ plots. I suggest using `blue` and `darkgoldenrod` instead.) You can see more named colours available in _pandas_ and `matplotlib` in this [colour palette example](https://matplotlib.org/gallery/color/named_colors.html#sphx-glr-gallery-color-named-colors-py)

You should also use different shapes for each set of points, as given in this [list of marker shapes](https://matplotlib.org/api/markers_api.html), using the `marker` keyword parameter to _pandas_ `plot`.

In [None]:
# Define the colours and shapes to use
colrs = ['black', 'orange', 'blue', 'lightseagreen', 'darkgoldenrod', 'dodgerblue', 'tomato', 'orchid']
markrs = 'ov^s+Dxp'

To use these colours and shapes, we need to iterate over them in lockstep with the groups. The built-in Python function `zip` does this.

In [None]:
for (name, movies), colr, mkr in zip(good_bond_films.groupby('name'), colrs, markrs):
    print(name, ':', colr, mkr)

Finally, we can create the scatter plot we want.

In [None]:
ax = plt.axes()
for (name, movies), colr, mkr in zip(good_bond_films.groupby('name'), colrs, markrs):
    movies.plot.scatter(x='rt_all_critics_rating', y='rt_audience_rating', 
                        c=colr, marker=mkr, ax=ax, label=name)
ax.set(title="Critic vs audience ratings of Bond films", 
       xlabel='Critic rating', 
       ylabel='Audience rating');

#We can also relocate the legend outside the bounding box so as not to occlude any points
#bbox_to_anchor takes arguments (x_displacment, y_displacement)
#You can also use other loc variants, such as 'upper left', with or without displacement
#Guess at some other loc position terms to try to move the legend around
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5));

### Activity 9


Do a similar plot, but without the unrated _The World is Not Enough_. That will spread out the rest of the points, adding clarity.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

One approach is to use a selector to create a subset of the DataFrame, then do the same plotting as before.

In [None]:
ax = plt.axes()
most_bond_films = bond_films[bond_films.rt_audience_rating > 0]
for (name, movies), colr, mkr in zip(most_bond_films.groupby('name'), colrs, markrs):
    movies.plot.scatter(x='rt_all_critics_rating', y='rt_audience_rating', 
                        c=colr, marker=mkr, ax=ax, label=name) 
ax.set(title="Critic vs audience ratings of Bond films", 
       xlabel='Critic rating', 
       ylabel='Audience rating');

plt.legend(loc='lower right');

#### End of Activity 9

-------------------------------------------

## Bob Simmons

What does the dataset tell us about Bob Simmons? Who did he play?

In [None]:
%%sql

SELECT name, person.id, title, character
FROM cast_member, person, movie
WHERE cast_member.person_id = person.id
    AND cast_member.movie_id = movie.id
    AND character ~* 'james.*bond'
    AND name ~* 'simmons'
ORDER BY name;

So it seems he had a minor (and uncredited) part in the title sequence.

What about the rest of Bob's career?

In [None]:
%%sql

SELECT name, person.id, title, character
FROM cast_member, person, movie
WHERE cast_member.person_id = person.id
    AND cast_member.movie_id = movie.id
    AND person.id = 1166842
ORDER BY name;

There wasn't one. It seems he was a <a href="https://en.wikipedia.org/wiki/Bob_Simmons_(stunt_man)">stunt man</a>.

## Actors' other films
If we want to look at the _other_, non-Bond films someone has made, we need to examine the `cast_member` table twice for each `person`. One look at the `cast_member` will be for counting Bond films; the other will be for counting all their films.

As we need to look at two sets of rows in the same query, we use a _correlation name_ to give aliases to the two versions of the table we're drawing from. These two "copies" of the `cast_member` table are handled entirely independently of each other, so we include the join conditions to ensure that both copies are connected to the current person of interest. [Harrington chapter 17](http://proquestcombo.safaribooksonline.com.libezproxy.open.ac.uk/book/databases/9780128499023/chapter-17-retrieving-data-from-more-than-one-table/st0065_html_2) has more details.

For instance, this query counts the number of films an actor has made, and the number of Bond films they've made. Note the use of `DISTINCT` to count only the different movies each person has made.

The `AS` keyword is optional in the `FROM` clause and you'll often see queries without it, so the `FROM` clause could be:


```SQL
FROM cast_member bond_appearance, 
    cast_member other_appearance, 
    person
```

I include `AS` because I think it makes the query a bit more readable.

In [None]:
%%sql

SELECT name, person.id, 
    COUNT(DISTINCT other_appearance.movie_id) AS all_appearances,
    COUNT(DISTINCT bond_appearance.movie_id) AS bond_appearances
FROM cast_member AS bond_appearance, 
    cast_member AS other_appearance, 
    person
WHERE bond_appearance.person_id = person.id
    AND other_appearance.person_id = person.id
    AND bond_appearance.character ~* 'james.*bond'
GROUP BY name, person.id
ORDER BY name;

### Activity 10 (Optional)

This optional activity shows how to develop quite a complex query. As with many puzzles, you can easily sink large amounts of time into such challenges, so feel free to skim my approach as an alternative to attempting it yourself.

Find the average RT ratings (critic and audience) for each person's Bond appearances, and all of each person's starring appearances. Ideally, create both sets of averages in one query, so your result set looks something like this:

| name | id | all_critics_bond | audience_bond | bond_count | all_critics_all | audience_all | all_count | 
|------|----|------------------|---------------|------------|-----------------|--------------|-----------|
| Daniel Craig | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | 
| George Lazenby | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | 
| Pierce Brosnan | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | 
| Roger Moore | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | 
| Sean Connery | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | 
| Timothy Dalton | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ | $\cdots$ |

Plot the results on a scatter chart, with one point per actor.

Comment on the charts. Who do these suggest is the best Bond?

(Don't include Bob Simmons.)

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

This is a complex task, so I'll take a few goes to develop a query which answers the question, iteratively improving the query as I understand what it's doing. 

To find the RT ratings, we need to pull in the `movie` table as well. Finding the average rating of all Bond films for each actor is a fairly straightfoward query.

In [None]:
%%sql

SELECT name, person.id,
    AVG(movie.rt_all_critics_rating) AS all_critics_bond, 
    AVG(movie.rt_audience_rating) AS audience_bond,
    COUNT(distinct movie.id) AS bond_count
FROM cast_member, 
    person, 
    movie
WHERE cast_member.person_id = person.id
    AND cast_member.movie_id = movie.id
    AND cast_member.character ~* 'james.*bond'
GROUP BY name, person.id
ORDER BY name;

It's slightly more complex if we want to use one query to return the ratings for both an actor's Bond films and all their films. Because we're looking at two different sets of films for each row in the results, we need to have two correlation names for the `movie` table, as well as two correlation names for the `cast_member` table.

Although the ungrouped data contains several rows for each film (as it finds one row for each combination of "Bond" and "other" film, there's the same number of each, so the average calculation still works.

In [None]:
%%sql

SELECT name, person.id,
    AVG(bond_movie.rt_all_critics_rating) AS all_critics_bond, 
    AVG(bond_movie.rt_audience_rating) AS audience_bond,
    COUNT(distinct bond_movie.id) AS bond_count,
    AVG(other_movie.rt_all_critics_rating) AS all_critics_all, 
    AVG(other_movie.rt_audience_rating) AS audience_all,
    COUNT(DISTINCT other_movie.id) AS all_count
FROM cast_member AS bond_appearance, 
    cast_member AS other_appearance, 
    person, 
    movie AS bond_movie, 
    movie AS other_movie
WHERE bond_appearance.person_id = person.id
    AND bond_appearance.person_id = other_appearance.person_id
    AND bond_appearance.movie_id = bond_movie.id
    AND other_appearance.movie_id = other_movie.id
    AND bond_appearance.character ~* 'james.*bond'
GROUP BY name, person.id
ORDER BY name;

We can get rid of Bob Simmons by excluding groups with `all_count` of 1. We can keep only starring appearances by filtering with `cast_order`.

As this is the final version of this query, let's capture it into a DataFrame.

In [None]:
%%sql bond_all_ratings << 

SELECT name, person.id,
    AVG(bond_movie.rt_all_critics_rating) AS all_critics_bond, 
    AVG(bond_movie.rt_audience_rating) AS audience_bond,
    COUNT(distinct bond_movie.id) AS bond_count,
    AVG(other_movie.rt_all_critics_rating) AS all_critics_all, 
    AVG(other_movie.rt_audience_rating) AS audience_all,
    COUNT(DISTINCT other_movie.id) AS all_count
FROM cast_member AS bond_appearance, cast_member AS other_appearance, 
    person, 
    movie AS bond_movie, movie AS other_movie
WHERE bond_appearance.person_id = person.id
    AND bond_appearance.person_id = other_appearance.person_id
    AND bond_appearance.movie_id = bond_movie.id
    AND other_appearance.movie_id = other_movie.id
    AND bond_appearance.character ~* 'james.*bond'
    AND other_appearance.cast_order <= 5
GROUP BY name, person.id
HAVING COUNT (DISTINCT other_movie.id) > 1
ORDER BY name;

In [None]:
bond_all_ratings.set_index('id', inplace=True)
bond_all_ratings

Finally, plot the scatter charts, first for the critics ratings and then the audience.

Note the `apply ax.text` trick, to annotate each point with the actor's name.

In [None]:
ax = bond_all_ratings.plot.scatter(x='all_critics_bond', y='audience_bond')
ax.set(title="Critic vs audience ratings of Bond actors in Bond films", 
       xlabel='Critic rating', 
       ylabel='Audience rating');
bond_all_ratings[['all_critics_bond', 'audience_bond', 'name']].apply(lambda x: ax.text(*x),axis=1);

In [None]:
ax = bond_all_ratings.plot.scatter(x='all_critics_all', y='audience_all')
ax.set(title="Critic vs audience ratings of Bond actors in all their films", 
       xlabel='Critic rating', 
       ylabel='Audience rating');
bond_all_ratings[['all_critics_all', 'audience_all', 'name']].apply(lambda x: ax.text(*x),axis=1);

The charts are fairly clear: both the critics and audience say that Daniel Craig and Sean Connery are the best Bonds (though they disagree about the order). 

While the critics don't rate Timothy Dalton as a good actor, the audiences like him.

And then there's Pierce Brosnan.

#### End of Activity 10

-------------------------------------------

### Activity 11


The approach above includes each Bond film twice: once in the set of Bond films for an actor, and once for the set of all films. Therefore, highly-rated Bond films could overshadow the rest of an actors career. 

Repeat the analysis above, but this time don't include the Bond films in the non-Bond films. (The `!~*` Postresql operator finds fields that _don't_ match a regular expression.)

Does this change your interpretation of who is the best Bond?

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

One possible solution might be to include a check of the character name for `other_appearance.character` to exclude James Bond. Otherwise, the process is the same.

In [None]:
%%sql bond_other_ratings <<

SELECT name, person.id,
    AVG(bond_movie.rt_all_critics_rating) AS all_critics_bond, 
    AVG(bond_movie.rt_audience_rating) AS audience_bond,
    COUNT(DISTINCT bond_movie.id) AS bond_count,
    AVG(other_movie.rt_all_critics_rating) AS all_critics_other, 
    AVG(other_movie.rt_audience_rating) AS audience_other,
    count(distinct other_movie.id) AS other_count
FROM cast_member AS bond_appearance, 
    cast_member AS other_appearance, 
    person, 
    movie AS bond_movie, 
    movie AS other_movie
WHERE bond_appearance.person_id = person.id
    AND bond_appearance.person_id = other_appearance.person_id
    AND bond_appearance.movie_id = bond_movie.id
    AND other_appearance.movie_id = other_movie.id
    AND bond_appearance.character ~* 'james.*bond'
    AND other_appearance.character !~* 'james.*bond'
    AND other_appearance.cast_order <= 5
GROUP BY name, person.id
HAVING COUNT (DISTINCT other_movie.id) > 0
ORDER BY name

In [None]:
bond_other_ratings.set_index('id', inplace=True)
bond_other_ratings

In [None]:
ax = bond_other_ratings.plot.scatter(x='all_critics_bond', y='audience_bond')
ax.set(title="Critic vs audience ratings of Bond actors in Bond films", 
       xlabel='Critic rating', 
       ylabel='Audience rating');
bond_other_ratings[['all_critics_bond', 'audience_bond', 'name']].apply(lambda x: ax.text(*x),axis=1);

In [None]:
ax = bond_other_ratings.plot.scatter(x='all_critics_other', y='audience_other')
ax.set(title="Critic vs audience ratings of Bond actors in non-Bond films", 
       xlabel='Critic rating', 
       ylabel='Audience rating');
bond_other_ratings[['all_critics_other','audience_other','name']].apply(lambda x: ax.text(*x),axis=1);

With this slightly different view of the data, the results for the Bond films are, reassuringly, the same. The view for the other films, and hence the quality of the actor, are different. It seems that Bond is only played by actors who have "star quality" and star in movies with high audience ratings. This implies that the film producers are after a "safe pair of hands" for people playing Bond.

#### End of Activity 11

-------------------------------------------

# Conclusions

This notebook has made you familiar with the main elements of the Movies dataset. It's also given you some practice in taking ill-formed questions and converting them into some specific queries which can address those questions.

In the following Notebooks in this Part, you'll look at more aspects of the movies dataset and pose more complex queries to answer more sophisticated questions.