# Subqueries as tables



In this notebook, you will look at some examples of how the results of a subquery can be used in place of a table in an SQL query. 

You'll explore this use by asking some more complex questions of the movies dataset. As with Notebook 11.2, the questions in this Notebook are sometimes a little contrived, and sometimes a _lot_ contrived. Theses are not necessarily the most obvious or most interesting questions you could ask of the dataset. However, they they serve to illustrate the SQL techniques.

You should spend around one hour on this notebook.

## Setting up

The next group of cells set up your database connection, and reset the database to a clean state. Check notebook *08.1 Data Definition Language in SQL* if you are unsure what the next cells do.

You may need to change the given values of the variables `DB_USER` and `DB_PWD`, depending on which environment you are using

In [None]:
# Make the connection

%run sql_init.ipynb
print("Connecting with connection string : {}".format(DB_CONNECTION))
%sql $DB_CONNECTION

## The database ERD: recap



Before we start, here's the ERD of the database again, which can be useful in getting orientated around the information it contains.


![Movies ERD](./images/movies-erd.svg)


As with `Notebook 11.1: Movie analysis`, this notebook uses the `movies` schema, so let's set the `search_path` so that we don't need to qualify all the table names:

In [None]:
%%sql

SET search_path TO movies, public;

# The trend of globalisation

Have movies become more international over time? In other words, are more countries involved in the production of a film? Do films include more languages over time?

(As with the investigation in Notebook 11.2 into languages, the quality of the answers to this question is heavily affected by the limited scope of the dataset we're using.)

## The relevant data
We're still concentrating on the `spoken_language` and `production_country` tables, connective to the `movie` table by the composite entities of `movie_language` and `movie_production_country` respectively.

![Movies ERD fragment](images/movies-erd-foreign-fragment.svg)

This query counts the number of countries involved with each movie.

In [None]:
%%sql

SELECT movie_id, title, COUNT(*) AS production_country_count
FROM movie, movie_production_country
WHERE movie.id = movie_id 
GROUP BY movie_id, title
LIMIT 10;

Back in the early days of film production, many films were produced on a movie set back lot. But many films nowadays also make use of production teams and locations all over the world.

So has movie production become more likely to involve multiple production countries during the production of a particular movie?

If we know the number of countries involved in the production of each movie, then for each year we can also find the average number countries associated with the production of films in a given year. (We can also easily find a count of the number of movies produced each year.)

So where's the data for this? The `movie` table records the year that each movie was released, and the `movie_production_country` table contains the production countries (one row per country; each movie may be associated with one or more rows).

But what we really want to do is to join the `movie` table with a table that contains the number of countries associated with each film.

We'll look only at movies produced in the range 1980–2010, as those years have sufficient movies for good averages.

In [None]:
%%sql country_count_year <<

SELECT year, 
       COUNT(movie.id) AS number_of_movies, 
       AVG(production_country_count) AS average_production_country_count
FROM (
    SELECT movie_id, COUNT(*) AS production_country_count
         FROM movie, movie_production_country
         WHERE movie.id = movie_id 
         GROUP BY movie_id
     ) AS pcc, movie
WHERE movie.id = pcc.movie_id
    AND year > 1980
    AND year < 2009
GROUP BY year
ORDER BY year;


In [None]:
ax = country_count_year['average_production_country_count'].plot()
ax.set(title="Average number of production countries per movie over time", 
       xlabel='Year', 
       ylabel='Average production countries per movie');

That looks like a large increase, but what if we ensure the y-axis starts at zero?

In [None]:
ax = country_count_year['average_production_country_count'].plot(ylim=(0, 1.7))
ax.set(title="Average number of production countries per movie over time", 
       xlabel='Year', 
       ylabel='Average production countries per movie');

That's a more modest increase. On the other hand, we'd expect every movie to be made in at least one country, so forcing the y-axis to zero may not be the best choice. 

In any event, there's a clear increase in the number of countries involved in movie production over time. Films are becoming more multinational.

#### An Aside - From "Pseudo-Code" to Queries

If you have come from an imperative programming background, such as Python, you may naturally try to phrase queries using some sort of Python-like pseudo-code.

For example, the question of how to find the average number of production countries per film in each year might be calculated in a Python program something like the following:

```
for each year in a given range:
    find the average number of production countries used in movies from that year
```

This decomposes further:

```
for each year in a given range of years:
    for each movie:
        find the number of production countries involved
    find the average number of (production countries per movie) in that year
```

We might rearrange this algorithm as follows:

```
    find the number of production countries
        for each movie;  
    find the average number of (production countries per movie)
        within a range of years
            for each year in that range;
```

The SQL query can be seen to take on this form, although the syntactic structure is different (in particular, a query is a declarative problem statement, unlike the python code which uses explicit looping constructs):


```SQL
-- Get the average number of production countries they used
SELECT year, AVG(production_country_count) AS average_production_country_count

-- Join two tables: the movie table and a query generated production country count by movie table

-- We need the movie table because it contains the year of production
FROM movie,

-- Get the ID and a production country count
(
SELECT movie_id, COUNT(*) AS production_country_count
FROM movie, movie_production_country
WHERE movie.id = movie_id 
    -- for each movie...
    GROUP BY movie_id
) AS pcc

-- this is the join condition
WHERE movie.id = pcc.movie_id

    -- limit the period
    AND year > 1980
    AND year < 2009

        -- for each year...
        GROUP BY year;
```

### Activity 1

Following a similar structure, what was the average number of languages spoken in movies in each year? 

Plot the two trends (countries and languages) on the same graph.

> __HINT:__
> You may find the following construction useful when creating your chart:
> ```python
> ax = df1.plot(...) # Return an axis object when the first chart is created
> ax = df2.plot(ax = ax, ...) #Extend this object by adding the second chart to it ```

In [None]:
# Write your code in this cell

Comment on your findings.

Write your comments in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

The query has the same structure as before, just swapping `movie_language` for `movie_production_country` in the subquery.

In [None]:
%%sql language_count_year <<

SELECT year, COUNT(movie.id), AVG(language_count) AS average_number_of_languages
FROM (   SELECT movie_id, COUNT(*) AS language_count
         FROM movie, movie_language
         WHERE movie.id = movie_id 
         GROUP BY movie_id
     ) AS slc,
     movie
WHERE movie.id = slc.movie_id
    AND year > 1980
    AND year < 2009
GROUP BY year
ORDER BY year;

In [None]:
language_count_year.set_index('year', inplace=True)
language_count_year.head()

In [None]:
ax = language_count_year['average_number_of_languages'].plot()
ax.set(title="Average number of languages per movie over time", 
       xlabel='Year', 
       ylabel='Average languages per movie');

We plot the two lines on one figure by passing the axis chart object (that is, the `ax` parameter) produced from one `plot()` into a second plot. The second plot command then adds further chart components to this object.

In [None]:
ax = country_count_year['average_production_country_count'].plot(label='countries')
language_count_year['average_number_of_languages'].plot(ax=ax, label='languages')
ax.legend()
ax.set(title="Average number of language\nand production countries per movie over time", 
       xlabel='Year', 
       ylabel='Average languages and\nproduction countries per movie');

The two trends are very similar from 1987 onwards. There's a difference before 1987 where the number of languages is larger than the number of production countries. This suggests there are some bilingual films being made in one country. That's something which could be worth some extra investigation. But we won't go into that here.

#### End of Activity 1

----------------------------------------------------------------

# Number of appearances

Using a similar approach to one used in Notebook 11.1, we can count the number of appearances made by each actor with a straightforward query.

In [None]:
%%sql appearances << 

SELECT person_id, name, COUNT(person_id) AS appearance_count
FROM cast_member, person
WHERE person_id = id
GROUP BY person_id, name
ORDER BY COUNT(person_id) DESC;

We can then use the dataframe's `value_counts` method to find how many actors had a certain number of appearances. The `value_counts()` method returns a pandas *Series* which we can explicitly sort using `sort_values()` (by default, this sorts in ascending order). Using `.head()` limits the displayed result.

In [None]:
appearances['appearance_count'].value_counts().sort_values(ascending=False).head()

We can plot what the distribution of appearances looks like as we did in Notebook 11.1:

In [None]:
ax = appearances['appearance_count'].value_counts().sort_values(ascending=False).plot()
ax.set(title="Number of actors with numbers of appearances", 
       xlabel='Number of appearances', 
       ylabel='Number of actors');

One thing you may notice about the above process is that it proceeded in two stages: we used SQL to obtain the appearance counts, and pandas for the `value_counts()`).

When working in notebooks, this "polyglot" approach, of using different computer languages at different stages whilst performing a single overall task, is often quite convenient. However, in some situations, it may be more desirable to use the database to do the whole thing for us in one go.

### Activity 2

Write a SQL query which directly returns an equivalent result to the `value_counts` step above. 

You'll need to count each actors' appearances in a subquery, and include that subquery in the main query's `FROM` clause.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

The subquery is the same as above, but without the joining and ordering clauses:

```SQL
SELECT person_id, COUNT(person_id) AS appearance_count
FROM cast_member 
GROUP BY person_id;
```

That gives the number of appearances of each person. The outer query then counts the number of times each `appearance_count` occurs.

In [None]:
%%sql actor_appearances <<

SELECT appearance_count, COUNT(appearance_count) AS actors
FROM (  SELECT person_id, COUNT(person_id) AS appearance_count
        FROM cast_member 
        GROUP BY person_id
     ) AS ac
GROUP BY appearance_count
ORDER BY appearance_count

We then set the index of this DataFrame to be the `appearance_count` and sort the DataFrame by the index.

In [None]:
actor_appearances.set_index('appearance_count', inplace=True)
actor_appearances.sort_index(inplace=True)
actor_appearances.head()

#### End of Activity 2

------------------------------------------

# Maximum number of appearances per actor per year

An actor may appear in several movies in a year. But how many appearances has each actor made in each year?

In [None]:
%%sql

SELECT person.id, name, year, count(movie_id) AS appearance_count
FROM person, cast_member, movie
WHERE person.id = cast_member.person_id
    AND cast_member.movie_id = movie.id
GROUP BY person.id, name, year;

### Activity 3


What is the maximum number of appearances by any actor in each year?

Use a single SQL query (perhaps with subqueries) to return the answer.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

This is a similar approach to above: use the query above as a subquery, and have a wrapper query that aggregates (with `MAX`, in this case) the results of the subquery.

In [None]:
%%sql

SELECT year, MAX(appearance_count) AS max_appearance_count
FROM (SELECT person.id, year, COUNT(movie_id) AS appearance_count
     FROM person, cast_member, movie
     WHERE person.id = cast_member.person_id
        AND cast_member.movie_id = movie.id
     GROUP BY person.id, year
    ) AS ac
GROUP BY year
ORDER BY max_appearance_count DESC;

#### End of Activity 3

-----------------------------------------------------------

### Activity 4 (Optional)

As this activity is optional, you can either attempt it yourself, or feel free to skim my approach as an alternative.

The previous queries tells us the maximum number of appearances but year, but does not tell us how many actors may have made that many appearances. 

Using the query above as a subquery, write a query which finds not only the `max_appearance_count` as above, but also  the number of actors who had that many appearances that year. 

Your main query will need two subqueries: one to find the appearance count for each actor in each year, and one to find the `max_appearance_count` for each year. The fact that one of these subqueries itself contains a subquery is no obstacle to SQL.

Finally, you'll need to group the results.

For instance, in 1990, Joe Pesci (id 4517) and John Turturro (id 1241) both had 5 appearances; in 1986,	M. Emmet Walsh (id 588), Steven Hill (id 21521) and Charlie Sheen (id 6952) each had four appearances.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

First, let's just assemble the overall query. Using the subqueries above, it will look like this:

```SQL
SELECT year, max_appearance_count, COUNT(person.id)
FROM
    -- appearance count per actor per year -> person.id, year, appearance_count
    -- max appearance count per year -> year, max_appearance_count
    person
WHERE
    various joins
GROUP BY year, max_appearance_count
ORDER BY year;
```

where the comment lines are the name of the subquery, and the columns that query returns. 

However, the "max appearance count per year" query itself contains a subquery, so the structure is like this:


```SQL
SELECT year, max_appearance_count, COUNT(person.id)
FROM
    -- appearance count per actor per year -> person.id, year, appearance_count
    -- max appearance count per year -> year, max_appearance_count
       -- appearance count per actor per year -> person.id, year, appearance_count
    person
WHERE
    various joins
GROUP BY year, max_appearance_count
ORDER BY year;
```

We can then drop in the subqueries from above to form the overall query. To keep track of the fields returned by the various subqueries, I make use of correlation names for each subquery. (See notebook 11.1 for use of correlation names.)


In [None]:
%%sql

SELECT ac.year, max_appearance_count, COUNT(person.id) AS person_count
FROM 

    -- appearance count per actor per year
    (SELECT person.id, year, COUNT(movie_id) AS appearance_count
     FROM person, cast_member, movie
     WHERE person.id = cast_member.person_id
        AND cast_member.movie_id = movie.id
     GROUP BY person.id, year
    ) AS ac,           -- appearance count

    -- max appearance count per year
    (SELECT year, MAX(appearance_count) AS max_appearance_count
     FROM 
     
         -- appearance count per actor per year
         (SELECT person.id, year, COUNT(movie_id) AS appearance_count
          FROM person, cast_member, movie
          WHERE person.id = cast_member.person_id
            AND cast_member.movie_id = movie.id
         GROUP BY person.id, year
         ) AS iac      -- note change of name, inner appearance count
     GROUP BY year   
    ) AS mac,          -- max appearance count

    person
WHERE person.id = ac.id
    AND ac.year = mac.year
    AND appearance_count = max_appearance_count
GROUP BY ac.year, max_appearance_count
ORDER BY year;

In [None]:
max_appearances_by_year = _ # give a name to the result above
max_appearances_by_year.set_index('year', inplace=True)
max_appearances_by_year.head()

What does this data look like?

In [None]:
max_appearances_by_year.describe()

The `max_appearance_count` looks like a fairly constrained range of results, going from 1 to 7 with a median of 4. The `person_count` is somewhat less constrained: the mean is 7, the median is 2, but the maximum is 104!

#### End of Activity 4

---------------------------------------------------

## Viewing the Data (Optional)

As this activity is optional, you can either attempt it yourself, or feel free to skim my approach as an alternative.

Now we have the data, we can plot it. (If necessary, run the code in the previous Activity so that the `max_appearances_by_year` dataframe is defined and populated.) Let's start by just looking at the maximum number of appearances for any actor in a year.

In [None]:
ax = max_appearances_by_year['max_appearance_count'].plot()

#We could equally limit the columns included in the plot with the construction:
#ax = max_appearances_by_year['max_appearance_count'].plot()

ax.set(title="Max appearances by an actor each year over time", 
       xlabel='Year', 
       ylabel='Max appearances by an actor that year');

What if we try plotting both data series on the same graph?

In [None]:
ax = max_appearances_by_year.plot() #This plots all numerical columns
# The index values are plotted on the x-axis
ax.set(title="Max appearances by an actor each year over time", 
       xlabel='Year', 
       ylabel='Max appearances by an actor that year');

That really isn't useful: the spike of very large numbers in the 1910s swamps everything else.

We can try to alleviate that by plotting the `person_count` on a _secondary y-axis_ on the right. (Adding axis labels to the secondary y-axis just gets convoluted, so we won't bother here.)

In [None]:
ax = max_appearances_by_year.plot(secondary_y=['person_count'])
ax.set(title="Max appearances by an actor each year over time", 
       xlabel='Year', 
       ylabel='Max appearances by an actor that year');

Which, frankly, _still_ isn't that illuminating. There seems to be a limit of seven appearances per actor per year, but the number of people with that many credits varies widely. For most of the period of interest, there was a maximum of 2–4 appearances per actor per year, with around 10 people having that many appearances.

The spike in the 1910s is where there are years with many actors with one appearance each. 

# Conclusion
This notebook has shown you how to use subqueries as tables in a query. As you have seen, this allows you to express more sophisticated queries in just SQL, but at the cost of some rather complex queries. In the next notebook, you'll look at how to use **views** to keep queries simple while still doing complex tasks.