# Subqueries used as values and sets


In this notebook, you will look at some examples of how a subquery can be used in place of a value (or set of values) and used in exactly the same way as a literal value in an SQL query. 

Again, you will become familiar with the technique by asking some questions of the movies dataset. The questions in this notebook are sometimes a little contrived, and sometimes a _lot_ contrived. These are not necessarily the most obvious or most interesting questions you could ask of the dataset. However, they do serve to illustrate certain SQL techniques.

You should expect to spend around 45 minutes on this notebook. 

## Setting up

The next group of cells set up your database connection, and reset the database to a clean state. Check notebook *08.1 Data Definition Language in SQL* if you are unsure what the next cells do.

You may need to change the given values of the variables `DB_USER` and `DB_PWD`, depending on which environment you are using

In [None]:
# Make the connection

%run sql_init.ipynb
print("Connecting with connection string : {}".format(DB_CONNECTION))
%sql $DB_CONNECTION

## Asking questions of the database



Before we start, here's the ERD of the database again, which can be useful in getting orientated around the information it contains.


![Movies ERD](./images/movies-erd.svg)


As with `Notebook 11.1: Movie analysis`, this notebook uses the `movies` schema, so let's set the `search_path` so that we don't need to qualify all the table names:

In [None]:
%%sql

SET search_path TO movies, public;

# Highest grossing movie each year

What is the highest grossing movie in each year?

This seems a reasonable question to ask of the dataset, but how would you go about answering it?

Something like:

```SQL
SELECT MAX(revenue) AS revenue 
FROM movie
```

will find the highest revenue over all the movies in the database, but that's not the highest revenue for each year.

However, if we can find the highest revenue *in each year*, can we use that to identify the movie in that year with that revenue? That movie would then be the highest-grossing movie that year.

In "pseudo-SQL", we want to say something along the lines of: *"Select the movie in each year with the highest revenue in that year"*; but how do we know what the largest revenue is per year?

Instead, we need to pick our query apart a little, and separate out the concerns: *"Select the movie in each year where the movie has a revenue equal to the highest revenue in that year"*.

That "equal to" effectively splits our query into two queries.

### Unpicking the query - Largest Revenue Per Year


What is highest amount made by a movie each year?

Don't return anything about the movie, just list the maximum revenue against year. To put the numbers into some sort of context, plot the data to get a feel for how the largest revenue figure has evolved over the years.

A straightforward combination of `GROUP BY` and `MAX()` will return the highest revenue in each year.

In [None]:
%%sql highest_gross_revenue_year <<

SELECT year, MAX(revenue) AS revenue
FROM movie
GROUP BY year
ORDER BY year;

We have used the expression `highest_gross_revenue_year <<` to store the table returned by the query in a DataFrame called `highest_gross_revenue_year`. We can now index it by year.

In [None]:
highest_gross_revenue_year.set_index('year', inplace=True)
highest_gross_revenue_year.tail()

#### An Aside - Does the Data Look Right? Largest Revenue Trends and Behaviour

When working on a complex question that we have split into several sub-questions, it often makes sense to check that the result of each step "looks about right".

So as an aside, let's see if the maximum revenue figures look reasonable (that is, check that they don't behave in a completely unexpected, surprising  or unreasonable way) by putting them into some sort of context.

For example, would you expect a trend of increasing revenues (a growth in cinema audiences and ticket prices, for example) or decreasing revenues (if movie-going becomes less popular)? Or are there any outliers that might result from dirty data? What sort of range in the highest revenue values would be (un)reasonable?

In [None]:
ax = highest_gross_revenue_year['revenue'].plot(title="Highest movie revenue each year")
ax.set(xlabel='Year', 
       ylabel='Highest revenue');

There's a general trend of increasing revenue over time, but there's also a lot of "noise" in the graph. 

The plot is also rather distorted by an outlier in 1939. What film was that?

In [None]:
%%sql 

SELECT id, title, year, revenue 
FROM movie 
WHERE revenue > 2000000000;

There are a lot of low-revenue movies before about 1950 and what looks like a flat period in the 1980s. Let's replot the chart, using just the movies from 1980 to 2010.

In [None]:
ax = highest_gross_revenue_year.loc[1980:2010].revenue.plot(ylim=(0, 2000000000))
ax.set(title="Highest movie revenue each year", 
       xlabel='Year', 
       ylabel='Highest revenue');

Over this period, there is some sort of trend of increasing revenue, but the year-on-year variation in the revenue figures is large. It would be difficult to be confident about any trend over this period.

### Finding the Movie With  A Revenue Equal to the Highest Revenue

Armed with the maximum revenue in each year, we can now find the movie that brought in that money.

If we know that that the largest revenue in 2008 was the only one with a revenue over a billion dollars (1 times 10 to the power 9 dollars, or `1e9`), we could write a query of the form:

```SQL
SELECT title 
FROM movie
WHERE year=2008 AND revenue>1e9;
```

But how do we generalise this so we can query over each of the year/highest revenue combinations?

The trick is to use the query above as a _subquery_ in the outer query.

We want the subquery to return a single value, the maximum revenue of any movie in the year of interest. We can use that value in the `WHERE` clause of the outer query. 

The idea is to iterate through the `movie` table and, for each movie, find the maximum revenue of any movie in the same year. If the revenues match, we keep the movie in the outer query.

We can do it with a query like this, which will take a few moments to complete:

In [None]:
%%sql

SELECT id, title, revenue, year
FROM movie AS outer_movie
WHERE revenue = (
    SELECT MAX(revenue)
    FROM movie AS inner_movie
    WHERE inner_movie.year = outer_movie.year
    )
ORDER BY revenue DESC
LIMIT 10;

(I used `ORDER BY` and `LIMIT` to find just the ten years with the highest single revenue.)

### Activity 1

Based on the query developed above, find the movies with the largest _budget_ in each year. 

Return the movie id and title, revenue, and year.

Give the results for only the ten highest-budget years, in descending order of budget.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

One possible solution is to find the movie in each year that has the same revenue as the highest grossing moving that year.

Once again, this query may take some time to run...

In [None]:
%%sql

SELECT id, title, budget, year
FROM movie AS outer_movie
WHERE budget = (
    SELECT MAX(budget)
    FROM movie AS inner_movie
    WHERE inner_movie.year = outer_movie.year
    )
ORDER BY budget DESC
LIMIT 10;

The structure of the query is identical to the previous example.

In terms of the results, it's quite surprising how little overlap there is between the two lists. It seems that a large budget is no guarantee of commercial success.

#### End of Activity 1

----------------------------------------

## Foreign language films
The dataset contains information about the spoken languages used in a film. What does this tell us about the number of "foreign language" films in the dataset and how that number has changed over time?

What counts as a "foreign language film"? Is it a film that contains some language other than English, or a film in which the English language does not appear? Taking a (frequently assumed) Western bias, we might even naively want to distinguish between *foreign language* films and *"foreign" films* that are produced from outside the US or UK.

### The relevant data
The figure below shows the relevant fragment of the ERD of the movies dataset. It shows how the `movie` table is related to the `spoken_language` and `production_country` tables via the composite entities of `movie_language` and `movie_production_country` respectively.

![Movies ERD fragment](images/movies-erd-foreign-fragment.svg)

A few quick queries shows what's in these tables.

In the `spoken_language` table, there are several language `id` values where no `language` name is recorded.  In such cases, the `language` value is the empty string rather than a NULL value. The following query returns just language codes where we do know the language name:

In [None]:
%%sql 

SELECT *
FROM spoken_language WHERE language <> ''
LIMIT 5;

In [None]:
%%sql

SELECT *
FROM movie_language
LIMIT 5;

In [None]:
%%sql 

SELECT *
FROM production_country
LIMIT 5;

In [None]:
%%sql

SELECT *
FROM movie_production_country
LIMIT 5;

Essentially, `spoken_language` and `production_country` give human-readable names for languages and countries; the composite tables connect them to the `movie`s by putting the two IDs in the same row. 

We can show how this works with an SQL query with two joins, to give the languages spoken in each movie:

In [None]:
%%sql

SELECT movie_id, title, language_id, language
FROM movie, movie_language, spoken_language
WHERE movie.id = movie_id
    AND spoken_language.id = language_id
LIMIT 10;

#### An Aside - Looking for Bias In Datasets: Production Country

An important caveat with this dataset, and hence the reliability of the results you will generate, arises from where the data came from. The dataset was built from sources such as *The Movie Database (TMDb)* and the *Internet Movie Database (IMDb)*, which have a very strong US bias in the data selected for inclusion. This dataset is likely to be reliable for movies with a widespread US release, but there is very little data here for movies that were not released in the US. 

A quick look at the _production countries_ bears this out. (Note the join of `movie_production_country.production_country_id = production_country.id` to connect the `movie_production_country` rows, which are counted, with the `production_country` rows, which have the `country_name`. The column names are explicitly qualified with the appropriate table name to make the query easier to read.)

In [None]:
%%sql

SELECT production_country_id, country_name, COUNT(movie_id) AS movie_count
FROM movie_production_country, production_country
WHERE movie_production_country.production_country_id = production_country.id
GROUP BY production_country_id, country_name
ORDER BY movie_count DESC
LIMIT 10;

Although it has a significant movie output, the prolific Bollywood film production area is barely mentioned in this dataset:

In [None]:
%%sql

SELECT production_country_id, country_name, count(movie_id) AS movie_count
FROM movie_production_country, production_country
WHERE production_country_id = id
    AND country_name = 'India'
GROUP BY production_country_id, country_name
ORDER BY movie_count DESC;

Therefore, the number of "foreign language" films is from a US-centric perspective. And if your dataset is biased, that bias will propagate through to your results.

### Activity 2

Which movies have been produced in the UK that do not use English (the `language_code` for English is `en`)?

You only need to return the `movie_id` values.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

There are at least two ways of doing this.

One way is to get the IDs of films that do not use English and then search on these to find ones produced in the UK:

In [None]:
%%sql
SELECT movie_id
FROM movie_production_country
WHERE production_country_id='GB'
AND movie_id IN (
    SELECT movie_id
    FROM movie_language
    WHERE language_id!='en'
    ); 

A second way is to get the unique IDs of films produced in the UK (`production_country_id='GB'`), and then use these as the basis of a search on languages:

In [None]:
%%sql
SELECT DISTINCT(movie_id)
FROM movie_language
WHERE movie_id IN (
    SELECT movie_id 
    FROM movie_production_country
    WHERE production_country_id='GB'
    )
AND language_id!='en';

#### End of Activity 2

-----------------------------------------------

The following query returns a list of movies produced in the US where Spanish is spoken.


```SQL
SELECT DISTINCT(movie_id) AS number_of_movies
FROM movie_language
WHERE movie_id IN (
    SELECT movie_id 
    FROM movie_production_country
    WHERE production_country_id='US'
    )
AND language_id='es';
```

How many are there?

In [None]:
%%sql
SELECT COUNT(DISTINCT(movie_id)) AS number_of_movies
FROM movie_language
WHERE movie_id IN (
    SELECT movie_id 
    FROM movie_production_country
    WHERE production_country_id='US'
    )
AND language_id='es';

### Returning a Subset of Rows From a Table Based on a Group Property - HAVING

When we query the `movie_language` table, we get a separate row back for each *(movie, language)* combination. So if four languages appear in a given movie, we get four rows back for that movie.

The *SQL* `HAVING` filter allows us to group rows in a query, test a group property of those rows, and then return the rows in groups where the group property is satisfied.

For example, if we group rows returned from a query on the `movie_language` table and count the number of rows in each group, we can return a count of the number of languages spoken in each film.

In [None]:
%%sql
--count of languages by film
SELECT movie_id, COUNT(movie_id) AS number_of_languages
FROM movie_language
GROUP BY movie_id
LIMIT 5;

A command specific to Postgresql (`STRING_AGG()`) also allows you to summarise into a single string the values contained within a given column across all the rows in each group:

In [None]:
%%sql
--count of languages by film
SELECT movie_id, COUNT(movie_id) AS number_of_languages, STRING_AGG(language_id, ', ') AS languages_spoken
FROM movie_language
GROUP BY movie_id
LIMIT 5;

We can limit the movies that are returned based on ones where a minimum number of languages are spoken using the `HAVING` clause, which is rather like a `WHERE` clause over group properties. Such a query returns a summarised grouped response rather than all the rows in each group.

For example, let's get movies back where there are at least three languages used:

In [None]:
%%sql
--films where at least three languages are spoken
SELECT movie_id, COUNT(movie_id) AS num_langs, STRING_AGG(language_id, ', ') AS languages_spoken
FROM movie_language

-- group rows by movie_id
GROUP BY movie_id

-- and then filter the results to groups containing at least 3 rows
HAVING COUNT(movie_id) >= 3

-- Limit the nuber of results for convenience
LIMIT 3;

### Activity 3

How would you use a `HAVING` clause to find the rows from the `movie_language` table for films where 8 languages are used?

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

First of all, we want a query to identify the number of languages spoken for each film. We can use the `COUNT` aggregation function for this, remembering to use `GROUP BY` for the non-aggregated column (`movie_id`).

In [None]:
%%sql
SELECT movie_id, COUNT(language_id) AS languages_in_film
FROM movie_language
GROUP BY movie_id
LIMIT 5;


Then we can add a `HAVING` clause to give us those movies in which 8 languages are used:

In [None]:
%%sql

SELECT movie_id, COUNT(language_id) AS languages_in_film
FROM movie_language
GROUP BY movie_id
HAVING COUNT(language_id)=8
LIMIT 5;

Although we didn't ask for it in the activity, we can use this subquery to find the names of these films by using the subquery in the `WHERE` clause of a query on the `movie` table. To do this, we use just the `movie_id` column in the subquery:

In [None]:
%%sql

SELECT title
FROM movie
WHERE id IN (
    SELECT movie_id
    FROM movie_language
    GROUP BY movie_id
    HAVING COUNT(language_id)=8
);

#### End of Activity 3

------------------------------------------------------------

Can we validate the solution we developed in the previous activity? What languages are spoken in _Munich_, with `movie_id=612`?

In [None]:
%%sql 

SELECT movie.id, title, language_id, language
FROM movie, movie_language, spoken_language
WHERE movie.id = 612 
    AND movie.id = movie_id
    AND spoken_language.id = language_id;

### Activity 4

How many movies do not have spoken English?

There's a subtlety here about "films that aren't in English." Should we include silent films and other films for which there's no language recorded?

For the moment, find the movies that have at least one spoken language, but none of the spoken languages is English.

HINT: Find the set of all movies that have spoken English, then `SELECT` the movies that are `NOT IN` that set.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, click on the triangle symbol on the left-hand end of this cell.

We can answer the question by using a subquery to find the "spoken language is English" records for each movie, and only retrieving movies where the subquery returns an empty set. The join between `movie` and `movie_language` ensures that all the movies have at least one spoken language.

In [None]:
%%sql

SELECT id, title, COUNT(language_id) AS number_of_languages
FROM movie, movie_language
WHERE id = movie_id
    AND id NOT IN (
        -- Subquery that finds movies _with_ spoken English
        SELECT movie_id 
        FROM movie_language
        WHERE language_id = 'en'
        )
GROUP BY id;

Can we validate this? What languages are spoken in *Brotherhood of the Wolf*?

In [None]:
%%sql 

SELECT movie.id, title, language_id, language
FROM movie, movie_language, spoken_language
WHERE movie.id = 6312
    AND movie.id = movie_id
    AND spoken_language.id = language_id;

Yes, there are three languages spoken in that film, and none of them is English.

#### End of Activity 4

----------------------------------------

# Conclusion

This notebook has shown you how to use subqueries as values or sets of values in a query. As you have seen, this allows you to express more sophisticated queries in just SQL, without having to import large result sets into Pandas and then do some additional processing there. The more data manipulation you can get the database to do, the better. This has advantages both in processing time and also in memory usage: RDBMSs are very good at handling large sets of data, and can often do it more efficiently than Pandas.

These are techniques you are likely to need in your own data investigations.