# Subqueries and CTEs

### Key Takeaways:
* **Subqueries and CTEs** help to simplify complex queries by breaking them into smaller, reusable parts.
* **Independent Subqueries** are cached to avoid repeated execution.
* **Correlated Subqueries** can be powerful but need careful consideration due to their performance impact.
* **CTEs** improve readability and maintainability, especially for complex queries involving multiple calculations or filters.



## 1. List movies with a rating or revenue higher than the average rating or revenue of all movies.

Without subqueries, achieving this would require two queries: one to calculate the average and another to filter the results:

```sql
SELECT avg(rating) FROM movies;
SELECT title, count(*) 
FROM movies
WHERE rating > 5.73; -- value copied from the first query
```

### Using Subqueries
Instead, we can use a subquery. Subqueries are `enclosed in parentheses` within the main query:

```sql
SELECT
  count(*)
FROM
  movies
WHERE
  rating > (
    SELECT
      avg(rating)
    FROM
      movies
  );
```
If a subquery is run for each row, it would slow down the query. However, when the subquery result does not change (i.e., it is independent of the outer query), the database will run the subquery only once and cache the result for filtering purposes. This is known as an **independent subquery**.

### Using CTEs (Common Table Expressions)
CTEs allow you to reuse subqueries within a query.

```sql
WITH -- start with 'WITH' 
  avg_revenue_cte AS ( -- first CTE
    SELECT
      avg(revenue) -- use 'AS' if a nickname is needed
    FROM
      movies
  ), -- comma to separate each CTE; no 'WITH' needed for the second
  avg_rating_cte AS ( -- second CTE
    SELECT
      avg(rating)
    FROM
      movies
  ) -- no semicolon
SELECT
  title,
  director,
  revenue,
  ROUND(
    ( -- use parentheses
      SELECT
        * -- use the name of the column from CTE
      FROM
        avg_revenue_cte
    ),
    0
  ) AS avg_revenue,
  rating,
  ROUND(
    (
      SELECT
        *
      FROM
        avg_rating_cte
    ),
    0
  ) AS avg_rating
FROM
  movies
WHERE
  revenue > (
    SELECT
      *
    FROM
      avg_revenue_cte
  )
  AND rating > (
    SELECT
      *
    FROM
      avg_rating_cte
  );
```

## 2. List movies with a rating higher than the average rating of movies in their genre.

```sql
WITH
  rating_per_genre AS (
    SELECT
      avg(rating)
    FROM
      movies AS m2
    WHERE
      rating IS NOT NULL
      AND genres IS NOT NULL
      AND m2.genres = m1.genres
  )
SELECT
  title,
  rating,
  genres,
  (
    SELECT
      *
    FROM
      rating_per_genre
  ) AS rating_per_genre
FROM
  movies AS m1
WHERE
  release_date > 2020
  AND m1.rating > (
    SELECT
      *
    FROM
      rating_per_genre
  );
```

## 3. Find the movies with a rating higher than the average rating of movies released in the same year.

### Correlated Subqueries
```sql
SELECT
  m1.title,
  m1.director,
  m1.rating
FROM
  movies AS m1
WHERE
  m1.rating > (
    SELECT
      avg(m2.rating)
    FROM
      movies AS m2 
    WHERE
      m2.release_date = m1.release_date -- reference the outer query
  );
```
The subquery will run for every single row by referring to the value from the outer query. The number of executions will increase exponentially with the number of rows. To optimize, you can limit the results:
```sql
SELECT
  m1.title,
  m1.director,
  m1.rating
FROM
  movies AS m1
WHERE
  m1.release_date > 2022
  AND m1.rating > (
    SELECT
      avg(m2.rating)
    FROM
      movies AS m2 
    WHERE
      m2.release_date = m1.release_date -- reference the outer query
  );
```
* The order of filtering conditions in the `WHERE` clause does not matter because the query optimizer will choose the cheapest queries to run first.
* This query is not yet fully optimized.
### Correlated CTEs
```sql
WITH
  movie_avg_per_year AS (
    SELECT
      avg(m2.rating)
    FROM
      movies AS m2
    WHERE
      m2.release_date = m1.release_date
  )
SELECT
  m1.title,
  m1.director,
  m1.rating,
  (
    SELECT
      *
    FROM
      movie_avg_per_year
  ) AS year_avg
FROM
  movies AS m1
WHERE
  m1.release_date > 2022 -- to limit the results
  AND m1.rating > (
    SELECT
      *
    FROM
      movie_avg_per_year
  );
```

**Note:** SQLite allows referring to aliases used under CTE, but this is not true for all databases. Typically, aliases are created after CTE execution, as the execution order is from top to bottom, so CTEs are not supposed to reference aliases declared later.

## 4. Find the directors with a career revenue higher than the average revenue of all directors.

```sql
WITH
  directors_rev AS (
    SELECT
      director,
      SUM(revenue) AS career_rev
    FROM
      movies
    WHERE
      revenue IS NOT NULL
      AND director IS NOT NULL
    GROUP BY
      director
  )
SELECT
  director,
  SUM(revenue) AS total_rev,
  (
    SELECT
      avg(career_rev)
    FROM
      directors_rev
  ) AS peers_avg
FROM
  movies
WHERE
  director IS NOT NULL
  AND revenue IS NOT NULL
GROUP BY
  director
HAVING
  total_rev > (
    SELECT
      avg(career_rev)
    FROM
      directors_rev
  );
```
## Practice

```sql
WITH
  director_stats AS (
    SELECT
      director,
      AVG(rating) AS avg_rating,
      COUNT(*) AS total_movies,
      MAX(rating) AS best_rating,
      MIN(rating) AS worst_rating,
      MAX(budget) AS highest_budget,
      MIN(budget) AS lowest_budget
    FROM
      movies
    WHERE
      director IS NOT NULL
      AND budget > 0
      AND rating IS NOT NULL
    GROUP BY
      director
    HAVING
      total_movies > 2
    LIMIT
      20
  )
SELECT
  director,
  avg_rating,
  total_movies,
  best_rating,
  worst_rating,
  highest_budget,
  lowest_budget,
  (
    SELECT
      title
    FROM
      movies
    WHERE
      budget IS NOT NULL
      AND rating IS NOT NULL
      AND director = ds.director
    ORDER BY
      rating DESC
    LIMIT
      1
  ) AS highest_rating,
  (
    SELECT
      title
    FROM
      movies
    WHERE
      budget IS NOT NULL
      AND rating IS NOT NULL
      AND director = ds.director
    ORDER BY
      rating ASC
    LIMIT
      1
  ) AS lowest_rating,
  (
    SELECT
      title
    FROM
      movies
    WHERE
      budget IS NOT NULL
      AND rating IS NOT NULL
      AND director = ds.director
    ORDER BY
      rating DESC
    LIMIT
      1
  ) AS highest_rating,
  (
    SELECT
      title
    FROM
      movies
    WHERE
      budget IS NOT NULL
      AND rating IS NOT NULL
      AND director = ds.director
    ORDER BY
      budget ASC
    LIMIT
      1
  ) AS lowest_budget
FROM
  director_stats AS ds;
```

