<a href="https://colab.research.google.com/github/ipeirotis/introduction-to-databases/blob/master/module3/C-Join_Queries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SQL: JOIN Queries

**Learning Outcomes:**

By the end of this lesson, you will be able to:

- Understand the notion of a join in databases
- Join two or more tables together using an inner join
- Combine joins with filtering conditions
- Understand the notion of an outer join and why it differs from an inner join
- Join a table to itself (self-joins)
- Understand the meaning of "semi-join" and "anti-join"
- Understand how joins behave for 1-to-1, 1-to-many, and many-to-many relationships

## Setup

First, we authenticate with Google Cloud and set up the BigQuery client.

**Important:** Replace `'your-project-id'` with your own Google Cloud project ID.

In [None]:
# Authenticate with Google Cloud
from google.colab import auth
auth.authenticate_user()

In [None]:
from google.cloud import bigquery

# Specify your Google Cloud project ID
PROJECT_ID = 'nyu-datasets'  # <-- Replace with your project ID

client = bigquery.Client(project=PROJECT_ID)

def run_query(sql):
    """Run a BigQuery SQL query and return results as a pandas DataFrame."""
    return client.query(sql).to_dataframe()

---
## Poor Man's Joins: Find the genres that Steven Spielberg typically directs

Before learning joins, let's see how we might solve a problem using multiple queries.

**Goal:** Find all genres and probabilities for films directed by Steven Spielberg.

#### Step 1: Find the entry for Steven Spielberg to get his ID

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.directors`
WHERE first_name = 'Steven' AND last_name = 'Spielberg'
""")

#### Step 2: Query the `directors_genres` table using the ID from above

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.directors_genres`
WHERE director_id = 75380
ORDER BY prob DESC
""")

**Problem:** This approach requires two queries and manually copying the ID. Joins solve this!

---
## Simple JOIN Queries

#### List all the movies and their genres

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies` M
INNER JOIN `nyu-datasets.imdb.movies_genres` G ON M.id = G.movie_id
LIMIT 100
""")

#### List the movie genres for Steven Spielberg (using a JOIN instead of two queries)

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.directors` D
INNER JOIN `nyu-datasets.imdb.directors_genres` G ON G.director_id = D.id
WHERE D.first_name = 'Steven' AND D.last_name = 'Spielberg'
ORDER BY G.prob DESC
""")

In [None]:
# Same query but selecting only the columns we need
run_query("""
SELECT G.genre, G.prob
FROM `nyu-datasets.imdb.directors` D
INNER JOIN `nyu-datasets.imdb.directors_genres` G ON G.director_id = D.id
WHERE D.first_name = 'Steven' AND D.last_name = 'Spielberg'
ORDER BY G.prob DESC
""")

#### List all the movies and their directors (joining three tables)

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.directors` D
INNER JOIN `nyu-datasets.imdb.movies_directors` MD ON MD.director_id = D.id
INNER JOIN `nyu-datasets.imdb.movies` M ON M.id = MD.movie_id
LIMIT 100
""")

In [None]:
# With cleaner column selection and aliases
run_query("""
SELECT
    MD.director_id,
    D.first_name AS director_first_name,
    D.last_name AS director_last_name,
    MD.movie_id,
    M.name AS movie_title,
    M.year AS release_year,
    M.rating AS movie_rating
FROM `nyu-datasets.imdb.directors` D
INNER JOIN `nyu-datasets.imdb.movies_directors` MD ON MD.director_id = D.id
INNER JOIN `nyu-datasets.imdb.movies` M ON M.id = MD.movie_id
LIMIT 100
""")

#### List all the movies directed by Steven Spielberg

In [None]:
run_query("""
SELECT
    MD.director_id,
    D.first_name AS director_first_name,
    D.last_name AS director_last_name,
    MD.movie_id,
    M.name AS movie_title,
    M.year AS release_year,
    M.rating AS movie_rating
FROM `nyu-datasets.imdb.directors` D
INNER JOIN `nyu-datasets.imdb.movies_directors` MD ON MD.director_id = D.id
INNER JOIN `nyu-datasets.imdb.movies` M ON M.id = MD.movie_id
WHERE D.first_name = 'Steven' AND D.last_name = 'Spielberg'
ORDER BY M.rating DESC
""")

---
## JOIN Practice: Drama Movies from 2000

#### List all the movies from year 2000

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE year = 2000
LIMIT 50
""")

#### List all the movies from year 2000 and their genres

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies` M
INNER JOIN `nyu-datasets.imdb.movies_genres` G ON G.movie_id = M.id
WHERE M.year = 2000
LIMIT 100
""")

#### List all the Drama movies from year 2000

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies` M
INNER JOIN `nyu-datasets.imdb.movies_genres` G ON G.movie_id = M.id
WHERE M.year = 2000 AND G.genre = 'Drama'
LIMIT 100
""")

#### List all the Drama movies from year 2000 with ratings

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies` M
INNER JOIN `nyu-datasets.imdb.movies_genres` G ON G.movie_id = M.id
WHERE M.year = 2000 AND G.genre = 'Drama' AND M.rating IS NOT NULL
LIMIT 100
""")

#### List the top-50 Drama movies from year 2000, based on ratings

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies` M
INNER JOIN `nyu-datasets.imdb.movies_genres` G ON G.movie_id = M.id
WHERE M.year = 2000 AND G.genre = 'Drama' AND M.rating IS NOT NULL
ORDER BY M.rating DESC
LIMIT 50
""")

---
## JOIN Practice: James Bond Movies

#### List all the movies where there is an actor with the role 'James Bond'

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies` M
INNER JOIN `nyu-datasets.imdb.roles` R ON R.movie_id = M.id
WHERE R.role = 'James Bond'
""")

#### List the actors who played 'James Bond'

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.actors` A
INNER JOIN `nyu-datasets.imdb.roles` R ON R.actor_id = A.id
WHERE R.role = 'James Bond'
""")

#### List the actors who played 'James Bond' and the name of the movie

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.actors` A
INNER JOIN `nyu-datasets.imdb.roles` R ON R.actor_id = A.id
INNER JOIN `nyu-datasets.imdb.movies` M ON M.id = R.movie_id
WHERE R.role = 'James Bond'
""")

#### List the actors who played 'James Bond' and the movie name. Rank by rating

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.actors` A
INNER JOIN `nyu-datasets.imdb.roles` R ON R.actor_id = A.id
INNER JOIN `nyu-datasets.imdb.movies` M ON M.id = R.movie_id
WHERE R.role = 'James Bond'
ORDER BY M.rating DESC
""")

#### List the actors who played 'James Bond' and the movie name. Rank by year

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.actors` A
INNER JOIN `nyu-datasets.imdb.roles` R ON R.actor_id = A.id
INNER JOIN `nyu-datasets.imdb.movies` M ON M.id = R.movie_id
WHERE R.role = 'James Bond'
ORDER BY M.year
""")

---
## JOIN Practice: Brad Pitt Movies

#### List all the movies where Brad Pitt is playing

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.actors` A
INNER JOIN `nyu-datasets.imdb.roles` R ON R.actor_id = A.id
INNER JOIN `nyu-datasets.imdb.movies` M ON M.id = R.movie_id
WHERE A.first_name = 'Brad' AND A.last_name = 'Pitt'
""")

#### List all the movies where Brad Pitt is playing. Exclude movies where he plays "himself"

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.actors` A
INNER JOIN `nyu-datasets.imdb.roles` R ON R.actor_id = A.id
INNER JOIN `nyu-datasets.imdb.movies` M ON M.id = R.movie_id
WHERE A.first_name = 'Brad' AND A.last_name = 'Pitt'
  AND R.role NOT LIKE '%himself%'
""")

#### Brad Pitt movies (excluding "himself"), ranked by rating

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.actors` A
INNER JOIN `nyu-datasets.imdb.roles` R ON R.actor_id = A.id
INNER JOIN `nyu-datasets.imdb.movies` M ON M.id = R.movie_id
WHERE A.first_name = 'Brad' AND A.last_name = 'Pitt'
  AND R.role NOT LIKE '%himself%'
ORDER BY M.rating DESC
""")

#### Brad Pitt movies (excluding "himself"), ranked by year

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.actors` A
INNER JOIN `nyu-datasets.imdb.roles` R ON R.actor_id = A.id
INNER JOIN `nyu-datasets.imdb.movies` M ON M.id = R.movie_id
WHERE A.first_name = 'Brad' AND A.last_name = 'Pitt'
  AND R.role NOT LIKE '%himself%'
ORDER BY M.year
""")

---
## JOIN Practice: Facebook Database

#### List all the Single students

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.facebook.Profiles` P
INNER JOIN `nyu-datasets.facebook.Relationship` R ON R.ProfileID = P.ProfileID
WHERE R.Status = 'Single'
LIMIT 50
""")

#### List all Single students who live in Palladium

Allow for flexible matching since people list it differently (e.g., "Palladium 101" vs "Palladium")

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.facebook.Profiles` P
INNER JOIN `nyu-datasets.facebook.Relationship` R ON R.ProfileID = P.ProfileID
WHERE R.Status = 'Single' AND P.Residence LIKE 'Palladium%'
""")

#### List all Single students LookingFor "Random Play"

In [None]:
run_query("""
SELECT P.AIM, P.Sex
FROM `nyu-datasets.facebook.Profiles` P
INNER JOIN `nyu-datasets.facebook.Relationship` R ON R.ProfileID = P.ProfileID
INNER JOIN `nyu-datasets.facebook.LookingFor` L ON L.ProfileID = P.ProfileID
WHERE R.Status = 'Single' AND L.LookingFor = 'Random Play'
""")

#### List all students who have "The Killers" as favorite Music

In [None]:
run_query("""
SELECT P.*
FROM `nyu-datasets.facebook.Profiles` P
INNER JOIN `nyu-datasets.facebook.FavoriteMusic` M ON M.ProfileID = P.ProfileID
WHERE M.Music = 'The Killers'
""")

#### List all students who like the book "1984"

In [None]:
run_query("""
SELECT P.*
FROM `nyu-datasets.facebook.Profiles` P
INNER JOIN `nyu-datasets.facebook.FavoriteBooks` B ON B.ProfileID = P.ProfileID
WHERE B.Book = '1984'
""")

---
## Self Joins

A self join is when you join a table to itself. This is useful when you need to compare rows within the same table.

#### List movies that have both Drama AND Comedy listed among their genres

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies` M
INNER JOIN `nyu-datasets.imdb.movies_genres` G1 ON G1.movie_id = M.id
INNER JOIN `nyu-datasets.imdb.movies_genres` G2 ON G2.movie_id = M.id
WHERE G1.genre = 'Drama' AND G2.genre = 'Comedy'
LIMIT 50
""")

#### List students majoring in Computer Science AND another concentration

Show the second concentration as well.

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.facebook.Concentration` C1
INNER JOIN `nyu-datasets.facebook.Concentration` C2 ON C1.ProfileID = C2.ProfileID
WHERE C1.Concentration = 'Computer Science' AND C2.Concentration != 'Computer Science'
""")

---
## Outer Joins

An **outer join** returns all rows from one table, even if there are no matching rows in the other table.

- `LEFT JOIN`: Returns all rows from the left table
- `RIGHT JOIN`: Returns all rows from the right table
- `FULL OUTER JOIN`: Returns all rows from both tables

This is useful for finding records that **don't have** matches (anti-joins).

#### List all the movies without actors (anti-join)

In [None]:
run_query("""
SELECT M.*, R.*
FROM `nyu-datasets.imdb.movies` M
LEFT JOIN `nyu-datasets.imdb.roles` R ON M.id = R.movie_id
WHERE R.movie_id IS NULL
LIMIT 50
""")

#### List all the movies without an associated genre

In [None]:
run_query("""
SELECT M.*
FROM `nyu-datasets.imdb.movies` M
LEFT JOIN `nyu-datasets.imdb.movies_genres` G ON M.id = G.movie_id
WHERE G.movie_id IS NULL
LIMIT 50
""")

#### List all Students that have not listed a Concentration

In [None]:
run_query("""
SELECT P.*, C.*
FROM `nyu-datasets.facebook.Profiles` P
LEFT JOIN `nyu-datasets.facebook.Concentration` C ON P.ProfileID = C.ProfileID
WHERE C.ProfileID IS NULL
LIMIT 50
""")