<a href="https://colab.research.google.com/github/ipeirotis/introduction-to-databases/blob/master/session3/B3-Filtering_Queries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SQL: Filtering Queries

**Learning Objectives:**

By the end of this module, you will be able to:

- Filter rows from a table using the `WHERE` clause with a variety of conditions (`=`, `<>`, `>`, `<`, `BETWEEN`)
- Combine multiple conditions using boolean operators like `AND`, `OR`, and `NOT`
- Simplify queries by using the `IN` operator for lists of values
- Perform pattern matching on text using the `LIKE` operator to find approximate results
- Handle missing data by correctly querying for `NULL` values using `IS NULL` and `IS NOT NULL`
- Create conditional logic within your queries using the `CASE` statement

## Setup

First, we authenticate with Google Cloud and set up the BigQuery client to run our SQL queries.

**Important:** You'll need to replace `'your-project-id'` with your own Google Cloud project ID. This is required for billing/quota purposes, even though we're querying data hosted in the `nyu-datasets` project.

In [None]:
# Authenticate with Google Cloud
from google.colab import auth
auth.authenticate_user()

In [None]:
from google.cloud import bigquery

# Specify your Google Cloud project ID
# This is needed for billing/quota, even when querying datasets in other projects
PROJECT_ID = 'nyu-datasets'  # <-- Replace with your project ID

client = bigquery.Client(project=PROJECT_ID)

# Helper function to run queries and display results as a DataFrame
def run_query(sql):
    """Run a BigQuery SQL query and return results as a pandas DataFrame."""
    return client.query(sql).to_dataframe()

---
## The `WHERE` Clause

The `WHERE` clause defines which rows will appear in the results.

```sql
SELECT col1, col2
FROM   table_name
WHERE  condition
ORDER BY col1, col2;
```

---
## `WHERE`: Equality Conditions

| Condition | Description | Example |
|-----------|-------------|--------|
| `attr = 'text'` | Equality comparison for a textual attribute | `gender = 'Male'` |
| `attr = number` | Equality comparison for a numeric attribute | `year = 2006` |

### IMDb Database

#### Find the movie entry with id 64729

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE id = 64729
""")

#### Find the movie entry with movie title 'Pulp Fiction'

In [None]:
run_query("""
SELECT id, name, year
FROM `nyu-datasets.imdb.movies`
WHERE name = 'Pulp Fiction'
""")

#### Find the id of the movie "Citizen Kane"

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE name = 'Citizen Kane'
""")

#### Find the id of the movie "Schindler's List"

Note: Pay attention to the quote in the movie title!

In [None]:
# Option 1: Escape the single quote
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE name = 'Schindler\'s List'
""")

In [None]:
# Option 2: Use double quotes for the string
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE name = "Schindler's List"
""")

#### List all the roles for the movie with id 290070. Sort them alphabetically

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.roles`
WHERE movie_id = 290070
ORDER BY role
""")

---
## `WHERE`: Inequality Conditions

| Condition | Description | Example |
|-----------|-------------|--------|
| `attr <> value` or `attr != value` | Attribute is not equal to value | `genre <> 'Drama'` |
| `attr > value` | Attribute is greater than value | `rating > 7.8` |
| `attr < value` | Attribute is smaller than value | `year < 1900` |
| `attr >= value` | Attribute is equal or greater than value | `year >= 1900` |
| `attr <= value` | Attribute is equal or smaller than value | `year <= 1900` |
| `attr BETWEEN n AND m` | Attribute is between the two values (inclusive on both ends) | `year BETWEEN 1890 AND 1892` |

### IMDb Database

#### Find all information about movies that were released before 1895 (exclusive)

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE year < 1895
""")

#### Find all information about movies released between 1895 and 1898 (exclusive upper bound)

Try both using Boolean operators and using the `BETWEEN` operator.

In [None]:
# Using Boolean operators
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE year > 1895 AND year < 1898
""")

In [None]:
# Using BETWEEN (note: BETWEEN is inclusive on both sides)
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE year BETWEEN 1896 AND 1897
""")

#### Find movies released between 1890 and 1892 (inclusive)

In [None]:
run_query("""
SELECT id, name, year
FROM `nyu-datasets.imdb.movies`
WHERE year BETWEEN 1890 AND 1892
""")

---
## `WHERE`: Boolean Operators

| Condition | Description |
|-----------|-------------|
| `NOT cond` | The opposite of the condition |
| `cond1 AND cond2` | Both conditions should hold |
| `cond1 OR cond2` | At least one of the conditions should hold |

**Operator Priority:** `NOT` > `AND` > `OR`

**Best Practice:** Always parenthesize complex expressions to make intent clear.

### IMDb Database

#### Fetch all info for actresses (female gender) whose first name is Skyler

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.actors`
WHERE gender = 'F' AND first_name = 'Skyler'
""")

#### Fetch all info for the director Steven Spielberg

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.directors`
WHERE first_name = 'Steven' AND last_name = 'Spielberg'
""")

#### Fetch all info for the directors with last names Scorsese, Polanski, and Spielberg

Use the `OR` operator for your Boolean query.

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.directors`
WHERE last_name = 'Scorsese' OR last_name = 'Polanski' OR last_name = 'Spielberg'
""")

#### (Stretch) Fetch all info for the directors Quentin Tarantino, Stanley Kubrick, and Orson Welles

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.directors`
WHERE (first_name = 'Quentin' AND last_name = 'Tarantino') OR
      (first_name = 'Stanley' AND last_name = 'Kubrick') OR
      (first_name = 'Orson' AND last_name = 'Welles')
""")

---
## The `IN` Operator

| Condition | Description | Example |
|-----------|-------------|--------|
| `attr IN (x1, x2, x3, ...)` | Attribute value is either x1, or x2, or x3, ... | `genre IN ('Drama', 'Comedy')` |
| `attr NOT IN (x1, x2, x3, ...)` | Attribute value is not x1, nor x2, nor x3, ... | `genre NOT IN ('Adult', 'Horror')` |

### IMDb Database

#### Fetch all info for the directors with last names Scorsese, Polanski, and Spielberg

Use `IN` instead of multiple `OR` conditions.

In [None]:
run_query("""
SELECT id, first_name, last_name
FROM `nyu-datasets.imdb.directors`
WHERE last_name IN ('Scorsese', 'Spielberg', 'Polanski')
""")

#### Fetch all info for the actors with last names Pitt, Clooney, and Damon

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.actors`
WHERE last_name IN ('Pitt', 'Clooney', 'Damon')
""")

---
## The `LIKE` Operator for Approximate Queries

`LIKE` allows you to write simple approximate (pattern matching) queries:

- `%` matches any sequence of characters (including empty)
- `_` matches any single character

**Tip:** Leading `%` makes queries slower (can't use index).

**Examples:**
- `name LIKE 'B%'` — names starting with B
- `name LIKE 'B___B'` — 5-character names starting and ending with B

### IMDb Database

#### Find the entry for Alfred Hitchcock

Hint: Query using only the last name first, and notice how his first name is stored.

In [None]:
# First, let's see what we find with just the last name
run_query("""
SELECT *
FROM `nyu-datasets.imdb.directors`
WHERE last_name = 'Hitchcock'
""")

In [None]:
# Use LIKE to match first names starting with 'A'
run_query("""
SELECT *
FROM `nyu-datasets.imdb.directors`
WHERE last_name = 'Hitchcock' AND first_name LIKE 'A%'
""")

#### Find the Godfather movies, released in 1972, 1974, and 1990

Hint: The actual names for the movies are "Godfather, The", "Godfather: Part II, The", "Godfather: Part III, The"

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE name LIKE 'Godfather%' AND year IN (1972, 1974, 1990)
""")

#### Find all movies that start with 'B'

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE name LIKE 'B%'
LIMIT 20
""")

---
## The `NULL` Mark

When columns do not have a value, they are assigned a `NULL` mark, which is a special way that SQL handles the "empty value". We use the term "mark" instead of "value" as `NULL` indicates the **absence of a value**.

- To check if something is NULL: `attr IS NULL`
- To check if something is NOT NULL: `attr IS NOT NULL`

### IMDb Database

#### Find all movies that do not have a rating

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE rating IS NULL
LIMIT 20
""")

#### Find all movies that have a rating

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE rating IS NOT NULL
LIMIT 20
""")

### Incorrect approaches when using NULL

The following approaches are **WRONG** and will not work as expected:

In [None]:
# WRONG: NULL is not equal to empty string
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE rating = ''
""")

In [None]:
# WRONG: NULL is not equal to the string 'NULL'
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE rating = 'NULL'
""")

In [None]:
# WRONG: We do not use = to compare with NULL
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE rating = NULL
""")

In [None]:
# WRONG: We do not use <> to compare with NULL either
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE rating <> NULL
""")

---
## Conditional Construct: `CASE`

The `CASE` statement allows you to create conditional logic within your queries.

```sql
CASE
    WHEN condition THEN result
    [WHEN condition THEN result] ...
    [ELSE result]
END
```

### IMDb Database

#### Categorize movies by rating into categories

In [None]:
run_query("""
SELECT name, rating,
    CASE
        WHEN rating >= 8.5 THEN 'Excellent'
        WHEN rating >= 7.0 THEN 'Good'
        ELSE 'Average or Below'
    END AS RatingCategory
FROM `nyu-datasets.imdb.movies`
WHERE rating IS NOT NULL
LIMIT 20
""")

### Facebook Database

#### Mark the "Very Liberal" and "Liberal" students as "Left" and mark the "Conservative", "Very Conservative", and "Libertarian" students as "Right"

In [None]:
run_query("""
SELECT ProfileID, Name, PoliticalViews,
    CASE
        WHEN PoliticalViews IN ('Very Liberal', 'Liberal') THEN 'Left'
        WHEN PoliticalViews IN ('Conservative', 'Very Conservative', 'Libertarian') THEN 'Right'
        ELSE 'Other/Unknown'
    END AS PoliticalLeaning
FROM `nyu-datasets.facebook.Profiles`
LIMIT 20
""")

---
## SQL Functions

SQL also has functions that apply at the attribute level:

- **Date functions:** `CURRENT_DATE()`, `DATE()`, `DATETIME()`, `TIMESTAMP()`, `EXTRACT(YEAR FROM dt)`, `DATE_TRUNC(date, MONTH)`
- **Math functions:** `ROUND(x [, digits])`, `CEIL(x)`, `FLOOR(x)`, `POWER(x,y)`, `SQRT(x)`
- **String functions:** `UPPER()`, `LOWER()`, `TRIM()`, `LTRIM()`, `RTRIM()`, `SUBSTR()`, `CONCAT()`
- **Null-handling:** `COALESCE(a,b,...)`, `IFNULL(a,b)`

### Floating Point Bizarreness

In SQL (and many computer languages), handling decimal numbers can be strange due to limited accuracy when storing floating point numbers.

In [None]:
# This works as expected
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE rating > 7.8
LIMIT 10
""")

In [None]:
# Exact equality with floats can be problematic
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE rating = 7.8
LIMIT 10
""")

In [None]:
# Using ROUND for safer comparisons
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
WHERE ROUND(rating, 1) = 7.8
LIMIT 10
""")

---
## In-class Activity

Go to the Facebook database and find all the students that have New York as the home state.

Deal with all the different ways that students have written New York in the "HomeState" attribute.

In [None]:
# First, let's see all the different values in HomeState
run_query("""
SELECT DISTINCT HomeState
FROM `nyu-datasets.facebook.Profiles`
WHERE HomeState LIKE '%New York%' OR HomeState LIKE '%NY%' OR HomeState LIKE '%N.Y.%'
ORDER BY HomeState
""")

In [None]:
# Your solution here
run_query("""
SELECT ProfileID, Name, HomeState
FROM `nyu-datasets.facebook.Profiles`
WHERE HomeState LIKE '%New York%'
   OR HomeState LIKE '%NY%'
   OR HomeState LIKE '%N.Y.%'
LIMIT 20
""")