<a href="https://colab.research.google.com/github/ipeirotis/introduction-to-databases/blob/master/session2/B-Selection_Queries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SQL: Selection Queries

This notebook covers the fundamentals of SQL SELECT statements using Google BigQuery.

**Learning Objectives:**
- (a) Select columns
- (b) Rename outputs using AS
- (c) De-duplicate using DISTINCT
- (d) Sort and page results using ORDER BY and LIMIT

## Setup

First, we authenticate with Google Cloud and set up the BigQuery client to run our SQL queries.

In [None]:
# Authenticate with Google Cloud
from google.colab import auth
auth.authenticate_user()

In [None]:
from google.cloud import bigquery

# Create a BigQuery client
client = bigquery.Client(project='nyu-datasets')

# Helper function to run queries and display results as a DataFrame
def run_query(sql):
    """Run a BigQuery SQL query and return results as a pandas DataFrame."""
    return client.query(sql).to_dataframe()

---
## `SELECT *`

The `SELECT *` statement retrieves all columns from a table.

```sql
SELECT *
FROM table_name;
```

**Note:** In general, we do not use `SELECT *` for anything beyond "quick and dirty" queries. It's better to explicitly list the columns you need.

### IMDb Database

#### Return all movies

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
""")

#### Return all directors

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.directors`
""")

#### Return all actors

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.actors`
""")

#### Return all roles

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.roles`
""")

#### Return all genres for the movies

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies_genres`
""")

### Facebook Database

#### Return all students

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.facebook.Profiles`
""")

#### Return the hobbies of all students

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.facebook.Hobbies`
""")

#### Return the relationship status for all students

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.facebook.Relationship`
""")

#### Return what students are looking for

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.facebook.LookingFor`
""")

---
## `SELECT` _columns_

Good pattern: explicitly list the columns you need.

```sql
SELECT column1, column2, ...
FROM table_name;
```

### IMDb Database

#### Return the first and last names of actors

In [None]:
run_query("""
SELECT first_name, last_name
FROM `nyu-datasets.imdb.actors`
""")

#### Return year and rating for each movie

In [None]:
run_query("""
SELECT year, rating
FROM `nyu-datasets.imdb.movies`
""")

### Facebook Database

#### Return Name, Sex, and Birthday of all students

In [None]:
run_query("""
SELECT Name, Sex, Birthday
FROM `nyu-datasets.facebook.Profiles`
""")

#### Return Sex and Political Views of all students

In [None]:
run_query("""
SELECT Sex, PoliticalViews
FROM `nyu-datasets.facebook.Profiles`
""")

#### Return the Relationship status column

In [None]:
run_query("""
SELECT Status
FROM `nyu-datasets.facebook.Relationship`
""")

---
## `SELECT` _column_ `AS` _alias_

Sometimes we want to rename a column to have a more descriptive name in the results. We use the `AS` clause.

```sql
SELECT column1 AS name1, column2 AS name2, ...
FROM table_name;
```

**Tip:** Avoid spaces in alias names.

### IMDb Database

#### Return id, first, and last names of actors. Rename id to "actor_id"

In [None]:
run_query("""
SELECT id AS actor_id, first_name, last_name
FROM `nyu-datasets.imdb.actors`
""")

#### Return name, year, and rating for each movie. Rename name to "movie_title" and year to "release_year"

In [None]:
run_query("""
SELECT name AS movie_title, year AS release_year, rating
FROM `nyu-datasets.imdb.movies`
""")

### Facebook Database

#### Return Sex and Status of all students. Rename Sex to Gender and Status to UniversityStatus

In [None]:
run_query("""
SELECT Sex AS Gender, Status AS UniversityStatus
FROM `nyu-datasets.facebook.Profiles`
""")

---
## `SELECT DISTINCT`

Used to eliminate duplicates in the results.

```sql
SELECT DISTINCT column1, column2, ...
FROM table_name;
```

**Note:** `DISTINCT` removes duplicate rows across the selected columns, not per column.

### IMDb Database

#### Find all the movie genres

In [None]:
run_query("""
SELECT DISTINCT genre
FROM `nyu-datasets.imdb.movies_genres`
""")

### Facebook Database

#### Return the distinct PoliticalViews from the Profiles table

In [None]:
run_query("""
SELECT DISTINCT PoliticalViews
FROM `nyu-datasets.facebook.Profiles`
""")

#### Return the distinct Sex values from the Profiles table

In [None]:
run_query("""
SELECT DISTINCT Sex
FROM `nyu-datasets.facebook.Profiles`
""")

#### Find all possible "Relationship" statuses

In [None]:
run_query("""
SELECT DISTINCT Status
FROM `nyu-datasets.facebook.Relationship`
""")

#### Find what students are "LookingFor"

In [None]:
run_query("""
SELECT DISTINCT LookingFor
FROM `nyu-datasets.facebook.LookingFor`
""")

#### Find all possible Concentrations

In [None]:
run_query("""
SELECT DISTINCT Concentration
FROM `nyu-datasets.facebook.Concentration`
""")

---
## `ORDER BY` and `LIMIT`

- **ORDER BY:** Sort results by attribute values
  - `ASC` (default) for ascending, `DESC` for descending
  - Can list multiple attributes for tie-breaking
  - `NULLS FIRST` / `NULLS LAST` controls where empty values appear
  - Pro-tip: ORDER BY columns don't need to appear in SELECT

- **LIMIT n:** Limits the number of rows returned
- **OFFSET m:** Skip the first m rows

```sql
SELECT column1, column2
FROM table_name
ORDER BY column1 DESC, column2 ASC NULLS LAST
LIMIT 10 OFFSET 0;
```

**Portability notes:**
- MySQL/Postgres/SQLite/BigQuery: `LIMIT n OFFSET m`
- SQL Server: `ORDER BY ... OFFSET m ROWS FETCH NEXT n ROWS ONLY`
- Oracle: `FETCH FIRST n ROWS ONLY OFFSET m`

### IMDb Database

#### Find the top-10 ranked movies
- Rank by rating first (descending order)
- Break ties using year
- Break remaining ties using name

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.imdb.movies`
ORDER BY rating DESC, year, name
LIMIT 10
""")

#### List all the distinct years of the movies, in descending order

In [None]:
run_query("""
SELECT DISTINCT year
FROM `nyu-datasets.imdb.movies`
ORDER BY year DESC
""")

### Facebook Database

#### List the first 50 students that joined Facebook at NYU (use the MemberSince attribute)

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.facebook.Profiles`
ORDER BY MemberSince
LIMIT 50
""")

#### List the 10 students that have not updated their profiles for the longest time

**What is the problem?** NULL values may appear first when sorting in ascending order, showing students who never set a LastUpdate value rather than those with the oldest updates.

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.facebook.Profiles`
-- Problem: NULL values appear first!
ORDER BY LastUpdate
LIMIT 10
""")

#### Fixed version: Filter out NULL values or use NULLS LAST

In [None]:
run_query("""
SELECT *
FROM `nyu-datasets.facebook.Profiles`
WHERE LastUpdate IS NOT NULL
ORDER BY LastUpdate
LIMIT 10
""")

---
## Summary

- **Prefer explicit columns**; avoid `SELECT *` beyond quick exploration
- **Label outputs with `AS`** to create meaningful column names
- **Use `DISTINCT`** to de-duplicate rows
- **`ORDER BY`** for sorting; **`LIMIT`/`OFFSET`** for pagination
- Practice queries on `nyu-datasets.imdb` and `nyu-datasets.facebook`