# Activity: Filter and sort data with SQL

## Overview

The purposes of this activity is to show us how to use SQL to filter and sort data in a large dataset in BigQuery. We will:

- create a custom dataset in BigQuery,
- import a .csv file as a new table in the BigQuery dataset, and
- use SQL queries to filter and sort data.

## Dataset

We are provided with a .csv file with data on movies such as title, genre, revenue, etc. The data can be downloaded from [Google Sheets](https://docs.google.com/spreadsheets/d/1pjpm8QPJOhX7aIuQ6j4iKz66b5AfkXWXHdMSxsHEyKI/edit?usp=sharing) or directly by downloading the [.csv file](/activities/sql/c05m01-filter-data-with-sql/c05m01-movie-data.csv). A preview of the comma-delimited file is show below.

![Movie data in csv](c05m01-movie-data-csv.png 'Movie data in csv')

## Data Cleaning

Before importing the dataset into BigQuery, I need to clean the column names to ensure compatibility with SQL queries. The current column names contain spaces, which can lead to errors when writing queries or referencing fields. To fix this:

- Spaces will be replaced with underscores (_).
- Column names will be converted to lowercase for consistency.
- Extra spaces before or after column names will be removed.

This preprocessing step ensures the dataset is query-friendly and adheres to best practices for database schemas. Below is a list of the current column names and their updated versions, which will be applied before importing into BigQuery:

| Original Column Name | Updated Column Name |
| --- | --- |
| Movie Title | movie_title |
| Release Date | release_date |
| Wikipedia URL | wikipedia_url |
| Genre | genre |
| Director (1) | director_1 |
| Director (2) | director_2 |
| Cast (1) | cast_1 |
| Cast (2) | cast_2 |
| Cast (3) | cast_3 |
| Cast (4) | cast_4 |
| Cast (5) | cast_5 |
| Budget  | budget |
| Revenue | revenue |

The updated data can be downloaded from [Google Sheets](https://docs.google.com/spreadsheets/d/1PNJPNZwKd6eKyDQnj2ybsFrMQAikA48Of5uh1wXH5cY/edit?usp=sharing) or directly by downloading the [.csv file](/activities/sql/c05m01-filter-data-with-sql/c05m01-movie-data-clean.csv).

## Importing the data in BigQuery

The following steps are followed to import the movie dataset to BigQuery:

- **Create dataset** with **Dataset ID** `movie_data`
- In the **Dataset info** window, select the **CREATE TABLE** button
- In the **Source** section, select the ***Upload*** option in **Create table from**
- Browse to the `c05m01-movie-data-clean.csv` file and open
- Set the file format to `.csv`
- In the **Destination** section, name the table as `movies`
- In the **Schema** section, select **Auto detect**
- Optional (alternative to cleaning the column names): In the **Advanced options** menu, change **Column name character map** to `V2` as this will allow parentheses to be used in column names.

Finally, select **Create table**. A new table `movies` has been created and appear in the explorer pane under the database `movie_data`. A preview of the data is show below.

![Movie data in BigQuery](c05m01-movie-data-bigquery.png 'Movie data in BigQuery')

## Query: Movies in the Comedy genre

The following query will allow us to filter all the movie titles to only show movies:

- in the Comedy genre,
- with revenues greater than $300,000,000.00,
- extracting the year from the release date, and
- sorting the movies by year in descending order so that the latest movie will be listed first:

In [None]:
SELECT
  movie_title,
  EXTRACT(YEAR FROM release_date) AS release_year
FROM `plucky-aegis-427011-v5.movie_data.movies`
WHERE
  genre = 'Comedy'
  AND revenue > 300000000
ORDER BY
  EXTRACT(YEAR FROM release_date) DESC;

There are 508 rows in the movies table and this query only shows the 7 Comedy movies with revenues greater than $300,000,000.00 sorted in descending order by release year. The query successfully filtered and sorted the data. The query results were returned in the table below:

![Comedy movies query results](c05m01-movies-comedy-by-year.png 'Comedy movies query results')