# Kaggle Intro to SQL (and BigQuery)
- https://www.kaggle.com/learn/intro-to-sql

## 4. Exercise: As & With
- plus CTE (Common Table Expression)

### Introduction
- You are getting to the point where you can own an analysis from beginning to end.
- We are going to work with a dataset about taxi trips in the city of Chicago. Run the cell below to fetch the chicago_taxi_trips dataset.

In [1]:
### Fetch the 'full' table from the 'hacker_news' dataset.
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client('jmproject86385')

# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("chicago_taxi_trips", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Tables in dataset, number and names
tables = list(client.list_tables(dataset))
print(len(tables))
for tbl in tables:
    print(tbl.table_id)



1
taxi_trips


### Ex. 1) Find the data
- Before you can access the data, you need to find the table name with the data.

In [2]:
table_name = list(client.list_tables(dataset))[0].table_id
table_name

'taxi_trips'

### Ex. 2) Peek at the data
- Use the next code cell to peek at the top few rows of the data. Inspect the data and see if any issues with data quality are immediately obvious.
- JM+, convert the whole table to a DataFrame

In [3]:
# Construct a reference to the "comments" table
table_ref = dataset_ref.table("taxi_trips")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "comments" table
client.list_rows(table, max_results=5).to_dataframe()

  client.list_rows(table, max_results=5).to_dataframe()


Unnamed: 0,unique_key,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,...,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,pickup_location,dropoff_latitude,dropoff_longitude,dropoff_location
0,327b1d0310c1e44ffc0b0f90aa61352b8b48e462,707bf241e39c6a1e051cc7f153487e9f2d65801042cb03...,2018-11-19 07:15:00+00:00,2018-11-19 07:15:00+00:00,0,0.0,,,,,...,,,Unknown,Taxi Affiliation Services,,,,,,
1,d98f04722704bda967a6a13aaf5f200368c1ed78,e6a9b95839d258d36f7d4040e681ddbc5e1ccfa604c361...,2018-12-04 13:45:00+00:00,2018-12-04 13:45:00+00:00,0,0.0,,,,,...,,,Unknown,Choice Taxi Association,,,,,,
2,bba2d63bb7d1a06bb5ddddc29d07e8514937635f,e6a9b95839d258d36f7d4040e681ddbc5e1ccfa604c361...,2018-12-10 13:15:00+00:00,2018-12-10 13:15:00+00:00,480,2.4,,,,,...,,,Unknown,Choice Taxi Association,,,,,,
3,c678ba6f5a8517aa42f3567edf244068e8961ae4,f47ce5a42b6e736badd5a196272f3d6961805248c631d6...,2019-01-09 12:30:00+00:00,2019-01-09 12:30:00+00:00,0,0.0,,,,,...,,,Unknown,Choice Taxi Association,,,,,,
4,4b65058a3a5b6290f702047d1630d93d3b8b0cb9,c19d802859a6a7fa0822f0b7156d8bbebe4830062f1832...,2019-03-08 16:00:00+00:00,2019-03-08 16:00:00+00:00,0,0.0,17031280000.0,17031280000.0,28.0,28.0,...,,,Unknown,Taxi Affiliation Services,41.8853,-87.642808,POINT (-87.6428084655 41.8853000224),41.8853,-87.642808,POINT (-87.6428084655 41.8853000224)


In [4]:
# ### Pass the whole table to a DF could be EXPENSIVE !!
# # There are lot of data, ex. w/ a query
# query = '''
#     SELECT *
#     FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` '''
# query_job = client.query(query)
# df = query_job.to_dataframe()

### Ex. 3) Determine when this data is from

If the data is sufficiently old, we might be careful before assuming the data is still relevant to traffic patterns today. Write a query that counts the number of trips in each year. Your results should have two columns:
- `year` - the year of the trips.
- `num_trips` - number of trips in that year.
Hints:
    When using GROUP BY and ORDER BY, you should refer to the columns by the alias year that you set at the top of the SELECT query.
    The SQL code to SELECT the year from trip_start_timestamp is SELECT EXTRACT(YEAR FROM trip_start_timestamp)
    The FROM field can be a little tricky until you are used to it. The format is:
        A backick (the symbol `).
        The project name. In this case it is bigquery-public-data.
        A period.
        The dataset name. In this case, it is chicago_taxi_trips.
        A period.
        The table name. You used this as your answer in 1) Find the data.
        A backtick (the symbol `).


In [5]:
# Your code goes here
rides_per_year_query = """
    SELECT COUNT(1) AS num_trips, EXTRACT(YEAR FROM trip_start_timestamp) AS year
    FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
    GROUP BY year
    ORDER BY year """

# Set up the query (cancel the query if it would use too much of 
# your quota)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
rides_per_year_query_job = client.query(rides_per_year_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
rides_per_year_result = rides_per_year_query_job.to_dataframe()

# View results
print(rides_per_year_result)

    num_trips  year
0    27217300  2013
1    37395079  2014
2    32385527  2015
3    31756403  2016
4    24979611  2017
5    20731105  2018
6    16476440  2019
7     3888831  2020
8     3947677  2021
9     6382071  2022
10    3234974  2023


rides_per_year_query = """
                       SELECT EXTRACT(YEAR FROM trip_start_timestamp) AS year, 
                              COUNT(1) AS num_trips
                       FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                       GROUP BY year
                       ORDER BY year
                       """

### Ex. 4) Dive slightly deeper

You'd like to take a closer look at rides from 2016. Copy the query you used above in rides_per_year_query into the cell below for rides_per_month_query. Then modify it in two ways:
1. Use a WHERE clause to limit the query to data from 2016.
2. Modify the query to extract the month rather than the year.

In [7]:
# Your code goes here
rides_per_month_query = """
    SELECT EXTRACT(MONTH FROM trip_start_timestamp) AS month,
        COUNT(1) AS num_trips
    FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
    WHERE EXTRACT (YEAR FROM trip_start_timestamp) = 2016
    GROUP BY month
    ORDER BY month """

# Set up the query
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
rides_per_month_query_job = client.query(rides_per_month_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
rides_per_month_result = rides_per_month_query_job.to_dataframe()

# View results
print(rides_per_month_result)

    month  num_trips
0       1    2510389
1       2    2568433
2       3    2851106
3       4    2854290
4       5    2859147
5       6    2841872
6       7    2682912
7       8    2629482
8       9    2532650
9      10    2725340
10     11    2387790
11     12    2312992


### Ex. 5) Write the query

- It's time to step up the sophistication of your queries. Write a query that shows, for each hour of the day in the dataset, the corresponding number of trips and average speed.
- -