# SQL Sub Queries

In [None]:
# !wget https://github.com/gt-cse-6040/bootcamp/raw/main/Module%201/Session%204/NYC-311-2M_small.db

In [None]:
# create a connection to the database
import sqlite3 as db
import pandas as pd

# Connect to a database (or create one if it doesn't exist)
conn_nyc = db.connect('NYC-311-2M_small.db')

## Subqueries -- Quick Review

An inline subquery is a query nested within another SQL query, typically within the SELECT, WHERE, or FROM clause. 

Inline subqueries allow you to perform calculations or filtering within the context of the outer query. 

They are particularly useful when you need to filter results based on aggregated values like maximum, minimum, or average.

Inline subqueries are powerful, but they can become less efficient when the same calculation needs to be repeated multiple times, as each repetition can slow down the query execution. 

Visit the [w3resource's SQLite Subqueries page](https://www.w3resource.com/sqlite/sqlite-subqueries.php) to learn more.



## Subqueries

- Are created/defined and executed in the same statement.

- Are defined `inline`, within your main query, and as part of that main query.

- Are only executed where defined, not reusable.

### EXAMPLE -- INNER JOIN WITH SUBQUERY

**Requirement**

From the `data` table, for each `city`, return counts or a distribution of tickets per `hour` on the biggest day (by events) by `createdDate`. 

Hint, it's `2014-11-18` (8466 events) but how do we put this into code dynamically?

*    Columns
    *    `City`
    *    `createdHour`
    *    `countoccur`: the count of events

*    Exclude NULL cities i.e. `WHERE city IS NOT NULL`

*    Sort
    *   `City` in ascending order
    *   `createdHour` in ascending order

**Pseudocode:**
*    Need to find the biggest day
*    JOIN to the `data` table
*    produce `SELECT` statement
*    include `WHERE` statement
*    `GROUP BY`
*    `ORDER BY`

In [None]:
def inner_join_example():

    # display the list of cities
    query_inner_join = '''
                SELECT a.city
                        ,sq.createdymd
                        ,strftime('%H',CreatedDate) createdhour
                        ,count(*) countoccur
                FROM data a
                
                -- SUBQUERY BELOW
                INNER JOIN ( --this join gets the date with the most events
                              SELECT strftime('%Y-%m-%d',CreatedDate) createdymd
                                      ,count(*) totalymd
                              FROM data
                              GROUP BY 1
                              ORDER BY 2 desc
                              LIMIT 1
                            ) SQ 
                        on strftime('%Y-%m-%d',a.CreatedDate) = sq.createdymd
                -- END OF SUBQUERY
                
                WHERE a.city IS NOT NULL
                GROUP BY 1,2,3
                ORDER BY 1,2,3
                '''
    return query_inner_join

df_inner_join_example = pd.read_sql(inner_join_example(),conn_nyc)
display(df_inner_join_example)

### So what did we do here?

#### First the subquery itself:

1. The subquery counted the number of rows (complaints) for each data.

2. The subquery sorted by the number in descending order.

3. The subquery returned one row, which is the date with the most complaints.

#### Next, in the main query:

1. The subquery inner joined to the main query on the date.

2. Because the join to the subquery is an inner join, it ensures that the only rows included/returned are those with that date.

### What are your questions so far?

### LEFT JOIN WITH SUBQUERY AND MULTIPLE SUBQUERIES

From the `data` table, for each `city`, return counts or a distribution of tickets per `hour` on the biggest day (by events) by `createdDate`. 

Hint, it's `2014-11-18` (8466 events) but how do we put this into code dynamically?

But this time, **only include `City` if they have an event on that day**. 

Additionally, `City` with `Hour` without events on that day should be included but should have a 0 for `countoccur`.

*    Columns
    *    `City`
    *    `createdHour`
    *    `countoccur`: the count of events. **REMEMBER: This should be 0 if there aren't any events for that city/hour**


*    Exclude NULL cities i.e. `WHERE city IS NOT NULL`


*    Sort
    *   `City` in ascending order
    *   `createdHour` in ascending order

**Pseudocode:**

*    Need to find the biggest day. This is query `b` below.

*    Need to get all the hours that occur on the biggest day. This is query `c` below.

*    Need to get all the cities/hour combinations possible for the biggest day. This is query `aa` below.

*    Need to get the counts for each city/hour on the biggest day, This is query `bb` below.

*    Need to join query `aa` with qery `bb` to retain all of query `aa`. This is this overall query, `a`.

*    produce `SELECT` statement

*    `GROUP BY`
*    `ORDER BY`

### Note that there are 5 distinct queries/subqueries:

a.  The base `data` table, or the main table in the query.

b.  Gets the date with the most events. This subquery is written and executed in multiple places.

c.  Gets the distinct hours on the date with the most events. Depends on the date in `b`.

aa.  Gets a distinct list of city/createdhour for the date with the most events. Depends on the date in `b`.

bb.  Gets the number of events for city/hour. SAME AS INNER JOIN EXAMPLE ABOVE. Depends on the date in `b`.

In [None]:
def left_join_example():

    # display the list of cities
    query_left_join = '''
                SELECT aa.city
                        ,aa.createdhour
                        ,COALESCE(bb.countoccur,0) countoccur

                FROM ( --this gets a distinct list of city/createdhour for the date with the most events
                          SELECT DISTINCT a.city,c.createdhour
                          FROM data a
                          INNER JOIN 
                              ( --this join gets the date with the most events
                                SELECT strftime('%Y-%m-%d',CreatedDate) createdymd
                                        ,count(*) totalymd
                                FROM data
                                GROUP BY 1
                                ORDER BY 2 DESC
                                LIMIT 1
                              ) b on strftime('%Y-%m-%d',a.CreatedDate)=b.createdymd
                  
                          LEFT JOIN 
                                      ( --this join gets the distinct hours on the date with the most events
                                        SELECT distinct strftime('%Y-%m-%d',CreatedDate) createdymd
                                                        ,strftime('%H',CreatedDate) createdhour
                                        FROM data

                                      ) c on strftime('%Y-%m-%d',a.CreatedDate)=c.createdymd
                          WHERE a.city IS NOT NULL
                        ) aa

                    LEFT JOIN ( --this join gets the number of events for city/hour. SAME AS JOIN EXAMPLE ABOVE
                                  SELECT a.city
                                          ,strftime('%H',CreatedDate) createdhour
                                          ,count(*) countoccur
                                  FROM data a
                                  INNER JOIN ( --this join gets the date with the most events
                                                SELECT strftime('%Y-%m-%d',CreatedDate) createdymd
                                                        ,count(*) totalymd
                                                FROM data
                                                GROUP BY 1
                                                ORDER BY 2 DESC
                                                LIMIT 1
                                              ) b on strftime('%Y-%m-%d',a.CreatedDate)=b.createdymd
                                  WHERE a.city IS NOT NULL
                                  GROUP BY 1,2
                               ) bb 
                ON aa.city = bb.city 
                AND aa.createdhour = bb.createdhour

                ORDER BY 1,2
                '''
    return query_left_join

df_left_join_example = pd.read_sql(left_join_example(),conn_nyc)
display(df_left_join_example)

## What questions do you have on subqueries?