# SQL Aggregate Functions

In [None]:
# !wget https://github.com/gt-cse-6040/bootcamp/raw/main/Module%201/Session%204/NYC-311-2M_small.db

In [None]:
# create a connection to the database
import sqlite3 as db
import pandas as pd

# Connect to a database (or create one if it doesn't exist)
conn_nyc = db.connect('NYC-311-2M_small.db')

## In Homework NB9, Part 1, we introduce the following SQL topics:

**In the bootcamp, we will not have any additional coverage of these topics.**

- Character Case and COLLATE NOCASE

- HAVING clause

- Renaming columns with AS

- ORDER BY

- IN clause for membership

- LIKE clause for finding strings (regex-similar functionality)

- DATE/TIME manipulation using SQLITE function STRFTIME

**In this bootcamp notebook, we look at the below in additional detail, focusing on a couple of `gotchas` for students to be aware of.**

- Group by, Aggregations

**Finally, while NB9 Part 1 introduces Nested Queries, the bootcamp will add Common Table Expressions, CTEs, as another methodology for doing similar functionality.**

- Nested Queries 

## Aggregate Functions

### Recall that aggregate functions perform a specific operations over all of the rows in a group (group by clause). 

### Aggregate functions differ from other functions in that they take many rows of input and return a single row of output.


**Keep in mind that aggregate functions (typically) ignore NULL values.**

**This treatment of NULL values is important for students to understand, and it is the reason for many student questions and issues.**

**It is this NULL value treatment that we focus on here.**


The following table summarizes some useful SQL aggregations:

| Aggregate Function       | Description                       |
|--------------------------|-----------------------------------|
| ``COUNT( * )``           | total number (count) of all rows  |
| ``COUNT( value )``       | counts all non-NULL rows          |
| ``AVG( value )``         | averages all non-NULL values      |
| ``MIN( value )``         | returns the lowest value          |
| ``MAX( value )``         | returns the highest value         |
| ``TOTAL( value )``       | returns sum of all non-NULL values|
| ``SUM (value )``         | returns sum of all non-NULL values|



**A few notes about COUNT(), SUM() and TOTAL() concerning NULLs:**

The count(X) function returns a count of the number of times that X is not NULL in a group. The count(*) function (with no arguments) returns the total number of rows in the group.

The sum() and total() aggregate functions return the sum of all non-NULL values in the group. If there are no non-NULL input rows then sum() returns NULL but total() returns 0.0.

The result of total() is always a floating point value.

The result of sum() is an integer value if all non-NULL inputs are integers. If any input to sum() is neither an integer nor a NULL, then sum() returns a floating point value which is an approximation of the mathematical sum.

https://www.sqlite.org/lang_aggfunc.html

#### SQLite does not include a whole lot of aggregate functions, as you can see.

Let's look at some examples from the NYC 311 Calls database in Notebook 9.

For this exercise we have a subset of data, consisting of one month from 2014.

In [None]:
def count_all_rows():

    # count(*) returns all rows, including NULLS
    query_nulls = '''
                SELECT COUNT(*)
                FROM DATA
                '''
    return query_nulls

df_count_all_rows = pd.read_sql(count_all_rows(),conn_nyc)
display(df_count_all_rows)

In [None]:
def count_non_null_rows():

    # count(*) returns all rows, including NULLS
    # count only the non-NULL rows
    query_City = '''
                        SELECT COUNT(City)
                        FROM DATA
                        '''
    return query_City

df_count_non_null_rows = pd.read_sql(count_non_null_rows(),conn_nyc)
display(df_count_non_null_rows)

We can see from the two queries that there are 154,374 rows in the database, and the COUNT( * ) included all of them, while there are NULL values in the City column, so the COUNT of those values is somewhat less.

**This illustrates the difference in how aggregations in SQL treat NULL values.**

**Students must remember this difference when writing their queries that use aggregations.**

## Aggregations and String Case

### The other `gotcha` that students must be aware of is that, like Python, SQLite is case-sensitive in its treatment of comparing string/varchar data.

#### Not every database makes this UPPER/LOWER case distinction, and it is usually a setting on the database instance itself that controls this behavior.

What this means is that, in SQLite and for this class, students need to be aware that any string functionality will treat a letter in UPPER case as different from the same letter in LOWER case.

This is important for aggregations, as the below examples demonstrate.

Now let's look at a simple GROUP BY (again from NB 9 Part 1).

In [None]:
def simple_group_by():
    query_group = '''
                    SELECT City, COUNT(*)
                    FROM DATA
                    GROUP BY City
                    ORDER BY COUNT(*) DESC
                    LIMIT 5
                    '''
    return query_group

df_simple_group = pd.read_sql(simple_group_by(),conn_nyc)
display(df_simple_group)

As we can see, we simply returned the number of rows for each of the TOP 5 values in the `City` column.

Note that the `None` value is returned for `NULL` values in the `City` column.

#### Now let's look at string manipulation functions in SQL.

https://www.sqlitetutorial.net/sqlite-string-functions/

Some of the ones that we will use in this class are UPPER, LOWER, and SUBSTR.

The string functions generally work in the same manner as their Python equivalents, just check the documentation for the specific syntax.

Let's look at the UPPER function for some specific things that you should know.

In [None]:
def upper_group_by():
    query_upper = '''
                    SELECT DISTINCT City, UPPER(City) as UPPER_CASE
                    FROM data
                    WHERE City != UPPER(City)
                    ORDER BY City
                    LIMIT 10
                    '''
    return query_upper

df_upper_group = pd.read_sql(upper_group_by(),conn_nyc)
display(df_upper_group)

In [None]:
def upper_group_by_2():
    query_upper2 = '''
                    SELECT DISTINCT City, UPPER(City) as UPPER_CASE
                    FROM data
                    WHERE City = UPPER(City)
                    ORDER BY City
                    LIMIT 10
                    '''
    return query_upper2

df_upper_group2 = pd.read_sql(upper_group_by_2(),conn_nyc)
display(df_upper_group2)

Seems simple, yes?

But the string functions take on a bit more complexity when you are doing aggregations.

Let's look at a another example.

In [None]:
def upper_compare():
    query_upper_compare = '''
                    SELECT distinct City, COUNT(*)
                    FROM DATA
                    WHERE UPPER(City) = 'ASTORIA'
                    GROUP BY City
                    ORDER BY City
                    '''
    return query_upper_compare

df_upper_compare = pd.read_sql(upper_compare(),conn_nyc)
display(df_upper_compare)

We can see that there are two possible spellings for this city, and SQLite considers them to be different.

Let's extend the CITY query from above, to include more than 5 rows.

In [None]:
def simple_group_by_2():
    query_group_2 = '''
                    SELECT City, COUNT(*)
                    FROM DATA
                    GROUP BY City
                    ORDER BY COUNT(*) DESC
                    LIMIT 15
                    '''
    return query_group_2

df_simple_group_2 = pd.read_sql(simple_group_by_2(),conn_nyc)
display(df_simple_group_2)

### Note that the cities `Astoria`, `Jamaica`, and `Flushing` all have two different entries in the `City` column.

#### Students need to understand this behavior when grouping aggregations, and account for it.

#### The best way is to use a STRING MANIPULATION function, such as UPPER(), LOWER(), or COLLATE NOCASE.

Let's look at the two examples below.

In this first example:

See the two ways of handling the case-sensitive grouping. The first query uses the column aliases and the second uses the actual `UPPER()` function.

Either method is fine, and neither is better than the other.

In [None]:
def upper_group_by():
# group by the column alias
    query_upper_group = '''
                    SELECT UPPER(City) AS CITY, COUNT(*)
                    FROM DATA
                    GROUP BY CITY
                    ORDER BY COUNT(*) DESC
                    LIMIT 10
                    '''
# group by the UPPER function    
#     query_upper_group = '''
#                 SELECT UPPER(City) AS CITY, COUNT(*)
#                 FROM DATA
#                 GROUP BY UPPER(City)
#                 ORDER BY COUNT(*) DESC
#                 LIMIT 10
#                 '''
    
    return query_upper_group

df_upper_group = pd.read_sql(upper_group_by(),conn_nyc)
display(df_upper_group)

**Case-insensitive grouping: `COLLATE NOCASE`.** Another way to carry out the preceding query in a case-insensitive way is to add a `COLLATE NOCASE` qualifier to the `GROUP BY` clause.

The next example demonstrates this clause. 

Note that the two query versions return slightly different results.

Take the city `Jamaica` for example. While each query returns 3,260 rows with this city, note that the two queries return different CityName values. There is not a rule for which it will return, in the second query, so students must be aware and understand how they want the data to appear in their results.

> Finally, the `COLLATE NOCASE` clause modifies the column next to which it appears. So if you are grouping by more than one key and want to be case-insensitive, you need to write, `... GROUP BY ColumnA COLLATE NOCASE, ColumnB COLLATE NOCASE ...`.

In [None]:
def collate_group_by():
# group by the column alias
    query_collate_group = '''
                    SELECT UPPER(City) AS CityName, COUNT(*)
                    FROM DATA
                    GROUP BY City COLLATE NOCASE
                    ORDER BY COUNT(*) DESC
                    LIMIT 10
                    '''
    
#     query_collate_group = '''
#                 SELECT City AS CityName, COUNT(*)
#                 FROM DATA
#                 GROUP BY City COLLATE NOCASE
#                 ORDER BY COUNT(*) DESC
#                 LIMIT 10
#                 '''
    
    return query_collate_group

df_collate_group = pd.read_sql(collate_group_by(),conn_nyc)
display(df_collate_group)

### So what happens if we don't handle the case-sensitivity?

#### We can see from the examples that the City  differences give different results.

**So the takeaway is that we must ensure that we are correctly accounting for the data differences.**

Grouping in SQL is case-sensitive, so we must ensure that our code recognizes and deals with this.

## **This is a simple example of 'dirty data', which is something that you will need to deal with throughout your Analytics career.**

### What are your questions on aggregations and groupings?