<a href="https://colab.research.google.com/github/gt-cse-6040/bootcamp/blob/main/SQL/syllabus/SQL1nb4_SQL_NULLs_FA25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SQL Working with NULLs

In [None]:
!wget https://github.com/gt-cse-6040/bootcamp/raw/main/SQL/syllabus/NYC-311-2M_small.db

In [None]:
# create a connection to the database
import sqlite3 as db
import pandas as pd

# Connect to a database (or create one if it doesn't exist)
conn_nyc = db.connect('NYC-311-2M_small.db')

## NULL Values

#### In many reporting applications, users will not want to have NULL values in their data sets.

What they ask for is that NULL values be detected and filled in with some discrete value.

### SQLite (and SQL vendors in general) provides two good functions that perform this activity:  IFNULL() and COALESCE().


**Keep in mind that aggregate functions (typically) ignore NULL values.**

**This treatment of NULL values is important for students to understand, and it is the reason for many student questions and issues.**

**It is this NULL value treatment that we focus on here.**


### While many SQL developers use the two functions interchangeably, they operate a bit differently, so what we want to do is ensure that students understand how they differ.

## IFNULL() in SQLite

### The purpose of this function is to evaluate a single value to determine if it is NULL or not.

    --If the value is not NULL, then the function returns the value as its result.

    --If the value is NULL, it returns a designated result, substituting it for the NULL value.

Databases provide this functionality with slightly different syntax. Developers using SQL should be aware of the syntax that their database instance uses.

See this link for an overview of different sytaxes:  https://www.w3schools.com/sql/sql_isnull.asp

### Usage of IFNULL():

#### The IFNULL syntax is as follows:  IFNULL(value to evaluate, value to return if evaluated one is NULL)

    --The function evaluates the first value to determine whether it is NULL. If this value is not NULL, then it returns the value itself.

    --If the evaluated value is NULL, then it returns the second value.

#### Let's look at a simple example from the NYC 311 Calls database in Notebook 9.

For this exercise we have the subset of data, consisting of one month from 2014.

Recall that not all of the rows have the CITY field filled in.

In [None]:
def ifnull_example():

    # display the list of cities
    query_nulls = '''
                SELECT DISTINCT CITY, COUNT(*)
                FROM DATA
                GROUP BY CITY
                ORDER BY COUNT(*) DESC
                '''
    return query_nulls

df_ifnull_example = pd.read_sql(ifnull_example(),conn_nyc)
display(df_ifnull_example)

We can see that there are over 9,000 rows in which the CITY field is NULL.

Remember how aggregates work, and compare the query result above with this one.

In [None]:
def ifnull_example2():

    # display the list of cities
    query_nulls2 = '''
                SELECT DISTINCT CITY, COUNT(CITY)
                FROM DATA
                GROUP BY CITY
                ORDER BY COUNT(CITY) DESC
                '''
    return query_nulls2

df_ifnull_example2 = pd.read_sql(ifnull_example2(),conn_nyc)
display(df_ifnull_example2)

#### So let's say that we want all of the rows in our result, and if the CITY field is NULL, then designate it as "No City".

#### We can use IFNULL() for this purpose.

The syntax here is straightforward: Evaluat the CITY field, and if it is NULL, return the value "No City".

In [None]:
def ifnull_example3():

    # display the list of cities
    query_nulls3 = '''
                SELECT DISTINCT IFNULL(CITY,"No City") as FULL_CITY, COUNT(IFNULL(CITY,"No City"))
                FROM DATA
                GROUP BY FULL_CITY
                ORDER BY COUNT(FULL_CITY) DESC
                '''
    return query_nulls3

df_ifnull_example3 = pd.read_sql(ifnull_example3(),conn_nyc)
display(df_ifnull_example3)

In [None]:
# def ifnull_example3a():  # this does not work, why?

#     # display the list of cities
#     query_nulls3a = '''
#                 SELECT DISTINCT IFNULL(CITY,"No City") as FULL_CITY, COUNT(FULL_CITY)
#                 FROM DATA
#                 GROUP BY FULL_CITY
#                 ORDER BY COUNT(FULL_CITY) DESC
#                 '''
#     return query_nulls3a

# df_ifnull_example3a = pd.read_sql(ifnull_example3a(),conn_nyc)
# display(df_ifnull_example3a)

The above query does not work, because the "FULL_CITY" column is the alias that is returned from the SELECT. It does not already exist in the database, so it cannot be **COUNTED** in the SELECT.

Note that in our COUNT(), we counted the value that IFNULL returns.

What if we did not, and only put CITY there?

Recall how aggregates work.

In [None]:
def ifnull_example4():

    # display the list of cities
    query_nulls4 = '''
                SELECT DISTINCT IFNULL(CITY,"No City") as FULL_CITY, COUNT(CITY)
                FROM DATA
                GROUP BY FULL_CITY
                ORDER BY COUNT(FULL_CITY) DESC
                '''
    return query_nulls4

df_ifnull_example4 = pd.read_sql(ifnull_example4(),conn_nyc)
display(df_ifnull_example4)

Again, recalling how aggregates work.

In [None]:
def ifnull_example5():

    # display the list of cities
    query_nulls5 = '''
                SELECT DISTINCT IFNULL(CITY,"No City") as FULL_CITY, COUNT(*)
                FROM DATA
                GROUP BY FULL_CITY
                ORDER BY COUNT(FULL_CITY) DESC
                '''
    return query_nulls5

df_ifnull_example5 = pd.read_sql(ifnull_example5(),conn_nyc)
display(df_ifnull_example5)

We can see from the two queries that the COUNT( * ) counted all of them, while COUNT(CITY) only counted the non_NULL values, even though we had used the ISNULL function.

**This illustrates the difference in how aggregations in SQL treat NULL values.**

**Students must remember this difference when writing their queries that use aggregations.**

### What are your questions on IFNULL()?

## COALESCE() Function in SQLite

### Many SQL developers use COALESCE and IFNULL interchangeably, and in its most basic form, COALESCE does the same thing as IFNULL().

In [None]:
def ifnull_example6():

    # display the list of cities
    query_nulls6 = '''
                SELECT DISTINCT COALESCE(CITY,"No City") as FULL_CITY, COUNT(*)
                FROM DATA
                GROUP BY FULL_CITY
                ORDER BY COUNT(FULL_CITY) DESC
                '''
    return query_nulls6

df_ifnull_example6 = pd.read_sql(ifnull_example6(),conn_nyc)
display(df_ifnull_example6)

#### In the example above, COALESCE evaluated the CITY field and returned "No City" when it was NULL.

#### But COALESCE operates with a bit more functionality.

### We can pass in multiple values to COALESCE, and it will return the FIRST NON-NULL value.

In [None]:
def ifnull_example7():

    # display the list of cities
    query_nulls7 = '''
                SELECT DISTINCT COALESCE(NULL,CITY,"No City") as FULL_CITY, COUNT(*)
                FROM DATA
                GROUP BY FULL_CITY
                ORDER BY COUNT(FULL_CITY) DESC
                '''
    return query_nulls7

df_ifnull_example7 = pd.read_sql(ifnull_example7(),conn_nyc)
display(df_ifnull_example7)

#### While this example is a bit contrived, we can see that we passed in 3 parameters to COALESCE, and it returned the first one that was NOT NULL.

#### This is the advantage of COALESCE() over IFNULL() in that it allows a bit more complex logic in determining what it will return.

### When would we use one versus the other? The easiest way to think is that IFNULL() is your solution for simple evaluations and substitutions, and you should use COALESCE() when you have more complex logic in the substitution you need to do.

## What are your questions on these functions?