# Nulls, CASE Expressions

We will cover `NULL` values and `CASE` expressions in this section. A `NULL` value is no value, much like a `None` or `NaN` in Python indicates a blank value. The `CASE` expression allows us to pair conditions to resulting values, much like an `if`/`elif` in Python. 

We will cover both these operations in SQL. 

## Setup 
First get set up. Download the SQLite database file `company_operations.db` and connect to it. Also bring in `pandas` to display our SQL query results as a `DataFrame`. 

In [11]:
import sqlite3
import pandas as pd
import urllib.request

# download SQLite database and connect to it 
urllib.request.urlretrieve("https://github.com/thomasnield/anaconda_intro_to_sql/blob/main/company_operations.db?raw=true", "company_operations.db")
conn = sqlite3.connect('company_operations.db')

## NULL Values

Let's take a look at the `WEATHER_MONITOR` table. Sample these four records. 


In [12]:
sql = """
SELECT * FROM WEATHER_MONITOR 
WHERE REPORT_CODE IN ('LJVE08D', 'EP4AKZR', '1FC27OH', 'F4DEAK3') 
"""

pd.read_sql(sql, conn)


Unnamed: 0,ID,REPORT_CODE,REPORT_DATE,LOCATION_ID,TEMPERATURE,OVERCAST,RAIN,SNOW,LIGHTNING,HAIL,TORNADO
0,45,EP4AKZR,2021-03-11,32,64.5,0,,0.0,0,0,0
1,98,LJVE08D,2021-04-09,45,58.8,1,,0.0,0,0,0
2,1449,F4DEAK3,2021-01-15,17,35.3,1,0.0,,0,0,0
3,2967,1FC27OH,2020-12-01,16,89.7,0,,0.0,0,0,0


Note how some columns have values that are `NaN` or `None`, which indicate a `NULL` value. A null value is blank, meaning no value has been provided (not to be confused with `0` which is a value or an empty string `''`). 

Note that SQL databases will have `NULL` for blank values, but Pandas will re-interpret them as `None` or `NaN` depending if the column is numeric or not. 

If we have null values for rain, it might indicate that rain recordings were not possible because the instruments were broken. The same goes for the `SNOW` and other fields that are nullable. 

To qualify a null value, use `IS NULL`. Below we find records without a recorded `RAIN` measurement. 

In [13]:
sql = """
SELECT * FROM WEATHER_MONITOR 
WHERE RAIN IS NULL 
"""

pd.read_sql(sql, conn)


Unnamed: 0,ID,REPORT_CODE,REPORT_DATE,LOCATION_ID,TEMPERATURE,OVERCAST,RAIN,SNOW,LIGHTNING,HAIL,TORNADO
0,9,G0UINBG,2021-05-04,14,62.2,1,,0.0,0,0,0
1,17,89U7PF3,2021-05-02,2,67.8,0,,0.0,0,0,0
2,45,EP4AKZR,2021-03-11,32,64.5,0,,0.0,0,0,0
3,80,EPQO1H8,2021-05-09,31,54.1,0,,0.0,0,0,0
4,92,XQBYZKA,2021-03-04,2,63.8,0,,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
157,2948,2GERHDS,2020-12-02,4,92.6,1,,0.0,0,0,0
158,2963,D8XXEUW,2020-12-19,2,97.8,0,,0.0,0,0,0
159,2967,1FC27OH,2020-12-01,16,89.7,0,,0.0,0,0,0
160,2973,OXDUXW5,2020-11-02,44,100.2,1,,0.0,0,0,0


To qualify records that are not null, qualify with `IS NOT NULL`. 


In [14]:
sql = """
SELECT * FROM WEATHER_MONITOR 
WHERE RAIN IS NOT NULL 
"""

pd.read_sql(sql, conn)


Unnamed: 0,ID,REPORT_CODE,REPORT_DATE,LOCATION_ID,TEMPERATURE,OVERCAST,RAIN,SNOW,LIGHTNING,HAIL,TORNADO
0,1,UVYMMWW,2021-03-20,0,66.0,1,3.74,0.0,0,0,0
1,2,7VVYE2L,2021-04-10,24,61.3,0,0.00,0.0,0,0,0
2,3,PJVNOSP,2021-02-26,32,61.6,1,1.58,0.0,0,0,0
3,4,3B19P7S,2021-05-30,39,66.3,0,0.00,0.0,0,0,0
4,5,EHVUPGY,2021-04-09,48,58.5,0,0.00,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
2833,2996,RWG4DY4,2020-11-26,6,106.6,1,0.00,0.0,0,0,0
2834,2997,9A9EQNQ,2020-11-29,12,85.8,1,2.24,0.0,0,0,0
2835,2998,C7ALVIK,2021-01-14,48,96.8,0,0.00,0.0,0,0,0
2836,2999,V660AWC,2021-01-10,29,103.6,1,2.73,0.0,0,0,0


Note that if you do not handle `NULL` values explicitly in your `WHERE` condition on a given column, then `NULL` values will always be omitted. For example, if we qualify for records where `RAIN > 0` then the `NULL` values will be omitted. 

In [15]:
sql = """
SELECT * FROM WEATHER_MONITOR 
WHERE RAIN > 0 
"""

pd.read_sql(sql, conn)


Unnamed: 0,ID,REPORT_CODE,REPORT_DATE,LOCATION_ID,TEMPERATURE,OVERCAST,RAIN,SNOW,LIGHTNING,HAIL,TORNADO
0,1,UVYMMWW,2021-03-20,0,66.0,1,3.74,0.0,0,0,0
1,3,PJVNOSP,2021-02-26,32,61.6,1,1.58,0.0,0,0,0
2,8,R238Q5U,2021-05-15,39,62.9,1,1.45,0.0,0,0,0
3,14,DWGYB58,2021-05-01,17,60.0,1,2.92,0.0,1,0,0
4,15,6ALC472,2021-05-22,1,56.2,1,2.45,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
825,2991,B92HYWK,2020-11-30,17,98.3,1,0.91,0.0,0,0,0
826,2995,XV1UQZ4,2021-01-15,6,89.8,1,1.69,0.0,0,0,0
827,2997,9A9EQNQ,2020-11-29,12,85.8,1,2.24,0.0,0,0,0
828,2999,V660AWC,2021-01-10,29,103.6,1,2.73,0.0,0,0,0


If you want to include `NULL` values in your condition, explicitly allow for `NULL`. 

In [16]:
sql = """
SELECT * FROM WEATHER_MONITOR 
WHERE RAIN IS NULL OR RAIN > 0 
"""

pd.read_sql(sql, conn)


Unnamed: 0,ID,REPORT_CODE,REPORT_DATE,LOCATION_ID,TEMPERATURE,OVERCAST,RAIN,SNOW,LIGHTNING,HAIL,TORNADO
0,1,UVYMMWW,2021-03-20,0,66.0,1,3.74,0.0,0,0,0
1,3,PJVNOSP,2021-02-26,32,61.6,1,1.58,0.0,0,0,0
2,8,R238Q5U,2021-05-15,39,62.9,1,1.45,0.0,0,0,0
3,9,G0UINBG,2021-05-04,14,62.2,1,,0.0,0,0,0
4,14,DWGYB58,2021-05-01,17,60.0,1,2.92,0.0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...
987,2991,B92HYWK,2020-11-30,17,98.3,1,0.91,0.0,0,0,0
988,2995,XV1UQZ4,2021-01-15,6,89.8,1,1.69,0.0,0,0,0
989,2997,9A9EQNQ,2020-11-29,12,85.8,1,2.24,0.0,0,0,0
990,2999,V660AWC,2021-01-10,29,103.6,1,2.73,0.0,0,0,0


A helpful function to know by heart is `COALESCE()`. It will take a possibly `NULL` value and convert it to a different value if it is indeed `NULL`. Otherwise it will leave the value alone. 

The first argument for `COALESCE()` is the value that might be `NULL`. The second argument is the value to translate it into if it is indeed `NULL`. We can treat all `RAIN` values that are `NULL` as `0` in the `COALESCE()` below. 

In [17]:
sql = """
SELECT * FROM WEATHER_MONITOR 
WHERE COALESCE(RAIN,0) > 0 
"""

pd.read_sql(sql, conn)


Unnamed: 0,ID,REPORT_CODE,REPORT_DATE,LOCATION_ID,TEMPERATURE,OVERCAST,RAIN,SNOW,LIGHTNING,HAIL,TORNADO
0,1,UVYMMWW,2021-03-20,0,66.0,1,3.74,0.0,0,0,0
1,3,PJVNOSP,2021-02-26,32,61.6,1,1.58,0.0,0,0,0
2,8,R238Q5U,2021-05-15,39,62.9,1,1.45,0.0,0,0,0
3,14,DWGYB58,2021-05-01,17,60.0,1,2.92,0.0,1,0,0
4,15,6ALC472,2021-05-22,1,56.2,1,2.45,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
825,2991,B92HYWK,2020-11-30,17,98.3,1,0.91,0.0,0,0,0
826,2995,XV1UQZ4,2021-01-15,6,89.8,1,1.69,0.0,0,0,0
827,2997,9A9EQNQ,2020-11-29,12,85.8,1,2.24,0.0,0,0,0
828,2999,V660AWC,2021-01-10,29,103.6,1,2.73,0.0,0,0,0


As another example, to turn missing `RAIN` values into `-1`, we can use the `COALESCE` like this. 

In [18]:
sql = """
SELECT REPORT_CODE, 
RAIN, 
COALESCE(RAIN,-1) AS COALESCED_RAIN 

FROM WEATHER_MONITOR 
WHERE REPORT_CODE IN ('G0UINBG', 'PJVNOSP')
"""

pd.read_sql(sql, conn)


Unnamed: 0,REPORT_CODE,RAIN,COALESCED_RAIN
0,PJVNOSP,1.58,1.58
1,G0UINBG,,-1.0


## CASE Expression

Take a look at the `TEMPERATURE` field in the table. 

In [19]:
sql = """
SELECT REPORT_CODE, TEMPERATURE
FROM WEATHER_MONITOR
"""

pd.read_sql(sql, conn)


Unnamed: 0,REPORT_CODE,TEMPERATURE
0,UVYMMWW,66.0
1,7VVYE2L,61.3
2,PJVNOSP,61.6
3,3B19P7S,66.3
4,EHVUPGY,58.5
...,...,...
2995,RWG4DY4,106.6
2996,9A9EQNQ,85.8
2997,C7ALVIK,96.8
2998,V660AWC,103.6


Let's say we wanted to categorize each temperature as `HOT`, `MILD` or `COLD`. To do this we would have to use a `CASE` expression and attach a condition to each label. Let's demonstrate: 

In [20]:
sql = """
SELECT REPORT_CODE, 
TEMPERATURE,

CASE 
  WHEN TEMPERATURE >= 78 THEN 'HOT'
  WHEN TEMPERATURE >= 60 THEN 'MILD'
  ELSE 'COLD'
END AS TEMPERATURE_LABEL

FROM WEATHER_MONITOR
"""

pd.read_sql(sql, conn)


Unnamed: 0,REPORT_CODE,TEMPERATURE,TEMPERATURE_LABEL
0,UVYMMWW,66.0,MILD
1,7VVYE2L,61.3,MILD
2,PJVNOSP,61.6,MILD
3,3B19P7S,66.3,MILD
4,EHVUPGY,58.5,COLD
...,...,...,...
2995,RWG4DY4,106.6,HOT
2996,9A9EQNQ,85.8,HOT
2997,C7ALVIK,96.8,HOT
2998,V660AWC,103.6,HOT


Note how we use a `CASE` to open up the `CASE` expression. Each `WHEN` specifies a condition and `THEN` specifies the resulting value if that condition is true. Each condition is evaluted from top-to-bottom, and the first one found to be true is the one that will be chosen. An `ELSE` can optionally be appended to specifiy a default value if all the other conditions fail to be met. In this case, we establish any other record as `COLD` since we already deducted it is not `HOT` or `MILD`. 

However, you have to be careful about `NULL` values if they are present in a column. If you use an `ELSE` on the `TEMPERATURE` field, and that field happens to have `NULL` values (there are three), then they will be labelled as `NULL`. A better way to handle the `NULL` values might be to have an explicit condition for `COLD`, and then make the `ELSE` the catch-all for anomolies like `NULL` and label them `N/A`. 

In [21]:
sql = """
SELECT REPORT_CODE, 
TEMPERATURE,

CASE 
  WHEN TEMPERATURE >= 78 THEN 'HOT'
  WHEN TEMPERATURE >= 60 THEN 'MILD'
  WHEN TEMPERATURE < 60 THEN 'COLD'
  ELSE 'N/A'
END AS TEMPERATURE_LABEL

FROM WEATHER_MONITOR
"""

pd.read_sql(sql, conn)


Unnamed: 0,REPORT_CODE,TEMPERATURE,TEMPERATURE_LABEL
0,UVYMMWW,66.0,MILD
1,7VVYE2L,61.3,MILD
2,PJVNOSP,61.6,MILD
3,3B19P7S,66.3,MILD
4,EHVUPGY,58.5,COLD
...,...,...,...
2995,RWG4DY4,106.6,HOT
2996,9A9EQNQ,85.8,HOT
2997,C7ALVIK,96.8,HOT
2998,V660AWC,103.6,HOT


With a `CASE` expression, you can now do more interesting aggregations on fields that were not available before. For example, we can get a `COUNT` of the number records broken up by `TEMPERATURE_LABEL`. 

In [22]:
sql = """
SELECT 

CASE 
  WHEN TEMPERATURE >= 78 THEN 'HOT'
  WHEN TEMPERATURE >= 60 THEN 'MILD'
  WHEN TEMPERATURE < 60 THEN 'COLD'
  ELSE 'N/A'
END AS TEMPERATURE_LABEL,

COUNT(*) AS RECORD_COUNT

FROM WEATHER_MONITOR

GROUP BY TEMPERATURE_LABEL
"""

pd.read_sql(sql, conn)


Unnamed: 0,TEMPERATURE_LABEL,RECORD_COUNT
0,COLD,1498
1,HOT,908
2,MILD,591
3,,3


As a sidenote, you might have figured out that the `COALESCE` is a shorthand `CASE` expression to convert `NULL` values. Take our previous example showing the coalesced `RAIN` values. 

In [23]:
sql = """
SELECT REPORT_CODE, 
RAIN, 
COALESCE(RAIN,-1) AS COALESCED_RAIN 

FROM WEATHER_MONITOR 
WHERE REPORT_CODE IN ('G0UINBG', 'PJVNOSP')
"""

pd.read_sql(sql, conn)


Unnamed: 0,REPORT_CODE,RAIN,COALESCED_RAIN
0,PJVNOSP,1.58,1.58
1,G0UINBG,,-1.0


We can express this using a `CASE` expression. 

In [24]:
sql = """
SELECT REPORT_CODE, 
RAIN, 
CASE WHEN RAIN IS NULL THEN -1 ELSE RAIN END AS COALESCED_RAIN 

FROM WEATHER_MONITOR 
WHERE REPORT_CODE IN ('G0UINBG', 'PJVNOSP')
"""

pd.read_sql(sql, conn)


Unnamed: 0,REPORT_CODE,RAIN,COALESCED_RAIN
0,PJVNOSP,1.58,1.58
1,G0UINBG,,-1.0


## The NULL Case Trick 

Let's say you calculate the total rain broken up by `YEAR` and `MONTH`, only for the `YEAR` 2021. 

In [25]:
sql = """
SELECT 
CAST(strftime('%Y', REPORT_DATE) AS INTEGER) AS YEAR, 
CAST(strftime('%m', REPORT_DATE) AS INTEGER) AS MONTH, 

SUM(RAIN) AS TOTAL_RAIN

FROM WEATHER_MONITOR 

GROUP BY YEAR, MONTH
"""

pd.read_sql(sql, conn)


Unnamed: 0,YEAR,MONTH,TOTAL_RAIN
0,2020,11,392.22
1,2020,12,433.16
2,2021,1,316.27
3,2021,2,138.07
4,2021,3,129.03
5,2021,4,153.79
6,2021,5,158.24


Now you want to break up that `TOTAL_RAIN` column into two columns, one for when a `TORNADO` was present and one for when there was not. What's the problem here? 

In [26]:
sql = """
SELECT 
CAST(strftime('%Y', REPORT_DATE) AS INTEGER) AS YEAR, 
CAST(strftime('%m', REPORT_DATE) AS INTEGER) AS MONTH, 

SUM(RAIN) AS TOTAL_TORNADO_RAIN,
SUM(RAIN) AS TOTAL_NON_TORNADO_RAIN

FROM WEATHER_MONITOR 

WHERE TORNADO = 1 
AND YEAR = 2021

GROUP BY YEAR, MONTH
"""

pd.read_sql(sql, conn)


Unnamed: 0,YEAR,MONTH,TOTAL_TORNADO_RAIN,TOTAL_NON_TORNADO_RAIN
0,2021,2,15.22,15.22
1,2021,3,24.92,24.92
2,2021,4,9.87,9.87
3,2021,5,19.88,19.88


That `WHERE` condition inconveniently commits you to `TORNADO` being 1 or 0, but not both for each column. But you can get around this using a `CASE` expression and putting the respective conditions there. Observe below how we intercept the values going into each `SUM()` by checking for the `TORNADO` condition, and if it fails then we add a `0` to the `SUM` instead. Clever, right? 

In [27]:
sql = """
SELECT 
CAST(strftime('%Y', REPORT_DATE) AS INTEGER) AS YEAR, 
CAST(strftime('%m', REPORT_DATE) AS INTEGER) AS MONTH, 

SUM(CASE WHEN TORNADO = 1 THEN RAIN ELSE 0 END) AS TOTAL_TORNADO_RAIN,
SUM(CASE WHEN TORNADO = 0 THEN RAIN ELSE 0 END) AS TOTAL_NON_TORNADO_RAIN

FROM WEATHER_MONITOR 

WHERE YEAR = 2021 

GROUP BY YEAR, MONTH
"""

pd.read_sql(sql, conn)


Unnamed: 0,YEAR,MONTH,TOTAL_TORNADO_RAIN,TOTAL_NON_TORNADO_RAIN
0,2021,1,0.0,316.27
1,2021,2,15.22,122.85
2,2021,3,24.92,104.11
3,2021,4,9.87,143.92
4,2021,5,19.88,138.36


However, a `0` for the false condition can be problematic for other aggregation operations like `MIN`, `MAX`, `AVG` and `COUNT` as it will affect those calculations unlike `SUM`. You can instead use `NULL` as it will get ignored by all the aggregation operators, including `SUM`. 

In [28]:
sql = """
SELECT 
CAST(strftime('%Y', REPORT_DATE) AS INTEGER) AS YEAR, 
CAST(strftime('%m', REPORT_DATE) AS INTEGER) AS MONTH, 

SUM(CASE WHEN TORNADO = 1 THEN RAIN ELSE NULL END) AS AVG_TORNADO_RAIN,
SUM(CASE WHEN TORNADO = 0 THEN RAIN ELSE NULL END) AS AVG_NON_TORNADO_RAIN

FROM WEATHER_MONITOR 

WHERE YEAR = 2021 

GROUP BY YEAR, MONTH
"""

pd.read_sql(sql, conn)


Unnamed: 0,YEAR,MONTH,AVG_TORNADO_RAIN,AVG_NON_TORNADO_RAIN
0,2021,1,,316.27
1,2021,2,15.22,122.85
2,2021,3,24.92,104.11
3,2021,4,9.87,143.92
4,2021,5,19.88,138.36


Few people who are using `SQL` know this trick, and it can save many messy queries and derived tables. Use it liberally! 

## EXERCISE

For each `LOCATION_ID`, calculate the previous year total rain `PY_RAIN` and current year total rain `CY_RAIN`. Replace the question marks `?` and assume 2021 is the current year. 

In [29]:
sql = """
SELECT 

LOCATION_ID,

SUM(
  CASE WHEN CAST(strftime('%Y', REPORT_DATE) AS INTEGER) = 2021 THEN RAIN ELSE 0 END
) AS CY_RAIN,

SUM(
  CASE WHEN CAST(strftime('%Y', REPORT_DATE) AS INTEGER) = 2020 THEN RAIN ELSE 0 END
) AS PY_RAIN

FROM WEATHER_MONITOR 

WHERE CAST(strftime('%Y', REPORT_DATE) AS INTEGER) IN (2020, 2021)

GROUP BY LOCATION_ID
"""

pd.read_sql(sql, conn)


Unnamed: 0,LOCATION_ID,CY_RAIN,PY_RAIN
0,0,18.99,24.72
1,1,22.51,45.02
2,2,30.72,16.04
3,3,15.57,22.8
4,4,30.17,30.15
5,5,40.71,7.92
6,6,13.05,13.66
7,7,43.46,29.04
8,8,23.19,10.02
9,9,12.9,17.84




### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [30]:
sql = """
SELECT 

LOCATION_ID,

SUM(
  CASE WHEN CAST(strftime('%Y', REPORT_DATE) AS INTEGER) = 2021 THEN RAIN ELSE 0 END
) AS CY_RAIN,

SUM(
  CASE WHEN CAST(strftime('%Y', REPORT_DATE) AS INTEGER) = 2020 THEN RAIN ELSE 0 END
) AS PY_RAIN

FROM WEATHER_MONITOR 

WHERE CAST(strftime('%Y', REPORT_DATE) AS INTEGER) IN (2020, 2021)

GROUP BY LOCATION_ID
"""

pd.read_sql(sql, conn)


Unnamed: 0,LOCATION_ID,CY_RAIN,PY_RAIN
0,0,18.99,24.72
1,1,22.51,45.02
2,2,30.72,16.04
3,3,15.57,22.8
4,4,30.17,30.15
5,5,40.71,7.92
6,6,13.05,13.66
7,7,43.46,29.04
8,8,23.19,10.02
9,9,12.9,17.84
