# How do we handle NULL values in SQL?

## Goals

In this case, we will work with a dataset that has large number of missing values - the type of dataset that you will most likely be working with in a job setting. We hope to provide you with a framework for tackling missing data, a common problem in large datasets. You will learn to assess the cleaning needs of a dataset and come up with strategies to prepare it appropriately for further analysis, as well as practice writing SQL queries.

## Introduction

**Business Context.** You're a data analyst for the Los Angeles Lakers, and you currently work with the finance team. Part of your responsibility is to conduct analysis and determine a player's bonus.

![Basketball](data/images/Basketball_pic.png)

**Business Problem.** Your task is to first address the completeness of the dataset, and then assess the data for how much of a bonus should be paid.

**Analytical Context.** The data for a player is provided in a SQLite database, which you will convert to a database format. Once you have explored and organized the data, your task is to report the amount of that player's bonus based on the following *5 criteria:*

|Criteria   | Bonus Amount|
|:---|:---|
|Averages 30 points in each game where they are active| 500k|
|Plays at least 65% of the season| 500k|
|Plays against every team at least once during the season| 100k|
|Plays in more home games than away games| 250k|
|Plays in at least one game every month of the season| 50k|

Given the criteria above, analyze the season stats of LeBron James and determine how much his bonus should be. 

### Preparation

Load the following packages to begin working with SQLite.

### Initializing the database

Run the following code to load our database. 

In [None]:
%FETCH https://amzn-dana.workspace-lite.correlation-one.com/case.null_values_sql_jlite_fellow/files/data/bron.db bron

In [None]:
%LOAD bron RW

There is only one table in this database, called `king_james`. Let's query the first five rows of data to get an idea of what our table looks like.

In [None]:
SELECT * FROM king_james LIMIT 5;

The columns are easier to understand if you are familiar with sports statistics. As an analyst with a basketball team, you would likely be aware that the columns are as follows:
1. **G:** The number of games played by the LA Lakers. This runs from `1` to `82`, the number of regular season games the team played.
2. **GP:** The number of games LeBron has played in. For games he didn't play, it will show an "NaN", which stands for "not a number" and represents a missing value. The first game he started is `1`, then `2`, etc.
3. **Date:** The date of the game in the format year-month-day hour:minute:second. You do not need the time details, so it is okay that those are zeros. To learn more about SQL date formatting, or how to extract information from the timestamp, read the information [here.](https://dataschool.com/learn-sql/dates/) 
4. **Month:** The calendar month of the game.
5. **Year:** The calendar year of the game.
6. **Home/Away:** Home games are played in Los Angeles, away games are played at the opponent's stadium.
7. **Opp:** A three-letter designation of the team they played. For instance, 'GSW' stands for the 'Golden State Warriors'.
8. **GS:** Game started; shows if LeBron started the game. If he did, the value is `1`. Otherwise, the column is empty and registers as "NaN".
9. **REBS:** Number of rebounds.
10. **AST:** Number of assists.
11. **STL:** Number of steals.
12. **BLK:** Number of blocks.
13. **PF:** Number of personal fouls.
14. **PTS:** Number of points scored.

Note that REBS, AST, STL, BLK, PF, and PTS are personal stats that give an idea of how the player performed in the game. High numbers are better than low numbers, except for PF. You can find more details on basketball statistics [here](https://en.wikipedia.org/wiki/Basketball_statistics) or [here.](https://www.basketball-reference.com/)

### NULL values

A **NULL value** is a special marker used in SQL to indicate that a data value does not exist in the database.

It is important to note that while NULL values represent missing data, they are different from the value of "zero". In fact, a value of zero is not a NULL value, and in a given context, it's useful for them to be differentiated. Let's explore this a little more. Consider the BLK (Number of Blocks) column which has both zeros (like in row 2) and NULL values (like in rows 4 and 5). The zeros in this context state two things: LeBron played in the game and in said game, LeBron didn't block any shots, whereas the NULL values are missing because he didn't play the game, so of course he couldn't have made any blocks. If the NULL values were represented as zeros, and you wanted to know how many games LeBron played where he had zero blocks, it would be much more difficult to do so.

There will be countless times in which you will be working with data that has many missing values. Throughout this case, you will learn how to identify counts, percentages and best practices to work with missing data in SQL. 

### From looking at the data, why would it be important to identify the NULL values throughout each column?

1. Aggregating accurately: Being able to find accurate means, medians and modes of multiple columns can be difficult if instead of NULL, there is a zero. SQL ignores NULL values, so if you want to take the mean of a column, and there are 30 rows with 10 NULLS, the function will sum up the existing rows and divide by 20.   


2. If a certain number of null values are present, it might not even be worth analyzing the data. We need to determine the minimum amount of null values in the dataset in order to conduct a valuable analysis. In the case of the basketball data we're using, the NULL values are important, because they represent games the player has not played. When computing performance data, we may need to know what is missing, as much as we need to know what is there. If we were looking at survey data, and more than half of the data is missing, we might need to find a better dataset to represent the population.  

In the data here, nulls carry information. They mean the player was not active, and did not accrue any statistics. We're going to use this dataset to practice counting NULL values and using them to find out how much data is present, and how much is missing.

### Discovering NULL and Non-NULL values

As an example, we will start by calculating the points per game when the points are NULL. In this case, we're asking for an average of the NULL values, which should return nothing, unless we did something very, very wrong.

Notice the "2" in this query is part of the `ROUND()` function; it specifies the number of decimal places to which the average of the points per game should be rounded.

In [None]:
SELECT ROUND(AVG(PTS),2) AS Points_Per_Game 
FROM king_james 
WHERE PTS IS NULL;

We see that the output is `None`, which means there is nothing there.

Notice that when you are searching for NULL values of a specific column, you include the column after the WHERE statement followed by "IS NULL".

Now, we replace `WHERE PTS IS NULL` with `WHERE PTS IS NOT NULL` to get the results we're looking for.

In [None]:
SELECT ROUND(AVG(PTS),2) AS Points_Per_Game 
FROM king_james 
WHERE PTS IS NOT NULL;

This gives us the answer to the first question regarding James' bonus:

"Any player who averages at least 30 points in each game where the player was active  automatically gets a 500k bonus."

If we had divided the total number of games and not ignored the nulls, then we would have arrived at the wrong answer!

### Exercise 1

Write a query that returns the first 5 rows in the dataset where `GP` is NULL.

**Answer.**

### Exercise 2

Write a query that shows `G`, `HomeAway`, and `Opp` for only the games that LeBron DID NOT play in.

**Hint:** Consider when `GP` is NULL. There should be 26 rows in total.

**Answer.**

### Calculating percentages of null-values in a given column

To calculate the percentage of null-values, you will need to do the following:

1. Count the number of null values in the column
2. Get the count of the entire column
3. Divide the number of null values by the count of the entire column

Let's see an example and calculate the percent of NULL values for the `GP` column:

In [None]:
SELECT GP, COUNT(*) AS Total_Season_Games
FROM king_james 
WHERE GP IS NULL;

In [None]:
SELECT GP, COUNT(*) AS Total_Season_Games 
FROM king_james;

Notice that the first query counted only the number of NULLs, and the second query counted all of the rows, including the NULLs. `COUNT(*)` will count all of the rows, whether they contain values or not unless you specify what you want with a `WHERE` statement. We can calculate the percentage of games missed by dividing the number of NULLs, `26`, by the number of games, `82`.

Or, you can perform the operation in one cell with a `CASE` function. You can read more about these commands [here.](https://www.w3schools.com/sql/sql_case.asp) Below, we use the `CASE` command to count the non-NULL rows as `1`, and the NULLs as `0`. Then, we divide that number by the total count of rows.

This also reminds us that understanding the context is important. A sports analyst would be aware that basketball seasons are 82 games, not including the playoffs. The Lakers did not make the playoffs in the 2021-2022 season, a fact that warms my heart as the reviewer of this case but will certainly irk the writer, who can do nothing about my glee.

In [None]:
SELECT 100.0 * SUM(CASE WHEN GP IS NULL THEN 1 ELSE 0 END) / COUNT(*) AS Percent_Null
FROM king_james;

This also allows us to calculate the second part of James' bonus: 

>"Has a player played at least 65% of the season? If yes, they get a 500K bonus." 

Since he missed 31.7% of his games, this means he played in more than 65% of them.

### Exercise 3

Which teams did LeBron miss the most games against? 

**Hint:** Use `GROUP BY` and `ORDER BY` in your answer. 

**Answer.**

### Exercise 4

Now, write a query to find out if James gets a bonus based on these criteria:

>Has a player played against every team during the season at least one time? If so, they get a 100K bonus.

**Answer.**

### Exercise 5

In order to find the answer to this question regarding James' bonus:

>"Has a player played in more home games than away games? If so, they get a 250K bonus."

Find out whether LeBron played in more `Home` or `Away` games.

**Answer.**

### Exercise 6

The last question to answer in order to calculate LeBron's total bonus is this:

"Has a player played in a game every month of the season? If so, they get a 50K bonus."

Did LeBron play in at least one game each month?

**Answer.**

### Exercise 7
Given the criteria above, analyze the season stats of LeBron James and determine how much his bonus should be.

**Answer.**

### (Optional) Exercise 8 

What game number of the year was this player's last game played?
What game number was it for them? 
To answer these questions, you'll need to find the value in `G` and the corresponding highest value in `GP`.

**Answer.**

## Conclusion

NULL values are inevitable, but with much practice, curiosity, and application, you *will* master working with them! NULL values tell you that data is missing, and this is important information in your analysis.

## Takeaways
1. The difference between NULL values and zeros
2. How to use aggregations when working with NULL values
3. How to use case statements to help us calculate percentages of NULL values for specific columns

## Attribution

"Basketball pic", 4 October 2014, Pixbay, Creative Commons Zero, https://commons.wikimedia.org/wiki/File:Basketball_pic.png