# Handling missing data in SQL and *pandas*

This Notebook covers the basics of recognising and handling the standard encoding of missing data in SQL and *pandas*.

We cannot cover the cases where the user has created sentinel values to capture semantic variations in missing data.  In these cases the reasons data is missing usually puts this activity in the data cleansing and harmonisation activities, requiring decisions about appropriate alternative representations, rather than recognising and manipulating missing values.

As always, we encourage you to extend these Notebooks as you uncover or refine your techniques for handling specific examples.

In [None]:
import pandas as pd

Missing data values in *pandas* are typically represented as `NaN` (not a number) sentinel values, which can be assigned as the built-in python value `None`, or by using the *numpy* (numerical python) library value `np.nan`.

SQL uses `NULL`, and languages such as R tend to use `NA` as the null marker. 
We can achieve a similar effect by importing the *numpy* library (which is usually abbreviated to `np`, and then using the value `np.nan`. We can assign the value `np.nan` to the variable `NA`, so that `NA` appears in place of `nan`.

In [None]:
import numpy as np

NA=np.nan

## Accessing the PostgreSQL database engine

As in the notebook `03.3 combining data from multiple datasets`, in this notebook we will use the installed version of PostgreSQL for some of the data manipulations.

As with that notebook, we first need to connect to the PostgreSQL system, and then have a way to tell Python to pass the SQL code to PostgreSQL evaluation and to copy back into the Notebook any results tables we wish to capture.

For TM351, the connection is slightly different depending upon whether you are connecting to the Open University-hosted server (accessed through [tm351.open.ac.uk](https://tm351.open.ac.uk)), or a local server (accessed using docker or vagrant).

### Connecting to the database on [tm351.open.ac.uk](https://tm351.open.ac.uk)

If you are using the Open University hosted server, you should execute the following cell, using your OUCU as the value of `DB_USER`, and the password you were given at the beginning of the module.

If you need to, you can obtain your password again here: https://students.open.ac.uk/mct/tm351.

As with the test file you ran at the beginning of the module, you will need to provide your authentication credentials to connect with the database.

Click on the following cell, and replace the values `oucu123` and `tm351pwd` with your OUCU and your TM351 password (*not* your Open University password). Note that if the cell is in RAW NBconvert style, you will need to change its type to "code" in order to execute it.

### Connecting to the database on a locally hosted environment

If you are running the Jupyter server on your own machine, via Docker or Vagrant, you should execute the following cell. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

### Making the connection

The settings from the previous cells allow us to create a *connection string*, which is used to set up the connection to the database. First we will import pandas, and load in the sql extensions:

In [None]:
import pandas as pd

In [None]:
%load_ext sql

Next, we can construct the connection string so that we can connect to the database:

In [None]:
# Use urllib to escape any special characters in the password

import urllib

DB_CONNECTION_STRING='{engine}://{user}:{pwd}@{addr}/{name}'.format(engine='postgresql',
                                                                    user=DB_USER,
                                                                    pwd=urllib.parse.quote_plus(DB_PWD),
                                                                    addr='localhost:5432',
                                                                    name=DB_USER)

#Preview the connection string
DB_CONNECTION_STRING

And connect to the database:

In [None]:
%sql $DB_CONNECTION_STRING

## Putting missing data  into a dataset

If you worked through the `03.3 combining data from multiple datasets` Notebook you will already have seen that operations such as the outer joins can generate rows in `DataFrame`s and SQL tables that contain the `NULL` or `NaN` or `None` markers.

It is also sometimes useful to be able to insert missing data markers directly into the `DataFrame` or table.

In [None]:
# In pandas we can use an appropriate sentinel value directly in place of an actual value.
ss_df = pd.DataFrame( { 'key':['a', 'b', 'c', NA, 'e', 'f'], 
                        'num':[1, None, 3, 4, np.nan, 5] })
# Notice that we've used 3 different representations of the no value 
# marker - NA, None and np.nan in the above,
# but when they're displayed in the pandas Dataframe they're 
# rendered as NaN, the pandas representation.

ss_df

SQL allows the NULL marker to be entered in most places where data values can be entered, specifically in the `INSERT INTO` command and `UPDATE` command.

In [None]:
%%sql
DROP TABLE IF EXISTS dummy;
CREATE TABLE dummy(key INT, name VARCHAR(20), value REAL);
INSERT INTO dummy VALUES(NULL,'This',12.1);
INSERT INTO dummy VALUES(2, NULL,345.00);
INSERT INTO dummy VALUES(3,'The other', NULL);

SELECT * FROM dummy;

Recall that the returned object is rendered as a `DataFrame`, so we're seeing the output representation of missing data in the result (for some reason, using the numpy `None`) - but the PostgreSQL database table will have SQL NULL markers in it.

One thing to note here is that the `NaN`, `NA`, `NULL` and `None` etc. can be used whatever the datatype expected - it's a marker for missing data which is effectively typeless.

## Finding missing data markers

One of the most important issues is working out how to deal with *missing* data, and that starts with finding it.

Why do you think using a condition like  `name == NULL`, is unlikely to work?

Well, if `NULL` represents missing data - that is a data element that literally has `no value` - how could one `no value` be the same as another `no value`?

Both *pandas* and SQL take the same approach - having special conditions tests to identify if missing data markers are present.

*pandas* missing values can be identified using the `isnull()` method:

In [None]:
# We can use the vector-style application of isnull() to test every value in ss_df:
ss_df.isnull()

SQL has a similar conditional expression  `IS NULL` (and its converse `IS NOT NULL`):

In [None]:
%%sql
SELECT * 
FROM dummy
WHERE name IS NULL;

In [None]:
%%sql
SELECT *
FROM dummy
WHERE name IS NOT NULL;

# Doing something with missing data

*pandas* has several methods for handling `DataFrame`s with missing data.

SQL is generally a little poorer in this respect, forcing you to do much of the manipulation yourself.
We'll start by looking at the *pandas* methods, then see how we might shape a similar effect over an SQL table.

### *pandas*: replacing missing data with a value
We can replace a null marker using the `fillna()` method. 

By default this returns a new object (that is, the original object we apply the method to will not be changed).  To make the original object change we can add the `inplace='True'` qualifier.

In [None]:
print(ss_df['num'].fillna(0))
# Note that ss hasn't been changed.
print(ss_df)

There are several useful parameters to `fillna()` allowing a range of different filling actions.

For more information, see the pandas docs: [pandas.DataFrame.fillna](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.fillna.html)

### *pandas*: deleting rows containing missing data,  various forms
We can drop rows containing an `NaN` value *anywhere in the row* using the `dropna()` method.  Once again, this creates a new object unless we add the `inplace=True` qualifier.

In [None]:
ss_df.dropna()

To drop just those rows where there is a missing value in a particular column, we can use the `subset` parameter to specify which columns are of interest.

In [None]:
ss_df.dropna(subset=['key'])

Another useful parameter allows us to just drop rows where *all* the values are missing: `how='all'`.

To drop a *column* that is filled with NA values rather than a *row*, use `how='all', axis=1`.

For more information, see the pandas docs: [pandas.DataFrame.dropna](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.dropna.html).  

## And now with SQL

If the SQL used is a `SELECT` query, we're effectively creating a new table (the result of the `SELECT` query); to get the equivalent of the 'in place' behaviour we will need to explicitly update or delete the affected rows.

### SQL:  replacing missing data with a value

Basically we use a `WHERE` clause to identify those rows with `NULL` in the specified column, and process those appropriately - note that because the `WHERE` clause will result in only the rows that have `NULL` being copied to the result, it is then necessary to `UNION` those with the rows that didn't contain `NULL`.  (I did say you had a lot more work to do!)  

If the change is to create a new object then the `COALESCE()` function is used - this takes a series of arguments and returns the first one that is not null.

In [None]:
%%sql
SELECT COALESCE(key, 99999) AS key, 
       COALESCE(name, 'THIS ONE HAS NO NAME') as name, 
       COALESCE(value, 0.0) as value
FROM dummy;

In [None]:
%%sql
-- If coalesce is not available.
SELECT key, name, value
FROM dummy
WHERE value IS NOT NULL
UNION
SELECT key, name, 0.0
FROM dummy
WHERE value IS NULL;

To make this change inplace requires an `UPDATE` statement.

In [None]:
%%sql
UPDATE dummy
SET value = 0.0 
WHERE value IS NULL;

-- and we can see the change has affected the dummy table
SELECT *  FROM dummy;

### SQL: deleting rows containing missing data,  various forms
In SQL `SELECT` you can simply choose the rows you want to keep with increasingly complex condition statements.

In [None]:
%%sql
-- Removing a row with NULL anywhere in it requires each column in the row to be checked 
-- for the presence of NULL.
SELECT * 
FROM dummy
WHERE key IS NOT NULL AND name IS NOT NULL AND value IS NOT NULL;

In [None]:
%%sql
-- Remove a row with NULL in a specific column in that row. 
SELECT * 
FROM dummy
WHERE name IS NOT NULL;

In [None]:
%%sql
-- Remove a row with NULL in ALL the columns.
SELECT * 
FROM dummy
WHERE NOT( key IS NULL AND name IS NULL AND value IS NULL);

We're forced to move to `UPDATE` and `DELETE` if we want the inplace effect of actually changing the underlying table.

In [None]:
%%sql
-- Removing a row with NULL anywhere requires each column in the row to be checked.
DELETE FROM dummy
WHERE key IS NULL OR name IS NULL OR value IS NULL;


SELECT * FROM dummy;

In [None]:
%%sql
-- Oops I needed that - lets get the original table back
DROP TABLE IF EXISTS dummy;
CREATE TABLE dummy(key INT, name VARCHAR(20), value REAL);
INSERT INTO dummy VALUES(NULL,'This',12.1);
INSERT INTO dummy VALUES(2, NULL,345.00);
INSERT INTO dummy VALUES(3,'The other', NULL);

SELECT * FROM dummy;

In [None]:
%%sql
-- Remove a row with NULL in a specific column in that row.
DELETE
FROM dummy
WHERE name IS NULL;


SELECT * FROM dummy;

In [None]:
%%sql
-- Still needed that but let's add an entirely null row too!
DROP TABLE IF EXISTS dummy;
CREATE TABLE dummy(key INT, name VARCHAR(20), value REAL);
INSERT INTO dummy VALUES(NULL,'This',12.1);
INSERT INTO dummy VALUES(2, NULL,345.00);
INSERT INTO dummy VALUES(3,'The other', NULL);
INSERT INTO dummy VALUES(NULL, NULL, NULL);

SELECT * FROM dummy;

In [None]:
%%sql
-- Remove a row with NULL in ALL the columns.
DELETE 
FROM dummy
WHERE key IS NULL AND name IS NULL AND value IS NULL;

SELECT * FROM dummy;

### SQL: dropping a column where all rows had NULL in that column
SQL would be a bit messy for this.

Firstly, to remove an entire column requires the `ALTER TABLE` command with the `DROP < column name >` action.   However, this has no conditional part, so you would need to put this into an SQL conditional statement. 

Secondly, finding that a column had NULL in every row would require something like counting the number of rows that did not have `NULL` in that column and seeing if it was zero, but then also checking that you aren't looking at an empty table (one with no rows!).

Thirdly, the `IF` condition needed to wrap the condition around the `ALTER TABLE` statement is only available in PostgreSQL functions.

And finally, parameterising the required function, so that you could apply this to different columns in different tables, would be ... difficult - the alternative would be to write a specific function for each table and column you might want to apply it to!

So it's probably best just to note a quick way to test for an entirely `NULL` column, and the manual application of the `ALTER TABLE` statement.

No promises that this will work in all situations but this might do it:

In [None]:
%%sql
-- Custom table, with one column entirely NULL
DROP TABLE IF EXISTS dummy2;
CREATE TABLE dummy2(key INT, name VARCHAR(20), value REAL);
INSERT INTO dummy2 VALUES(NULL,'This',NULL);
INSERT INTO dummy2 VALUES(2, NULL,NULL);
INSERT INTO dummy2 VALUES(3,'The other', NULL);
INSERT INTO dummy2 VALUES(NULL, NULL, NULL);

SELECT * FROM dummy2;

In [None]:
%%sql
-- The first part of the condition checks this is not an empty table.
--                                         i.e. a table with no rows.
-- The second part counts the number of rows where value is not null.
--                               If this is 0 then they are all null.
-- The 'dummy' outer SELECT simply returns the result of the condition.

-- So, if the table is not empty and there are 0 rows where value is not null
--        then you want to uncomment the code in the following cell

SELECT ( ( (SELECT COUNT(*) FROM dummy2) <> 0)
 AND
 ( (SELECT COUNT(*) FROM dummy2 WHERE value IS NOT NULL) = 0 ) );


## Processing with missing data elements

Finally - what happens to some standard processing if there is missing data?

In [None]:
%%sql
-- First let us remind ourselves what the dummy2 table looks like.
SELECT *
FROM dummy2;

Now let's apply some expressions and aggregations using rows and columns with `NULL` markers in them, and see what happens.

Try to predict what you think will happen before looking at the result of running the cell.

In [None]:
%%sql
SELECT (key + 1) AS plusone, name
FROM dummy2;


In [None]:
%%sql
SELECT name || ', add this to every name' AS nameandstring
-- Note that the || is the string concatention operator in SQL.
FROM dummy2;

In [None]:
%%sql 

-- This one usually catches people out.

SELECT COUNT(key) as number_of_keys, COUNT(name) as number_of_names, COUNT(*) as number_of_rows
FROM dummy2;

In [None]:
%%sql
SELECT SUM(key) as total_of_keys
FROM dummy2;

In [None]:
%%sql
SELECT *
FROM dummy2
ORDER BY key;

Now, as an exercise, why not find out what *pandas* does for the equivalent processing operations when `NAN`s are involved.

## Tidying up

As in the previous notebook, we have created some new tables here. We will not need them again, so we'll remove them before going further:

In [None]:
%%sql

DROP TABLE IF EXISTS dummy;
DROP TABLE IF EXISTS dummy2;

# Summary

The basic techniques for identifying missing data (`NULL`, `None`, `NaN`, etc.) are similar in SQL and *pandas*: recognise that the value is missing by testing for it, then use that knowledge to decide what to do to resolve the missing data.

There is also a standard set of *pandas* functions to operate on rows or columns containing NULLs; in SQL it's quite a bit harder to get the same effect - but that's because SQL and *pandas* were created for different reasons.  SQL is about the careful management and persistence of data, where restructuring and reshaping is rare; _pandas_ is about data analyse where reshaping and restructuring is vital.

The module material talks about using different sentinel values (i.e. special strings like 'not known'). But if you do this you will need to write functions and conditions that identify these values as representing specific conditions in the dataset - and then use these to identify the data as requiring special treatment.  Some specialist libraries already have these kinds of sentinel values and the functions required to operate on them.

In all cases, care will be needed to ensure you understand how missing data will be handled in complex expressions, calculations, aggregations, etc.

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.
If you are working through this set of Notebooks as a whole, move on to the Part 4 Notebooks.