# Missing Data
* Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer

Real world data is rarely clean and homogenous (although are working labor market example is!). One of the main features we find in real world data is missing values - so we have to know how to deal with them

Pandas handles missing values through its reliance on the NumPy package, which does not have a built-in notion of NA values for non-floating-point data types. This is potentially something important to note - whenever you have missing data in a column, it will be treated as a floating-point data type.

In [None]:
import numpy as np
import pandas as pd

## `None`: Pythonic Missing Data

The Python default for missing information is `None`. Notice that `None` has a special type in Python:

In [None]:
type(None)

This means that `None` can only be stores in an array that is of 'object' type:

In [None]:
test_array = np.array([1,2,3,4])
test_array.dtype

In [None]:
test_array[3] = None

In [None]:
test_array2 = np.array([1,2,None,4])
test_array2

As a consequence - when there are `Nones` in an array and we perform an aggregation - we get an error:

In [None]:
test_array2.sum()

## NaN: missing numerical data

The other missing data representation is NaN, and it performs differently than `None` - it is a speical floating point value:

In [None]:
type(np.nan)

In [None]:
test_array3 = np.array([1,2,np.nan,4])
test_array3.dtype

Notice that `np.nan` 'infects' every operation it is combined with - any arithemetic with an `np.nan` yields a nan as a result:

In [None]:
test_array3.sum()

In [None]:
10 + np.nan

In [None]:
np.log(np.nan)

Numpy does provide functionality to get around these nans by ignoring them:

In [None]:
np.nansum(test_array3)

In [None]:
np.nanmin(test_array3), np.nanmean(test_array3)

## `NaN` and `None` in Pandas

Pandas is built to handle both - almost interchangably:

In [None]:
pd.Series([None, 42, np.nan])

and because `np.nan` is a floating point, pandas type-casts:

In [None]:
pd.Series([1, 42])

In [None]:
pd.Series([np.nan, 42])

## Operating on Null Values

Pandas provides methods for working with null values in its data structures:
* `isnull()`: generates boolean mask indicating missing values
* `notnull()`: opposite of `isnull()`
* `dropna()`: returns filtered version of data
* `fillna()`: returns a copy of the data with missing valued filled or imputed

Let's see them in action:


In [None]:
data = pd.Series([1, np.nan, 42, None])

In [None]:
data.isnull()

In [None]:
data.notnull()

In [None]:
data[(data.notnull())]

### Dropping na

In [None]:
data.dropna()

In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

In [None]:
df.dropna()

In [None]:
df.dropna(axis='columns')

In [None]:
df[3] = np.nan
df

In [None]:
df.dropna(axis='columns', how='all')

In [None]:
df.dropna(axis='columns', how='any')

In [None]:
df.dropna(axis='rows', thresh=3) # 3 non na values

In [None]:
df.loc[1,1]=np.nan
df

In [None]:
df.dropna(axis='columns', thresh=2) # 3 non na values

### Filling na

In [None]:
data

In [None]:
data.fillna(99)

In [None]:
data.fillna(method='ffill')

In [None]:
data.fillna(method='bfill')

In [None]:
df.fillna(method='ffill', axis=1)