### Handling Missing Data

Two options for indicating missing values:
1. Mask array that globally indicates missing values. Disadvantage is requiring additional memory.
2. Indicating missing values with a sentinel, e.g. -1, nan, inf, null. Disadvantage is that sentinel reduces the range of valid values that can be represented and may require extra CPU/GPU logic during computation.

Pandas missing data handling is limited to NumPy conventions. Unfortunately NumPy doesn't have a built-in NA valuefor non-floating-point data types.

Pandas chose to use two exiting Python null values as sentinels: None and NaN

In [1]:
import numpy as np
import pandas as pd

#### None
 - a Python singleton object
 - it can't be used in NumPy array without causing upcasting to the object type

In [4]:
vals = np.array([1, None, 3, 4])

Since the None value has up-casted the array to type "object", operations will be as slow and inefficient as plain Python

In [3]:
for dtype in ['object', 'int']:
    print('dtype = ', dtype)
    %timeit np.arange(1e6, dtype=dtype).sum()
    print()

dtype =  object
55.4 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype =  int
797 µs ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



Performing aggregation operations across None will cause errors

In [5]:
vals.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

#### NaN

- flaoting-point value recognized by all systems that use standard IEEE floating-point representation
- allows for fast array operations because it is a float type, not object type like None

In [6]:
vals = np.array([1, np.nan, 3, 4])
vals.dtype

dtype('float64')

In [7]:
1 + np.nan

nan

NaN allows for fast operations and NumPy provides several NaN safe aggregation Ufuncs.

In [13]:
vals

array([ 1., nan,  3.,  4.])

In [10]:
vals.sum(), vals.max()

(nan, nan)

In [12]:
np.nansum(vals), np.nanmax(vals)

(8.0, 4.0)

#### NaN and None in Pandas

Pandas handles the two near interchangeably and converts between them when appropriate.

Arrays will be automatically upcasted to float type when any value is None or NaN.

In [15]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [16]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int64

In [17]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

#### Operating with Null Values

In [18]:
# isnull() and notnull() can be used to detect null values

In [19]:
data = pd.Series([1, np.nan, 'hello', None])
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [20]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [21]:
data.notnull()

0     True
1    False
2     True
3    False
dtype: bool

In [22]:
data[data.isnull()]

1     NaN
3    None
dtype: object

In [25]:
# null values can be dropped with dropna()
# this returns a copy of DataFrame without missing values
# unless inplace=True
data.dropna()

0        1
2    hello
dtype: object

In [26]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [27]:
# when working with DataFrames
# we can only drop entire rows or columns, not just cells

In [28]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [29]:
df.dropna(axis=0)

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [30]:
df.dropna(axis=1)

Unnamed: 0,2
0,2
1,5
2,6


In [31]:
# the 'how' and 'thresh' arguments can be used to select
# when a row or column should be dropped

In [32]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [33]:
# 'how' defaults to 'any', so if any value is missing it drops

In [34]:
df.dropna(axis='columns', how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [36]:
# thresh specifies the minimum number of non-null values
# that is required to keep the row or column
df.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


#### Filling Null Values

Can be used to impute or interpolate from good values.

In [37]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [38]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [39]:
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [41]:
# forward fill 'ffill' propagates the previous value forward
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [42]:
# similarly, backward fill propagates in opposite direction
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64