# Handling Missing Data


Missing data is common in most data analysis applications. One of the goals in designing
pandas was to make working with missing data as painless as possible. For
example, all of the descriptive statistics on pandas objects exclude missing data as
you’ve seen earlier in the chapter.

pandas uses the floating point value NaN (Not a Number) to represent missing data in
both floating as well as in non-floating point arrays. It is just used as a sentinel that can
be easily detected:

In [2]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

In [3]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [4]:
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [5]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

The built-in Python None value is also treated as NA in object arrays:

In [6]:
string_data[0] = None

In [7]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

I do not claim that pandas’s NA representation is optimal, but it is simple and reasonably
consistent. It’s the best solution, with good all-around performance characteristics
and a simple API, that I could concoct in the absence of a true NA data type or bit
pattern in NumPy’s data types. Ongoing development work in NumPy may change this
in the future.

Table 5-12. NA handling methods

Argument Description

dropna Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much
missing data to tolerate.

fillna Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.

isnull Return like-type object containing boolean values indicating which values are missing / NA.

notnull Negation of isnull.

## Filtering Out Missing Data

You have a number of options for filtering out missing data. While doing it by hand is
always an option, dropna can be very helpful. On a Series, it returns the Series with only
the non-null data and index values:

In [8]:
from numpy import nan as NA

In [9]:
data = Series([1, NA, 3.5, NA, 7])

In [10]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

Naturally, you could have computed this yourself by boolean indexing:

In [11]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, these are a bit more complex. You may want to drop rows
or columns which are all NA or just those containing any NAs. dropna by default drops
any row containing a missing value:m

In [13]:
data = DataFrame([[1., 6.5, 3.]
                  , [1., NA, NA]
                  ,[NA, NA, NA]
                  , [NA, 6.5, 3.]])

In [14]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [15]:
cleaned = data.dropna()

In [16]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing how='all' will only drop rows that are all NA:

In [17]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Dropping columns in the same way is only a matter of passing axis=1:

In [18]:
data[4] = NA

In [19]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [20]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data. Suppose
you want to keep only rows containing a certain number of observations. You can
indicate this with the thresh argument:

In [22]:
df = DataFrame(np.random.randn(7, 3))

In [23]:
df

Unnamed: 0,0,1,2
0,-0.79614,0.797883,-0.642255
1,1.632909,0.081464,-0.520545
2,0.113194,-2.153705,2.20601
3,-0.738177,1.133723,-0.175932
4,-0.057007,-1.00828,0.844486
5,-0.54108,0.475725,0.262045
6,1.185896,0.804016,-0.211708


In [24]:
df.ix[:4, 1] = NA

In [26]:
df.ix[:2, 2] = NA

In [27]:
df

Unnamed: 0,0,1,2
0,-0.79614,,
1,1.632909,,
2,0.113194,,
3,-0.738177,,-0.175932
4,-0.057007,,0.844486
5,-0.54108,0.475725,0.262045
6,1.185896,0.804016,-0.211708


In [28]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2
5,-0.54108,0.475725,0.262045
6,1.185896,0.804016,-0.211708


## Filling in Missing Data


Rather than filtering out missing data (and potentially discarding other data along with
it), you may want to fill in the “holes” in any number of ways. For most purposes, the
fillna method is the workhorse function to use. Calling fillna with a constant replaces
missing values with that value:

In [29]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.79614,0.0,0.0
1,1.632909,0.0,0.0
2,0.113194,0.0,0.0
3,-0.738177,0.0,-0.175932
4,-0.057007,0.0,0.844486
5,-0.54108,0.475725,0.262045
6,1.185896,0.804016,-0.211708


Calling fillna with a dict you can use a different fill value for each column:

In [31]:
df.fillna({1: 0.5, 3: -1})

Unnamed: 0,0,1,2
0,-0.79614,0.5,
1,1.632909,0.5,
2,0.113194,0.5,
3,-0.738177,0.5,-0.175932
4,-0.057007,0.5,0.844486
5,-0.54108,0.475725,0.262045
6,1.185896,0.804016,-0.211708


fillna returns a new object, but you can modify the existing object in place:

In [32]:
# always returns a reference to the filled object
_ = df.fillna(0, inplace=True)

In [33]:
df

Unnamed: 0,0,1,2
0,-0.79614,0.0,0.0
1,1.632909,0.0,0.0
2,0.113194,0.0,0.0
3,-0.738177,0.0,-0.175932
4,-0.057007,0.0,0.844486
5,-0.54108,0.475725,0.262045
6,1.185896,0.804016,-0.211708


The same interpolation methods available for reindexing can be used with fillna:

In [34]:
df = DataFrame(np.random.randn(6, 3))

In [35]:
df.ix[2:, 1] = NA

In [36]:
df.ix[4:, 2] = NA

In [37]:
df

Unnamed: 0,0,1,2
0,-0.738434,-0.132072,-0.267316
1,-0.083626,0.70325,-0.306177
2,0.165872,,-0.188502
3,1.153035,,0.16003
4,0.933949,,
5,1.620144,,


In [38]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.738434,-0.132072,-0.267316
1,-0.083626,0.70325,-0.306177
2,0.165872,0.70325,-0.188502
3,1.153035,0.70325,0.16003
4,0.933949,0.70325,0.16003
5,1.620144,0.70325,0.16003


In [39]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.738434,-0.132072,-0.267316
1,-0.083626,0.70325,-0.306177
2,0.165872,0.70325,-0.188502
3,1.153035,0.70325,0.16003
4,0.933949,,0.16003
5,1.620144,,0.16003


With fillna you can do lots of other things with a little creativity. For example, you
might pass the mean or median value of a Series:

In [40]:
data = Series([1., NA, 3.5, NA, 7])

In [41]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [42]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

See Table 5-13 for a reference on fillna.

Table 5-13. fillna function arguments

Argument Description

value Scalar value or dict-like object to use to fill missing values

method Interpolation, by default 'ffill' if function called with no other arguments

axis Axis to fill on, default axis=0

inplace Modify the calling object without producing a copy

limit For forward and backward filling, maximum number of consecutive periods to fill