# 处理缺失的数据项

在数据处理中，输入的数据常常出现缺失的情况，而pandas提供了便利的方法处理数据对象中各种缺失的数据。

In [1]:
import pandas as pd
import numpy as np

pandas中浮点数使用`NaN`（Not a Number）表示缺失的数据。

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

在pandas中，任意缺失的数据一般称之为NA（Not Available）,即可表示该数据

python中的`None`在pandas中也作为NA数据。

In [4]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

|方法|备注|
|-|-|
|dropna||
|fillna||
|isnull||
|notnull||

## 过滤缺失的数据

`isnull`方法返回一个boolean类型`Series`或`DataFrame`对象，对应位置上的数据为空的为True，反之为False。

`notnull`方法返回一个boolean类型`Series`或`DataFrame`对象，对应位置上的数据为空的为False，反之为True。

`dropna`方法返回一个`Series`或`DataFrame`对象，删除原对象中空的元素。

### Series对象

In [5]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [6]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [7]:
# 等价于data.dropna()
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

### DataFrame对象

`dropna`方法的`how`属性指定了删除空值时的行为，`axis`属性指定了删除行或列的空值。

In [8]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [9]:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [11]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [12]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [13]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [14]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-1.799075,,
1,-0.590887,,
2,-0.181862,,1.026534
3,0.716982,,-1.063201
4,-0.072309,0.890491,0.161354
5,-0.38652,0.583375,1.602683
6,-0.41657,-0.668482,0.93093


In [15]:
df.dropna()

Unnamed: 0,0,1,2
4,-0.072309,0.890491,0.161354
5,-0.38652,0.583375,1.602683
6,-0.41657,-0.668482,0.93093


In [16]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,-0.181862,,1.026534
3,0.716982,,-1.063201
4,-0.072309,0.890491,0.161354
5,-0.38652,0.583375,1.602683
6,-0.41657,-0.668482,0.93093


## 填充空值

`fillna`方法可用于替换`Series`或`DataFrame`中的空值，其中`inplace`属性可以指定是否在原对象上替换空值，`method`属性可以指定替换空值的行为方式。

In [17]:
df

Unnamed: 0,0,1,2
0,-1.799075,,
1,-0.590887,,
2,-0.181862,,1.026534
3,0.716982,,-1.063201
4,-0.072309,0.890491,0.161354
5,-0.38652,0.583375,1.602683
6,-0.41657,-0.668482,0.93093


In [18]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-1.799075,0.0,0.0
1,-0.590887,0.0,0.0
2,-0.181862,0.0,1.026534
3,0.716982,0.0,-1.063201
4,-0.072309,0.890491,0.161354
5,-0.38652,0.583375,1.602683
6,-0.41657,-0.668482,0.93093


将一个`dict`作为`fillna`方法参数，可以分别指定每一列中空值替换值。

In [19]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,-1.799075,0.5,0.0
1,-0.590887,0.5,0.0
2,-0.181862,0.5,1.026534
3,0.716982,0.5,-1.063201
4,-0.072309,0.890491,0.161354
5,-0.38652,0.583375,1.602683
6,-0.41657,-0.668482,0.93093


In [20]:
_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,-1.799075,0.0,0.0
1,-0.590887,0.0,0.0
2,-0.181862,0.0,1.026534
3,0.716982,0.0,-1.063201
4,-0.072309,0.890491,0.161354
5,-0.38652,0.583375,1.602683
6,-0.41657,-0.668482,0.93093


In [21]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,1.199776,0.138289,-0.617422
1,1.350596,-2.199139,0.01564
2,-0.714865,,0.072188
3,-0.118342,,1.087101
4,0.903383,,
5,-2.122971,,


In [22]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,1.199776,0.138289,-0.617422
1,1.350596,-2.199139,0.01564
2,-0.714865,-2.199139,0.072188
3,-0.118342,-2.199139,1.087101
4,0.903383,-2.199139,1.087101
5,-2.122971,-2.199139,1.087101


In [23]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,1.199776,0.138289,-0.617422
1,1.350596,-2.199139,0.01564
2,-0.714865,-2.199139,0.072188
3,-0.118342,-2.199139,1.087101
4,0.903383,,1.087101
5,-2.122971,,1.087101


在`fillna`方法中使用统计函数，如将`Series`对象中的空值替换为所有元素的平均数。

In [25]:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

|属性|备注|
|-|-|
|value||
|method||
|axis||
|inplace||
|limit||