# 缺失数据

缺失数据在现实生活中是一个困扰的问题。在机器学习和数据挖掘领域中经常需要面临模型预测精度不足的严重问题。在这些领域中，缺失数据处理是提高模型准确和合理程度的一个关键点。
数据缺失的场景有很多，比如进行在线问卷调查时，人们通常都不会分享与他们相关的所有信息，使用某一产品的体验满意度可能很好获得，但是联系方式就比较难收集了。诸如此类的数据缺失现象非常常见。

### 1. 缺失值检测

In [3]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),index=list('acefh'),columns=['one','two','three'])
df = df.reindex(list('abcdefgh'))
print df,'\n'
print df['one'].isnull(),'\n'
print df['one'].notnull()

        one       two     three
a  0.160120 -1.836773  1.437116
b       NaN       NaN       NaN
c -0.150992 -0.167657  0.922874
d       NaN       NaN       NaN
e  0.715220  0.360311 -0.025933
f -1.052191 -0.163628 -0.829559
g       NaN       NaN       NaN
h  0.177549 -1.370043 -0.789223 

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool 

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool


### 2. 缺失值计算
- 数据累加时，NA(NaN)被视为0
- 如果所有数据都为NA，结果也将为NA

In [4]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),index=list('acefh'),columns=['one','two','three'])
df = df.reindex(list('abcdefgh'))
print df['one'].sum()

0.31221771641


另一个例子

In [5]:
import pandas as pd
import numpy as np

df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
print df['one'].sum()

nan


### 3. 缺失值清理/填充
- 用标量代替NaN

In [6]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3,3),index=['a','c','e'],columns=['one','two','three'])
df = df.reindex(['a','b','c'])
print df
print "NaN replaced with '0':"
print df.fillna(0)

        one       two     three
a -0.635565 -1.638826 -0.308573
b       NaN       NaN       NaN
c  0.038882  0.035291  1.729256
NaN replaced with '0':
        one       two     three
a -0.635565 -1.638826 -0.308573
b  0.000000  0.000000  0.000000
c  0.038882  0.035291  1.729256


- 前向/后向填充NA

| 方式 | 行为 |
| :---: | :---: |
| pad/fill | 前向填充 |
| bfill/backfill | 后向填充 |

In [11]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df,'\n'
# 前向填充
print df.fillna(method='pad'),'\n' 
# 后向填充
print df.fillna(method='bfill')

        one       two     three
a  0.755379  0.027855  0.384575
b       NaN       NaN       NaN
c  0.715024  0.135741  0.552713
d       NaN       NaN       NaN
e -0.000671 -0.377629  1.083004
f  2.383431 -1.351335 -0.359673
g       NaN       NaN       NaN
h -0.589477 -0.161721  0.927219 

        one       two     three
a  0.755379  0.027855  0.384575
b  0.755379  0.027855  0.384575
c  0.715024  0.135741  0.552713
d  0.715024  0.135741  0.552713
e -0.000671 -0.377629  1.083004
f  2.383431 -1.351335 -0.359673
g  2.383431 -1.351335 -0.359673
h -0.589477 -0.161721  0.927219 

        one       two     three
a  0.755379  0.027855  0.384575
b  0.715024  0.135741  0.552713
c  0.715024  0.135741  0.552713
d -0.000671 -0.377629  1.083004
e -0.000671 -0.377629  1.083004
f  2.383431 -1.351335 -0.359673
g -0.589477 -0.161721  0.927219
h -0.589477 -0.161721  0.927219


### 4. 删除缺失值

如果只需排除缺失值，则使用dropna函数和axis参数。默认情况下，axis = 0，即沿着行，这意味着如果一行内的任何值是NA，则删除整行。

In [12]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),index=['a','c','e','f','h'],
                  columns=['one','two','three'])
df = df.reindex(list('abcdefgh'))
print df.dropna()

        one       two     three
a -0.657432 -2.245392  1.369749
c  0.158382  0.236098  0.345406
e -1.254812 -1.373786  0.261470
f  1.363391  1.516563  1.139172
h  0.680741  0.106961 -1.279349


### 5. 替换丢失（或）通用值
很多时候，我们必须用一些具体值取代一个通用值。我们可以通过使用替换方法来实现这一点。用标量值替换NA与fillna()函数的行为等价。

In [13]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'one':[10,20,30,40,50,2000],
                  'two':[100,0,30,40,50,60]})
print df.replace({1000:10,2000:60})

   one  two
0   10  100
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60
