# 数据清洗和准备
在数据分析和建模的过程中，相当多的时间要用在数据准备上：加载、清理、转换以及重塑。这些工作会占到分析师时间的80%或更多。

# 处理缺失数据
* 在许多数据分析工作中，缺失数据是经常发生的。pandas的目标之一就是尽量轻松地处理缺失数据。例如，pandas对象的所有描述性统计默认都不包括缺失数据。

* 
缺失数据在pandas中呈现的方式有些不完美，但对于大多数用户可以保证功能正常。对于数值数据，pandas使用浮点值NaN（Not a Number）表示缺失数据。我们称其为哨兵值，可以方便的检测出来：

In [1]:
import pandas as pd

In [2]:
xa=pd.DataFrame(pd.np.random.randn(10,4),index=[chr(i) for i in range(65,75)],columns=[chr(i) for i in range(97,101)])
xa

Unnamed: 0,a,b,c,d
A,-0.458535,-0.966027,1.99072,0.049643
B,-2.019448,0.469118,-0.368198,0.016554
C,0.672553,0.848501,-0.045076,-0.637709
D,-0.204048,0.427297,2.821062,-3.404246
E,0.851437,0.430311,-0.234595,-0.40273
F,1.63202,-0.912922,0.408945,0.844415
G,-0.998616,-0.257,0.528972,0.110432
H,-0.074171,-0.352392,0.203358,-0.22042
I,-0.677709,0.595538,-1.613784,-0.339048
J,-0.178358,-0.594873,-0.461883,-1.959267


In [3]:
xb=pd.np.sqrt(xa)
xb

  """Entry point for launching an IPython kernel.


Unnamed: 0,a,b,c,d
A,,,1.410929,0.222806
B,,0.684922,,0.128661
C,0.820094,0.921141,,
D,,0.65368,1.679602,
E,0.922733,0.655981,,
F,1.277505,,0.639488,0.918921
G,,,0.727305,0.332313
H,,,0.450952,
I,,0.771711,,
J,,,,


# isna\isnull 判断无效值

In [6]:
xb.isna()

Unnamed: 0,a,b,c,d
A,True,True,False,False
B,True,False,True,False
C,False,False,True,True
D,True,False,False,True
E,False,False,True,True
F,False,True,False,False
G,True,True,False,False
H,True,True,False,True
I,True,False,True,True
J,True,True,True,True


# dropna 过滤无效值

In [11]:
xb.a=range(10)

In [13]:
xb.iloc[0,:]=range(4)

In [14]:
xb

Unnamed: 0,a,b,c,d
A,0,1.0,2.0,3.0
B,1,0.684922,,0.128661
C,2,0.921141,,
D,3,0.65368,1.679602,
E,4,0.655981,,
F,5,,0.639488,0.918921
G,6,,0.727305,0.332313
H,7,,0.450952,
I,8,0.771711,,
J,9,,,


In [35]:
xb.dropna(axis='columns',how='any')

Unnamed: 0,a
A,0
B,1
C,2
D,3
E,4
F,5
G,6
H,7
I,8
J,9


In [38]:
xb.dropna(axis=0,thresh=3)

Unnamed: 0,a,b,c,d
A,0,1.0,2.0,3.0
B,1,0.684922,,0.128661
D,3,0.65368,1.679602,
F,5,,0.639488,0.918921
G,6,,0.727305,0.332313


# 填充缺失数据
你可能不想滤除缺失数据（有可能会丢弃跟它有关的其他数据），而是希望通过其他方式填补那些“空洞”。对于大多数情况而言，fillna方法是最主要的函数。通过一个常数调用fillna就会将缺失值替换为那个常数值：

In [40]:
xb.fillna(0)

Unnamed: 0,a,b,c,d
A,0,1.0,2.0,3.0
B,1,0.684922,0.0,0.128661
C,2,0.921141,0.0,0.0
D,3,0.65368,1.679602,0.0
E,4,0.655981,0.0,0.0
F,5,0.0,0.639488,0.918921
G,6,0.0,0.727305,0.332313
H,7,0.0,0.450952,0.0
I,8,0.771711,0.0,0.0
J,9,0.0,0.0,0.0


若是通过一个字典调用fillna，就可以实现对不同的列填充不同的值：

In [46]:
xb.fillna({'b':100,'c':-100,'d':0})

Unnamed: 0,a,b,c,d
A,0,1.0,2.0,3.0
B,1,0.684922,-100.0,0.128661
C,2,0.921141,-100.0,0.0
D,3,0.65368,1.679602,0.0
E,4,0.655981,-100.0,0.0
F,5,100.0,0.639488,0.918921
G,6,100.0,0.727305,0.332313
H,7,100.0,0.450952,0.0
I,8,0.771711,-100.0,0.0
J,9,100.0,-100.0,0.0


In [50]:
xb.fillna(method='ffill',axis='columns')

Unnamed: 0,a,b,c,d
A,0.0,1.0,2.0,3.0
B,1.0,0.684922,0.684922,0.128661
C,2.0,0.921141,0.921141,0.921141
D,3.0,0.65368,1.679602,1.679602
E,4.0,0.655981,0.655981,0.655981
F,5.0,5.0,0.639488,0.918921
G,6.0,6.0,0.727305,0.332313
H,7.0,7.0,0.450952,0.450952
I,8.0,0.771711,0.771711,0.771711
J,9.0,9.0,9.0,9.0


In [51]:
xb.fillna?

# 数据转换\移除重复数据

In [52]:
xa=pd.DataFrame([range(4) if i%2 else range(1,5) for i in range(8)])
xa

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,0,1,2,3
2,1,2,3,4
3,0,1,2,3
4,1,2,3,4
5,0,1,2,3
6,1,2,3,4
7,0,1,2,3


DataFrame的duplicated方法返回一个布尔型Series，表示各行是否是重复行（前面出现过的行）：

In [53]:
xa.duplicated()

0    False
1    False
2     True
3     True
4     True
5     True
6     True
7     True
dtype: bool

还有一个与此相关的drop_duplicates方法，它会返回一个DataFrame，重复的数组会标为False：

In [54]:
xa.drop_duplicates()

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,0,1,2,3


这两个方法默认会判断全部列，你也可以指定部分列进行重复项判断。

In [56]:
xa.drop_duplicates([1])

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,0,1,2,3


duplicated和drop_duplicates默认保留的是第一个出现的值组合。传入keep='last'则保留最后一个：

In [57]:
xa.drop_duplicates(keep='last')

Unnamed: 0,0,1,2,3
6,1,2,3,4
7,0,1,2,3


# 利用函数或映射进行数据转换

In [62]:
xa.applymap(lambda x:str(x**2-1)+"a")

Unnamed: 0,0,1,2,3
0,0a,3a,8a,15a
1,-1a,0a,3a,8a
2,0a,3a,8a,15a
3,-1a,0a,3a,8a
4,0a,3a,8a,15a
5,-1a,0a,3a,8a
6,0a,3a,8a,15a
7,-1a,0a,3a,8a


# 替换值
利用fillna方法填充缺失数据可以看做值替换的一种特殊情况。前面已经看到，map可用于修改对象的数据子集，而replace则提供了一种实现该功能的更简单、更灵活的方式。我们来看看下面这个Series：

In [67]:
xa[0].replace([0,1],[-100,100])

0    100
1   -100
2    100
3   -100
4    100
5   -100
6    100
7   -100
Name: 0, dtype: int64

In [69]:
xa[0].replace({0:100,1:-100})

0   -100
1    100
2   -100
3    100
4   -100
5    100
6   -100
7    100
Name: 0, dtype: int64

# 重命名轴索引
跟Series中的值一样，轴标签也可以通过函数或映射进行转换，从而得到一个新的不同标签的对象。轴还可以被就地修改，而无需新建一个数据结构。

In [70]:
xb=xa[0]

In [77]:
xb.index=xb.index.map(lambda x:"str:{:.2f}".format(x))

In [79]:
xb

str:0.00    1
str:1.00    0
str:2.00    1
str:3.00    0
str:4.00    1
str:5.00    0
str:6.00    1
str:7.00    0
Name: 0, dtype: int64

如果想要创建数据集的转换版（而不是修改原始数据），比较实用的方法是rename

In [89]:
xa.rename(index=lambda x:chr(x+80),columns=lambda x:chr(x+100))

Unnamed: 0,d,e,f,g
P,1,2,3,4
Q,0,1,2,3
R,1,2,3,4
S,0,1,2,3
T,1,2,3,4
U,0,1,2,3
V,1,2,3,4
W,0,1,2,3


# 离散化和面元划分
为了便于分析，连续数据常常被离散化或拆分

In [91]:
xa=pd.Series(range(50))
xb=pd.cut(xa,[-1,10,30,50])

是一个特殊的Categorical对象。结果展示了pandas.cut划分的面元。你可以将其看做一组表示面元名称的字符串。

In [92]:
pd.value_counts(xb)

(10, 30]    20
(30, 50]    19
(-1, 10]    11
dtype: int64

# 检测和过滤异常值
过滤或变换异常值（outlier）在很大程度上就是运用数组运算

In [93]:
xa=pd.DataFrame(pd.np.random.randn(100,5))

In [101]:
xa.describe()

Unnamed: 0,0,1,2,3,4
count,100.0,100.0,100.0,100.0,100.0
mean,0.005161,-0.12336,0.097801,0.144553,-0.018344
std,1.038628,1.049578,1.018409,1.027993,1.055948
min,-2.434947,-2.402733,-2.177662,-2.22886,-2.164555
25%,-0.663125,-0.939597,-0.598766,-0.676779,-0.837292
50%,-0.038703,-0.051882,0.024745,0.244455,-0.229076
75%,0.685943,0.715446,0.837162,0.79622,0.68794
max,3.0,2.907048,2.66541,2.638246,3.0


In [96]:
xb=xa[0]

In [98]:
xb[pd.np.abs(xb)>3]=3

In [100]:
xa[pd.np.abs(xa)>3]=3

# 排列和随机采样
利用numpy.random.permutation函数可以轻松实现对Series或DataFrame的列的排列工作（permuting，随机重排序）。

In [102]:
xa=pd.DataFrame(pd.np.arange(20).reshape((5,4)))
xa

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [103]:
perm=pd.np.random.permutation(5)
perm

array([3, 1, 0, 4, 2])

In [104]:
xa.take(perm)

Unnamed: 0,0,1,2,3
3,12,13,14,15
1,4,5,6,7
0,0,1,2,3
4,16,17,18,19
2,8,9,10,11


In [105]:
xa.iloc[perm,:]

Unnamed: 0,0,1,2,3
3,12,13,14,15
1,4,5,6,7
0,0,1,2,3
4,16,17,18,19
2,8,9,10,11


In [106]:
xa.sample(3)

Unnamed: 0,0,1,2,3
4,16,17,18,19
3,12,13,14,15
2,8,9,10,11


# pandas的矢量化字符串函数
* 通过data.map，所有字符串和正则表达式方法都能被应用于（传入lambda表达式或其他函数）各个值，但是如果存在NA（null）就会报错。
* 为了解决这个问题，Series有一些能够跳过NA值的面向数组方法，进行字符串操作。通过Series的str属性即可访问这些方法。

In [107]:
xa=pd.Series([chr(i) for i in range(65,100)])
xa

0     A
1     B
2     C
3     D
4     E
5     F
6     G
7     H
8     I
9     J
10    K
11    L
12    M
13    N
14    O
15    P
16    Q
17    R
18    S
19    T
20    U
21    V
22    W
23    X
24    Y
25    Z
26    [
27    \
28    ]
29    ^
30    _
31    `
32    a
33    b
34    c
dtype: object

In [110]:
dir(xa.str)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__frozen',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_data',
 '_freeze',
 '_get_series_list',
 '_is_categorical',
 '_make_accessor',
 '_orig',
 '_validate',
 '_wrap_result',
 'capitalize',
 'cat',
 'center',
 'contains',
 'count',
 'decode',
 'encode',
 'endswith',
 'extract',
 'extractall',
 'find',
 'findall',
 'get',
 'get_dummies',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'islower',
 'isnumeric',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'len',
 'ljust',
 'lower',
 'lstrip',
 'match',
 'normalize',
 'pad',
 'partition',
 'repeat',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rst

高效的数据准备可以让你将更多的时间用于数据分析，花较少的时间用于准备工作，这样就可以极大地提高生产力。