# 数据清洗

数据清洗是处理数据的第一步.我们的数据一般都是由不同来源汇总得来,因此难免会有

+ 重复值
+ 缺值
+ 异常值和极端值
+ 插值操作

等问题.数据清洗就是处理这些问题,使原始数据成为可用于进一步分析的数据


In [1]:
import pandas as pd
import numpy as np

In [2]:
dirty = pd.read_csv("source/dirty.csv",sep=",")
dirty

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69,175.0,m
1,Jessica,,89,195.0,f
2,Mary,15.0,49,169.0,f
3,John,18.0,79,184.0,m
4,Bob,12.0,69,175.0,m
5,Mel,11.0,45,,f
6,Mary,14.0,56,176.0,f
7,Jessica,25555.0,44,149.0,f
8,Bob,18.0,69,178.0,m
9,Marila,17.0,48,164.0,f


## 重复值处理

+ ### 首先我们要观察数据是否有重复值

In [3]:
dirty.duplicated()

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
dtype: bool

从第4行就有重复

+ ### 去除重复

In [4]:
dirty.drop_duplicates()

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69,175.0,m
1,Jessica,,89,195.0,f
2,Mary,15.0,49,169.0,f
3,John,18.0,79,184.0,m
5,Mel,11.0,45,,f
6,Mary,14.0,56,176.0,f
7,Jessica,25555.0,44,149.0,f
8,Bob,18.0,69,178.0,m
9,Marila,17.0,48,164.0,f
10,Johana,16.0,57,162.0,f


这两个方法默认会判断全部列，你也可以指定部分列进行重复项判断。比如我们以names作为key，且只根据key列过滤重复项：

In [5]:
dirty.duplicated(["name"])

0     False
1     False
2     False
3     False
4      True
5     False
6      True
7      True
8      True
9     False
10    False
11    False
12    False
13    False
14    False
15    False
dtype: bool

In [6]:
dirty.drop_duplicates(["name"])

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69,175.0,m
1,Jessica,,89,195.0,f
2,Mary,15.0,49,169.0,f
3,John,18.0,79,184.0,m
5,Mel,11.0,45,,f
9,Marila,17.0,48,164.0,f
10,Johana,16.0,57,162.0,f
11,Melenda,15.0,42,153.0,f
12,Maryre,1977.0,300,,f
13,Jeson,-25555.0,200,,m


duplicated和drop_duplicates默认保留的是第一个出现的值组合。传入`keep='last'`则保留最后一个：

In [7]:
dirty.drop_duplicates(["name"],keep='last')

Unnamed: 0,name,age,weight,height,sex
3,John,18.0,79,184.0,m
5,Mel,11.0,45,,f
6,Mary,14.0,56,176.0,f
7,Jessica,25555.0,44,149.0,f
8,Bob,18.0,69,178.0,m
9,Marila,17.0,48,164.0,f
10,Johana,16.0,57,162.0,f
11,Melenda,15.0,42,153.0,f
12,Maryre,1977.0,300,,f
13,Jeson,-25555.0,200,,m


## 缺值处理

有的时候读出来的数据是有缺值的,所有缺值都会在pandas中表现为numpy.NaN.

+ ### 使用isnull方法查看

In [8]:
dirty.isnull()

Unnamed: 0,name,age,weight,height,sex
0,False,False,False,False,False
1,False,True,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,True,False
6,False,False,False,False,False
7,False,False,False,False,False
8,False,False,False,False,False
9,False,False,False,False,False


In [9]:
dirty["age"].isnull()

0     False
1      True
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15     True
Name: age, dtype: bool

+ ### 用dropna()方法来做删除处理

可以设定how="all"来丢弃全部为空的行,

要丢弃列加入参数axis = 1即可

In [10]:
dirty.dropna()

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69,175.0,m
2,Mary,15.0,49,169.0,f
3,John,18.0,79,184.0,m
4,Bob,12.0,69,175.0,m
6,Mary,14.0,56,176.0,f
7,Jessica,25555.0,44,149.0,f
8,Bob,18.0,69,178.0,m
9,Marila,17.0,48,164.0,f
10,Johana,16.0,57,162.0,f
11,Melenda,15.0,42,153.0,f


+ ### 缺值填充

有时我们并不想删除,而是会希望用其他数据去填充,这时可以用fillna方法

In [11]:
dirty.fillna(0)

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69,175.0,m
1,Jessica,0.0,89,195.0,f
2,Mary,15.0,49,169.0,f
3,John,18.0,79,184.0,m
4,Bob,12.0,69,175.0,m
5,Mel,11.0,45,0.0,f
6,Mary,14.0,56,176.0,f
7,Jessica,25555.0,44,149.0,f
8,Bob,18.0,69,178.0,m
9,Marila,17.0,48,164.0,f


参数同样可以是一个pandas对象,比如用均值填充

In [12]:
dirty.fillna(dirty.mean())

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69,175.0,m
1,Jessica,152.928571,89,195.0,f
2,Mary,15.0,49,169.0,f
3,John,18.0,79,184.0,m
4,Bob,12.0,69,175.0,m
5,Mel,11.0,45,171.083333,f
6,Mary,14.0,56,176.0,f
7,Jessica,25555.0,44,149.0,f
8,Bob,18.0,69,178.0,m
9,Marila,17.0,48,164.0,f


fillna方式中参数可以是一个列为key的字典,来实现不同列填入不同数据

In [13]:
dirty.fillna({"age":dirty["age"].mean(),
             "height":dirty["height"].mean()})

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69,175.0,m
1,Jessica,152.928571,89,195.0,f
2,Mary,15.0,49,169.0,f
3,John,18.0,79,184.0,m
4,Bob,12.0,69,175.0,m
5,Mel,11.0,45,171.083333,f
6,Mary,14.0,56,176.0,f
7,Jessica,25555.0,44,149.0,f
8,Bob,18.0,69,178.0,m
9,Marila,17.0,48,164.0,f


如果希望向前或向后填补空白,那么可以加入参数`method=<method>`
可以有这些关键字:

+ 'backfill'和 'bfill'
    使用下一个有效的观察来填补前面的缺口
    
+ 'pad'和'ffill' 
    传播最后一个观察到的有效的数据到下一个
    
使用method,如果我们只希望连续的空白填充到一定数量的数据点，我们可以使用`limit=<n:int>`关键字

In [14]:
dirty.fillna(method='bfill',limit=1 )

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69,175.0,m
1,Jessica,15.0,89,195.0,f
2,Mary,15.0,49,169.0,f
3,John,18.0,79,184.0,m
4,Bob,12.0,69,175.0,m
5,Mel,11.0,45,176.0,f
6,Mary,14.0,56,176.0,f
7,Jessica,25555.0,44,149.0,f
8,Bob,18.0,69,178.0,m
9,Marila,17.0,48,164.0,f


+ ### 异常值和极端值

像`25555.0`,`-25555`,还有weight里面的`200`,`300`这些值可能是一个表示缺失数据的标记值,也可能是录入时候出错了,属于异常值.要将其替换为pandas能够理解的NA值，我们可以利用replace来产生一个新的dataframe

In [15]:
dirty.replace(25555.0, np.nan)

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69,175.0,m
1,Jessica,,89,195.0,f
2,Mary,15.0,49,169.0,f
3,John,18.0,79,184.0,m
4,Bob,12.0,69,175.0,m
5,Mel,11.0,45,,f
6,Mary,14.0,56,176.0,f
7,Jessica,,44,149.0,f
8,Bob,18.0,69,178.0,m
9,Marila,17.0,48,164.0,f


如果你希望一次性替换多个值，可以传入一个由待替换值组成的列表以及一个替换值：

In [16]:
dirty.replace([25555.0,-25555.0], np.nan)

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69,175.0,m
1,Jessica,,89,195.0,f
2,Mary,15.0,49,169.0,f
3,John,18.0,79,184.0,m
4,Bob,12.0,69,175.0,m
5,Mel,11.0,45,,f
6,Mary,14.0,56,176.0,f
7,Jessica,,44,149.0,f
8,Bob,18.0,69,178.0,m
9,Marila,17.0,48,164.0,f


如果希望对不同的值进行不同的替换，则传入一个由替换关系组成的列表即可：

In [17]:
dirty.replace([25555.0,-25555.0], [np.nan, 0])

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69,175.0,m
1,Jessica,,89,195.0,f
2,Mary,15.0,49,169.0,f
3,John,18.0,79,184.0,m
4,Bob,12.0,69,175.0,m
5,Mel,11.0,45,,f
6,Mary,14.0,56,176.0,f
7,Jessica,,44,149.0,f
8,Bob,18.0,69,178.0,m
9,Marila,17.0,48,164.0,f


传入的参数也可以是字典：

In [18]:
dirty.replace({25555.0: np.nan, -25555.0: 0})

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69,175.0,m
1,Jessica,,89,195.0,f
2,Mary,15.0,49,169.0,f
3,John,18.0,79,184.0,m
4,Bob,12.0,69,175.0,m
5,Mel,11.0,45,,f
6,Mary,14.0,56,176.0,f
7,Jessica,,44,149.0,f
8,Bob,18.0,69,178.0,m
9,Marila,17.0,48,164.0,f


更加常见的是针对不同列给出不同的替换方案

In [19]:
nan_dirty = dirty.replace({"age":[25555.0,-25555.0,1977],
              "weight":[300,200]},np.nan)
nan_dirty

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69.0,175.0,m
1,Jessica,,89.0,195.0,f
2,Mary,15.0,49.0,169.0,f
3,John,18.0,79.0,184.0,m
4,Bob,12.0,69.0,175.0,m
5,Mel,11.0,45.0,,f
6,Mary,14.0,56.0,176.0,f
7,Jessica,,44.0,149.0,f
8,Bob,18.0,69.0,178.0,m
9,Marila,17.0,48.0,164.0,f


更常见的情况是数据非常多,我们无法确定哪些是需要替换的离群点,这种时候我们只能手动的找出是否是离群点

In [20]:
States = ['NY', 'NY', 'NY', 'NY', 'FL', 'FL', 'GA', 'GA', 'FL', 'FL'] 
data = [1.0, 2, 3, 4, 5, 6, 7, 8, 9, 10]
idx = pd.date_range('1/1/2012', periods=10, freq='MS')
df1 = pd.DataFrame(data, index=idx, columns=['Revenue'])
df1['State'] = States

In [21]:
df1

Unnamed: 0,Revenue,State
2012-01-01,1.0,NY
2012-02-01,2.0,NY
2012-03-01,3.0,NY
2012-04-01,4.0,NY
2012-05-01,5.0,FL
2012-06-01,6.0,FL
2012-07-01,7.0,GA
2012-08-01,8.0,GA
2012-09-01,9.0,FL
2012-10-01,10.0,FL


In [22]:
data2 = [10.0, 10.0, 9, 9, 8, 8, 7, 7, 6, 6]
idx2 = pd.date_range('1/1/2013', periods=10, freq='MS')
df2 = pd.DataFrame(data2, index=idx2, columns=['Revenue'])
df2['State'] = States
df2

Unnamed: 0,Revenue,State
2013-01-01,10.0,NY
2013-02-01,10.0,NY
2013-03-01,9.0,NY
2013-04-01,9.0,NY
2013-05-01,8.0,FL
2013-06-01,8.0,FL
2013-07-01,7.0,GA
2013-08-01,7.0,GA
2013-09-01,6.0,FL
2013-10-01,6.0,FL


In [23]:
df = pd.concat([df1,df2])
df

Unnamed: 0,Revenue,State
2012-01-01,1.0,NY
2012-02-01,2.0,NY
2012-03-01,3.0,NY
2012-04-01,4.0,NY
2012-05-01,5.0,FL
2012-06-01,6.0,FL
2012-07-01,7.0,GA
2012-08-01,8.0,GA
2012-09-01,9.0,FL
2012-10-01,10.0,FL


以上就是我们的例子使用的源数据,接着我们来根据他的均值和标准差来判断是不是离群点

+ 方法一

    使用针对整体的统计特点,最终确定是否是离群点

In [24]:
newdf = df.copy()

In [25]:
newdf['x-Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean())
newdf

Unnamed: 0,Revenue,State,x-Mean
2012-01-01,1.0,NY,5.75
2012-02-01,2.0,NY,4.75
2012-03-01,3.0,NY,3.75
2012-04-01,4.0,NY,2.75
2012-05-01,5.0,FL,1.75
2012-06-01,6.0,FL,0.75
2012-07-01,7.0,GA,0.25
2012-08-01,8.0,GA,1.25
2012-09-01,9.0,FL,2.25
2012-10-01,10.0,FL,3.25


In [26]:
newdf['1.96*std'] = 1.96*newdf['Revenue'].std()  
newdf

Unnamed: 0,Revenue,State,x-Mean,1.96*std
2012-01-01,1.0,NY,5.75,5.200273
2012-02-01,2.0,NY,4.75,5.200273
2012-03-01,3.0,NY,3.75,5.200273
2012-04-01,4.0,NY,2.75,5.200273
2012-05-01,5.0,FL,1.75,5.200273
2012-06-01,6.0,FL,0.75,5.200273
2012-07-01,7.0,GA,0.25,5.200273
2012-08-01,8.0,GA,1.25,5.200273
2012-09-01,9.0,FL,2.25,5.200273
2012-10-01,10.0,FL,3.25,5.200273


In [27]:
newdf['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std()
newdf

Unnamed: 0,Revenue,State,x-Mean,1.96*std,Outlier
2012-01-01,1.0,NY,5.75,5.200273,True
2012-02-01,2.0,NY,4.75,5.200273,False
2012-03-01,3.0,NY,3.75,5.200273,False
2012-04-01,4.0,NY,2.75,5.200273,False
2012-05-01,5.0,FL,1.75,5.200273,False
2012-06-01,6.0,FL,0.75,5.200273,False
2012-07-01,7.0,GA,0.25,5.200273,False
2012-08-01,8.0,GA,1.25,5.200273,False
2012-09-01,9.0,FL,2.25,5.200273,False
2012-10-01,10.0,FL,3.25,5.200273,False


+ 方法二
    
    使用groupby+transform,针对各个组别的统计特点,确定是否是离群点

In [28]:
newdf = df.copy()

In [29]:
State = newdf.groupby('State')
State.groups

{'FL': DatetimeIndex(['2012-05-01', '2012-06-01', '2012-09-01', '2012-10-01',
                '2013-05-01', '2013-06-01', '2013-09-01', '2013-10-01'],
               dtype='datetime64[ns]', freq=None),
 'GA': DatetimeIndex(['2012-07-01', '2012-08-01', '2013-07-01', '2013-08-01'], dtype='datetime64[ns]', freq=None),
 'NY': DatetimeIndex(['2012-01-01', '2012-02-01', '2012-03-01', '2012-04-01',
                '2013-01-01', '2013-02-01', '2013-03-01', '2013-04-01'],
               dtype='datetime64[ns]', freq=None)}

In [30]:
newdf['Outlier'] = State.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
newdf

Unnamed: 0,Revenue,State,Outlier
2012-01-01,1.0,NY,False
2012-02-01,2.0,NY,False
2012-03-01,3.0,NY,False
2012-04-01,4.0,NY,False
2012-05-01,5.0,FL,False
2012-06-01,6.0,FL,False
2012-07-01,7.0,GA,False
2012-08-01,8.0,GA,False
2012-09-01,9.0,FL,False
2012-10-01,10.0,FL,False


In [31]:
newdf['x-Mean'] = State.transform( lambda x: abs(x-x.mean()) )
newdf['1.96*std'] = State.transform( lambda x: 1.96*x.std() )
newdf

Unnamed: 0,Revenue,State,Outlier,x-Mean,1.96*std
2012-01-01,1.0,NY,False,5.0,7.554813
2012-02-01,2.0,NY,False,4.0,7.554813
2012-03-01,3.0,NY,False,3.0,7.554813
2012-04-01,4.0,NY,False,2.0,7.554813
2012-05-01,5.0,FL,False,2.25,3.434996
2012-06-01,6.0,FL,False,1.25,3.434996
2012-07-01,7.0,GA,False,0.25,0.98
2012-08-01,8.0,GA,False,0.75,0.98
2012-09-01,9.0,FL,False,1.75,3.434996
2012-10-01,10.0,FL,False,2.75,3.434996


+ 方法三

    使用Group by item,通过groupby+apply,根据分组的统计特征判断是否是离群点
   

In [32]:
newdf = df.copy()

State = newdf.groupby('State')

def s(group):
    group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean())
    group['1.96*std'] = 1.96*group['Revenue'].std()  
    group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std()
    return group

Newdf2 = State.apply(s)
Newdf2

Unnamed: 0,Revenue,State,x-Mean,1.96*std,Outlier
2012-01-01,1.0,NY,5.0,7.554813,False
2012-02-01,2.0,NY,4.0,7.554813,False
2012-03-01,3.0,NY,3.0,7.554813,False
2012-04-01,4.0,NY,2.0,7.554813,False
2012-05-01,5.0,FL,2.25,3.434996,False
2012-06-01,6.0,FL,1.25,3.434996,False
2012-07-01,7.0,GA,0.25,0.98,False
2012-08-01,8.0,GA,0.75,0.98,False
2012-09-01,9.0,FL,1.75,3.434996,False
2012-10-01,10.0,FL,2.75,3.434996,False


根据多个item观察

In [33]:
newdf = df.copy()

StateMonth = newdf.groupby(['State', lambda x: x.month])

StateMonth.groups

{('FL',
  5): DatetimeIndex(['2012-05-01', '2013-05-01'], dtype='datetime64[ns]', freq=None),
 ('FL',
  6): DatetimeIndex(['2012-06-01', '2013-06-01'], dtype='datetime64[ns]', freq=None),
 ('FL',
  9): DatetimeIndex(['2012-09-01', '2013-09-01'], dtype='datetime64[ns]', freq=None),
 ('FL',
  10): DatetimeIndex(['2012-10-01', '2013-10-01'], dtype='datetime64[ns]', freq=None),
 ('GA',
  7): DatetimeIndex(['2012-07-01', '2013-07-01'], dtype='datetime64[ns]', freq=None),
 ('GA',
  8): DatetimeIndex(['2012-08-01', '2013-08-01'], dtype='datetime64[ns]', freq=None),
 ('NY',
  1): DatetimeIndex(['2012-01-01', '2013-01-01'], dtype='datetime64[ns]', freq=None),
 ('NY',
  2): DatetimeIndex(['2012-02-01', '2013-02-01'], dtype='datetime64[ns]', freq=None),
 ('NY',
  3): DatetimeIndex(['2012-03-01', '2013-03-01'], dtype='datetime64[ns]', freq=None),
 ('NY',
  4): DatetimeIndex(['2012-04-01', '2013-04-01'], dtype='datetime64[ns]', freq=None)}

In [34]:
def s(group):
    group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean())
    group['1.96*std'] = 1.96*group['Revenue'].std()  
    group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std()
    return group

Newdf2 = StateMonth.apply(s)
Newdf2

Unnamed: 0,Revenue,State,x-Mean,1.96*std,Outlier
2012-01-01,1.0,NY,4.5,12.473364,False
2012-02-01,2.0,NY,4.0,11.087434,False
2012-03-01,3.0,NY,3.0,8.315576,False
2012-04-01,4.0,NY,2.5,6.929646,False
2012-05-01,5.0,FL,1.5,4.157788,False
2012-06-01,6.0,FL,1.0,2.771859,False
2012-07-01,7.0,GA,0.0,0.0,False
2012-08-01,8.0,GA,0.5,1.385929,False
2012-09-01,9.0,FL,1.5,4.157788,False
2012-10-01,10.0,FL,2.0,5.543717,False


## 插值

Series和Dataframe对象有一个插值方法`interpolate()`，默认情况下，可以在Nan的数据点位置进行线性插值。

插值是用来填补空缺值得,一般是估计来的值,个人认为最好不要用

可以使用预设的几种方法插值:

+ 'linear': 忽略索引，并将值视为等间隔的。这是支持多指标的唯一方法。

+ 'time': 针对每日并且高频率数据,插值给给定区间的长度

+ 'index', 'values': 用索引的数值

+ 'nearest', 'zero', 'slinear', 'quadratic', 'cubic','barycentric', 'polynomial','krogh', 'piecewise_polynomial', 'spline', 'pchip'和 'akima','piecewise_polynomial'都是scipy中的对应方法

In [35]:
nan_dirty.interpolate()

Unnamed: 0,name,age,weight,height,sex
0,Bob,12.0,69.0,175.0,m
1,Jessica,13.5,89.0,195.0,f
2,Mary,15.0,49.0,169.0,f
3,John,18.0,79.0,184.0,m
4,Bob,12.0,69.0,175.0,m
5,Mel,11.0,45.0,175.5,f
6,Mary,14.0,56.0,176.0,f
7,Jessica,16.0,44.0,149.0,f
8,Bob,18.0,69.0,178.0,m
9,Marila,17.0,48.0,164.0,f
