# **数据清洗和准备**
---


## **处理缺失值**

> **_None值也属于NaN_**   
> **_当进行数据清洗以进行分析时，最好直接对缺失数据进行分析，以判断数据采集的问题或缺失数据可能导致的偏差。_**

In [46]:
import numpy as np 
import pandas as pd
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

> **NaN值处理方式**  
> **dropna | 移除NaN值**   
> **fillna | 填充NaN值**  
> **isnull | 判断null值**   
> **notnull | 过滤null值** 


### **dropna()过滤缺失值**

> **Series的NaN值处理方式**

In [47]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [48]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

> **Dataframe的NaN值处理方式：how='all'只删除全部NAN值**

In [49]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                    [NA, NA, NA], [NA, 6.5, 3.]])

cleaned = data.dropna()

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [50]:
data.dropna(how='all') 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [51]:
data[4] = NA
data.dropna(axis=1,how='all') # 传入axis=1 对列操作

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


> **df.dropna(thresh=2):留下一部分观测数据**

In [52]:
df = pd.DataFrame(np.random.randn(7, 3))

df.iloc[:4, 1] = NA

df.iloc[:2, 2] = NA

df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.355031,,0.409409
3,-0.794884,,-0.670684
4,-0.068064,0.361251,0.493749
5,0.205158,0.585751,2.101582
6,-1.713099,0.359112,-0.229397


### **填充数据**

In [53]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-1.536782,0.0,0.0
1,1.301905,0.0,0.0
2,0.355031,0.0,0.409409
3,-0.794884,0.0,-0.670684
4,-0.068064,0.361251,0.493749
5,0.205158,0.585751,2.101582
6,-1.713099,0.359112,-0.229397


In [54]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,-1.536782,0.5,0.0
1,1.301905,0.5,0.0
2,0.355031,0.5,0.409409
3,-0.794884,0.5,-0.670684
4,-0.068064,0.361251,0.493749
5,0.205158,0.585751,2.101582
6,-1.713099,0.359112,-0.229397


> **reindexing有效的那些插值方法也可用于fillna**

In [55]:
df.fillna(method='ffill')
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-1.536782,,
1,1.301905,,
2,0.355031,,0.409409
3,-0.794884,,-0.670684
4,-0.068064,0.361251,0.493749
5,0.205158,0.585751,2.101582
6,-1.713099,0.359112,-0.229397


> **可以传入很多聚合函数**

In [56]:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

> **fillna的参数**  
> **value:填充的值或者字典对象**  
> **method：插值方式，默认ffill**  
> **axis:轴**
> **inplace:在对象上修改，不产生副本**  
> **limit：连续填充的最大数量**

---
## **数据转换**

### **移除重复值**

> **df.duplicated()查找重复值**

In [57]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                       'k2': [1, 1, 2, 3, 3, 4, 4]})
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

> **df.drop_duplicates()删除重复值**

In [58]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [59]:
data['v1'] = range(7)
data.drop_duplicates(['k1']) # 根据固定列删除重复值
data.drop_duplicates(['k1', 'k2'], keep='last') # 默认保留最后一个重复值

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### **利用函数或映射转换数据**

> **S.map()接受函数或者映射关系的字典型对象**

In [60]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                                'Pastrami', 'corned beef', 'Bacon',
                                'pastrami', 'honey ham', 'nova lox'],
                                'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

meat_to_animal = {
                    'bacon': 'pig',
                    'pulled pork': 'pig',
                    'pastrami': 'cow',
                    'corned beef': 'cow',
                    'honey ham': 'pig',
                    'nova lox': 'salmon'
                    }

lowercased = data['food'].str.lower() # 字符串大写转小写
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


> **S.map()一次性搞定**

In [61]:
# (lambda x,y:x*y) 是匿名函数，常规函数的简写版
data['food'].map(lambda x:meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

### **替换值**

> **replace(x,y)替换值**  
> 笔记：data.replace方法与data.str.replace不同，后者做的是字符串的元素级替换。我们会在后面学习Series的字符串方法。

In [62]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])

data


0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [63]:
data.replace(-999, np.nan)
data.replace([-999, -1000], np.nan) # 替换多个值
data.replace([-999, -1000], [np.nan,0]) # 替换成不同的值
data.replace({-999: np.nan, -1000: 0}) # 也可以传入字典

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### **重命名轴索引**

> **index.map()通过函数更改轴索引名**

In [67]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                     index=['Ohio', 'Colorado', 'New York'],
                     columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [72]:
transform = lambda x:x[:4].upper() # 注意这里把index列表的每个值拿出来取[:4]
data.index = data.index.map(transform)
data 

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


> **df.rename()重命名索引**

In [74]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


> **df.name()用字典对象更改轴标签**

In [78]:
data.rename(index={'OHIO': 'INDIANA'},
                    columns={'three': 'peekaboo'},inplace=True)
                                                # inplace=True 修改源数据
data

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### **离散化和面元划分**

---
## **字符串操作**
