# **Pandas**
---
## **Pandas数据结构介绍**

### **Series** 
> **一维数组，包含索引index   
> index和value分别查看索引和值**

In [None]:
import pandas as pd
import numpy as np
obj = pd.Series([4,7,-5,3])
obj.values

> **pandas会自动生成索引，也可以自定义索引**

In [None]:
obj2 = pd.Series([4,7,-5,3],index=['a','b','c','d'])
obj2

In [None]:
obj2[obj2>2]

In [None]:
np.exp(obj2) # 转换为浮点数仍然保留索引

> **Series可以当成字典用在函数中**   
> **所以可以拿python字典来创建series**

In [None]:
'b' in obj2

In [None]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

> **改变字典的键来更改索引**

In [None]:
# 改变字典的键来更改索引
states = ['California','Ohio','Oregon','Texas']
obj4 = pd.Series(sdata,index=states)
obj4

>  **isnull和notnull函数来检测缺失值** 
> **同样obj.isnull方法也能实现同样操作**

In [None]:
# isnull和notnull函数来检测缺失值
pd.isnull(obj4) 

> **Series最重要的一个功能是，它会根据运算的索引标签自动对齐数据：意味着运算结果为交集，类似与数据库join**

In [None]:
obj3 + obj4 

> **还可以给Series对象设置名字和索引名，索引名能赋值修改**

In [None]:
obj4.name = "population"
obj4.index.name = 'state'
obj4

### **Dataframe** 
> **DataFrame是一个表格型的数据结构  
> DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典（共用同一个索引）**

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data) # 转换为数据帧
frame.index= [1,2,3,4,5,6] # 可以像Series一样更改索引
frame.columns = ['year','state','pop'] # 指定排列方式
frame

In [None]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'])
frame2.index = ['one', 'two', 'three', 'four', 'five', 'six']
frame2 # 如果没数据会产生空值

> **可以通过切片从Dataframe中提取Series或者新的Dataframe**

In [None]:
frame2.year
frame2[['year','pop','debt']] # 跟numpy的切片索引很像

In [None]:
frame2.loc['three'] # 可以通过位置索引

> **通过赋值可以更改列数据**

In [None]:
import numpy as np
frame2.debt = 16.5
frame2['debt'] = np.arange(6.0)
frame2

In [None]:
val = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])
frame2.debt = val 
frame2

> **del用来删除列**  
> **赋值不存在的列会创建新列**

In [None]:
frame2['easterm'] = frame2.state == 'Ohio' # 这里就不能用方法了，只能用索引['eastern']
frame2 # 注意：不能用frame2.eastern创建新的列，笑死了我

> **注意：通过索引方式返回的列只是相应数据的视图而已，并不是副本。因此，对返回的Series所做的任何就地修改全都会反映到源DataFrame上。通过Series的copy方法即可指定复制列。**

In [None]:
frame3 = frame2.year.copy() # 复制一列
frame3

### **嵌套字典**

> **嵌套字典：外层字典的键作为列，内层键则作为行索引**

In [None]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3

In [None]:
frame3.T # 转置

> **自定义索引顺序**

In [None]:
pd.DataFrame(pop, index=[2001, 2002, 2003])
frame3.index = [2001, 2002, 2003] # 两者效果一样
frame3

In [None]:
pdata = {
    'Ohio':frame3['Ohio'][:-1],
    'Nevada':frame3['Nevada'][:2]
}
pd.DataFrame(pdata)

> **dataframe构造函数能接受二维数组、numpy的结构化数组、Series字典、字典组成的字典、列表或者元组组成的列表、另一个dataframe**

> **可以给dataframe设置表名、列名、索引名**

In [None]:
frame3.index.name = 'year'
frame3.columns.name = 'state' # 设置列名，相当于设置了表名
frame3

> **跟Series一样可以用index和value返回索引和值**

In [None]:
frame2.values
frame3.values

### **索引对象**
> **索引也是对象,索引对象不可更改，所以能在多个数据结构中共享**

In [None]:
import pandas as pd
obj = pd.Series(range(3),index=['a','b','c'])
index = obj.index
index[1:]
# index[1] = 'd' # 索引对象的索引不可更改

In [None]:
import numpy as np
labels = pd.Index(np.arange(3))
labels
# Int64Index([0, 1, 2], dtype='int64')
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2 
 #   0    1.5
 #   1   -2.5
 #   2    0.0
 #   dtype: float64
obj2.index is labels


> **列的columns操作和index一样**

In [None]:
frame3

In [None]:
frame3.columns

In [None]:
'Ohio' in frame3.columns

> **索引名相同时选择索引会显示所有值**

>**常见的index方法和属性**

> - **append 连接另一个index对象，产生一个新的index**
> - **difference 计算差集，得到一个index**
> - **intersection 计算交集**  
> - **union 计算并集**  
> - **isin 计算指示值是否包含在参数集**  
> - **delect 删除某处索引**  
> - **drop 删除传入的值**   
> - **insert 插入索引**  
> - **is_monotonic**   
> - **is_unique 检测索引是否重复**  
> - **unique  找到不重复的索引**


---
## **基本功能**

### **df.reindex()重新索引 —— 说人话就是重命名表格行列名**

In [1]:
import pandas as pd
import numpy as np 
obj = pd.Series([4.5,7.2,-5.3,3.6],index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a','b','c','d','e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [2]:
obj3 = pd.Series(['blue','purple','yellow'] , index = [0.,2,4] )
obj3

0.0      blue
2.0    purple
4.0    yellow
dtype: object

In [3]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

> **reindex重新设置索引名和列名**  
> **reindex函数的参数：index，method，fill_value**

In [4]:
frame = pd.DataFrame(np.arange(9).reshape(3,3), # reshape方法重塑数组形状
                    index=['a','c','d'],
                    columns=['Ohio', 'Texas', 'California'])
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [None]:
states = ['Texas', 'Utah', 'California','Ohio']
frame.reindex(columns=states)

### **df.drop()丢弃指定轴上的项 —— 说人话就是清除数据** 
**使用drop方法**

In [None]:
import pandas as pd 
import numpy as np 
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

In [None]:
obj.drop('c')

> **可以删除dataframe任意轴上的索引值**

In [None]:
data = pd.DataFrame(np.arange(36).reshape((6, 6)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York','Marsh','Mosh'],
                    columns=['one', 'two', 'three', 'four','five','six'])
data

> **删除行需要使用[索引]切片**

In [None]:
data.drop(['Marsh','Mosh']) # 删除行

> **删除列也能使用[索引]切片**   
> **要加上axis=1 删除1列**  
> **或者axis='columns'**

In [None]:
data.drop('five', axis=1)

In [None]:
data.drop(['five','six'],axis='columns')

### **用索引、选取和过滤**

> **Series索引（obj[...]）的工作方式类似于NumPy数组的索引**

In [None]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['b'] # 根据索引名查找
obj[1] # 根据索引号查找 结果一样

In [None]:
obj[2:4]

In [None]:
obj[['b', 'a', 'd']] # 时刻记牢索引切片

In [None]:
obj[obj < 2]

> **pandas的切片运算包含末端**

In [None]:
obj['b':'c']

> **切片索引可以赋值**

In [None]:
obj['b', 'a', 'd'] = 5
obj

> **切片操作同样适用于dataframe**  
> **特殊情况是纯切片和布尔选取时跟python一样**

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                  columns=['one', 'two', 'three', 'four'])
data

In [None]:
data[['three', 'one']] # 选取列

In [None]:
data[:2] # 选取行

In [None]:
data[data['three']>5] # 选取行

### **用loc和iloc选取子集——索引函数**
> **loc 轴标签索引  
> iloc 整数索引**

In [None]:
data

In [None]:
data.loc['Colorado', ['two', 'three']] # 相当于坐标

In [None]:
data.iloc[2,[3,0,1]]

In [None]:
data.iloc[[1, 2], [3, 0, 1]]

>**【重点笔记】  
df[val] 选取单列或者多列  
df[value] 选取单行或者多行  
df.loc[val] 选取单行或者多行  
df.loc[:,val] 选取单列或者列子集  
df.loc[val1,val2] 选取坐标位置上的值或者子集  
df.iloc[where] 通过整数进行选择，用法与loc一样  
df.at[label_i,label_2] 通过标签选取坐标位置上的值  
df.iat[label_i,label_2] 通过整数选取坐标位置上的值**

### **整数索引** 
>**整数索引会产生歧义，非整数索引不会产生歧义  
所以尽量都使用loc和iloc方法索引**

In [None]:
ser = pd.Series(np.arange(3.))
ser
# ser[-1] 会报错


### **算术运算和数据对齐**
> **索引的运算类似于SQL的Join连接**

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5],index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [None]:
s3 = s1 + s2 
s3 

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1

In [None]:
df2

In [None]:
df1 + df2

### **在算术方法中填充值**

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                      columns=list('abcd'))

df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                       columns=list('abcde'))

df2.loc[1, 'b'] = np.nan # 设置一个空值

df1 

In [None]:
df2

> **直接相加相当于内连接**

In [None]:
df1 + df2

> **缺失值可以使用加法add用其他对象值填充，NaN值可以使用fill_value=？指定填充**  
> **使用add相加相当于外连接**

In [None]:
df1.add(df2,fill_value=1)

> Series和DataFrame的算术方法
![](img/%E7%AE%97%E6%9C%AF%E8%AE%A1%E7%AE%97pandas.png)

In [None]:
df1.radd(df2,fill_value=1)

> **重设索引列也可指定填充值**

In [None]:
df1.reindex(columns=df2.columns, fill_value=0)

### **DataFrame和Series之间的运算**

> **广播（broadcasting） 数组之间的运算会广播到每一行**

In [None]:
arr = np.arange(12.).reshape((3, 4))
arr[0]

In [None]:
arr - arr[0]

> **根据列匹配在行上广播**  
> **Series和DataFrame的运算也会广播**

In [None]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                    columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])

frame 

In [None]:

series = frame.iloc[0]
series

In [None]:
frame - series

In [None]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2

In [None]:
calc = frame + series2
calc

In [None]:
calc1 = calc.add(frame)
calc1

> **根据行匹配在列上广播**

In [None]:
series3 = frame['d']
series3

In [None]:
frame

In [None]:
frame.sub(series3, axis='index') # 传入的轴号就是希望匹配的轴

### **函数应用和映射**

> **numpy通用函数也能用来处理pandas数据**

In [5]:
import pandas as pd
import numpy as np 
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                         index=['Utah', 'Ohio', 'Texas', 'Oregon'])

frame

Unnamed: 0,b,d,e
Utah,-1.441116,-0.727283,1.431343
Ohio,0.132378,0.006846,0.31566
Texas,2.320527,-0.716923,-0.738081
Oregon,0.140255,0.405675,-0.054437


In [6]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.441116,0.727283,1.431343
Ohio,0.132378,0.006846,0.31566
Texas,2.320527,0.716923,0.738081
Oregon,0.140255,0.405675,0.054437


> **【重点】df.apply(f)自定义函数处理数据**

In [7]:
frame.sum(axis='columns') # 常见函数不需要用apply直接使用就好

Utah     -0.737056
Ohio      0.454884
Texas     0.865522
Oregon    0.491493
dtype: float64

In [8]:
frame.sum(axis='index') # 常见函数不需要用apply直接使用就好

b    1.152044
d   -1.031686
e    0.954484
dtype: float64

In [9]:
# 默认按每行执行
f = lambda x: x.max() - x.min()
frame.apply(f) 

b    3.761643
d    1.132958
e    2.169424
dtype: float64

In [10]:
# 默认按每行执行
#f = lambda x: x.max() - x.min()
frame.apply(lambda x: x.max() - x.min()) 

b    3.761643
d    1.132958
e    2.169424
dtype: float64

In [None]:
# 设置按列执行
f = lambda x: x.max() - x.min()
frame.apply(f,axis='columns') 

> **可以把返回的多个值设置成Series**

In [None]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

> **也可以返回计算后的每个字符，比如格式化字符串**

In [None]:
format = lambda x: '%.2f' % x
frame.applymap(format)

### **排序和排名**

> **sort_index()按索引排序**

In [11]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [12]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                        index=['three', 'one'],
                        columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [13]:
frame.sort_index(axis=0)

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [14]:
frame.sort_index(axis=1,ascending=False)
# axis=1 axis='columns' 按列排 axis=0 按行排
# ascending=False 降序排

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


> **sort_value(by=name) 按值排序**


In [15]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values() # 在排序时，任何缺失值默认都会被放到Series的末尾：

2   -3
3    2
0    4
1    7
dtype: int64

> **sort_value(by='columns_name1','columns_name2')**   
> **指定多条件按值排序**

In [None]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

In [None]:
frame.sort_values(by=['a', 'b'])

> **df.rank()把平均名次排序**

In [None]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj

In [None]:
obj.rank()

In [None]:
obj.rank(method='first') # 源数据出现顺序排

In [None]:
obj.rank(ascending=False, method='max') # 降序按最大值排名

> **df.rank()用在Dataframe里**
> 

In [None]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame

In [None]:
frame.rank(ascending=False,axis='columns')

**【rank排序】**
![](img/%E6%8E%92%E5%BA%8F.png)

### 带有重复标签的轴索引 
> **如果标签对应多个值，切片选取时会返回多个值**

In [None]:
import pandas as pd
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj['a']

---
## **汇总和计算描述统计**

In [3]:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                  [np.nan, np.nan], [0.75, -1.3]],
                     columns=['one', 'two'])
df

Unnamed: 0,one,two
0,1.4,
1,7.1,-4.5
2,,
3,0.75,-1.3


> **axis=0按列求和 NA值会被忽略**

In [4]:
df.sum(axis=0)

one    9.25
two   -5.80
dtype: float64

> **axis=1按行求和**

In [None]:
df.sum(axis=1)

In [None]:
df.mean(axis=1, skipna=False) # 按列求平均 跳过空值

In [7]:
df.idxmax()

one    1
two    3
dtype: int64

In [6]:
df.cumsum()

Unnamed: 0,one,two
0,1.4,
1,8.5,-4.5
2,,
3,9.25,-5.8


> **df.describe()**

In [None]:
df.describe()

In [None]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

> **描述性统计和汇总方法**
![](img/pandas%E6%96%B9%E6%B3%95%E6%B1%87%E6%80%BB.png)

### **计算相关系数与协方差**

> **Series的corr方法用于计算两个Series中重叠的、非NA的、按索引对齐的值的相关系数。与此类似，cov用于计算协方差**

```
In [242]: returns = price.pct_change()

In [243]: returns.tail()
Out[243]: 
                AAPL      GOOG       IBM      MSFT
Date                                              
2016-10-17 -0.000680  0.001837  0.002072 -0.003483
2016-10-18 -0.000681  0.019616 -0.026168  0.007690
2016-10-19 -0.002979  0.007846  0.003583 -0.002255
2016-10-20 -0.000512 -0.005652  0.001719 -0.004867
2016-10-21 -0.003930  0.003011 -0.012474  0.042096
```
> **计算相关系数和协方差**
```
In [244]: returns['MSFT'].corr(returns['IBM'])
Out[244]: 0.49976361144151144

In [245]: returns['MSFT'].cov(returns['IBM'])
Out[245]: 8.8706554797035462e-05
```

> **DataFrame的corr和cov方法将以DataFrame的形式分别返回完整的相关系数或协方差矩阵：**
```
In [247]: returns.corr()
Out[247]: 
          AAPL      GOOG       IBM      MSFT
AAPL  1.000000  0.407919  0.386817  0.389695
GOOG  0.407919  1.000000  0.405099  0.465919
IBM   0.386817  0.405099  1.000000  0.499764
MSFT  0.389695  0.465919  0.499764  1.000000

In [248]: returns.cov()
Out[248]: 
          AAPL      GOOG       IBM      MSFT
AAPL  0.000277  0.000107  0.000078  0.000095
GOOG  0.000107  0.000251  0.000078  0.000108
IBM   0.000078  0.000078  0.000146  0.000089
MSFT  0.000095  0.000108  0.000089  0.000215
```

> **DataFrame的corrwith方法，你可以计算其列或行跟另一个Series或DataFrame之间的相关系数。传入一个Series将会返回一个相关系数值Series（针对各列进行计算）：**

```
In [249]: returns.corrwith(returns.IBM)
Out[249]: 
AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64
```
> **传入一个DataFrame则会计算按列名配对的相关系数**

### **计算唯一值、值计数以及成员资格**

> **unique得到唯一值数组**  
**uniques.sort()可以对其排序**  
**value_counts计算其值出现频率**

In [None]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques

![](img/%E5%94%AF%E4%B8%80%E5%80%BC%E6%96%B9%E6%B3%95.png)

In [None]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                    'Qu2': [2, 3, 1, 2, 3],
                    'Qu3': [1, 5, 2, 4, 4]})
data

In [None]:
result = data.apply(pd.value_counts).fillna(0)
result