# Pandas

## 简介

python之所以称为数据分析在工业界的事实标准之一, 其中的重要因素就是因为有numpy和pandas这两个明星级的三方库. numpy提供了高性能, pandas提供了高效率

### 什么是pandas

Python Data Analysis Library 或 pandas 是基于NumPy的一种工具, 该工具是为了解决数据分析任务而创建的

### Series

类似于一维数组的对象, 由下面两个部分组成:
    
- values
- index 行索引

In [3]:
import pandas as pd
import numpy as np

In [4]:
a = pd.Series([1,2,3,4],index=list("abcd")) # 通过list
b = pd.Series(np.array([1,2,3,4]), index=list("abcd")) # 通过numpy
c = pd.Series({'a':1, 'b':2, 'c':3, 'd':4}) # 通过字典

In [5]:
display(a,b,c)

a    1
b    2
c    3
d    4
dtype: int64

a    1
b    2
c    3
d    4
dtype: int64

a    1
b    2
c    3
d    4
dtype: int64

**由ndarray创建的是引用,而不是副本.对Series元素的改变也会改变原来的ndarray对象中的元素**

In [28]:
nd = np.array([[1,2,3],[4,5,6]])

In [30]:
df = pd.DataFrame(nd)
df1 = df

In [31]:
df1[0] = [22,33] 

In [32]:
display(nd, df, df1)

array([[22,  2,  3],
       [33,  5,  6]])

Unnamed: 0,0,1,2
0,22,2,3
1,33,5,6


Unnamed: 0,0,1,2
0,22,2,3
1,33,5,6


#### 索引和切片

- 显式索引 通过索引名

- 隐式索引 通过索引列表下标

In [34]:
df = pd.Series([1,2,3,4,5], index=list('abcde'))
df

a    1
b    2
c    3
d    4
e    5
dtype: int64

**显式索引**

In [35]:
df['a']

1

In [51]:
display(df.loc[['a','b','c']],  df.loc['a':'c']) # 闭区间

a    1
b    2
c    3
dtype: int64

a    1
b    2
c    3
dtype: int64

**隐式索引**

In [46]:
df[0]

1

In [54]:
display(df.iloc[[0,1,2]],df.iloc[0:2]) # 开区间

a    1
b    2
c    3
dtype: int64

a    1
b    2
dtype: int64

#### Series的一些属性

In [56]:
display(df.shape, df.size, df.index, df.values)

(5,)

5

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

array([1, 2, 3, 4, 5])

#### Series的运算

In [59]:
display(df+1, df*10, df.multiply(10), df.sub(1))

a    2
b    3
c    4
d    5
e    6
dtype: int64

a    10
b    20
c    30
d    40
e    50
dtype: int64

a    10
b    20
c    30
d    40
e    50
dtype: int64

a    0
b    1
c    2
d    3
e    4
dtype: int64

**索引不对齐的情况**

In [60]:
s1 = pd.Series([1,2,3], index=list('abc'))
s2 = pd.Series([4,5,6], index=list('ade'))

In [61]:
display(s1, s2)

a    1
b    2
c    3
dtype: int64

a    4
d    5
e    6
dtype: int64

In [62]:
s1+s2

a    5.0
b    NaN
c    NaN
d    NaN
e    NaN
dtype: float64

### DataFrame

DataFrame是一个表格型的数据结构.可以看作是由Series组成的字典.DataFrame既有行索引,也有列索引.

In [65]:
df = pd.DataFrame([[10,20,30],[30,10,15],[20,40,10]], index=['A001','A002','A003'], columns=['apple', 'banana', 'apear'])
df

Unnamed: 0,apple,banana,apear
A001,10,20,30
A002,30,10,15
A003,20,40,10


In [66]:
df = pd.DataFrame({'apple':[10,20,30], 'banana':[30,10,15], 'apear':[20,40,10]}, index=['A001','A002','A003'])
df

Unnamed: 0,apple,banana,apear
A001,10,30,20
A002,20,10,40
A003,30,15,10


In [69]:
# 添加空列
df = pd.DataFrame({'apple':[10,20,30], 'banana':[30,10,15], 'apear':[20,40,10]}, index=['A001','A002','A003'], columns=['apple', 'banana', 'apear', 'tomato'])
df

Unnamed: 0,apple,banana,apear,tomato
A001,10,30,20,
A002,20,10,40,
A003,30,15,10,


In [70]:
display(df.shape, df.values, df.columns, df.index, df.dtypes)

(3, 4)

array([[10, 30, 20, nan],
       [20, 10, 40, nan],
       [30, 15, 10, nan]], dtype=object)

Index(['apple', 'banana', 'apear', 'tomato'], dtype='object')

Index(['A001', 'A002', 'A003'], dtype='object')

apple      int64
banana     int64
apear      int64
tomato    object
dtype: object

#### 索引

分为行索引和列索引

##### 1. 对列索引

- 通过字典形式
- 通过属性形式

In [71]:
df

Unnamed: 0,apple,banana,apear,tomato
A001,10,30,20,
A002,20,10,40,
A003,30,15,10,


In [73]:
display( df['apple'], df.apple, df[['apple','apear']] )

A001    10
A002    20
A003    30
Name: apple, dtype: int64

A001    10
A002    20
A003    30
Name: apple, dtype: int64

Unnamed: 0,apple,apear
A001,10,20
A002,20,40
A003,30,10


##### 2. 对行索引

- 使用.loc[]加index来进行行索引.(显式)
- 使用.iloc[]加整数来进行行索引.(隐式)

In [82]:
display(df.loc['A001'], df.loc[['A001','A003']], df.loc['A001':'A003'])

apple      10
banana     30
apear      20
tomato    NaN
Name: A001, dtype: object

Unnamed: 0,apple,banana,apear,tomato
A001,10,30,20,
A003,30,15,10,


Unnamed: 0,apple,banana,apear,tomato
A001,10,30,20,
A002,20,10,40,
A003,30,15,10,


In [84]:
display(df.iloc[0], df.iloc[[0,2]], df.iloc[0:2])

apple      10
banana     30
apear      20
tomato    NaN
Name: A001, dtype: object

Unnamed: 0,apple,banana,apear,tomato
A001,10,30,20,
A003,30,15,10,


Unnamed: 0,apple,banana,apear,tomato
A001,10,30,20,
A002,20,10,40,


##### 3. 对元素索引

In [86]:
display(df['apple'][0], df['apple']['A001']) # 使用列索引

10

10

In [87]:
display(df.iloc[0,1], df.iloc[[0,1]])

30

Unnamed: 0,apple,banana,apear,tomato
A001,10,30,20,
A002,20,10,40,


### 本项目中的常用操作