# Pandas对象

In [1]:
import numpy as np
import pandas as pd

## 1. Series对象

带有**索引数据**的一维数组

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

从上面看出，数据和索引（第一列）绑定在一起

In [3]:
# values属性返回数据值
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [4]:
# index属性返回pd.Index索引对象
data.index

RangeIndex(start=0, stop=4, step=1)

通过括号索引标签取值

In [8]:
data[1]

0.5

In [9]:
data[1:3]

1    0.50
2    0.75
dtype: float64

### (1)用字符串定义索引

Numpy数组是通过隐式定义的整数索引获取值

Series对象是一种显式定义的索引与数值关联

In [8]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [9]:
data['b']

0.5

可以使用不连续的索引

In [10]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [11]:
data[5]

0.5

### (2)Series作为特殊的字典

In [12]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

索引按照顺序排列，字典取值方法也可用

In [13]:
population['California']

38332521

还可以切片！

In [14]:
population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

### (3)创建Series对象

```python
>>> pd.Series(data, index=index)
```

In [15]:
# 可以是列表
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [16]:
# 可以是标量，会进行自动填充
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [17]:
# 可以是字典
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In [18]:
# Series只会保留显式定义的键值对
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

## 2. DataFrame对象

### (1)DataFrame作为通用的数组

按照共同的索引排列的若干个Series对象

In [10]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
area = pd.Series(area_dict)
population = pd.Series(population_dict)

states = pd.DataFrame({'population': population, 'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


DataFrame也有一个index属性可以获取索引标签

In [20]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

DataFrame的colums属性存放列标签的Index对象

In [21]:
states.columns

Index(['population', 'area'], dtype='object')

可以看作通用的Numpy二维数组，他的行和列可以通过索引获取

### (2)DataFrame作为特殊的字典

In [12]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

DataFrame中，``data['col0']``，只返回第一列，和NumpyArray不同

### (3)创建DataFrame对象

#### (1)通过单个Series对象创建

In [15]:
pd.DataFrame(population,columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### (2)通过字典列表创建

In [24]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


即使某些键不存在，也会用NaN表示(Not a Number)

In [25]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### (3)通过Series对象字典创建

In [16]:
pd.DataFrame({'population': population, 'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


#### (4)通过NumpyArray创建

In [17]:
pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.739644,0.629162
b,0.955598,0.154161
c,0.18066,0.552287


#### (5)通过Numpy结构化数组

In [28]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [29]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## 2. Index对象

**不可变数组**和**有序集合**

In [19]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

### Index看作不可变数组

操作类似于数组

In [20]:
ind[1]

3

In [21]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

有很多类似Numpy的属性

In [22]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


区别在于，Index对象不可变，无法修改

### 看作有序集合

有set的用法

In [24]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [25]:
indA & indB  # intersection交集

Int64Index([3, 5, 7], dtype='int64')

In [26]:
indA | indB  # union并集

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [27]:
indA ^ indB  # symmetric difference异或

Int64Index([1, 2, 9, 11], dtype='int64')

也可以调用方法``indA.intersection(indB)``.

# 取值

## 1. Series取值

### (1)Series看作字典

In [1]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [2]:
data['b']

0.5

检测键/索引/值

In [3]:
'a' in data

True

In [4]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [5]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [6]:
# 修改数据
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

### (2)Series看作一维数组

In [7]:
# 显式索引切片
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [8]:
# 隐式索引切片
data[0:2]

a    0.25
b    0.50
dtype: float64

In [9]:
# 掩码
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [10]:
# 花哨
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

显式：包含最后一个

隐式：不包含最后一个

### 索引器: loc, iloc, and ix

如果Series是显式整数索引，则``data[1]``取值为显式索引

而``data[1:3]``为隐式索引

In [11]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [12]:
# 显式
data[1]

'a'

In [13]:
# 隐式
data[1:3]

3    b
5    c
dtype: object

担心混淆，所以使用索引器

In [14]:
# 显式
data.loc[1]

'a'

In [15]:
# 显式
data.loc[1:3]

1    a
3    b
dtype: object

In [16]:
# 隐式
data.iloc[1]

'b'

In [17]:
# 隐式
data.iloc[1:3]

3    b
5    c
dtype: object

``ix``为混合

## DataFrame取值

### (1)DataFrame看作字典

In [18]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [19]:
data['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [20]:
data.area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [21]:
data.area is data['area']

True

若列名与DataFrame的方法同名的话，不能获取，并且避免用属性形式直接修改值

In [22]:
data.pop is data['pop']

False

In [23]:
# 增加新的变量
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


### (2)DataFrame看作二维数组

In [24]:
# 查看数组数据
data.values

array([[  4.23967000e+05,   3.83325210e+07,   9.04139261e+01],
       [  1.70312000e+05,   1.95528600e+07,   1.14806121e+02],
       [  1.49995000e+05,   1.28821350e+07,   8.58837628e+01],
       [  1.41297000e+05,   1.96511270e+07,   1.39076746e+02],
       [  6.95662000e+05,   2.64481930e+07,   3.80187404e+01]])

In [25]:
# 进行转制
data.T

Unnamed: 0,California,Florida,Illinois,New York,Texas
area,423967.0,170312.0,149995.0,141297.0,695662.0
pop,38332520.0,19552860.0,12882140.0,19651130.0,26448190.0
density,90.41393,114.8061,85.88376,139.0767,38.01874


In [26]:
data.values[0]

array([  4.23967000e+05,   3.83325210e+07,   9.04139261e+01])

In [28]:
# 隐式：从0开始，左闭右开
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


In [29]:
# 显式
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


In [30]:
# 混合
data.ix[:3, :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


In [31]:
# 结合花哨
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
Florida,19552860,114.806121
New York,19651127,139.076746


In [32]:
# 修改数据
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


### (3)其他取值方法

In [33]:
# 切片
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [34]:
data[1:3]

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [35]:
# 掩码
data[data.density > 100]

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
New York,141297,19651127,139.076746
