# 入门Pandas

In [1]:
import numpy as np
import pandas as pd

## 生成对象

1、用值列表生成 Series 时，Pandas 默认自动生成整数索引：

In [2]:
s = pd.Series([1,2,4,5,np.nan,6,8]) 

In [3]:
s

0    1.0
1    2.0
2    4.0
3    5.0
4    NaN
5    6.0
6    8.0
dtype: float64

2、用含日期时间索引与标签的 NumPy 数组生成 DataFrame：

In [4]:
dates = pd.date_range('2020-8-21',periods=6)

In [5]:
dates

DatetimeIndex(['2020-08-21', '2020-08-22', '2020-08-23', '2020-08-24',
               '2020-08-25', '2020-08-26'],
              dtype='datetime64[ns]', freq='D')

In [6]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [7]:
df

Unnamed: 0,A,B,C,D
2020-08-21,-0.240493,1.049402,-2.227476,0.946034
2020-08-22,0.585824,-0.00689,-0.847243,-1.44641
2020-08-23,-1.503538,0.867013,-2.263381,-2.04173
2020-08-24,-0.944664,0.276492,-1.174007,-0.529376
2020-08-25,-2.594362,-0.708836,-1.577329,0.09938
2020-08-26,0.307533,-0.917163,-0.868527,2.010804


### 3、用 Series 字典对象生成 DataFrame:

In [8]:
df2 = pd.DataFrame({'A':1.,
                    'B':pd.Timestamp('20200821'),
                    'C':pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D':np.array([3] * 4, dtype='int32'),
                    'E':pd.Categorical(["test", "train", "test", "train"]),
                    'F':'foo'})


In [9]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2020-08-21,1.0,3,test,foo
1,1.0,2020-08-21,1.0,3,train,foo
2,1.0,2020-08-21,1.0,3,test,foo
3,1.0,2020-08-21,1.0,3,train,foo


In [10]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## 查看数据

下列代码说明如何查看 DataFrame 头部和尾部数据

In [11]:
df2.head()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2020-08-21,1.0,3,test,foo
1,1.0,2020-08-21,1.0,3,train,foo
2,1.0,2020-08-21,1.0,3,test,foo
3,1.0,2020-08-21,1.0,3,train,foo


In [12]:
df.tail(3)

Unnamed: 0,A,B,C,D
2020-08-24,-0.944664,0.276492,-1.174007,-0.529376
2020-08-25,-2.594362,-0.708836,-1.577329,0.09938
2020-08-26,0.307533,-0.917163,-0.868527,2.010804


显示索引与列名：

In [13]:
df.index

DatetimeIndex(['2020-08-21', '2020-08-22', '2020-08-23', '2020-08-24',
               '2020-08-25', '2020-08-26'],
              dtype='datetime64[ns]', freq='D')

In [14]:
df2.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [15]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [16]:
df2.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

DataFrame.to_numpy() 输出底层数据的 NumPy 对象。注意，DataFrame 的列由多种数据类型组成时，该操作耗费系统资源较大，这也是 Pandas 和 NumPy 的本质区别：NumPy 数组只有一种数据类型，DataFrame 每列的数据类型各不相同。调用 DataFrame.to_numpy() 时，Pandas 查找支持 DataFrame 里所有数据类型的 NumPy 数据类型。还有一种数据类型是 object，可以把 DataFrame 列里的值强制转换为 Python 对象。

下面的 df 这个 DataFrame 里的值都是浮点数，DataFrame.to_numpy() 的操作会很快，而且不复制数据。

In [17]:
 df.to_numpy()

array([[-0.24049267,  1.04940218, -2.22747623,  0.9460338 ],
       [ 0.58582395, -0.00688963, -0.84724289, -1.44640995],
       [-1.50353842,  0.86701302, -2.26338075, -2.04173026],
       [-0.94466448,  0.27649176, -1.17400708, -0.52937574],
       [-2.59436244, -0.70883576, -1.57732878,  0.09937994],
       [ 0.30753339, -0.91716323, -0.86852731,  2.01080411]])

df2 这个 DataFrame 包含了多种类型，DataFrame.to_numpy() 操作就会耗费较多资源。

In [18]:
df2.to_numpy()

array([[1.0, Timestamp('2020-08-21 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-08-21 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2020-08-21 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-08-21 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

> 提醒  
> DataFrame.to_numpy() 的输出不包含行索引和列标签。

describe() 可以快速查看数据的统计摘要：

In [19]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.731617,0.093336,-1.492994,-0.160216
std,1.19731,0.802598,0.640122,1.505668
min,-2.594362,-0.917163,-2.263381,-2.04173
25%,-1.36382,-0.533349,-2.064939,-1.217151
50%,-0.592579,0.134801,-1.375668,-0.214998
75%,0.170527,0.719383,-0.944897,0.73437
max,0.585824,1.049402,-0.847243,2.010804


In [20]:
df2.describe()

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,1.0,1.0,3.0
std,0.0,0.0,0.0
min,1.0,1.0,3.0
25%,1.0,1.0,3.0
50%,1.0,1.0,3.0
75%,1.0,1.0,3.0
max,1.0,1.0,3.0


转置数据：

In [21]:
df.T

Unnamed: 0,2020-08-21,2020-08-22,2020-08-23,2020-08-24,2020-08-25,2020-08-26
A,-0.240493,0.585824,-1.503538,-0.944664,-2.594362,0.307533
B,1.049402,-0.00689,0.867013,0.276492,-0.708836,-0.917163
C,-2.227476,-0.847243,-2.263381,-1.174007,-1.577329,-0.868527
D,0.946034,-1.44641,-2.04173,-0.529376,0.09938,2.010804


In [22]:
df2.T

Unnamed: 0,0,1,2,3
A,1,1,1,1
B,2020-08-21 00:00:00,2020-08-21 00:00:00,2020-08-21 00:00:00,2020-08-21 00:00:00
C,1,1,1,1
D,3,3,3,3
E,test,train,test,train
F,foo,foo,foo,foo


按轴排序：

In [23]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2020-08-21,0.946034,-2.227476,1.049402,-0.240493
2020-08-22,-1.44641,-0.847243,-0.00689,0.585824
2020-08-23,-2.04173,-2.263381,0.867013,-1.503538
2020-08-24,-0.529376,-1.174007,0.276492,-0.944664
2020-08-25,0.09938,-1.577329,-0.708836,-2.594362
2020-08-26,2.010804,-0.868527,-0.917163,0.307533


按值排序：

In [24]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2020-08-26,0.307533,-0.917163,-0.868527,2.010804
2020-08-25,-2.594362,-0.708836,-1.577329,0.09938
2020-08-22,0.585824,-0.00689,-0.847243,-1.44641
2020-08-24,-0.944664,0.276492,-1.174007,-0.529376
2020-08-23,-1.503538,0.867013,-2.263381,-2.04173
2020-08-21,-0.240493,1.049402,-2.227476,0.946034


## 选择

> 提醒

> 选择、设置标准 Python / Numpy 的表达式已经非常直观，交互也很方便，  
但对于生产代码，我们还是推荐优化过的 Pandas 数据访问方法：.at、.iat、.loc 和 .iloc

### 1、 获取数据

选择单列，产生 Series，与 df.A 等效：

In [30]:
df['A']

2020-08-21   -0.240493
2020-08-22    0.585824
2020-08-23   -1.503538
2020-08-24   -0.944664
2020-08-25   -2.594362
2020-08-26    0.307533
Freq: D, Name: A, dtype: float64

### 2、用 [ ] 切片行：

In [31]:
df[0:3]

Unnamed: 0,A,B,C,D
2020-08-21,-0.240493,1.049402,-2.227476,0.946034
2020-08-22,0.585824,-0.00689,-0.847243,-1.44641
2020-08-23,-1.503538,0.867013,-2.263381,-2.04173


In [32]:
 df['20200821':'20200823']

Unnamed: 0,A,B,C,D
2020-08-21,-0.240493,1.049402,-2.227476,0.946034
2020-08-22,0.585824,-0.00689,-0.847243,-1.44641
2020-08-23,-1.503538,0.867013,-2.263381,-2.04173


### 3、按标签选择

用标签提取一行数据：

In [34]:
df.loc[dates[0]]

A   -0.240493
B    1.049402
C   -2.227476
D    0.946034
Name: 2020-08-21 00:00:00, dtype: float64

用标签选择多列数据：

In [35]:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2020-08-21,-0.240493,1.049402
2020-08-22,0.585824,-0.00689
2020-08-23,-1.503538,0.867013
2020-08-24,-0.944664,0.276492
2020-08-25,-2.594362,-0.708836
2020-08-26,0.307533,-0.917163


用标签切片，包含行与列结束点：

In [36]:
 df.loc['20200821':'20200823', ['A', 'B']]

Unnamed: 0,A,B
2020-08-21,-0.240493,1.049402
2020-08-22,0.585824,-0.00689
2020-08-23,-1.503538,0.867013


返回对象降维：

In [37]:
df.loc['20200821', ['A', 'B']]

A   -0.240493
B    1.049402
Name: 2020-08-21 00:00:00, dtype: float64