# 入门Pandas

In [12]:
import numpy as np
import pandas as pd

## 生成对象

1、用值列表生成 Series 时，Pandas 默认自动生成整数索引：

In [13]:
s = pd.Series([1,2,4,5,np.nan,6,8]) 

In [14]:
s

0    1.0
1    2.0
2    4.0
3    5.0
4    NaN
5    6.0
6    8.0
dtype: float64

2、用含日期时间索引与标签的 NumPy 数组生成 DataFrame：

In [15]:
dates = pd.date_range('2020-8-21',periods=6)

In [16]:
dates

DatetimeIndex(['2020-08-21', '2020-08-22', '2020-08-23', '2020-08-24',
               '2020-08-25', '2020-08-26'],
              dtype='datetime64[ns]', freq='D')

In [17]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [18]:
df

Unnamed: 0,A,B,C,D
2020-08-21,1.216082,0.739473,2.258084,0.018941
2020-08-22,-0.361165,-2.185694,2.213578,-0.92388
2020-08-23,-0.402191,0.773984,-1.669357,0.007736
2020-08-24,0.86363,0.519955,0.808847,0.57294
2020-08-25,0.455807,-1.71159,-2.724522,-0.825916
2020-08-26,-0.935263,0.560101,0.120242,0.203922


### 3、用 Series 字典对象生成 DataFrame:

In [30]:
df2 = pd.DataFrame({'A':1.,
                    'B':pd.Timestamp('20200821'),
                    'C':pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D':np.array([3] * 4, dtype='int32'),
                    'E':pd.Categorical(["test", "train", "test", "train"]),
                    'F':'foo'})


In [31]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2020-08-21,1.0,3,test,foo
1,1.0,2020-08-21,1.0,3,train,foo
2,1.0,2020-08-21,1.0,3,test,foo
3,1.0,2020-08-21,1.0,3,train,foo


In [32]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## 查看数据

下列代码说明如何查看 DataFrame 头部和尾部数据

In [34]:
df2.head()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2020-08-21,1.0,3,test,foo
1,1.0,2020-08-21,1.0,3,train,foo
2,1.0,2020-08-21,1.0,3,test,foo
3,1.0,2020-08-21,1.0,3,train,foo


In [35]:
df.tail(3)

Unnamed: 0,A,B,C,D
2020-08-24,0.86363,0.519955,0.808847,0.57294
2020-08-25,0.455807,-1.71159,-2.724522,-0.825916
2020-08-26,-0.935263,0.560101,0.120242,0.203922


显示索引与列名：

In [36]:
df.index

DatetimeIndex(['2020-08-21', '2020-08-22', '2020-08-23', '2020-08-24',
               '2020-08-25', '2020-08-26'],
              dtype='datetime64[ns]', freq='D')

In [37]:
df2.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [38]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [39]:
df2.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

DataFrame.to_numpy() 输出底层数据的 NumPy 对象。注意，DataFrame 的列由多种数据类型组成时，该操作耗费系统资源较大，这也是 Pandas 和 NumPy 的本质区别：NumPy 数组只有一种数据类型，DataFrame 每列的数据类型各不相同。调用 DataFrame.to_numpy() 时，Pandas 查找支持 DataFrame 里所有数据类型的 NumPy 数据类型。还有一种数据类型是 object，可以把 DataFrame 列里的值强制转换为 Python 对象。

下面的 df 这个 DataFrame 里的值都是浮点数，DataFrame.to_numpy() 的操作会很快，而且不复制数据。

In [40]:
 df.to_numpy()

array([[ 1.21608153,  0.7394732 ,  2.25808396,  0.0189412 ],
       [-0.36116513, -2.18569425,  2.21357793, -0.92387988],
       [-0.4021907 ,  0.77398438, -1.66935723,  0.00773555],
       [ 0.8636297 ,  0.51995529,  0.80884709,  0.57294048],
       [ 0.45580702, -1.71158955, -2.72452206, -0.82591635],
       [-0.9352626 ,  0.56010088,  0.12024241,  0.20392166]])

df2 这个 DataFrame 包含了多种类型，DataFrame.to_numpy() 操作就会耗费较多资源。

In [42]:
df2.to_numpy()

array([[1.0, Timestamp('2020-08-21 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-08-21 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2020-08-21 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-08-21 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

> 提醒  
> DataFrame.to_numpy() 的输出不包含行索引和列标签。

describe() 可以快速查看数据的统计摘要：

In [43]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.139483,-0.217295,0.167812,-0.15771
std,0.834584,1.353029,2.03511,0.592743
min,-0.935263,-2.185694,-2.724522,-0.92388
25%,-0.391934,-1.153703,-1.221957,-0.617503
50%,0.047321,0.540028,0.464545,0.013338
75%,0.761674,0.69463,1.862395,0.157677
max,1.216082,0.773984,2.258084,0.57294


In [44]:
df2.describe()

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,1.0,1.0,3.0
std,0.0,0.0,0.0
min,1.0,1.0,3.0
25%,1.0,1.0,3.0
50%,1.0,1.0,3.0
75%,1.0,1.0,3.0
max,1.0,1.0,3.0


转置数据：

In [45]:
df.T

Unnamed: 0,2020-08-21,2020-08-22,2020-08-23,2020-08-24,2020-08-25,2020-08-26
A,1.216082,-0.361165,-0.402191,0.86363,0.455807,-0.935263
B,0.739473,-2.185694,0.773984,0.519955,-1.71159,0.560101
C,2.258084,2.213578,-1.669357,0.808847,-2.724522,0.120242
D,0.018941,-0.92388,0.007736,0.57294,-0.825916,0.203922


In [46]:
df2.T

Unnamed: 0,0,1,2,3
A,1,1,1,1
B,2020-08-21 00:00:00,2020-08-21 00:00:00,2020-08-21 00:00:00,2020-08-21 00:00:00
C,1,1,1,1
D,3,3,3,3
E,test,train,test,train
F,foo,foo,foo,foo


按轴排序：

In [47]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2020-08-21,0.018941,2.258084,0.739473,1.216082
2020-08-22,-0.92388,2.213578,-2.185694,-0.361165
2020-08-23,0.007736,-1.669357,0.773984,-0.402191
2020-08-24,0.57294,0.808847,0.519955,0.86363
2020-08-25,-0.825916,-2.724522,-1.71159,0.455807
2020-08-26,0.203922,0.120242,0.560101,-0.935263


按值排序：

In [48]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2020-08-22,-0.361165,-2.185694,2.213578,-0.92388
2020-08-25,0.455807,-1.71159,-2.724522,-0.825916
2020-08-24,0.86363,0.519955,0.808847,0.57294
2020-08-26,-0.935263,0.560101,0.120242,0.203922
2020-08-21,1.216082,0.739473,2.258084,0.018941
2020-08-23,-0.402191,0.773984,-1.669357,0.007736
