# 06 - Pandas

- Source of example datasets: https://github.com/mwaskom/seaborn-data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

## 基本数据结构：Series, DataFrame

Pandas中核心的表示数据的对象类型：
- Series: 一维的有标签的array，元素可以是任何类型的对象（数字、字符串等），元素的标签是其index；
- DataFrame: 二维的有标签的数据结构，其中的每一列都是Series；行的标签是index，列的标签是columns；

更多内容：intro to data structure: https://pandas.pydata.org/docs/user_guide/dsintro.html

In [None]:
# 从已有对象创建
df = pd.DataFrame({
    'a': range(3),
    'b': list('xyz'),
    'c': np.random.normal(size=3),
    'd': 1
})
df

In [None]:
# 从文件创建
iris = pd.read_table('homework/iris.csv', sep=',')
iris

In [None]:
type(iris)

In [None]:
iris.columns

In [None]:
iris.index

In [None]:
type(iris.sepal_length)

Series类似R里面的向量，DataFrame类似前面讲的dict,即DataFrame ~ dict of Series

In [None]:
x = pd.Series(list('aabbbc'))
x

In [None]:
x.value_counts()

In [None]:
x[0]

In [None]:
x.isna()

In [None]:
x.isin(['a', 'b'])

In [None]:
# Series可以有名字
s = pd.Series(np.random.standard_normal(5), index=list('abcde'))
s

In [None]:
s.index

In [None]:
s.values

In [None]:
print([i for i in dir(s) if not i.startswith('_')])

In [None]:
print([i for i in dir(iris) if not i.startswith('_')])

## 提取子集 (indexing/slicing)
（能提取即能赋值，不再赘述）

In [None]:
df

In [None]:
df['a']    # 字符串索引提取列得到Series

In [None]:
df[['a']]  # 列表索引得到包含指定列的DataFrame

In [None]:
df[df['a'] > 0]   # 逻辑向量索引选取符合条件的行

In [None]:
df.a       # OOP style

In [None]:
df.index  # 行名

In [None]:
df.index = [2, 1, 0]
df

In [None]:
df.columns  # 列名

In [None]:
# 根据行名或列名索引
df.loc[[1, 2], ['a', 'b']]

In [None]:
df.loc[:, 'a']

--------------

🙋**练习**

下面代码的输出应该是什么？
```python
df.loc[:, ['a']]
```

--------------

In [None]:
# 根据行或列的顺序索引（与numpy 2D array类似）
df.iloc[::-1, ::-1]

In [None]:
# Series的索引与DataFrame的列类似
s

In [None]:
s.a

In [None]:
s['a']

In [None]:
s.loc['a']

In [None]:
s.iloc[0]

## 分组操作

GroupBy -> Summarize

In [None]:
iris

In [None]:
iris.groupby('species').mean()

In [None]:
# 等价于
iris.groupby('species').agg('mean')

GroupBy -> Transform

In [None]:
iris.cumsum()

In [None]:
iris.groupby('species').cumsum()

In [None]:
# 等价于
iris.groupby('species').transform('cumsum')

通用方案：Groupby -> Apply

In [None]:
iris.groupby('species').apply(lambda df: df.mean())

In [None]:
iris.groupby('species')['sepal_length'].apply(lambda x: np.mean(x))

## 拓展（自学内容）

- 常用函数速查：https://github.com/gxelab/tutorials/blob/main/essential_pandas.md
- 官方教学：https://pandas.pydata.org/docs/user_guide/index.html