# Pandas DataFrame
参考资料：
[pandas 官方文档](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe)

## DataFrame 是什么
- 是二维的、带标签的数据结构
- 可以看作 Sql 表，或者是 Series 对象的的字典（一个序列对象就是一列）
- 拥有行索引 index（可选）
- 拥有列索引 columns（可选）
- 如果加了索引或列名，不匹配的数据将会被丢弃掉
- 对于 Python version >= 3.6 and pandas >= 0.23，DataFrame 列的顺序是插入这些列的顺序；否则为列名的字典顺序

## 怎样创建一个DataFrame

### 1. 通过 Series 的字典创建

In [6]:
import pandas as pd
d = {
   "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
   "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
   }

df = pd.DataFrame(d)
print(df)
print("\n")

df = pd.DataFrame(d, index=["d", "b", "a"])
print(df)
print("\n")

df = pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])
print(df)


   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0


   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0


   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN


### 2. 通过 ndarrays / lists 的字典创建
注意：ndarrays / lists 的长度必须相同

In [9]:
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
df = pd.DataFrame(d)
print(df)
print("\n")

df = pd.DataFrame(d, index=["a", "b", "c", "d"])
print(df)

   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0


   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0


### 3. 通过 ndarrays 的创建
注意：ndarrays 的每一个元素是一个元组，是 DataFrame 的一行记录

In [14]:
import numpy as np

data = np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])
print(data)

data[:] = [(1, 2.0, "Hello"), (2, 3.0, "World")]
print(data)

df = pd.DataFrame(data)
print(df)

df = pd.DataFrame(data, index=["first", "second"])
print(df)

df = pd.DataFrame(data, columns=["C", "A", "B"])
print(df)

[(0, 0., b'') (0, 0., b'')]
[(1, 2., b'Hello') (2, 3., b'World')]
   A    B         C
0  1  2.0  b'Hello'
1  2  3.0  b'World'
        A    B         C
first   1  2.0  b'Hello'
second  2  3.0  b'World'
          C  A    B
0  b'Hello'  1  2.0
1  b'World'  2  3.0


## 数据选择

### 1. 列选择、添加、删除
可以在语义上将 DataFrame 视为 Series 对象的字典。获取、设置和删除列的操作语法与类似dict操作的语法相同

In [6]:
import pandas as pd
d = {
   "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
   "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
   }

df = pd.DataFrame(d)

print("选择列：")
print(df["one"])

print("\n添加列：")
df["three"] = df["one"] * df["two"]
df["flag"] = df["one"] > 2
df["foo"] = "bar"
df["one_trunc"] = df["one"][:2]
print(df)

print("\n删除列：")
del df["two"]
df.pop("three")
print(df)

选择列：
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

添加列：
   one  two  three   flag  foo  one_trunc
a  1.0  1.0    1.0  False  bar        1.0
b  2.0  2.0    4.0  False  bar        2.0
c  3.0  3.0    9.0   True  bar        NaN
d  NaN  4.0    NaN  False  bar        NaN

删除列：
   one   flag  foo  one_trunc
a  1.0  False  bar        1.0
b  2.0  False  bar        2.0
c  3.0   True  bar        NaN
d  NaN  False  bar        NaN


### 2. 按 列 / 行 选择数据
| 操作   | 语法   |  输出
| ---- | ---- | ---- |
|   选择列   |   df[col] 或 df[[cols]]   |   Series 或 DataFrame   |
|   通过每行的标签选择行   |   df.loc[label]   |   Series   |
|   通过行号选择行   |   df.iloc[loc]   |   Series   |
|   按行切分   |   df[5:10]   |   DataFrame   |
|   通过布尔向量选择行   |   df[[bool_vec]]   |   DataFrame   |

In [17]:
print("选择列")
print(df[["flag", "one"]])

print("\n通过每行的标签选择行")
print(df.loc["b"])

print("\n通过行号选择行")
print(df.iloc[2])

print("\n按行切分")
print(df[1:3])

print("\n通过布尔向量选择行")
print(df[[True, False, True, False]])

选择列
    flag  one
a  False  1.0
b  False  2.0
c   True  3.0
d  False  NaN

通过每行的标签选择行
one            2.0
flag         False
foo            bar
one_trunc      2.0
Name: b, dtype: object

通过行号选择行
one           3.0
flag         True
foo           bar
one_trunc     NaN
Name: c, dtype: object

按行切分
   one   flag  foo  one_trunc
b  2.0  False  bar        2.0
c  3.0   True  bar        NaN

通过布尔向量选择行
   one   flag  foo  one_trunc
a  1.0  False  bar        1.0
c  3.0   True  bar        NaN


### 3. 创建新的列 / 覆写现有的列，返回新的 DataFrame

In [4]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

iris = load_iris()

df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names']+['target'])

df.rename(columns={'sepal length (cm)':'SepalLength', 
                   'sepal width (cm)':'SepalWidth', 
                   'petal length (cm)':'PetalLength',
                   'petal width (cm)':'PetalWidth'}, 
          inplace=True)
print(df.head())

# 方式一
print(df.assign(sepal_ratio=df["SepalWidth"] / df["SepalLength"]).head())

# 方式二
print(df.assign(sepal_ratio=lambda x: (x["SepalWidth"] / x["SepalLength"])).head())

   SepalLength  SepalWidth  PetalLength  PetalWidth  target
0          5.1         3.5          1.4         0.2     0.0
1          4.9         3.0          1.4         0.2     0.0
2          4.7         3.2          1.3         0.2     0.0
3          4.6         3.1          1.5         0.2     0.0
4          5.0         3.6          1.4         0.2     0.0
   SepalLength  SepalWidth  PetalLength  PetalWidth  target  sepal_ratio
0          5.1         3.5          1.4         0.2     0.0     0.686275
1          4.9         3.0          1.4         0.2     0.0     0.612245
2          4.7         3.2          1.3         0.2     0.0     0.680851
3          4.6         3.1          1.5         0.2     0.0     0.673913
4          5.0         3.6          1.4         0.2     0.0     0.720000
   SepalLength  SepalWidth  PetalLength  PetalWidth  target  sepal_ratio
0          5.1         3.5          1.4         0.2     0.0     0.686275
1          4.9         3.0          1.4         0.2     

注意：**assign 总是返回数据的副本，而原始DataFrame保持不变**

In [5]:
print(df.head())


   SepalLength  SepalWidth  PetalLength  PetalWidth  target
0          5.1         3.5          1.4         0.2     0.0
1          4.9         3.0          1.4         0.2     0.0
2          4.7         3.2          1.3         0.2     0.0
3          4.6         3.1          1.5         0.2     0.0
4          5.0         3.6          1.4         0.2     0.0


## 算数运算

### 1. DataFrame 之间的运算
会自动对齐，无法对齐计算的结果为NaN

In [22]:
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"])

print(df+df2)

          A         B         C   D
0 -2.146374  2.816157 -0.233808 NaN
1  0.084840 -1.295670  1.572045 NaN
2 -1.858359  1.031073  0.012748 NaN
3 -1.678174  0.612198 -0.743227 NaN
4  0.259344  3.149538  1.528738 NaN
5 -0.115171  2.005839 -0.363523 NaN
6  1.517893  0.845197 -0.713599 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN


### 2. DataFrame 与 Series 之间的运算
默认行为是在DataFrame列上对齐Series索引

In [23]:
print(df - df.iloc[0])

          A         B         C         D
0  0.000000  0.000000  0.000000  0.000000
1 -0.122818 -2.804331  1.023692 -0.179254
2 -1.715853 -1.396656  0.064808  0.471526
3 -1.406908 -2.379344 -1.177743 -0.453933
4 -1.452593 -2.335547  0.730372  0.353130
5 -1.978379 -0.226348 -0.501059 -2.212686
6  0.200042 -3.092785  0.139767 -0.646990
7 -2.454655 -2.581250 -1.433988  1.170834
8 -0.740658 -2.097519  0.396698 -0.403309
9 -0.606310 -2.403896 -0.539102 -0.991267


### 3. DataFrame 与标量计算

In [25]:
print(df * 5 + 2)
print("\n")

print(1 / df)
print("\n")

print(df ** 4)

          A          B         C         D
0  6.326814  14.296686  1.620653  3.488176
1  5.712723   0.275031  6.739112  2.591908
2 -2.252453   7.313406  1.944691  5.845804
3 -0.707726   2.399967 -4.268061  1.218508
4 -0.936149   2.618950  5.272513  5.253826
5 -3.565082  13.164948 -0.884644 -7.575253
6  7.327024  -1.167238  2.319490  0.253225
7 -5.946459   1.390434 -5.549286  9.342348
8  2.623522   3.809092  3.604141  1.471633
9  3.295266   2.277205 -1.074859 -1.468161


          A          B          C         D
0  1.155585   0.406614 -13.180555  3.359818
1  1.346721  -2.898603   1.055050  8.447260
2 -1.175792   0.941016 -90.401408  1.300118
3 -1.846568  12.501045  -0.797695 -6.398021
4 -1.702911   8.078202   1.527878  1.536653
5 -0.898459   0.447830  -1.733316 -0.522179
6  0.938610  -1.578663  15.649947 -2.862418
7 -0.629211  -8.202555  -0.662314  0.680981
8  8.018957   2.763817   3.116933 -9.463115
9  3.860211  18.037212  -1.626091 -1.441686


          A          B             C   

### 4. DataFrame 转置

In [26]:
print(df[:5].T)

          0         1         2         3         4
A  0.865363  0.742545 -0.850491 -0.541545 -0.587230
B  2.459337 -0.344994  1.062681  0.079993  0.123790
C -0.075869  0.947822 -0.011062 -1.253612  0.654503
D  0.297635  0.118382  0.769161 -0.156298  0.650765


## DataFrame与NumPy函数的互操作

### 1. 使用NumPy ufuncs（log，exp，sqrt等）对 DataFrame 进行计算

In [33]:
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])
print(df)

print("\n指数计算")
print(np.exp(df))

print("\n转array")
print(np.asarray(df))

          A         B         C         D
0  0.920363 -1.136099  0.039272 -0.056404
1 -1.042451  0.486291  0.466590  1.097513
2  0.863837 -1.107791  0.084759  0.478753
3  0.897284  0.558088 -0.179154 -0.461662
4  0.886460 -1.282126 -0.152255 -1.087159
5 -0.584089 -0.187413  1.342429 -0.617281
6 -1.173771 -0.448081 -1.570684 -1.361098
7 -1.510312 -0.871244  0.177987 -0.585025
8 -0.150362 -0.256548 -0.490751  0.443875
9 -0.365964 -0.064101 -0.966917  0.612064

指数计算
          A         B         C         D
0  2.510200  0.321069  1.040053  0.945157
1  0.352589  1.626274  1.594548  2.996705
2  2.372245  0.330288  1.088454  1.614061
3  2.452933  1.747329  0.835977  0.630236
4  2.426523  0.277447  0.858769  0.337173
5  0.557614  0.829101  3.828330  0.539409
6  0.309199  0.638853  0.207903  0.256379
7  0.220841  0.418431  1.194810  0.557092
8  0.860396  0.773718  0.612167  1.558735
9  0.693528  0.937910  0.380254  1.844233

转array
[[ 0.92036251 -1.13609936  0.03927205 -0.05640389]
 [-1.042450