# Pandas数据帧（DataFrame）

数据帧(DataFrame)是二维数据结构，即数据以行和列的表格方式排列。
数据帧(DataFrame)的功能特点：

潜在的列是不同的类型大小可变标记轴(行和列)可以对行和列执行算术运算

pandas中的DataFrame可以使用以下构造函数创建

|  参数   | 描述  |
|  ----  | ----  |
| data  | 数据采取各种形式，如:ndarray，series，map，lists，dict，constant和另一个DataFrame |
| index  | 对于行标签，要用于结果帧的索引是可选缺省值np.arrange(n) |
| columns  | 列标签 |
| dtype  | 每列的数据类型。 || dtype  | 每列的数据类型。 |

## 创建DataFrame对象

DataFrame会自动创建索引，且会被有序排列； Index索引对象是不可修改的。

In [51]:
import pandas as pd

# 通过列表创建
data = [1,2,3,4,5]
df = pd.DataFrame(data)

data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])

# 从ndarrays/Lists的字典来创建
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])

# 从列表创建数据帧DataFrame
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)

# 从系列的字典来创建
d = {'one' : pd.Series([1, 2, 3], index=[0,1,2]),
      'two' : pd.Series([1, 2, 3, 4], index=[0,1,2,3])}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
0,1.0,1
1,2.0,2
2,3.0,3
3,,4


## 查看数据

用label就用loc，用position就用iloc

In [57]:
# 查看行数和列数
print(df.shape)

# 查看前n行
print(df.head(5))

# 查看后n行
print(df.tail(5))

# 查看数值型列的汇总统计
print(df.describe())

# 查看列名
df.columns
print(df._stat_axis.values.tolist()) # 行名称
print(df.columns.values.tolist())

# 某一列的列值
df['one']

# 第i行的第j列
print(df.loc[0, "one"])   # 第1行的one列值
print(df.iloc[0, 0])   # 第1行的第一列值，同上

# 多行中的多列
print(df.loc[[2,3],['one','two']])  #选取指定的第2行和第3行，name和age列的数据
print(df.iloc[[2,3], [0,1]])  # 同上

(4, 2)
   one  two
0  1.0    1
1  2.0    2
2  3.0    3
3  NaN    4
   one  two
0  1.0    1
1  2.0    2
2  3.0    3
3  NaN    4
       one       two
count  3.0  4.000000
mean   2.0  2.500000
std    1.0  1.290994
min    1.0  1.000000
25%    1.5  1.750000
50%    2.0  2.500000
75%    2.5  3.250000
max    3.0  4.000000
['one', 'two']
1.0
1.0
   one  two
2  3.0    3
3  NaN    4
   one  two
2  3.0    3
3  NaN    4


## 修改数据

In [58]:
# 列添加
df['three']=pd.Series([10,20,30],index=['a','b','c'])
df['four']=df['one']+df['two']
print (df)

# 列删除
df.pop('two')
print (df)

# 行遍历&切片
print(df[1:3])
for index, row in df.iterrows():   # 第一行的值
    print(row)

# 附加行
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=['one', 'two'])
df = df.append(df2)
print (df)

# 删除行
df = df.drop(0)
print (df)


   one  two  three  four
0  1.0    1    NaN   2.0
1  2.0    2    NaN   4.0
2  3.0    3    NaN   6.0
3  NaN    4    NaN   NaN
   one  three  four
0  1.0    NaN   2.0
1  2.0    NaN   4.0
2  3.0    NaN   6.0
3  NaN    NaN   NaN
   one  three  four
1  2.0    NaN   4.0
2  3.0    NaN   6.0
one      1.0
three    NaN
four     2.0
Name: 0, dtype: float64
one      2.0
three    NaN
four     4.0
Name: 1, dtype: float64
one      3.0
three    NaN
four     6.0
Name: 2, dtype: float64
one     NaN
three   NaN
four    NaN
Name: 3, dtype: float64
   one  three  four  two
0  1.0    NaN   2.0  NaN
1  2.0    NaN   4.0  NaN
2  3.0    NaN   6.0  NaN
3  NaN    NaN   NaN  NaN
0  5.0    NaN   NaN  6.0
1  7.0    NaN   NaN  8.0
   one  three  four  two
1  2.0    NaN   4.0  NaN
2  3.0    NaN   6.0  NaN
3  NaN    NaN   NaN  NaN
1  7.0    NaN   NaN  8.0


## DataFrame数据处理——map、apply、applymap

示例数据

In [60]:
import pandas as pd
import numpy as np

boolean=[True,False]
gender=["男","女"]
color=["white","black","yellow"]
data=pd.DataFrame({
    "height":np.random.randint(150,190,100),
    "weight":np.random.randint(40,90,100),
    "smoker":[boolean[x] for x in np.random.randint(0,2,100)],
    "gender":[gender[x] for x in np.random.randint(0,2,100)],
    "age":np.random.randint(15,90,100),
    "color":[color[x] for x in np.random.randint(0,len(color),100) ]
}
)
data

Unnamed: 0,height,weight,smoker,gender,age,color
0,158,88,False,女,75,yellow
1,188,42,True,女,52,white
2,156,55,False,女,51,black
3,182,57,True,女,73,white
4,159,69,True,男,87,white
...,...,...,...,...,...,...
95,180,53,True,女,78,white
96,175,50,False,女,57,yellow
97,183,71,False,男,35,yellow
98,189,60,False,男,73,yellow


### 1. map用法

如果需要把数据集中gender列的男替换为1，女替换为0，怎么做呢？绝对不是用for循环实现，使用Series.map()可以很容易做到，最少仅需一行代码。

不论是利用字典还是函数进行映射，map方法都是把对应的数据逐个当作参数传入到字典或函数中，得到映射后的值。

In [4]:
#①使用字典进行映射
data["gender"] = data["gender"].map({"男":1, "女":0})

#②使用函数
def gender_map(x):
    gender = 1 if x == "男" else 0
    return gender
#注意这里传入的是函数名，不带括号
data["gender"] = data["gender"].map(gender_map)

### 2. apply

对DataFrame而言，apply是非常重要的数据处理方法，它可以接收各种各样的函数（Python内置的或自定义的），也可以同时处理多列数据；

apply()在运算时实际上是一行一行遍历的，IO开销比较大，可以使用progress_apply()监视运行进度；

注意：在DataFrame中，axis=0代表操作对列columns进行，axis=1代表操作对行row进行

In [61]:
def BMI(series):
    weight = series["weight"]
    height = series["height"]/100
    BMI = weight/height**2
    return BMI

# 沿着1轴操作
data["BMI"] = data.apply(BMI,axis=1)

对于一个非常大的DataFrame,有着相当大的IO开销，十分耗时，使用pd.merge要比appl()快大约1000倍

http://www.cocoachina.com/articles/63993

In [69]:
import string
import numpy as np
import pandas as pd 

def f1(col, p_dict):
    return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]

# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})

# Apply the f1 method  as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
print(temp)

# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)
print(temp)

1.07 s ± 96.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
0     (156, 56)
1     (173, 88)
2     (187, 90)
3     (160, 78)
4     (160, 43)
        ...    
95    (176, 78)
96    (179, 76)
97    (166, 45)
98    (175, 63)
99    (166, 62)
Length: 100, dtype: object
1.42 ms ± 233 µs per loop (mean ± std. dev. of 5 runs, 1 loop each)
0     (156, 56)
1     (173, 88)
2     (187, 90)
3     (160, 78)
4     (160, 43)
        ...    
95    (176, 78)
96    (179, 76)
97    (166, 45)
98    (175, 63)
99    (166, 62)
Length: 100, dtype: object


In [11]:
# 沿着0轴求和
data[["height","weight","age"]].apply(np.sum, axis=0)

height    16720
weight     6441
age        4832
dtype: int64

In [19]:
# 沿着0轴取对数
data[["height","weight","age"]] = data[["height","weight","age"]].apply(np.log, axis=0)

Unnamed: 0,height,weight,smoker,gender,age,color,BMI
0,5.141664,4.094345,True,0,4.382027,black,20.519134
1,5.036953,4.143135,False,0,3.931826,white,26.564345
2,5.099866,4.304065,True,0,4.430817,white,27.513385
3,5.117994,4.430817,False,0,4.174387,black,30.119402
4,5.111988,3.912023,False,0,4.454347,black,18.144869
...,...,...,...,...,...,...,...
95,5.093750,4.356709,False,0,4.110874,white,29.357522
96,5.236442,4.262680,False,0,3.178054,black,20.088275
97,5.164786,4.110874,False,0,3.178054,white,19.918367
98,5.036953,4.174387,True,0,3.258097,black,27.407657


In [70]:
# def xx(a,b):
#     return a+1, b+1

# a,b = zip(*data.apply(lambda row: xx(row['height'], row['weight']), axis=1))

def xx(a):
    return a+1

print(type(data[['height', 'weight']]))
data[['height', 'weight']] = data[['height', 'weight']].progress_apply(xx, axis=1)
data

apply: 100%|███████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 2482.26it/s]

<class 'pandas.core.frame.DataFrame'>





Unnamed: 0,height,weight,smoker,gender,age,color,BMI
0,159,89,False,女,75,yellow,35.250761
1,189,43,True,女,52,white,11.883205
2,157,56,False,女,51,black,22.600263
3,183,58,True,女,73,white,17.208067
4,160,70,True,男,87,white,27.293224
...,...,...,...,...,...,...,...
95,181,54,True,女,78,white,16.358025
96,176,51,False,女,57,yellow,16.326531
97,184,72,False,男,35,yellow,21.200991
98,190,61,False,男,73,yellow,16.796842


### 3. applymap

applymap() 是与map() 方法相对应的专属于Dataframe对象的方法，可传入函数、字典等，作用于整个数据框中的每个位置的元素，返回结果的形状与元数据框 一致！


In [46]:
def lowerx(x):
    if isinstance(x, str):
        return x.lower() + '_'
    else:
        return x

data.applymap(lowerx)

Unnamed: 0,height,weight,smoker,gender,age,color
0,183,52,False,女_,72,yellow_
1,181,76,False,男_,58,black_
2,185,80,True,女_,34,yellow_
3,193,90,False,女_,50,white_
4,184,57,False,男_,64,black_
...,...,...,...,...,...,...
95,173,91,True,女_,61,black_
96,168,72,False,男_,46,white_
97,162,52,True,女_,39,white_
98,179,54,False,女_,54,black_


## 参考

https://zhuanlan.zhihu.com/p/100064394

http://www.360doc.com/content/20/0202/23/7669533_889336554.shtml