# pandas的数据框对象

所谓数据框(Dataframe)是这一种最常见的数据分析用的数据结构,它针对的是二维数据.数据帧有如下特点:

+ 列是不同的类型
+ 大小可变
+ 标记轴(行和列)
+ 可以对行和列执行算术运算

数据框是数据科学中一种非常的重要的数据结构,几乎所有的工作都建立在其上

In [1]:
import pandas as pd

## 数据框的构造


从python对象构造数据框对象可以有如下途径

### 从二维列表转换

pandas可以将二维列表以表格的形式组合

In [2]:
names = ['Bob','Jessica','Mary','John','Mel']
births = [1968, 1955, 1977,1978, 1973]
weight = [69,89,76,90,78]
table_o = list(zip(names,births,weight))
table_o

[('Bob', 1968, 69),
 ('Jessica', 1955, 89),
 ('Mary', 1977, 76),
 ('John', 1978, 90),
 ('Mel', 1973, 78)]

In [3]:
pd.DataFrame(table_o,columns =["name","births","weight"])# columns指定列标签

Unnamed: 0,name,births,weight
0,Bob,1968,69
1,Jessica,1955,89
2,Mary,1977,76
3,John,1978,90
4,Mel,1973,78


### 从字典中直接生成

pandas也允许将数据放在字典内,这样key是每列的标题,value就会按顺序填入

In [4]:
table_dict = {"names":['Bob','Jessica','Mary','John','Mel'],
    "births":[1968, 1955, 1977,1978, 1973],
    "weight":[69,89,76,90,78]
}
table_dict

{'names': ['Bob', 'Jessica', 'Mary', 'John', 'Mel'],
 'births': [1968, 1955, 1977, 1978, 1973],
 'weight': [69, 89, 76, 90, 78]}

In [5]:
pd.DataFrame(table_dict)

Unnamed: 0,names,births,weight
0,Bob,1968,69
1,Jessica,1955,89
2,Mary,1977,76
3,John,1978,90
4,Mel,1973,78


### 从包裹字典的列表中获取

另一种则是按行获取,每个字典是表格中的一行

In [6]:
table_row = [{"names":'Bob',
    "births":1968,
    "weight":69
},
{"names":'Jessica',
    "births":1955,
    "weight":89
},
{"names":'Mary',
    "births":1977,
    "weight":76
},
{"names":'John',
    "births":1978,
    "weight":90
},
{"names":'Mel',
    "births": 1973,
    "weight":78
}]
table_row

[{'names': 'Bob', 'births': 1968, 'weight': 69},
 {'names': 'Jessica', 'births': 1955, 'weight': 89},
 {'names': 'Mary', 'births': 1977, 'weight': 76},
 {'names': 'John', 'births': 1978, 'weight': 90},
 {'names': 'Mel', 'births': 1973, 'weight': 78}]

In [7]:
pd.DataFrame(table_row)

Unnamed: 0,births,names,weight
0,1968,Bob,69
1,1955,Jessica,89
2,1977,Mary,76
3,1978,John,90
4,1973,Mel,78


## 数据框的基本操作

我们以常见的数据集[iris]()作为例子,如何读入外部数据可以看[数据获取与保存]部分

In [8]:
import pandas as pd
iris_data = pd.read_csv("source/iris.csv")
iris_data[:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### 数据过滤

就像透视表一样,我们可以有选择性的查看表格

In [9]:
iris_data[iris_data["class"]=="Iris-virginica"][::10]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
100,6.3,3.3,6.0,2.5,Iris-virginica
110,6.5,3.2,5.1,2.0,Iris-virginica
120,6.9,3.2,5.7,2.3,Iris-virginica
130,7.4,2.8,6.1,1.9,Iris-virginica
140,6.7,3.1,5.6,2.4,Iris-virginica


In [10]:
iris_data[iris_data["petal_width"]>iris_data["petal_width"].mean()][::10]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
50,7.0,3.2,4.7,1.4,Iris-versicolor
63,6.1,2.9,4.7,1.4,Iris-versicolor
75,6.6,3.0,4.4,1.4,Iris-versicolor
88,5.6,3.0,4.1,1.3,Iris-versicolor
100,6.3,3.3,6.0,2.5,Iris-virginica
110,6.5,3.2,5.1,2.0,Iris-virginica
120,6.9,3.2,5.7,2.3,Iris-virginica
130,7.4,2.8,6.1,1.9,Iris-virginica
140,6.7,3.1,5.6,2.4,Iris-virginica


### 排序sort

比如我们根据sepal_length做降序排列

In [11]:
biggest5_sl_iris = iris_data.sort_values('sepal_length',ascending=False)[:5]

In [12]:
biggest5_sl_iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
131,7.9,3.8,6.4,2.0,Iris-virginica
135,7.7,3.0,6.1,2.3,Iris-virginica
122,7.7,2.8,6.7,2.0,Iris-virginica
117,7.7,3.8,6.7,2.2,Iris-virginica
118,7.7,2.6,6.9,2.3,Iris-virginica


再把序号(index)排排序

In [13]:
biggest5_sl_iris.sort_index(ascending=False)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
135,7.7,3.0,6.1,2.3,Iris-virginica
131,7.9,3.8,6.4,2.0,Iris-virginica
122,7.7,2.8,6.7,2.0,Iris-virginica
118,7.7,2.6,6.9,2.3,Iris-virginica
117,7.7,3.8,6.7,2.2,Iris-virginica


### 排名rank

In [14]:
biggest5_sl_iris.rank(method="min",numeric_only = True)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
131,5.0,4.0,2.0,1.0
135,1.0,3.0,1.0,4.0
122,1.0,2.0,3.0,1.0
117,1.0,4.0,3.0,3.0
118,1.0,1.0,5.0,4.0


### 选择,切片操作

切片可以用来准确的提取需要的数据

pandas支持多种切片方式

#### 间隔切片

In [15]:
iris_data[::20]#每20行取一次

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
20,5.4,3.4,1.7,0.2,Iris-setosa
40,5.0,3.5,1.3,0.3,Iris-setosa
60,5.0,2.0,3.5,1.0,Iris-versicolor
80,5.5,2.4,3.8,1.1,Iris-versicolor
100,6.3,3.3,6.0,2.5,Iris-virginica
120,6.9,3.2,5.7,2.3,Iris-virginica
140,6.7,3.1,5.6,2.4,Iris-virginica


#### 连续数据段提取

In [16]:
iris_data[5:10]#取第5到第9行

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


#### 提取某一行

In [17]:
iris_data.loc[5] #取第5行的数据

sepal_length            5.4
sepal_width             3.9
petal_length            1.7
petal_width             0.4
class           Iris-setosa
Name: 5, dtype: object

### 投影操作

所谓投影和数据库中差不多,就是取列(取属性),简单的方式就是用`[]`圈住需要的列号或者列名

In [18]:
iris_data["sepal_length"][:5]#取某列

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal_length, dtype: float64

In [19]:
iris_data.sepal_length[:5]#同样地取某列

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal_length, dtype: float64

In [20]:
iris_data[["sepal_length","petal_width"]][:5]#取两列

Unnamed: 0,sepal_length,petal_width
0,5.1,0.2
1,4.9,0.2
2,4.7,0.2
3,4.6,0.2
4,5.0,0.2


### **iloc **位置坐标操作

简单粗暴的直接查看对应坐标,第一位参数是行,第二位是列

In [21]:
iris_data.iloc[5]#取第5行的数据

sepal_length            5.4
sepal_width             3.9
petal_length            1.7
petal_width             0.4
class           Iris-setosa
Name: 5, dtype: object

In [22]:
iris_data.iloc[0,2:4]#取第一行第3个数据和第四个数据

petal_length    1.4
petal_width     0.2
Name: 0, dtype: object

### 增加一列元素

增加一列只需要在原数据上后面用`[]`填入要新增的元素即可,注意这个操作是对源数据的修改,如果希望源数据不变,先copy再增加

In [23]:
people_fromExcel = pd.read_excel('./source/people.xlsx', u'工作表1', index_col=None, na_values=['NA'])

people_Data = people_fromExcel.append(pd.DataFrame([["Hao",24]],columns = ["name","age"])).reset_index(drop=True)
people_Data

Unnamed: 0,name,age
0,Michael,
1,Andy,30.0
2,Justin,19.0
3,Hao,24.0


In [24]:
people_Data["nation"] = ["USA","UK","AUS","PRC"]

In [25]:
people_Data

Unnamed: 0,name,age,nation
0,Michael,,USA
1,Andy,30.0,UK
2,Justin,19.0,AUS
3,Hao,24.0,PRC


也可以只输入一个值,这样就全部都是都是它了

In [26]:
people_Data[u"星球"] = u"地球"
people_Data

Unnamed: 0,name,age,nation,星球
0,Michael,,USA,地球
1,Andy,30.0,UK,地球
2,Justin,19.0,AUS,地球
3,Hao,24.0,PRC,地球
