# 胖大师基础

## Dataframe（数框）

### 加载胖大师

In [144]:
import pandas as pd

### 载入数据

In [145]:
df = pd.read_csv('data\gapminder.tsv', sep='\t')

### 瞅瞅都有啥

In [146]:
df.head(3)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071


### 数框规模

In [147]:
df.shape

(1704, 6)

### 数框类型

In [148]:
type(df)

pandas.core.frame.DataFrame

### 有哪些列

In [149]:
df.columns

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

#### 列含义说明

- country：国家
- continent：大洲
- year：年份
- lifeExp：预期寿命
- pop：人口
- gdpPercap：人均GDP

In [150]:
type(df.columns)

pandas.core.indexes.base.Index

### 数框数据类型

In [151]:
df.dtypes

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

### 数框详细信息

In [152]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


## 列与行

### 单列

In [153]:
df['country']

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

#### 单列类型

In [154]:
type(df['country'])

pandas.core.series.Series

#### 用head参看单列的头5条记录

In [155]:
df['country'].head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

#### 用head和tail查看单列头尾的情况

In [156]:
[df['country'].head(), df['country'].tail()]

[0    Afghanistan
 1    Afghanistan
 2    Afghanistan
 3    Afghanistan
 4    Afghanistan
 Name: country, dtype: object,
 1699    Zimbabwe
 1700    Zimbabwe
 1701    Zimbabwe
 1702    Zimbabwe
 1703    Zimbabwe
 Name: country, dtype: object]

In [157]:
type([df['country'].head(), df['country'].tail()])

list

### 多列

In [158]:
df[ ['country', 'year', 'pop', 'gdpPercap'] ]

Unnamed: 0,country,year,pop,gdpPercap
0,Afghanistan,1952,8425333,779.445314
1,Afghanistan,1957,9240934,820.853030
2,Afghanistan,1962,10267083,853.100710
3,Afghanistan,1967,11537966,836.197138
4,Afghanistan,1972,13079460,739.981106
...,...,...,...,...
1699,Zimbabwe,1987,9216418,706.157306
1700,Zimbabwe,1992,10704340,693.420786
1701,Zimbabwe,1997,11404948,792.449960
1702,Zimbabwe,2002,11926563,672.038623


### 截取多列一个被废弃(deprecated)的用法

In [159]:
# df[list(range(5))]

胖大师0.21.0版本以后，已经不支持（deprecated）这种通过列的序号来构造数框的子集。
可以使用iloc来根据列或行的位置索引来取得数框的切片。

#### 正确用法

In [160]:
df.iloc[:,list(range(5))]

Unnamed: 0,country,continent,year,lifeExp,pop
0,Afghanistan,Asia,1952,28.801,8425333
1,Afghanistan,Asia,1957,30.332,9240934
2,Afghanistan,Asia,1962,31.997,10267083
3,Afghanistan,Asia,1967,34.020,11537966
4,Afghanistan,Asia,1972,36.088,13079460
...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418
1700,Zimbabwe,Africa,1992,60.377,10704340
1701,Zimbabwe,Africa,1997,46.809,11404948
1702,Zimbabwe,Africa,2002,39.989,11926563


#### 用法扩展一下

In [161]:
df.iloc[list(range(0, 5, 2)),list(range(0, 5, 2))]

Unnamed: 0,country,year,pop
0,Afghanistan,1952,8425333
2,Afghanistan,1962,10267083
4,Afghanistan,1972,13079460


### 单行

#### 单行序列

In [162]:
df.loc[0]

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object

#### 单行数框

In [163]:
df.loc[[0]]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314


#### 类型比较

In [164]:
print('type of df.loc[0]: {}, type of df.loc[[0]]: {}.'.format(type(df.loc[0]), type(df.loc[[0]])))

type of df.loc[0]: <class 'pandas.core.series.Series'>, type of df.loc[[0]]: <class 'pandas.core.frame.DataFrame'>.


### 多行

#### 根据索引值提取多行

In [165]:
df.loc[[0, 1, 7]]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
7,Afghanistan,Asia,1987,40.822,13867957,852.395945


#### 根据提取最后一行（但这个用法是有问题的，因为行的索引值=行的位置索引且该位置索引从0开始）

In [166]:
df.loc[[ df.shape[0]-1 ]]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298


#### 获取最后一行比较稳的做法

In [167]:
df.tail(1)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298


### loc[ ], iloc[ ]和ix[ ]的区别
 - loc[] :根据行或列的索引值对数框进行切片；
 - iloc[]:根据行或列的位置索引对数框进行切片；
 - ix[]  :先按索引值查找行或列，找不到按位置索引访问行或列，然后对数框进行切片。**注意，在1.0.0中已经被废弃（deprecated）。**

#### 取具体cell的值

In [168]:
df.loc[42,'country']

'Angola'

In [169]:
df.loc[[42],['country']]

Unnamed: 0,country
42,Angola


#### 行列切片loc方式

In [170]:
df.loc[[40,41,42],['country','gdpPercap']]

Unnamed: 0,country,gdpPercap
40,Angola,5473.288005
41,Angola,3008.647355
42,Angola,2756.953672


#### 行列切片iloc方式

In [171]:
df.iloc[[40,41,42],[0,5]]

Unnamed: 0,country,gdpPercap
40,Angola,5473.288005
41,Angola,3008.647355
42,Angola,2756.953672


#### 行列切片ix翻车方式

In [172]:
# df.ix[[40,41,42],[0,5]]