# 数据分析实训之 Pandas

## Pandas 简介

pandas 是基于NumPy 的一种工具，该工具是为解决数据分析任务而创建的。

Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。

pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

你很快就会发现，它是使Python成为强大而高效的数据分析环境的重要因素之一。



- Pandas 是基于 Numpy 的强大的分析结构化数据的工具集
    - Pandas 官方网址：https://pandas.pydata.org
    - Pandas 官方中文文档：https://www.pypandas.cn
- 从 Numpy 的 Ndarray 到 Pandas 的 Series / DataFrame

    - (Numpy) 1-dimensional array => `Series` (Pandas)

    - (Numpy) 2-dimensional array => `DataFrame` (Pandas)

![](https://vip2.loli.io/2023/11/30/fk7qVSGTWMLp6eI.jpg)



## Series vs DataFrame


- `Series` 是一种类似于一维数组的对象，是由一组数据 (各种 NumPy 数据类型) 以及一组与之相关的数据标签(即索引)组成。仅由一组数据也可产生简单的 Series 对象。

- `DataFrame` 是 Pandas 中的一个表格型的数据结构，包含有一组有序的列，每列可以是不同的值类型(数值、字符串、布尔型等)，DataFrame 即有行索引也有列索引，可以被看做是由 Series 组成的字典。

<div class="sl-block is-focused" data-block-type="image" style="min-width: 1px; min-height: 1px; width: 667px; height: 255px; left: 104px; top: 360px;" data-name="image-921b68" data-origin-id="a77e8d4cfe8aafe431f1d0ec770d79c5"><div class="sl-block-content" style="z-index: 18;"><img style="" data-natural-width="1000" data-natural-height="383" data-lazy-loaded="" src="https://i.loli.net/2019/07/28/5d3c78e7811be88759.png"></div></div>


## Excel vs DataFrame


<div class="sl-block is-focused" data-block-type="image" style="min-width: 1px; min-height: 1px; width: 886.106px; height: 360px; left: 56px; top: 331px;" data-name="image-8b265f" data-origin-id="ca9c6ce5e6f64dee8958b98f94241fa5"><div class="sl-block-content" style="z-index: 17;"><img style="" data-natural-width="1435" data-natural-height="583" data-lazy-loaded="" src="https://i.loli.net/2019/07/28/5d3c7ad83735c55921.png"></div></div>


Pandas 要点：Pandas是一个专门用于数据分析的Python工具库
* python用于数据分析和数据处理的一个工具库
* 基于numpy(对数据做“向量化”运算的科学计算工具库)去构建的
* 有一种用python去操作Excel/SQL的感觉

## 实验操作

> <time>60 min</time>

### 初始化环境


- 引入numpy和pandas库

In [1]:
import numpy as np
import pandas as pd

### `Series` 数据结构

#### 构建和初始化 `Series`

Series 是一种类似于一维数组的对象，是由一组数据 (各种 NumPy 数据类型) 以及一组与之相关的数据标签(即索引)组成。 仅由一组数据也可产生简单的 Series 对象。

构建一个series，并输出查看它的数据结构和类型信息

In [2]:
s = pd.Series([7, 'GW170817', 3.14, -123, "GW150914"])

print(s)

0           7
1    GW170817
2        3.14
3        -123
4    GW150914
dtype: object


In [3]:
print(type(s))

<class 'pandas.core.series.Series'>


从输出的数据结构还可以看到，pandas会默认用0到n作为series的索引，但是我们也可以自己指定index。index可以类比成字典当中的key。

In [4]:
s = pd.Series([7, 'GW170817', 3.14, -123, "GW150914"], \
              index=['A','B','C','D','E'])

print(s)

A           7
B    GW170817
C        3.14
D        -123
E    GW150914
dtype: object


- Series 是一个一维的数据。可以实现类似 numpy 的切片方式。

In [5]:
s[0:4] # 根据默认索引切片

A           7
B    GW170817
C        3.14
D        -123
dtype: object

In [6]:
s['A':'D'] #也可以根据指定的索引来切片，左闭右闭

A           7
B    GW170817
C        3.14
D        -123
dtype: object

In [7]:
s[['A','D','B']]  # 可以类似 numpy 的索引方式做筛选

A           7
D        -123
B    GW170817
dtype: object

In [8]:
# list 是不具备这种索引筛选的
l = [7, 'GW170817', 3.14, -123, "GW150914"]
l[[0,3,1]]

TypeError: list indices must be integers or slices, not list

In [10]:
s[[0,3,1]]

  s[[0,3,1]]


A           7
D        -123
B    GW170817
dtype: object

In [11]:
s.iloc[[0,3,1]]

A           7
D        -123
B    GW170817
dtype: object

- 本质上，Series 就是在 ndarray 结构基础上的一种”扩展封装“

In [12]:
s.values

array([7, 'GW170817', 3.14, -123, 'GW150914'], dtype=object)

我们可以用list来构建Series，同时指定index。

实际上我们还可以用字典dic去初始化Series，因为Series本来就是key-value的结构。

In [13]:
cities = {"Beijing":55000, "Shanghai":60000, "Shenzhen":50000,
         "Hangzhou":30000, "Guangzhou":40000, "Suzhou":None}
cities

{'Beijing': 55000,
 'Shanghai': 60000,
 'Shenzhen': 50000,
 'Hangzhou': 30000,
 'Guangzhou': 40000,
 'Suzhou': None}

In [14]:
apt = pd.Series(cities, name='income')
apt

Beijing      55000.0
Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     30000.0
Guangzhou    40000.0
Suzhou           NaN
Name: income, dtype: float64

In [15]:
apt['Guangzhou']

40000.0

In [16]:
apt[0:3] #切片，左闭右开

Beijing     55000.0
Shanghai    60000.0
Shenzhen    50000.0
Name: income, dtype: float64

In [17]:
apt[1:]

Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     30000.0
Guangzhou    40000.0
Suzhou           NaN
Name: income, dtype: float64

In [18]:
apt[:-1]

Beijing      55000.0
Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     30000.0
Guangzhou    40000.0
Name: income, dtype: float64

- 还是那句话，Series 和 ndarray 数据结构之间是”无痛衔接“的。

#### 用index的list去索引序列的数据

In [19]:
# 再体会一下，用index的list去索引序列的数据
apt[[3,4,1]]

  apt[[3,4,1]]


Hangzhou     30000.0
Guangzhou    40000.0
Shanghai     60000.0
Name: income, dtype: float64

In [20]:
apt.iloc[[3,4,1]]

Hangzhou     30000.0
Guangzhou    40000.0
Shanghai     60000.0
Name: income, dtype: float64

#### 广播特性

In [21]:
# 广播特性
3*apt

Beijing      165000.0
Shanghai     180000.0
Shenzhen     150000.0
Hangzhou      90000.0
Guangzhou    120000.0
Suzhou            NaN
Name: income, dtype: float64

In [22]:
np.log(apt)

Beijing      10.915088
Shanghai     11.002100
Shenzhen     10.819778
Hangzhou     10.308953
Guangzhou    10.596635
Suzhou             NaN
Name: income, dtype: float64

In [23]:
# 数乘 list 可不是广播哦。。。。
my_list = list(apt.values)
my_list

[55000.0, 60000.0, 50000.0, 30000.0, 40000.0, nan]

In [24]:
3*my_list

[55000.0,
 60000.0,
 50000.0,
 30000.0,
 40000.0,
 nan,
 55000.0,
 60000.0,
 50000.0,
 30000.0,
 40000.0,
 nan,
 55000.0,
 60000.0,
 50000.0,
 30000.0,
 40000.0,
 nan]

In [25]:
apt/2.5 # 其他运算同理

Beijing      22000.0
Shanghai     24000.0
Shenzhen     20000.0
Hangzhou     12000.0
Guangzhou    16000.0
Suzhou           NaN
Name: income, dtype: float64

#### 多个 `Series` 的计算是基于index对齐去做的

In [26]:
apt[1:]

Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     30000.0
Guangzhou    40000.0
Suzhou           NaN
Name: income, dtype: float64

In [27]:
apt[:-1]

Beijing      55000.0
Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     30000.0
Guangzhou    40000.0
Name: income, dtype: float64

In [28]:
# 根据index对齐的方式！ （一种非常直接的 merge 形式）
apt[1:] + apt[:-1]

Beijing           NaN
Guangzhou     80000.0
Hangzhou      60000.0
Shanghai     120000.0
Shenzhen     100000.0
Suzhou            NaN
Name: income, dtype: float64

In [29]:
# in / not in 找的是 index
"Hangzhou" in apt

True

In [31]:
apt

Beijing      55000.0
Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     30000.0
Guangzhou    40000.0
Suzhou           NaN
Name: income, dtype: float64

In [32]:
# get函数
apt.get("Hangzhou")

30000.0

In [33]:
print(apt.get("Chongqing"))

# Returns default value if not found.

None


#### booling indexing/条件判断索引

In [34]:
apt>40000

Beijing       True
Shanghai      True
Shenzhen      True
Hangzhou     False
Guangzhou    False
Suzhou       False
Name: income, dtype: bool

In [35]:
# 数据清理/筛选
apt[apt>40000]

Beijing     55000.0
Shanghai    60000.0
Shenzhen    50000.0
Name: income, dtype: float64

In [None]:
# 本质上是bool型的index判断
list((apt>40000).values)

In [39]:
apt[[True, False, False, True, True, False]]

Beijing      55000.0
Hangzhou     30000.0
Guangzhou    40000.0
Name: income, dtype: float64

#### 统计计算

和 numpy 一样。。。。

In [40]:
apt.mean() #均值

47000.0

In [41]:
apt.median()

50000.0

In [42]:
apt.max()

60000.0

In [43]:
apt.min()

30000.0

#### `Series` 赋值

In [44]:
apt

Beijing      55000.0
Shanghai     60000.0
Shenzhen     50000.0
Hangzhou     30000.0
Guangzhou    40000.0
Suzhou           NaN
Name: income, dtype: float64

In [45]:
apt['Shenzhen'] = 70000

In [46]:
apt

Beijing      55000.0
Shanghai     60000.0
Shenzhen     70000.0
Hangzhou     30000.0
Guangzhou    40000.0
Suzhou           NaN
Name: income, dtype: float64

In [47]:
# 筛选
apt<=40000

Beijing      False
Shanghai     False
Shenzhen     False
Hangzhou      True
Guangzhou     True
Suzhou       False
Name: income, dtype: bool

In [48]:
# 条件赋值
apt[apt<=40000] = 45000

In [49]:
apt

Beijing      55000.0
Shanghai     60000.0
Shenzhen     70000.0
Hangzhou     45000.0
Guangzhou    45000.0
Suzhou           NaN
Name: income, dtype: float64

In [50]:
cars = pd.Series({
    "Beijing":350000,
    "Shanghai":400000,
    "Shenzhen":300000,
    "Tianjin":200000,
    "Guangzhou":250000,
    "Chongqing":150000
}
)
cars

Beijing      350000
Shanghai     400000
Shenzhen     300000
Tianjin      200000
Guangzhou    250000
Chongqing    150000
dtype: int64

In [52]:
tmp = cars + 10*apt #广播特性 和 series之间的运算

In [53]:
tmp

Beijing       900000.0
Chongqing          NaN
Guangzhou     700000.0
Hangzhou           NaN
Shanghai     1000000.0
Shenzhen     1000000.0
Suzhou             NaN
Tianjin            NaN
dtype: float64

#### 数据缺失

In [54]:
"Hangzhou" in apt

True

In [55]:
"Hangzhou" in cars

False

In [57]:
apt

Beijing      55000.0
Shanghai     60000.0
Shenzhen     70000.0
Hangzhou     45000.0
Guangzhou    45000.0
Suzhou           NaN
Name: income, dtype: float64

In [56]:
# notnull
apt.notnull()

Beijing       True
Shanghai      True
Shenzhen      True
Hangzhou      True
Guangzhou     True
Suzhou       False
Name: income, dtype: bool

In [58]:
# isnull
apt.isnull()

Beijing      False
Shanghai     False
Shenzhen     False
Hangzhou     False
Guangzhou    False
Suzhou        True
Name: income, dtype: bool

In [59]:
# 非缺失
apt[apt.notnull()]

Beijing      55000.0
Shanghai     60000.0
Shenzhen     70000.0
Hangzhou     45000.0
Guangzhou    45000.0
Name: income, dtype: float64

### `DataFrame` 数据结构

#### 构建和初始化 `DataFrame`


首先我们注意一下Pandas中两个数据结构 Series 和 DataFrame 的区别：Series是一维数据，而DataFrame是一张表格，是二维数据。可以把DataFrame类比成Office中的Excel，或者是理解成多个Series的集合。


我们先来看一下两种简单的 DataFrame 的构建与初始化方式：1. 按列构建； 2. 按行构建

In [63]:
# 按行构建
data = [
    ['Beijing', 2017, 2100],
    ['Shanghai', 2018, 2300],
    ['Guangzhou', 2017, 1000],
    ['Shenzhen', 2018, 700],
    ['Hangzhou', 2017, 500], 
    ['Chongqing', 2017, 500]
]
pd.DataFrame(data, 
    columns = ['City', 'year', 'population'])

Unnamed: 0,City,year,population
0,Beijing,2017,2100
1,Shanghai,2018,2300
2,Guangzhou,2017,1000
3,Shenzhen,2018,700
4,Hangzhou,2017,500
5,Chongqing,2017,500


除了基于列表构建之外，再看另外一种构建方式：定义一个字典，并使用字典初始化构建一个dataframe

In [64]:
# 按列构建
data = {'City':["Beijing", "Shanghai", "Guangzhou", "Shenzhen", "Hangzhou", "Chongqing"],
        'year':[2017,2018,2017,2018,2017,2017],
        'population':[2100, 2300, 1000, 700, 500, 500]
       }

pd.DataFrame(data)

Unnamed: 0,City,year,population
0,Beijing,2017,2100
1,Shanghai,2018,2300
2,Guangzhou,2017,1000
3,Shenzhen,2018,700
4,Hangzhou,2017,500
5,Chongqing,2017,500


In [65]:
pd.DataFrame(data, columns=['year', 'City', 'population'])

Unnamed: 0,year,City,population
0,2017,Beijing,2100
1,2018,Shanghai,2300
2,2017,Guangzhou,1000
3,2018,Shenzhen,700
4,2017,Hangzhou,500
5,2017,Chongqing,500


In [66]:
pd.DataFrame(data, columns=['year', 'City', 'population'],\
            index = ['one', 'two', 'three', 'four', 'five', 'six'])

Unnamed: 0,year,City,population
one,2017,Beijing,2100
two,2018,Shanghai,2300
three,2017,Guangzhou,1000
four,2018,Shenzhen,700
five,2017,Hangzhou,500
six,2017,Chongqing,500


#### 用多个 `Series` 初始化 `DataFrame`

最后我们看一下用多个 `Series` 构建和初始化 `DataFrame` 的方式，这个操作也能体验到 `Series` 和 `DataFrame` 直接的关系。这里注意 `Series` 是index对齐的。

In [67]:
apt = pd.Series({
    "Beijing":55000, 
    "Shanghai":60000, 
    "Shenzhen":50000,
    "Hangzhou":30000, 
    "Guangzhou":40000, 
    "Suzhou":None
})

cars = pd.Series({
    "Beijing":350000,
    "Shanghai":400000,
    "Shenzhen":300000,
    "Tianjin":200000,
    "Guangzhou":250000,
    "Chongqing":150000
})

In [68]:
# series是index对齐的
df = pd.DataFrame({
    'apt':apt,
    'cars':cars
})
df

Unnamed: 0,apt,cars
Beijing,55000.0,350000.0
Chongqing,,150000.0
Guangzhou,40000.0,250000.0
Hangzhou,30000.0,
Shanghai,60000.0,400000.0
Shenzhen,50000.0,300000.0
Suzhou,,
Tianjin,,200000.0


In [69]:
df['cars']

Beijing      350000.0
Chongqing    150000.0
Guangzhou    250000.0
Hangzhou          NaN
Shanghai     400000.0
Shenzhen     300000.0
Suzhou            NaN
Tianjin      200000.0
Name: cars, dtype: float64

In [70]:
type(df['cars'])

pandas.core.series.Series

In [71]:
df[['cars']]

Unnamed: 0,cars
Beijing,350000.0
Chongqing,150000.0
Guangzhou,250000.0
Hangzhou,
Shanghai,400000.0
Shenzhen,300000.0
Suzhou,
Tianjin,200000.0


In [73]:
type(df[['cars']])

pandas.core.frame.DataFrame

In [75]:
# 赋值
df

Unnamed: 0,apt,cars
Beijing,55000.0,350000.0
Chongqing,,150000.0
Guangzhou,40000.0,250000.0
Hangzhou,30000.0,
Shanghai,60000.0,400000.0
Shenzhen,50000.0,300000.0
Suzhou,,
Tianjin,,200000.0


In [76]:
df['bonus'] = 40000 #广播特性 (新建了column)

In [77]:
df

Unnamed: 0,apt,cars,bonus
Beijing,55000.0,350000.0,40000
Chongqing,,150000.0,40000
Guangzhou,40000.0,250000.0,40000
Hangzhou,30000.0,,40000
Shanghai,60000.0,400000.0,40000
Shenzhen,50000.0,300000.0,40000
Suzhou,,,40000
Tianjin,,200000.0,40000


In [78]:
df['income'] = df['apt'] + df['bonus'] #广播特性

In [79]:
df

Unnamed: 0,apt,cars,bonus,income
Beijing,55000.0,350000.0,40000,95000.0
Chongqing,,150000.0,40000,
Guangzhou,40000.0,250000.0,40000,80000.0
Hangzhou,30000.0,,40000,70000.0
Shanghai,60000.0,400000.0,40000,100000.0
Shenzhen,50000.0,300000.0,40000,90000.0
Suzhou,,,40000,
Tianjin,,200000.0,40000,


#### `DataFrame` 的索引与切片


In [80]:
df.index

Index(['Beijing', 'Chongqing', 'Guangzhou', 'Hangzhou', 'Shanghai', 'Shenzhen',
       'Suzhou', 'Tianjin'],
      dtype='object')

In [85]:
df.columns

Index(['apt', 'cars', 'bonus', 'income'], dtype='object')

In [86]:
df

Unnamed: 0,apt,cars,bonus,income
Beijing,55000.0,350000.0,40000,95000.0
Chongqing,,150000.0,40000,
Guangzhou,40000.0,250000.0,40000,80000.0
Hangzhou,30000.0,,40000,70000.0
Shanghai,60000.0,400000.0,40000,100000.0
Shenzhen,50000.0,300000.0,40000,90000.0
Suzhou,,,40000,
Tianjin,,200000.0,40000,


##### loc 索引

In [87]:
# 逗号之前是对行做选择的条件；逗号之后是对列做选择的条件
df.loc['Beijing':"Suzhou", ['cars', "income"]] #对行和列做选择

Unnamed: 0,cars,income
Beijing,350000.0,95000.0
Chongqing,150000.0,
Guangzhou,250000.0,80000.0
Hangzhou,,70000.0
Shanghai,400000.0,100000.0
Shenzhen,300000.0,90000.0
Suzhou,,


In [88]:
df['cars']>200000

Beijing       True
Chongqing    False
Guangzhou     True
Hangzhou     False
Shanghai      True
Shenzhen      True
Suzhou       False
Tianjin      False
Name: cars, dtype: bool

In [89]:
df.loc[df['cars']>200000 , ['cars', "income"]]

Unnamed: 0,cars,income
Beijing,350000.0,95000.0
Guangzhou,250000.0,80000.0
Shanghai,400000.0,100000.0
Shenzhen,300000.0,90000.0


In [90]:
df['cars']>200000 #bool series

Beijing       True
Chongqing    False
Guangzhou     True
Hangzhou     False
Shanghai      True
Shenzhen      True
Suzhou       False
Tianjin      False
Name: cars, dtype: bool

In [91]:
df['income']>90000 #bool series

Beijing       True
Chongqing    False
Guangzhou    False
Hangzhou     False
Shanghai      True
Shenzhen     False
Suzhou       False
Tianjin      False
Name: income, dtype: bool

- 且

In [92]:
(df['cars']>200000) & (df['income']>90000) #且

Beijing       True
Chongqing    False
Guangzhou    False
Hangzhou     False
Shanghai      True
Shenzhen     False
Suzhou       False
Tianjin      False
dtype: bool

In [93]:
# 逗号之前是对行做选择的条件；逗号之后是对列做选择的条件
df.loc[(df['cars']>200000) & (df['income']>90000), ['cars', "income"]]

Unnamed: 0,cars,income
Beijing,350000.0,95000.0
Shanghai,400000.0,100000.0


In [94]:
df.loc[(df['cars']>200000) & (df['income']>90000), 'cars':"income"]

Unnamed: 0,cars,bonus,income
Beijing,350000.0,40000,95000.0
Shanghai,400000.0,40000,100000.0


- 或

In [95]:
df.loc[(df['cars']>200000)|(df['income']>90000), 'cars':"income"] #或

Unnamed: 0,cars,bonus,income
Beijing,350000.0,40000,95000.0
Guangzhou,250000.0,40000,80000.0
Shanghai,400000.0,40000,100000.0
Shenzhen,300000.0,40000,90000.0


- 非

In [98]:
df.loc[~(df['cars']>200000), 'cars':"income"] #非

Unnamed: 0,cars,bonus,income
Chongqing,150000.0,40000,
Hangzhou,,40000,70000.0
Suzhou,,40000,
Tianjin,200000.0,40000,


In [99]:
df.loc[~(df['cars']>200000), 'cars':'''income'''] #非

Unnamed: 0,cars,bonus,income
Chongqing,150000.0,40000,
Hangzhou,,40000,70000.0
Suzhou,,40000,
Tianjin,200000.0,40000,


-  数据选择的方法
    * iloc
    * at
    * iat
    * ~ix~

In [100]:
df

Unnamed: 0,apt,cars,bonus,income
Beijing,55000.0,350000.0,40000,95000.0
Chongqing,,150000.0,40000,
Guangzhou,40000.0,250000.0,40000,80000.0
Hangzhou,30000.0,,40000,70000.0
Shanghai,60000.0,400000.0,40000,100000.0
Shenzhen,50000.0,300000.0,40000,90000.0
Suzhou,,,40000,
Tianjin,,200000.0,40000,


##### iloc 索引

In [None]:
# index locate
df.iloc[[0,3,4],[0,2]]

仅供参考：

- DataFrame.at : Access a single value for a row/column pair by label.
- DataFrame.iat : Access a single value for a row/column pair by integer
    position.
- DataFrame.loc : Access a group of rows and columns by label(s).
- DataFrame.iloc : Access a group of rows and columns by integer
    position(s).
- Series.at : Access a single value by label.
- Series.iat : Access a single value by integer position.
- Series.loc : Access a group of rows by label(s).
- Series.iloc : Access a group of rows by integer position(s).

##### 按列切片

In [101]:
df[['bonus', 'income']]  # return DataFrame

Unnamed: 0,bonus,income
Beijing,40000,95000.0
Chongqing,40000,
Guangzhou,40000,80000.0
Hangzhou,40000,70000.0
Shanghai,40000,100000.0
Shenzhen,40000,90000.0
Suzhou,40000,
Tianjin,40000,


In [103]:
df[['income']] # return DataFrame

Unnamed: 0,income
Beijing,95000.0
Chongqing,
Guangzhou,80000.0
Hangzhou,70000.0
Shanghai,100000.0
Shenzhen,90000.0
Suzhou,
Tianjin,


In [104]:
df['income'] # return Series (way.1)

Beijing       95000.0
Chongqing         NaN
Guangzhou     80000.0
Hangzhou      70000.0
Shanghai     100000.0
Shenzhen      90000.0
Suzhou            NaN
Tianjin           NaN
Name: income, dtype: float64

In [105]:
df.income  # retrun Series (way.2)

Beijing       95000.0
Chongqing         NaN
Guangzhou     80000.0
Hangzhou      70000.0
Shanghai     100000.0
Shenzhen      90000.0
Suzhou            NaN
Tianjin           NaN
Name: income, dtype: float64

In [106]:
# 反向筛选
df.drop(['bonus', 'income'], axis=1)

Unnamed: 0,apt,cars
Beijing,55000.0,350000.0
Chongqing,,150000.0
Guangzhou,40000.0,250000.0
Hangzhou,30000.0,
Shanghai,60000.0,400000.0
Shenzhen,50000.0,300000.0
Suzhou,,
Tianjin,,200000.0


##### 按行切片

In [107]:
df.drop('Tianjin', axis=0)

Unnamed: 0,apt,cars,bonus,income
Beijing,55000.0,350000.0,40000,95000.0
Chongqing,,150000.0,40000,
Guangzhou,40000.0,250000.0,40000,80000.0
Hangzhou,30000.0,,40000,70000.0
Shanghai,60000.0,400000.0,40000,100000.0
Shenzhen,50000.0,300000.0,40000,90000.0
Suzhou,,,40000,


In [109]:
df.head() # 默认前5行

Unnamed: 0,apt,cars,bonus,income
Beijing,55000.0,350000.0,40000,95000.0
Chongqing,,150000.0,40000,
Guangzhou,40000.0,250000.0,40000,80000.0
Hangzhou,30000.0,,40000,70000.0
Shanghai,60000.0,400000.0,40000,100000.0


In [110]:
df.tail() # 默认后5行

Unnamed: 0,apt,cars,bonus,income
Hangzhou,30000.0,,40000,70000.0
Shanghai,60000.0,400000.0,40000,100000.0
Shenzhen,50000.0,300000.0,40000,90000.0
Suzhou,,,40000,
Tianjin,,200000.0,40000,


In [111]:
df.sample(3) # 随机挑3行

Unnamed: 0,apt,cars,bonus,income
Shanghai,60000.0,400000.0,40000,100000.0
Chongqing,,150000.0,40000,
Suzhou,,,40000,


In [112]:
df[df.income > 80000]  # 按 bool 型 Series 进行 index 切片

Unnamed: 0,apt,cars,bonus,income
Beijing,55000.0,350000.0,40000,95000.0
Shanghai,60000.0,400000.0,40000,100000.0
Shenzhen,50000.0,300000.0,40000,90000.0


- 会玩 numpy 后，你总可以随性的切片！

In [113]:
df.iloc[3:8,:]

Unnamed: 0,apt,cars,bonus,income
Hangzhou,30000.0,,40000,70000.0
Shanghai,60000.0,400000.0,40000,100000.0
Shenzhen,50000.0,300000.0,40000,90000.0
Suzhou,,,40000,
Tianjin,,200000.0,40000,


In [114]:
df.iloc[[1,4,5],:2]

Unnamed: 0,apt,cars
Chongqing,,150000.0
Shanghai,60000.0,400000.0
Shenzhen,50000.0,300000.0


- 如果你同时也会玩 pandas了，你就可以更随性的切片！

In [115]:
df.loc[df.income > 80000,'apt':'bonus']

Unnamed: 0,apt,cars,bonus
Beijing,55000.0,350000.0,40000
Shanghai,60000.0,400000.0,40000
Shenzhen,50000.0,300000.0,40000


#### 用 Pandas 加载/保存 Excel 格式的文件


因为pandas的强大而又灵活的功能，在pandas的实际使用中，经常会代替excel，分析数据。通过前面的构建过程，也能感受到，pandas中的dataframe数据结构，实际上就是一张类似于excel的二维表。

下面我们通过一个pandas分析商业银行的数据实例，来体验pandas在实际的数据分析工作中的操作和应用。


在这个文件中，对于我们在操作中可能涉及到的字段说明如下：

- Bank Marketing (with social/economic context)

    - age (numeric)
    - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
    - previous: number of contacts performed before this campaign and for this client (numeric)
    - target: has the client subscribed a term deposit? (binary: "yes","no")

我们需要先使用pandas来加载这个csv文件的数据

In [117]:
help(pd.read_csv)

In [121]:
# !cat bank-campaign.csv -n 3

In [122]:
df = pd.read_csv('bank-campaign.csv', nrows=10, usecols = ['age', 'pdays', 'previous', 'target'])

df

Unnamed: 0,age,pdays,previous,target
0,56,999,0,no
1,57,999,0,no
2,37,999,0,no
3,40,999,0,no
4,56,999,0,no
5,45,999,0,no
6,59,999,0,no
7,41,999,0,no
8,24,999,0,no
9,25,999,0,no


dataframe数据保存成csv文件

In [123]:
df.to_csv("mydata.csv")
!ls -lh mydata.csv

-rw-r--r-- 1 root root 167 Dec  3 11:40 mydata.csv


In [124]:
!rm mydata.csv

#### `DataFrame` 的改造与合并


构建一个简单的 `DataFrame` 结构

In [125]:
data = pd.DataFrame({
            'name': ['Zhangsan', 'Lisi', 'Xiaomei'],
            'age': [56, 57, 37], 
            'sex': ['M', 'M', 'F'],
            'money': [100, 90, 50]
        })
data

Unnamed: 0,name,age,sex,money
0,Zhangsan,56,M,100
1,Lisi,57,M,90
2,Xiaomei,37,F,50


改造index

In [127]:
data = data.set_index('name')
data = data.reset_index()

data

Unnamed: 0,name,age,sex,money
0,Zhangsan,56,M,100
1,Lisi,57,M,90
2,Xiaomei,37,F,50


改造value

In [129]:
data.sort_values('age')

Unnamed: 0,name,age,sex,money
2,Xiaomei,37,F,50
0,Zhangsan,56,M,100
1,Lisi,57,M,90


添加一列

In [130]:
data['target'] = ['No', 'Yes', 'No']

data

Unnamed: 0,name,age,sex,money,target
0,Zhangsan,56,M,100,No
1,Lisi,57,M,90,Yes
2,Xiaomei,37,F,50,No


In [132]:
data.loc[:,'target'] = ['Yes', 'Yes', 'No']

data

Unnamed: 0,name,age,sex,money,target
0,Zhangsan,56,M,100,Yes
1,Lisi,57,M,90,Yes
2,Xiaomei,37,F,50,No


数据合并

In [135]:
df

Unnamed: 0,age,pdays,previous,target
0,56,999,0,no
1,57,999,0,no
2,37,999,0,no
3,40,999,0,no
4,56,999,0,no
5,45,999,0,no
6,59,999,0,no
7,41,999,0,no
8,24,999,0,no
9,25,999,0,no


In [136]:
# 我们从这个 case 开始：
data[['age', 'money']]

Unnamed: 0,age,money
0,56,100
1,57,90
2,37,50


用 concat 方法来合并

In [137]:
pd.concat([df, data[['age', 'money']]], axis=1) # 硬生生的合并！

Unnamed: 0,age,pdays,previous,target,age.1,money
0,56,999,0,no,56.0,100.0
1,57,999,0,no,57.0,90.0
2,37,999,0,no,37.0,50.0
3,40,999,0,no,,
4,56,999,0,no,,
5,45,999,0,no,,
6,59,999,0,no,,
7,41,999,0,no,,
8,24,999,0,no,,
9,25,999,0,no,,


用 merge 方法来合并。这里思考一个问题：为什么合并后的结果中会有四行数据？

In [138]:
pd.merge(df, data[['age', 'money']], on = 'age', how = 'outer') # 聪明的合并！

Unnamed: 0,age,pdays,previous,target,money
0,56,999,0,no,100.0
1,56,999,0,no,100.0
2,57,999,0,no,90.0
3,37,999,0,no,50.0
4,40,999,0,no,
5,45,999,0,no,
6,59,999,0,no,
7,41,999,0,no,
8,24,999,0,no,
9,25,999,0,no,


#### `DataFrame` 常用的属性与方法

- 查看信息

In [139]:
df.shape

(10, 4)

In [140]:
df.head()

Unnamed: 0,age,pdays,previous,target
0,56,999,0,no
1,57,999,0,no
2,37,999,0,no
3,40,999,0,no
4,56,999,0,no


In [141]:
df.describe()

Unnamed: 0,age,pdays,previous
count,10.0,10.0,10.0
mean,44.0,999.0,0.0
std,12.987173,0.0,0.0
min,24.0,999.0,0.0
25%,37.75,999.0,0.0
50%,43.0,999.0,0.0
75%,56.0,999.0,0.0
max,59.0,999.0,0.0


In [142]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   age       10 non-null     int64 
 1   pdays     10 non-null     int64 
 2   previous  10 non-null     int64 
 3   target    10 non-null     object
dtypes: int64(3), object(1)
memory usage: 448.0+ bytes


In [143]:
df.values

array([[56, 999, 0, 'no'],
       [57, 999, 0, 'no'],
       [37, 999, 0, 'no'],
       [40, 999, 0, 'no'],
       [56, 999, 0, 'no'],
       [45, 999, 0, 'no'],
       [59, 999, 0, 'no'],
       [41, 999, 0, 'no'],
       [24, 999, 0, 'no'],
       [25, 999, 0, 'no']], dtype=object)

因为基于 numpy，所以统计计算函数也适用

In [144]:
df.max()

age          59
pdays       999
previous      0
target       no
dtype: object

In [145]:
df

Unnamed: 0,age,pdays,previous,target
0,56,999,0,no
1,57,999,0,no
2,37,999,0,no
3,40,999,0,no
4,56,999,0,no
5,45,999,0,no
6,59,999,0,no
7,41,999,0,no
8,24,999,0,no
9,25,999,0,no


In [149]:
df.mean(numeric_only=True)

age          44.0
pdays       999.0
previous      0.0
dtype: float64

In [147]:
df.std(numeric_only=True)

age         12.987173
pdays        0.000000
previous     0.000000
dtype: float64

In [151]:
df

Unnamed: 0,age,pdays,previous,target
0,56,999,0,no
1,57,999,0,no
2,37,999,0,no
3,40,999,0,no
4,56,999,0,no
5,45,999,0,no
6,59,999,0,no
7,41,999,0,no
8,24,999,0,no
9,25,999,0,no


In [154]:
df.age.value_counts()

age
56    2
57    1
37    1
40    1
45    1
59    1
41    1
24    1
25    1
Name: count, dtype: int64

---

In [150]:
from IPython.display import IFrame
IFrame('Pandas_Cheat_Sheet.pdf', width=1800, height=1450)