# 5.数据合并(merge & concat)                             


#### 笨办法学 Python 数据分析  / learn data analysis the hard way
- @Author：知行并重


## 目录
数据的合并，可以通过两种方式实现
1. concat（堆叠）

    堆叠的方式可以是横向、也可以是纵向的； 纵向类似与SQL 中的 `Union all` 操作

2. merge（拼接）

    拼接是指根于一个或者多个键（列）将数据表`横向`拼接起来。 类似与 SQL 中的 `join`操作

## 一、数据读取

In [1]:
### 导入必要的库

import pandas as pd #数据分析
import numpy as np #科学计算

data = pd.read_csv("../input/titanic.csv")

use_cols1 = ['PassengerId','Sex','Fare']
use_cols2 = ['PassengerId','Sex','SibSp','Pclass']

sub_data1 = data.loc[:100,use_cols1]
sub_data2 = data.loc[:100,use_cols2]
sub_data3 = data.loc[50:150,use_cols1]
sub_data4 = data.loc[50:150,use_cols2].set_index('PassengerId')

## 一、Concat

### 1.1 将 sub_data1 与 sub_data2 按行拼接(忽略索引)data1

#### 纵向堆叠，对行 axis = 0 情况(默认)    ====>变长 

<font color='red'>注意变化后的形状</font>

In [2]:
data1 = pd.concat([sub_data1,sub_data2],axis = 0)

print('data shape:',sub_data1.shape)
print('data shape:',sub_data2.shape)
print('data shape:',data1.shape)

data shape: (101, 3)
data shape: (101, 4)
data shape: (202, 5)


### 1.2 将sub_data1,sub_data2 按行拼接 内拼接
#### 仅保留公共部分 列字段

In [3]:
data2 = pd.concat([sub_data1,sub_data2],axis = 0, join = 'inner')

print('data shape:',sub_data1.shape)
print('data shape:',sub_data3.shape)
print('data shape:',data2.shape)

data shape: (101, 3)
data shape: (101, 3)
data shape: (202, 2)


In [4]:
data2.head()

Unnamed: 0,PassengerId,Sex
0,1,male
1,2,female
2,3,female
3,4,female
4,5,male


### 1.3 将sub_data1,sub_data2 按列拼接

#### 横向堆叠，对列  axis = 1    ====>变宽 

<font color='red'>注意变化后的形状</font>

默认相同列名，也会同时保留，后续需要重命名

In [5]:
data3 = pd.concat([sub_data1,sub_data2],axis = 1) 

In [6]:
print('data shape:',sub_data1.shape)
print('data shape:',sub_data2.shape)
print('data shape:',data3.shape)

data shape: (101, 3)
data shape: (101, 4)
data shape: (101, 7)


In [7]:
data3.head()

Unnamed: 0,PassengerId,Sex,Fare,PassengerId.1,Sex.1,SibSp,Pclass
0,1,male,7.25,1,male,1,3
1,2,female,71.2833,2,female,1,1
2,3,female,7.925,3,female,0,3
3,4,female,53.1,4,female,1,1
4,5,male,8.05,5,male,0,3


### 1.4 将sub_data1,sub_data3 按列拼接 内拼接

In [8]:
data4 = pd.concat([sub_data1,sub_data3],axis = 1, join = 'inner')  # 只有50 行

In [9]:
print('data shape:',sub_data1.shape)
print('data shape:',sub_data3.shape)
print('data shape:',data4.shape)

data shape: (101, 3)
data shape: (101, 3)
data shape: (51, 6)


In [10]:
data4.head()

Unnamed: 0,PassengerId,Sex,Fare,PassengerId.1,Sex.1,Fare.1
50,51,male,39.6875,51,male,39.6875
51,52,male,7.8,52,male,7.8
52,53,female,76.7292,53,female,76.7292
53,54,female,26.0,54,female,26.0
54,55,male,61.9792,55,male,61.9792


### 1.5 拓展学习

concat 的第一个参数是个列表，当有多个数据表需要拼接时 可以 [df1,df2,df3,df4] 进行

查看 `pd.concat??`文档，了解其他字段参数的使用方法。
如 `ignore_index`、`verify_integrity` 等

In [11]:
# pd.concat?? 

##  二、Merge

###  2.1 将sub_data1 和 sub_data3 数据根据 PassengerId 进行 左 拼接

In [12]:
data1 = pd.merge(sub_data1,sub_data3,on = ['PassengerId'], how ='left')

In [13]:
print('data shape:',sub_data1.shape)
print('data shape:',sub_data3.shape)
print('data shape:',data1.shape)

data shape: (101, 3)
data shape: (101, 3)
data shape: (101, 5)


In [14]:
data1.head()

Unnamed: 0,PassengerId,Sex_x,Fare_x,Sex_y,Fare_y
0,1,male,7.25,,
1,2,female,71.2833,,
2,3,female,7.925,,
3,4,female,53.1,,
4,5,male,8.05,,


### 2.2 将sub_data1和 sub_data3数据根据'PassengerId','Age' 进行 内部合并

merge 一次拼接连个数据表，但是可以根据多个字段进行拼接

In [15]:
data3 = pd.merge(sub_data1,sub_data4,on =['PassengerId','Sex'], how ='inner' )

In [16]:
print('data shape:',sub_data1.shape)
print('data shape:',sub_data3.shape)
print('data shape:',data3.shape)

data shape: (101, 3)
data shape: (101, 3)
data shape: (51, 5)


In [17]:
data3.head()

Unnamed: 0,PassengerId,Sex,Fare,SibSp,Pclass
0,51,male,39.6875,4,3
1,52,male,7.8,0,3
2,53,female,76.7292,1,1
3,54,female,26.0,1,2
4,55,male,61.9792,0,1


### 2.3 将sub_data1 和 sub_data4 数据根据 PassengerId 公共数据合并 

#### 有同名列的情况
同名字段添加后缀left right。`suffixes`  默认_x _y

注：对于合并的数据集，有一个键是索引时，可以选择将索引变为列再合并。但是更一般的做法是，通过参数一步实现。

In [18]:
data2 = pd.merge(sub_data1,sub_data4, 
                     left_on =['PassengerId'],right_index =  True, 
                     how ='inner', suffixes=('_left', '_right'))

In [19]:
print('data shape:',sub_data1.shape)
print('data shape:',sub_data4.shape)
print('data shape:',data2.shape)

data shape: (101, 3)
data shape: (101, 3)
data shape: (51, 6)


In [20]:
data2.head()

Unnamed: 0,PassengerId,Sex_left,Fare,Sex_right,SibSp,Pclass
50,51,male,39.6875,male,4,3
51,52,male,7.8,male,0,3
52,53,female,76.7292,female,1,1
53,54,female,26.0,female,1,2
54,55,male,61.9792,male,0,1


### 1.5 拓展学习

查看 `pd.merge??`文档，了解其他字段参数的使用方法。
如 `left_index`、`right_on` 等

In [21]:
# pd.merge??

# 谢谢观看
Github 代码：https://github.com/kevin-meng/learn-data-analysis-the-hard-way

![](../pics/thankyou.png)
