## Pandas怎样实现DataFrame的Merge

### merge的语法：
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)  
* left，right：要merge的dataframe或者有name的Series
* how：join类型，'left', 'right', 'outer', 'inner'
* on：join的key，left和right都需要有这个key
* left_on：left的df或者series的key
* right_on：right的df或者seires的key
* left_index，right_index：使用index而不是普通的column做join
* suffixes：两个元素的后缀，如果列有重名，自动添加后缀，默认是('_x', '_y')


### 1、电影数据集的join实例

In [1]:
import pandas as pd

In [2]:
df_ratings = pd.read_csv(
    "./datas/movielens-1m/ratings.dat", 
    sep="::",
    engine='python', 
    names="UserID::MovieID::Rating::Timestamp".split("::")
)

In [3]:
df_ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [4]:
df_users = pd.read_csv(
    "./datas/movielens-1m/users.dat", 
    sep="::",
    engine='python', 
    names="UserID::Gender::Age::Occupation::Zip-code".split("::")
)

In [5]:
df_users.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [6]:
df_movies = pd.read_csv(
    "./datas/movielens-1m/movies.dat", 
    sep="::",
    engine='python', 
    names="MovieID::Title::Genres".split("::")
)

In [7]:
df_movies.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
df_ratings_users = pd.merge(
   df_ratings, df_users, left_on="UserID", right_on="UserID", how="inner"
)

In [9]:
df_ratings_users.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Gender,Age,Occupation,Zip-code
0,1,1193,5,978300760,F,1,10,48067
1,1,661,3,978302109,F,1,10,48067
2,1,914,3,978301968,F,1,10,48067
3,1,3408,4,978300275,F,1,10,48067
4,1,2355,5,978824291,F,1,10,48067


In [10]:
df_ratings_users_movies = pd.merge(
    df_ratings_users, df_movies, left_on="MovieID", right_on="MovieID", how="inner"
)

In [11]:
df_ratings_users_movies.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Gender,Age,Occupation,Zip-code,Title,Genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama


### 2、理解merge时数量的对齐关系

以下关系要正确理解：
* one-to-one：一对一关系，关联的key都是唯一的
  - 比如(学号，姓名) merge (学号，年龄)
  - 结果条数为：1*1
* one-to-many：一对多关系，左边唯一key，右边不唯一key
  - 比如(学号，姓名) merge (学号，[语文成绩、数学成绩、英语成绩])
  - 结果条数为：1*N
* many-to-many：多对多关系，左边右边都不是唯一的
  - 比如（学号，[语文成绩、数学成绩、英语成绩]） merge (学号，[篮球、足球、乒乓球])
  - 结果条数为：M*N

#### 2.1 one-to-one 一对一关系的merge

In [33]:
left = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'name': ['name_a', 'name_b', 'name_c', 'name_d']
                    })
left

Unnamed: 0,sno,name
0,11,name_a
1,12,name_b
2,13,name_c
3,14,name_d


In [34]:
right = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'age': ['21', '22', '23', '24']
                    })
right

Unnamed: 0,sno,age
0,11,21
1,12,22
2,13,23
3,14,24


In [36]:
# 一对一关系，结果中有4条
pd.merge(left, right, on='sno')

Unnamed: 0,sno,name,age
0,11,name_a,21
1,12,name_b,22
2,13,name_c,23
3,14,name_d,24


#### 2.2 one-to-many 一对多关系的merge

注意：数据会被复制

In [37]:
left = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'name': ['name_a', 'name_b', 'name_c', 'name_d']
                    })
left

Unnamed: 0,sno,name
0,11,name_a
1,12,name_b
2,13,name_c
3,14,name_d


In [42]:
right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
                       'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
                     })
right

Unnamed: 0,sno,grade
0,11,语文88
1,11,数学90
2,11,英语75
3,12,语文66
4,12,数学55
5,13,英语29


In [44]:
# 数目以多的一边为准
pd.merge(left, right, on='sno')

Unnamed: 0,sno,name,grade
0,11,name_a,语文88
1,11,name_a,数学90
2,11,name_a,英语75
3,12,name_b,语文66
4,12,name_b,数学55
5,13,name_c,英语29


#### 2.3 many-to-many 多对多关系的merge

注意：结果数量会出现乘法

In [46]:
left = pd.DataFrame({'sno': [11, 11, 12, 12,12],
                      '喜爱体育': ['篮球', '羽毛球', '乒乓球', '篮球', "足球"]
                    })
left

Unnamed: 0,sno,喜爱体育
0,11,篮球
1,11,羽毛球
2,12,乒乓球
3,12,篮球
4,12,足球


In [47]:
right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
                       'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
                     })
right

Unnamed: 0,sno,grade
0,11,语文88
1,11,数学90
2,11,英语75
3,12,语文66
4,12,数学55
5,13,英语29


In [48]:
pd.merge(left, right, on='sno')

Unnamed: 0,sno,喜爱体育,grade
0,11,篮球,语文88
1,11,篮球,数学90
2,11,篮球,英语75
3,11,羽毛球,语文88
4,11,羽毛球,数学90
5,11,羽毛球,英语75
6,12,乒乓球,语文66
7,12,乒乓球,数学55
8,12,篮球,语文66
9,12,篮球,数学55


In [49]:
### 3、理解left join、right join、inner join的区别

In [50]:
### 4、如果出现非Key的字段重名怎么办