# 07. Merging the Pandas DataFrames - Joining

- So far you learned Pandas Dataframe which is a 2D mutable, heterogeneous tabular Data Structure.
- It has 2 axes, rows and columns.
- You can use:
    - the `df.merge()` for merging
    - the `pd.concat()` for concating pandas objects
    - and the `df.join()` to join the 2 dataframes.
- Merge and Join are discussed here.

### Import the Pandas and NumPy Libraries

In [1]:
# Import the necessary libraries
import pandas as pd
import numpy as np

## 1. Merge DataFrames using [`df.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) or [`pd.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.merge.html)

- The pandas provides join function similar to SQL, called **merge**.
- It’s a high performance in-memory join operation.
- Two DataFrames would be merged based on common columns with the same value, which must exist in both DataFrames.
- If a key combination does not appear in either the left or right tables, the values in the joined table will be filled by **NaN**.

- A named Series object is treated as a DataFrame with a single named column.
- The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored.
- Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
- When performing a cross merge, no column specifications to merge on are allowed.

``` python
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
```

Depending on the situation, you can use an appropriate method to merge the two DataFrames.

| Merge |	SQL	 | Description |
|:---|:---|:---|
| left	| LEFT OUTER JOIN |	Use keys from left frame only |
| right	| RIGHT OUTER JOIN |	Use keys from right frame only |
| outer	| FULL OUTER JOIN |	Use union of keys from both frames |
| inner	| INNER JOIN |	Use intersection of keys from both frames |
| cross | -- | Creates the cartesian product from both frames, preserves the order of the left keys|

### 1.1. Merge DataFrames on a single key
Simplest example on how to merge two DataFrames on a single key - pkey

In [2]:
d1 = {
    'pkey': ['K0', 'K1', 'K2', 'K3', 'K4'],
    'A': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'B': ['B0', 'B1', 'B2', 'B3', 'B5']
}
df1 = pd.DataFrame(d1)

d2 = {
    'pkey': ['K0', 'K1', 'K2', 'K3', 'K4'],
    'C': ['C0', 'C1', 'C2', 'C3', 'C4'],
    'D': ['D0', 'D1', 'D2', 'D3', 'D4']
}
df2 = pd.DataFrame(d2)

In [3]:
display(df1, df2)

Unnamed: 0,pkey,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3
4,K4,A4,B5


Unnamed: 0,pkey,C,D
0,K0,C0,D0
1,K1,C1,D1
2,K2,C2,D2
3,K3,C3,D3
4,K4,C4,D4


In [4]:
df3 = pd.merge(df1, df2, on='pkey')
df3

Unnamed: 0,pkey,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3
4,K4,A4,B5,C4,D4


#### Alternatively, you can also say:

In [5]:
df3 = df1.merge(df2, on='pkey')
df3

Unnamed: 0,pkey,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3
4,K4,A4,B5,C4,D4


### 1.2. Merge DataFrames on combined keys
Learn to merge 2 dataframes on combined Keys.

In [6]:
d1 = {
    'pkey1': ['K0', 'K1', 'K2', 'K3', 'K4'],
    'pkey2': ['K10', 'K11', 'K12', 'K13', 'K14'],
    'A': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'B': ['B0', 'B1', 'B2', 'B3', 'B5']
}
df1 = pd.DataFrame(d1)

d2 = {
    'pkey1': ['K0', 'K1', 'K2', 'K3', 'K4'],
    'pkey2': ['K10', 'K11', 'K12', 'K13', 'K14'],
    'C': ['C0', 'C1', 'C2', 'C3', 'C4'],
    'D': ['D0', 'D1', 'D2', 'D3', 'D4']
}
df2 = pd.DataFrame(d2)

In [7]:
display(df1, df2)

Unnamed: 0,pkey1,pkey2,A,B
0,K0,K10,A0,B0
1,K1,K11,A1,B1
2,K2,K12,A2,B2
3,K3,K13,A3,B3
4,K4,K14,A4,B5


Unnamed: 0,pkey1,pkey2,C,D
0,K0,K10,C0,D0
1,K1,K11,C1,D1
2,K2,K12,C2,D2
3,K3,K13,C3,D3
4,K4,K14,C4,D4


In [8]:
pd.merge(df1, df2, on=['pkey1', 'pkey2'])

Unnamed: 0,pkey1,pkey2,A,B,C,D
0,K0,K10,A0,B0,C0,D0
1,K1,K11,A1,B1,C1,D1
2,K2,K12,A2,B2,C2,D2
3,K3,K13,A3,B3,C3,D3
4,K4,K14,A4,B5,C4,D4


### 1.3. Using how options in merge

**how {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default 'inner'**
Type of merge to be performed.

- left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
- right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
- cross: creates the cartesian product from both frames, preserves the order of the left keys.

In [9]:
d1 = {
    'pkey1': ['K0', 'K0', 'K1', 'K3', 'K4'],
    'A': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'B': ['B0', 'B1', 'B2', 'B3', 'B4']
}
df1 = pd.DataFrame(d1)
df1

Unnamed: 0,pkey1,A,B
0,K0,A0,B0
1,K0,A1,B1
2,K1,A2,B2
3,K3,A3,B3
4,K4,A4,B4


In [10]:
dr = {
    'pkey1': ['K0', 'K2', 'K2', 'K4', 'K5'],
    'C': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'D': ['B0', 'B1', 'B2', 'B3', 'B4']
}
df2 = pd.DataFrame(dr)
df2

Unnamed: 0,pkey1,C,D
0,K0,A0,B0
1,K2,A1,B1
2,K2,A2,B2
3,K4,A3,B3
4,K5,A4,B4


In [11]:
pd.merge(df1, df2, on='pkey1', how='left')

Unnamed: 0,pkey1,A,B,C,D
0,K0,A0,B0,A0,B0
1,K0,A1,B1,A0,B0
2,K1,A2,B2,,
3,K3,A3,B3,,
4,K4,A4,B4,A3,B3


In [12]:
pd.merge(df1, df2, on='pkey1', how='right')

Unnamed: 0,pkey1,A,B,C,D
0,K0,A0,B0,A0,B0
1,K0,A1,B1,A0,B0
2,K2,,,A1,B1
3,K2,,,A2,B2
4,K4,A4,B4,A3,B3
5,K5,,,A4,B4


In [13]:
pd.merge(df1, df2, on='pkey1', how='outer')

Unnamed: 0,pkey1,A,B,C,D
0,K0,A0,B0,A0,B0
1,K0,A1,B1,A0,B0
2,K1,A2,B2,,
3,K3,A3,B3,,
4,K4,A4,B4,A3,B3
5,K2,,,A1,B1
6,K2,,,A2,B2
7,K5,,,A4,B4


In [14]:
pd.merge(df1, df2, on='pkey1', how='inner')

Unnamed: 0,pkey1,A,B,C,D
0,K0,A0,B0,A0,B0
1,K0,A1,B1,A0,B0
2,K4,A4,B4,A3,B3


In [15]:
pd.merge(df1, df2, how='cross')

Unnamed: 0,pkey1_x,A,B,pkey1_y,C,D
0,K0,A0,B0,K0,A0,B0
1,K0,A0,B0,K2,A1,B1
2,K0,A0,B0,K2,A2,B2
3,K0,A0,B0,K4,A3,B3
4,K0,A0,B0,K5,A4,B4
5,K0,A1,B1,K0,A0,B0
6,K0,A1,B1,K2,A1,B1
7,K0,A1,B1,K2,A2,B2
8,K0,A1,B1,K4,A3,B3
9,K0,A1,B1,K5,A4,B4


_______________
## 2. Join dataframes using `pd.join()` or [`df.join()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html)
- The join method takes two dataframes and joins them on their indexes (technically, you can pick the column to join on for the left dataframe).
- If there are overlapping columns, the join will want you to add a suffix to the overlapping column name from the left dataframe. Our two dataframes do have an overlapping column name P.
Source: [Geeksforgeeks](https://www.geeksforgeeks.org/what-is-the-difference-between-join-and-merge-in-pandas/)



```python
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
```

In [33]:
d1 = {
    'pkey': ['K0', 'K1', 'K2', 'K3', 'K4'],
    'A': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'B': ['B0', 'B1', 'B2', 'B3', 'B5']
}
df1 = pd.DataFrame(d1)

d2 = {
    'pkey': ['K0', 'K1', 'K2', 'K3', 'K4'],
    'C': ['C0', 'C1', 'C2', 'C3', 'C4'],
    'D': ['D0', 'D1', 'D2', 'D3', 'D4']
}
df2 = pd.DataFrame(d2)

In [34]:
display(df1, df2)

Unnamed: 0,pkey,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3
4,K4,A4,B5


Unnamed: 0,pkey,C,D
0,K0,C0,D0
1,K1,C1,D1
2,K2,C2,D2
3,K3,C3,D3
4,K4,C4,D4


In [35]:
df1.join(df2, on='index')

KeyError: 'index'

In [36]:
df1.set_index('pkey', inplace=True)
df2.set_index('pkey', inplace=True)

In [37]:
df1.join(df2)

Unnamed: 0_level_0,A,B,C,D
pkey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
K0,A0,B0,C0,D0
K1,A1,B1,C1,D1
K2,A2,B2,C2,D2
K3,A3,B3,C3,D3
K4,A4,B5,C4,D4


_________________
## References:
https://github.com/jakevdp/PythonDataScienceHandbook