# 07. Merging the Pandas DataFrames - Joining

- So far you learned Pandas Dataframe which is a 2D mutable, heterogeneous tabular Data Structure. You can combine 2 data frames.
- You can use:
    - the `df.merge()` for merging
    - the `df.join()` to join the 2 dataframes.
    - the `pd.concat()` for concating pandas df or series
- Merge and Join are discussed here.

**NOTE:** Same can be performed on dataframes as well as Series.

### Import the Pandas and NumPy Libraries

In [3]:
# Import the necessary libraries
import pandas as pd
import numpy as np

## 1. Merge DataFrames using [`df.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) or [`pd.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.merge.html)

- The pandas provides join function similar to SQL, called **merge**.
- It’s a high performance in-memory join operation.
- Two DataFrames would be merged based on common columns with the same value, which must exist in both DataFrames.
- If a key combination does not appear in either the left or right tables, the values in the joined table will be filled by **NaN**.

- A named Series object is treated as a DataFrame with a single named column.
- The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored.
- Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
- When performing a cross merge, no column specifications to merge on are allowed.


``` python
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
```

``` python
Pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
```

Depending on the situation, you can use an appropriate method to merge the two DataFrames.

| Merge |	SQL	 | Description |
|:---|:---|:---|
| left	| LEFT OUTER JOIN |	Use keys from left frame only |
| right	| RIGHT OUTER JOIN |	Use keys from right frame only |
| outer	| FULL OUTER JOIN |	Use union of keys from both frames |
| inner	| INNER JOIN |	Use intersection of keys from both frames |
| cross | -- | Creates the cartesian product from both frames, preserves the order of the left keys|

### 1.1. Merge DataFrames on a single key
Simplest example on how to merge two DataFrames on a single key - pkey

In [None]:
d1 = {
    'pkey': ['K0', 'K1', 'K2', 'K3', 'K4'],
    'A': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'B': ['B0', 'B1', 'B2', 'B3', 'B4']
}
df1 = pd.DataFrame(d1)

d2 = {
    'pkey': ['K0', 'K1', 'K2', 'K3', 'K4'],
    'C': ['C0', 'C1', 'C2', 'C3', 'C4'],
    'D': ['D0', 'D1', 'D2', 'D3', 'D4']
}
df2 = pd.DataFrame(d2)

In [None]:
display(df1, df2)

In [None]:
df3 = pd.merge(df1, df2, on='pkey')
df3

#### Alternatively, you can also say:

In [None]:
df3 = df1.merge(df2, on='pkey')
df3

### 1.2. Merge DataFrames on combined keys
Learn to merge 2 dataframes on combined Keys.

In [None]:
d1 = {
    'pkey1': ['K0', 'K1', 'K2', 'K3', 'K4'],
    'pkey2': ['K10', 'K11', 'K12', 'K13', 'K14'],
    'A': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'B': ['B0', 'B1', 'B2', 'B3', 'B4']
}
df1 = pd.DataFrame(d1)

d2 = {
    'pkey1': ['K0', 'K1', 'K2', 'K3', 'K4'],
    'pkey2': ['K10', 'K11', 'K12', 'K13', 'K14'],
    'C': ['C0', 'C1', 'C2', 'C3', 'C4'],
    'D': ['D0', 'D1', 'D2', 'D3', 'D4']
}
df2 = pd.DataFrame(d2)

In [None]:
display(df1, df2)

In [None]:
pd.merge(df1, df2, on=['pkey1', 'pkey2'])

### 1.3. Using how options in merge

**how {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default 'inner'**
Type of merge to be performed.

- left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
- right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
- cross: creates the cartesian product from both frames, preserves the order of the left keys.

In [None]:
d1 = {
    'pkey1': ['K0', 'K0', 'K1', 'K3', 'K4'],
    'A': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'B': ['B0', 'B1', 'B2', 'B3', 'B4']
}
df1 = pd.DataFrame(d1)
df1

In [None]:
d2 = {
    'pkey1': ['K0', 'K2', 'K2', 'K4', 'K5'],
    'C': ['A0', 'A1', 'A2', 'A3', 'A4'],
    'D': ['B0', 'B1', 'B2', 'B3', 'B4']
}
df2 = pd.DataFrame(d2)
df2

In [None]:
pd.merge(df1, df2, on='pkey1', how='left')

In [None]:
pd.merge(df1, df2, on='pkey1', how='right')

In [None]:
pd.merge(df1, df2, on='pkey1', how='outer')

In [None]:
pd.merge(df1, df2, on='pkey1', how='inner')

In [None]:
pd.merge(df1, df2, how='cross')

_______________
## 2. Join dataframes using `pd.join()` or [`df.join()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html)
- The join method takes two dataframes and joins them on their **indexes (technically, you can pick the column to join on for the left dataframe)**.
- If there are overlapping columns, the join will want you to add a suffix to the overlapping column name from the left dataframe. Our two dataframes do have an overlapping column name P.
Source: [Geeksforgeeks](https://www.geeksforgeeks.org/what-is-the-difference-between-join-and-merge-in-pandas/)



```python
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
```

In [None]:
d1 = {
    'pkey': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
    'A': ['A0', 'A1', 'A2', 'A3', 'A4', "A5"],
    'B': ['B0', 'B1', 'B2', 'B3', 'B5']
}
df1 = pd.DataFrame(d1)

d2 = {
    'pkey': ['K0', 'K1', 'K2', 'K3', 'K4'],
    'C': ['C0', 'C1', 'C2', 'C3', 'C4'],
    'D': ['D0', 'D1', 'D2', 'D3', 'D4']
}
df2 = pd.DataFrame(d2)

In [None]:
display(df1, df2)

In [None]:
df1.join(df2, on='index')

In [None]:
df1.set_index('pkey', inplace=True)
df2.set_index('pkey', inplace=True)

In [None]:
df1.join(df2)

**Pandas merge() is lot more powerful than the join() method.**

## 3. Conatenation in Pandas
- It is just stiching together 2 dataframes (or series) along a perticular axis.

In [11]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index = [0, 1, 2, 3],
)

df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index = [3, 4, 5, 6],
)

display(df1, df2)

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


Unnamed: 0,A,B,C,D
3,A4,B4,C4,D4
4,A5,B5,C5,D5
5,A6,B6,C6,D6
6,A7,B7,C7,D7


In [12]:
pd.concat([df1, df2])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
3,A4,B4,C4,D4
4,A5,B5,C5,D5
5,A6,B6,C6,D6
6,A7,B7,C7,D7


In [8]:
df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index = [4, 5, 6, 7],
)
pd.concat([df1, df2])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [16]:
df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index = [4, 5, 6, 7],
)
pd.concat(objs=[df1, df2], axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1
0,A0,B0,C0,D0,,,,
1,A1,B1,C1,D1,,,,
2,A2,B2,C2,D2,,,,
3,A3,B3,C3,D3,,,,
4,,,,,A4,B4,C4,D4
5,,,,,A5,B5,C5,D5
6,,,,,A6,B6,C6,D6
7,,,,,A7,B7,C7,D7


_________________
## References:
1. https://github.com/jakevdp/PythonDataScienceHandbook
2. https://pandas.pydata.org/docs/user_guide/merging.html#