# Join and Merge data
* Author: Owen Chen
* History: - 3/31/2022 started

This notebook contains examples of how to use pandas to join and merge different datasets

1. Join rows - concat horizonally
2. Join columns - concat Vertically
3. Merge by keys - inner join
4. Merge by keys - left join
5. Merge by keys - right join
6. Merge by keys - outer join <br>
 
* <b> concat()</b> syntax:
```   
    pd.concat(
    objs,
    axis=0,
    join="outer",
    ignore_index=False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity=False,
    copy=True,
     )
```
* <b> merge()</b> syntax:
```   
    pd.merge(
        left,
        right,
        how="inner",
        on=None,
        left_on=None,
        right_on=None,
        left_index=False,
        right_index=False,
        sort=True,
        suffixes=("_x", "_y"),
        copy=True,
        indicator=False,
        validate=None,
    )
```

In [59]:
import pandas as pd

In [97]:
# Create a DF with a dictionary
df1 = pd.DataFrame(
    {
        "id":[1,1,2,2],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
    )
df1

Unnamed: 0,id,A,B,C,D
0,1,A0,B0,C0,D0
1,1,A1,B1,C1,D1
2,2,A2,B2,C2,D2
3,2,A3,B3,C3,D3


In [98]:
# Create a DF with a list of lists
df2 = pd.DataFrame(
    [
        [2, "A3", "B3", "C3", "D3"],
        [2, "A4", "B4", "C4", "D4"],
        [3, "A5", "B5", "C5", "C3"],
        [3, "A6", "B6", "C6", "D6"]
    ],
    columns=['id','A2','B2','C2','D2']
)
df2

Unnamed: 0,id,A2,B2,C2,D2
0,2,A3,B3,C3,D3
1,2,A4,B4,C4,D4
2,3,A5,B5,C5,C3
3,3,A6,B6,C6,D6


## 1. Join Horizontally - concat rows

In [85]:
# Make sure that both df have the same columns
saved_df2_columns=df2.columns
df2.columns = df1.columns
df3 = pd.concat([df1, df2], ignore_index=True)
df3

Unnamed: 0,id,A,B,C,D
0,1,A0,B0,C0,D0
1,1,A1,B1,C1,D1
2,2,A2,B2,C2,D2
3,2,A3,B3,C3,D3
4,2,A3,B3,C3,D3
5,2,A4,B4,C4,D4
6,3,A5,B5,C5,C3
7,3,A6,B6,C6,D6


## 2. Join Vertically - concat Columns

In [77]:
df2.columns = saved_df2_columns
df3 = pd.concat([df1, df2], axis=1)
df3

Unnamed: 0,id,A,B,C,D,id.1,A2,B2,C2,D2
0,1,A0,B0,C0,D0,2,A3,B3,C3,D3
1,1,A1,B1,C1,D1,2,A4,B4,C4,D4
2,2,A2,B2,C2,D2,3,A5,B5,C5,C3
3,2,A3,B3,C3,D3,3,A6,B6,C6,D6


In [78]:
df3 = pd.concat([df1, ps_row.to_frame().T], axis=0, ignore_index=True)
df3

Unnamed: 0,id,A,B,C,D
0,1,A0,B0,C0,D0
1,1,A1,B1,C1,D1
2,2,A2,B2,C2,D2
3,2,A3,B3,C3,D3
4,5,A5,B5,C5,D5


## 3. Merge - inner join

In [79]:
df1

Unnamed: 0,id,A,B,C,D
0,1,A0,B0,C0,D0
1,1,A1,B1,C1,D1
2,2,A2,B2,C2,D2
3,2,A3,B3,C3,D3


In [86]:
df2.columns =saved_df2_columns
df2

Unnamed: 0,id,A2,B2,C2,D2
0,2,A3,B3,C3,D3
1,2,A4,B4,C4,D4
2,3,A5,B5,C5,C3
3,3,A6,B6,C6,D6


In [87]:
df3 = pd.merge(df1, df2,how="inner", on='id')
df3

Unnamed: 0,id,A,B,C,D,A2,B2,C2,D2
0,2,A2,B2,C2,D2,A3,B3,C3,D3
1,2,A2,B2,C2,D2,A4,B4,C4,D4
2,2,A3,B3,C3,D3,A3,B3,C3,D3
3,2,A3,B3,C3,D3,A4,B4,C4,D4


## 4. Merge - Left join

In [88]:
df3 = pd.merge(df1, df2,how="left", on='id')
df3

Unnamed: 0,id,A,B,C,D,A2,B2,C2,D2
0,1,A0,B0,C0,D0,,,,
1,1,A1,B1,C1,D1,,,,
2,2,A2,B2,C2,D2,A3,B3,C3,D3
3,2,A2,B2,C2,D2,A4,B4,C4,D4
4,2,A3,B3,C3,D3,A3,B3,C3,D3
5,2,A3,B3,C3,D3,A4,B4,C4,D4


## 5. Merge - Right join

In [89]:
df3 = pd.merge(df1, df2,how="right", on='id')
df3

Unnamed: 0,id,A,B,C,D,A2,B2,C2,D2
0,2,A2,B2,C2,D2,A3,B3,C3,D3
1,2,A3,B3,C3,D3,A3,B3,C3,D3
2,2,A2,B2,C2,D2,A4,B4,C4,D4
3,2,A3,B3,C3,D3,A4,B4,C4,D4
4,3,,,,,A5,B5,C5,C3
5,3,,,,,A6,B6,C6,D6


## 6. Merge - outer join

In [90]:
df3 = pd.merge(df1, df2,how="outer", on='id')
df3

Unnamed: 0,id,A,B,C,D,A2,B2,C2,D2
0,1,A0,B0,C0,D0,,,,
1,1,A1,B1,C1,D1,,,,
2,2,A2,B2,C2,D2,A3,B3,C3,D3
3,2,A2,B2,C2,D2,A4,B4,C4,D4
4,2,A3,B3,C3,D3,A3,B3,C3,D3
5,2,A3,B3,C3,D3,A4,B4,C4,D4
6,3,,,,,A5,B5,C5,C3
7,3,,,,,A6,B6,C6,D6


### check which record exists in which table - indicator=True

In [99]:
df3 = pd.merge(df1, df2,how="outer", on='id', indicator=True)
df3

Unnamed: 0,id,A,B,C,D,A2,B2,C2,D2,_merge
0,1,A0,B0,C0,D0,,,,,left_only
1,1,A1,B1,C1,D1,,,,,left_only
2,2,A2,B2,C2,D2,A3,B3,C3,D3,both
3,2,A2,B2,C2,D2,A4,B4,C4,D4,both
4,2,A3,B3,C3,D3,A3,B3,C3,D3,both
5,2,A3,B3,C3,D3,A4,B4,C4,D4,both
6,3,,,,,A5,B5,C5,C3,right_only
7,3,,,,,A6,B6,C6,D6,right_only


In [92]:
# difference key
df2.rename(columns={'id':'id2'}, inplace=True)
df2

Unnamed: 0,id2,A2,B2,C2,D2
0,2,A3,B3,C3,D3
1,2,A4,B4,C4,D4
2,3,A5,B5,C5,C3
3,3,A6,B6,C6,D6


In [96]:
# default merge is an inner join - how="inner"
df3 = pd.merge(df1, df2, left_on='id', right_on='id2')
df3

Unnamed: 0,id,A,B,C,D,id2,A2,B2,C2,D2
0,2,A2,B2,C2,D2,2,A3,B3,C3,D3
1,2,A2,B2,C2,D2,2,A4,B4,C4,D4
2,2,A3,B3,C3,D3,2,A3,B3,C3,D3
3,2,A3,B3,C3,D3,2,A4,B4,C4,D4
