## Combining DataFrames

#### [Examples](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)
---

In [1]:
import numpy as np
import pandas as pd

#### Concatenation

Directly  "glue" together dataframes.

In [2]:
data_one = {'A': ['A0', 'A1', 'A2', 'A3'],'B': ['B0', 'B1', 'B2', 'B3']}
data_two = {'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}

one = pd.DataFrame(data_one)
two = pd.DataFrame(data_two)

In [3]:
one

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


In [4]:
two

Unnamed: 0,C,D
0,C0,D0
1,C1,D1
2,C2,D2
3,C3,D3


#### Axis = 0 

Concatenate along rows

In [5]:
axis0 = pd.concat([one, two], axis=0)
axis0

Unnamed: 0,A,B,C,D
0,A0,B0,,
1,A1,B1,,
2,A2,B2,,
3,A3,B3,,
0,,,C0,D0
1,,,C1,D1
2,,,C2,D2
3,,,C3,D3


#### Axis = 1

Concatenate along columns

In [6]:
axis1 = pd.concat([one, two], axis=1)
axis1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


#### Axis 0 , but columns match up


In [7]:
two.columns = one.columns
pd.concat([one, two], axis=0)

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3
0,C0,D0
1,C1,D1
2,C2,D2
3,C3,D3


#### Merge

Data Tables

In [8]:
registrations = pd.DataFrame({'reg_id': [1, 2, 3, 4], 'name': ['Andrew', 'Bobo', 'Claire', 'David']})
logins = pd.DataFrame({'log_id':[1, 2, 3, 4],'name':['Xavier', 'Andrew', 'Yolanda', 'Bobo']})

In [9]:
registrations

Unnamed: 0,reg_id,name
0,1,Andrew
1,2,Bobo
2,3,Claire
3,4,David


In [10]:
logins

Unnamed: 0,log_id,name
0,1,Xavier
1,2,Andrew
2,3,Yolanda
3,4,Bobo


### Inner, Left, Right, and Outer Joins

`pd.merge()` Merge pandas DataFrames based on key columns, similar to a SQL join. Results based on the `how` parameter.
See `help(pd.merge)`
---

#### Inner Join

Match up where the `key` is present in **BOTH** tables. There should be no NaNs due to the join, since by definition to be part of the Inner Join they need info in both tables.

In [12]:
# Notice pd.merge doesn't take in a list like concat
pd.merge(*[registrations, logins], how='inner', on='name')

Unnamed: 0,reg_id,name,log_id
0,1,Andrew,2
1,2,Bobo,4


In [13]:
# Pandas smart enough to figure out key column (on parameter) if only one column name matches up
pd.merge(*[registrations,logins], how='inner')

Unnamed: 0,reg_id,name,log_id
0,1,Andrew,2
1,2,Bobo,4


In [19]:
# Pandas reports an error if "on" key column isn't in both dataframes
try:
    pd.merge(*[registrations,logins], how='inner', on='reg_id')

except KeyError as e:
    print(f'Key Error Message: {e}')

Key Error Message: 'reg_id'


#### Left Join

Match up AND include all rows from Left Table.
<br>Show everyone who registered on Left Table, if they don't have login info, then fill with NaN.

In [20]:
pd.merge(*[registrations,logins], how='left')

Unnamed: 0,reg_id,name,log_id
0,1,Andrew,2.0
1,2,Bobo,4.0
2,3,Claire,
3,4,David,


#### Right Join
Match up AND include all rows from Right Table.
<br>Show everyone who logged in on the Right Table, if they don't have registration info, then fill with NaN.

In [21]:
pd.merge(*[registrations,logins], how='right')

Unnamed: 0,reg_id,name,log_id
0,,Xavier,1
1,1.0,Andrew,2
2,,Yolanda,3
3,2.0,Bobo,4


#### Outer Join

Match up on all info found in either Left or Right Table.
Show everyone that's in the Log in table and the registrations table. Fill any missing info with NaN.

In [22]:
pd.merge(*[registrations,logins], how='outer')

Unnamed: 0,reg_id,name,log_id
0,1.0,Andrew,2.0
1,2.0,Bobo,4.0
2,3.0,Claire,
3,4.0,David,
4,,Xavier,1.0
5,,Yolanda,3.0


---
#### Join on Index or Column

Use combinations of `left_on`, `right_on`, `left_index`, `right_index` to merge a column or index on each other

In [23]:
registrations

Unnamed: 0,reg_id,name
0,1,Andrew
1,2,Bobo
2,3,Claire
3,4,David


In [24]:
logins

Unnamed: 0,log_id,name
0,1,Xavier
1,2,Andrew
2,3,Yolanda
3,4,Bobo


In [25]:
registrations = registrations.set_index("name")
registrations

Unnamed: 0_level_0,reg_id
name,Unnamed: 1_level_1
Andrew,1
Bobo,2
Claire,3
David,4


In [30]:
pd.merge(*[registrations,logins], left_index=True, right_on='name')

Unnamed: 0,reg_id,log_id,name
1,1,2,Andrew
3,2,4,Bobo


In [31]:
pd.merge(*[logins, registrations], right_index=True, left_on='name')

Unnamed: 0,log_id,name,reg_id
1,2,Andrew,1
3,4,Bobo,2


#### Dealing with different key column names in joined tables

In [32]:
registrations = registrations.reset_index()
registrations

Unnamed: 0,name,reg_id
0,Andrew,1
1,Bobo,2
2,Claire,3
3,David,4


In [33]:
logins

Unnamed: 0,log_id,name
0,1,Xavier
1,2,Andrew
2,3,Yolanda
3,4,Bobo


In [34]:
registrations.columns = ['reg_name','reg_id']
registrations

Unnamed: 0,reg_name,reg_id
0,Andrew,1
1,Bobo,2
2,Claire,3
3,David,4


In [38]:
# ERROR
try:
    pd.merge(*[registrations, logins])

except Exception as e:
    print(f'Merge Error: {e}')

Merge Error: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False


In [39]:
pd.merge(*[registrations, logins], left_on='reg_name', right_on='name')

Unnamed: 0,reg_name,reg_id,log_id,name
0,Andrew,1,2,Andrew
1,Bobo,2,4,Bobo


In [40]:
pd.merge(*[registrations, logins], left_on='reg_name', right_on='name').drop('reg_name', axis=1)

Unnamed: 0,reg_id,log_id,name
0,1,2,Andrew
1,2,4,Bobo


#### Pandas automatically tags duplicate columns

In [42]:
registrations.columns = ['name', 'id']
logins.columns = ['id', 'name']

In [43]:
registrations

Unnamed: 0,name,id
0,Andrew,1
1,Bobo,2
2,Claire,3
3,David,4


In [44]:
logins

Unnamed: 0,id,name
0,1,Xavier
1,2,Andrew
2,3,Yolanda
3,4,Bobo


In [45]:
# _x is for left
# _y is for right
pd.merge(*[registrations, logins], on='name')

Unnamed: 0,name,id_x,id_y
0,Andrew,1,2
1,Bobo,2,4


In [46]:
pd.merge(*[registrations, logins], on='name', suffixes=('_reg', '_log'))

Unnamed: 0,name,id_reg,id_log
0,Andrew,1,2
1,Bobo,2,4


---