### Combining Dataframes

Often the data we need exists in two separate sources and we need to combine together to work
If both the sources are pandas dataframe then we can very easily concatenate both the dataframes using **pd.concat()** method of pandas. There are multiple ways we can concatenate 2 dataframes
- Concatenate by **columns**
- Concatenate by **rows**

In [1]:
import pandas as pd
import numpy as np

#### Concatenate by **columns**

In [3]:
df_one = pd.DataFrame({'A': ['A1', 'A2', 'A3', 'A4'], 'B': ['B1', 'B2', 'B3', 'B4']})

In [4]:
df_two = pd.DataFrame({'C': ['C1', 'C2', 'C3', 'C4'], 'D': ['D1', 'D2', 'D3', 'D4']})

In [5]:
df_one

Unnamed: 0,A,B
0,A1,B1
1,A2,B2
2,A3,B3
3,A4,B4


In [6]:
df_two

Unnamed: 0,C,D
0,C1,D1
1,C2,D2
2,C3,D3
3,C4,D4


In [7]:
pd.concat((df_one, df_two), axis=1)

Unnamed: 0,A,B,C,D
0,A1,B1,C1,D1
1,A2,B2,C2,D2
2,A3,B3,C3,D3
3,A4,B4,C4,D4


In [9]:
df_one = pd.DataFrame({'A': ['A1', 'A2', 'A3', 'A4'], 'B': ['B1', 'B2', 'B3', 'B4']})

In [10]:
df_two = pd.DataFrame({'A': ['C1', 'C2', 'C3', 'C4'], 'B': ['D1', 'D2', 'D3', 'D4']})

In [11]:
df_one

Unnamed: 0,A,B
0,A1,B1
1,A2,B2
2,A3,B3
3,A4,B4


In [12]:
df_two

Unnamed: 0,A,B
0,C1,D1
1,C2,D2
2,C3,D3
3,C4,D4


In [13]:
pd.concat((df_one, df_two))

Unnamed: 0,A,B
0,A1,B1
1,A2,B2
2,A3,B3
3,A4,B4
0,C1,D1
1,C2,D2
2,C3,D3
3,C4,D4


In [14]:
# concatenate 2 dataframes with different columns
df_one = pd.DataFrame({'A': ['A1', 'A2', 'A3', 'A4'], 'B': ['B1', 'B2', 'B3', 'B4']})
df_two = pd.DataFrame({'C': ['C1', 'C2', 'C3', 'C4'], 'D': ['D1', 'D2', 'D3', 'D4']})

In [16]:
pd.concat((df_one, df_two)) # replace the unkown values with NaN

Unnamed: 0,A,B,C,D
0,A1,B1,,
1,A2,B2,,
2,A3,B3,,
3,A4,B4,,
0,,,C1,D1
1,,,C2,D2
2,,,C3,D3
3,,,C4,D4


In [17]:
pd.concat((df_one, df_two)).reset_index()

Unnamed: 0,index,A,B,C,D
0,0,A1,B1,,
1,1,A2,B2,,
2,2,A3,B3,,
3,3,A4,B4,,
4,0,,,C1,D1
5,1,,,C2,D2
6,2,,,C3,D3
7,3,,,C4,D4


In [19]:
# rename columns of one df before concatenate
df_two.columns = df_one.columns

In [21]:
df_two

Unnamed: 0,A,B
0,C1,D1
1,C2,D2
2,C3,D3
3,C4,D4


In [22]:
df_one

Unnamed: 0,A,B
0,A1,B1
1,A2,B2
2,A3,B3
3,A4,B4


In [24]:
pd.concat((df_one, df_two), ignore_index=True)

Unnamed: 0,A,B
0,A1,B1
1,A2,B2
2,A3,B3
3,A4,B4
4,C1,D1
5,C2,D2
6,C3,D3
7,C4,D4


#### Merging Dataframes

Two dataframes can be merged based on 
- **how** : how to merge ('inner', 'left', 'right', 'outer')
- **on** : on what column to merge

Merging dataframes are similar to **join** in SQL

In [25]:
import pandas as pd
import numpy as np

In [28]:
df_reg = pd.DataFrame({'reg_id':[1,2,3,4], 'name': ['Rama', 'Krishna', 'Radha', 'Shyam']})

In [29]:
df_reg

Unnamed: 0,reg_id,name
0,1,Rama
1,2,Krishna
2,3,Radha
3,4,Shyam


In [32]:
df_log = pd.DataFrame({'log_id':[1,2,3,4], 'name': ['Kanha','Rama','Radha', 'Shyamlal']})

In [33]:
df_log

Unnamed: 0,log_id,name
0,1,Kanha
1,2,Rama
2,3,Radha
3,4,Shyamlal


In [34]:
# inner merge on 'name' - merging must be done on unique columns. Here we are considering the name is unique
pd.merge(left=df_log, right=df_reg, how='inner', on='name')

Unnamed: 0,log_id,name,reg_id
0,2,Rama,1
1,3,Radha,3


<span style="background-color:red; color:white; padding:2px">IMPORTANT</span>: `inner` will return **common/matching** rows from both the dataframes. In `inner` merge the position of dataframes while passing as parameter does not matter much.

In [35]:
# left merge 
pd.merge(left=df_reg, right=df_log, how='left', on='name')

Unnamed: 0,reg_id,name,log_id
0,1,Rama,2.0
1,2,Krishna,
2,3,Radha,3.0
3,4,Shyam,


<span style="background-color:red; color:white; padding:2px">IMPORTANT</span>: `left` will return all the rows from left dataframe i.e. `df_reg` and the **common/matching** rows from right dataframes and make other entries null of right dataframe if there is no matching record in left table. In `left/right` merge the position of dataframes while passing as parameter matters.

In [36]:
# right merge
pd.merge(left=df_reg, right=df_log, how='right', on='name')

Unnamed: 0,reg_id,name,log_id
0,,Kanha,1
1,1.0,Rama,2
2,3.0,Radha,3
3,,Shyamlal,4


If columns name in both dataframe is different but hold the similar data then we can use below approached to merge

- rename the column first and then merge
- use buitlin parameters on merge function

In [37]:
pd.merge(left=df_reg, right=df_log, left_on='reg_id', right_on='log_id', how='inner')

Unnamed: 0,reg_id,name_x,log_id,name_y
0,1,Rama,1,Kanha
1,2,Krishna,2,Rama
2,3,Radha,3,Radha
3,4,Shyam,4,Shyamlal


In [38]:
pd.merge(left=df_reg, right=df_log, left_on='reg_id', right_on='log_id', how='inner', suffixes=['_reg', '_log'])

Unnamed: 0,reg_id,name_reg,log_id,name_log
0,1,Rama,1,Kanha
1,2,Krishna,2,Rama
2,3,Radha,3,Radha
3,4,Shyam,4,Shyamlal


In [39]:
# outer merge - everything in both the diagram
pd.merge(left=df_reg, right=df_log, on='name', how='outer')

Unnamed: 0,reg_id,name,log_id
0,,Kanha,1.0
1,2.0,Krishna,
2,3.0,Radha,3.0
3,1.0,Rama,2.0
4,4.0,Shyam,
5,,Shyamlal,4.0


In [40]:
# Merging on index - so far we have seen how to merge on columns but we can also merge based on index
df_reg

Unnamed: 0,reg_id,name
0,1,Rama
1,2,Krishna
2,3,Radha
3,4,Shyam


In [42]:
df_reg_n = df_reg.set_index(keys='name')

In [43]:
df_reg_n

Unnamed: 0_level_0,reg_id
name,Unnamed: 1_level_1
Rama,1
Krishna,2
Radha,3
Shyam,4


In [44]:
# to merge based on index we have to use left_index, right_index
pd.merge(left=df_reg_n, right=df_log, left_index='name', right_on='name')

ValueError: left_index parameter must be of type bool, not <class 'str'>