# Combining DataFrames: Merging, Joining, and Concatenating

### Creating Sample Data(Frames)

In [30]:
import pandas as pd
import numpy as np

In [31]:
df1 = pd.DataFrame({"c0": np.arange(4),
                        "c1": np.arange(4, 8),
                        "c2": np.arange(8, 12),
                        "c3": np.arange(12, 16)},
                        index=[0, 1, 2, 3])

In [32]:
df2 = pd.DataFrame({"c0": np.arange(16, 20),
                        "c1": np.arange(20, 24),
                        "c2": np.arange(24, 28),
                        "c3": np.arange(28, 32)},
                        index=[0, 1, 2, 3])

In [33]:
df3 = pd.DataFrame({"c0": np.arange(32, 36),
                        "c1": np.arange(36, 40),
                        "c2": np.arange(40, 44),
                        "c3": np.arange(44, 48)},
                        index=[0, 1, 2, 3])

In [34]:
df1

Unnamed: 0,c0,c1,c2,c3
0,0,4,8,12
1,1,5,9,13
2,2,6,10,14
3,3,7,11,15


In [35]:
df2

Unnamed: 0,c0,c1,c2,c3
0,16,20,24,28
1,17,21,25,29
2,18,22,26,30
3,19,23,27,31


In [36]:
df3

Unnamed: 0,c0,c1,c2,c3
0,32,36,40,44
1,33,37,41,45
2,34,38,42,46
3,35,39,43,47


## Concatenation


>```pd.concat()``` is a function in pandas that glues together ```DataFrames``` along an axis, provided the dimensions match along that axis.

In [37]:
pd.concat([df1,df2,df3])

Unnamed: 0,c0,c1,c2,c3
0,0,4,8,12
1,1,5,9,13
2,2,6,10,14
3,3,7,11,15
0,16,20,24,28
1,17,21,25,29
2,18,22,26,30
3,19,23,27,31
0,32,36,40,44
1,33,37,41,45


In [38]:
pd.concat([df1,df2,df3],axis=1)

Unnamed: 0,c0,c1,c2,c3,c0.1,c1.1,c2.1,c3.1,c0.2,c1.2,c2.2,c3.2
0,0,4,8,12,16,20,24,28,32,36,40,44
1,1,5,9,13,17,21,25,29,33,37,41,45
2,2,6,10,14,18,22,26,30,34,38,42,46
3,3,7,11,15,19,23,27,31,35,39,43,47


Suppose the dataframes did not have the same column names or indices. While concatenating, all columns (or) indices are condsidered (depending on which axis we are concatenating on), and a ```NaN``` is placed in positions (i.e., **column x index**) which dont exist in each DataFrame.

Example:

In [39]:
df4 = pd.DataFrame({'c0': np.arange(4),
                        'c1': np.arange(4, 8),
                        'c2': np.arange(8, 12),
                        'c3': np.arange(12, 16)},
                        index=[7, 8, 9, 10])

In [40]:
pd.concat([df1, df4], axis = 1)

Unnamed: 0,c0,c1,c2,c3,c0.1,c1.1,c2.1,c3.1
0,0.0,4.0,8.0,12.0,,,,
1,1.0,5.0,9.0,13.0,,,,
2,2.0,6.0,10.0,14.0,,,,
3,3.0,7.0,11.0,15.0,,,,
7,,,,,0.0,4.0,8.0,12.0
8,,,,,1.0,5.0,9.0,13.0
9,,,,,2.0,6.0,10.0,14.0
10,,,,,3.0,7.0,11.0,15.0


## Merging

>The ```pd.merge()``` function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. 

For those that aren't familiar with SQL, think of merging as the joining of two tables, **row-wise**, such that some combination of columns is constant with the rows that are being matched.


This combination of columns is called the ```on``` parameter.

There are multiple methods used to merge dataframes. The method which we prefer can be specified using the ```how``` parameter.


The various methods can be illustrated with the following diagram: 

<p align="center">
  <img src="assets/join-merge.png" />
</p>


Since this is a very important concept, let us walk through a few examples together.

In [41]:
left = pd.DataFrame({"key": ["K0", "K1", "K2", "K3"],
                    "A": ["A0", "A1", "A2", "A3"],
                    "B": ["B0", "B1", "B2", "B3"]})
   
right = pd.DataFrame({"key": ["K0", "K1", "K2", "K3"],
                    "C": ["C0", "C1", "C2", "C3"],
                    "D": ["D0", "D1", "D2", "D3"]})    

In [42]:
left

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3


In [43]:
right

Unnamed: 0,key,C,D
0,K0,C0,D0
1,K1,C1,D1
2,K2,C2,D2
3,K3,C3,D3


### Inner Merge

As the diagram points out, we're merging when the ```on``` parameter is common to both dataframes.

So in the example below, we are merging dataframes ```left``` and ```right```, where the rows contain a common ```key``` column.

Or in venn diagram terms: 

$$L[key] \cap R[key]$$


where ```L``` is the left dataframe and ```R``` is the right dataframe.

In [44]:
pd.merge(left, right, how="inner", on = "key")

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


**Note:** In case ```on``` isn't specified the default is to merge on the basis of the combination of **all** commonly named columns.

To show how we can merge based on a combination of columns, let's look at another example:

In [45]:
left = pd.DataFrame({"key1": ["K0", "K0", "K1", "K2"],
                     "key2": ["K0", "K1", "K0", "K1"],
                        "A": ["A0", "A1", "A2", "A3"],
                        "B": ["B0", "B1", "B2", "B3"]})

right = pd.DataFrame({"key1": ["K0", "K1", "K1", "K2"],
                     "key2": ["K0", "K0", "K0", "K0"],
                     "C": ["C0", "C1", "C2", "C3"],
                     "D": ["D0", "D1", "D2", "D3"]})

In [46]:
# by default the "how" parameter is "inner"
pd.merge(left, right, on=["key1", "key2"])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


**Note:** We can see that we have the combination of ```K1``` and ```K0``` appear twice. This is because there are two such combinations of ```K1``` and ```K0``` in the ```right``` dataframe. So the row in the ```left``` is matched with each of the rows in the ```right``` dataframe with this combination

### Left and Right [outer] Merge

Again, let's return to that diagram. We can see that ```left``` is basically 

$$L[key] \cup (L[key] \cap R[key])$$

where ```L``` is the left dataframe and ```R``` is the right dataframe.

So instead of just taking all rows where this combination of columns is common to both we additionally take:

----

i) All combinations in the ```left``` dataframe that is not present in the ```right``` dataframe. And since these combinations do not exist in the ```right``` dataframe, we place a ```NaN``` in their positions.
Ex:

In [47]:
pd.merge(left, right, how="left", on=["key1", "key2"])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


The additional rows we see here in comparison to the inner merge are:

a) ```K1, K0``` (yes, order matters!!)

b) ```K2, K1```

We can see that these are the combinations of the keys that exist only in the ```left``` dataframe.

**Note:** The names we assigned are just for convinience' sake. The ```how``` parameter only takes in one of the following: ```'left'```, ```'right'```, ```'outer'```, or ```'inner'```.

----

ii) All combinations in the ```right``` dataframe that is not present in the ```left``` dataframe. And since these combinations do not exist in the ```left``` dataframe, we place a ```NaN``` in their positions.

Ex:

In [48]:
pd.merge(left, right, how="right", on=["key1", "key2"])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


----
In case we want to merge based on differently named columns, we use the ```right_on``` and ```left_on``` parameters.

Ex:

In [49]:
one = pd.DataFrame({"key1": ["K0", "K1", "K2", "K3"],
                    "A": ["A0", "A1", "A2", "A3"],
                    "B": ["B0", "B1", "B2", "B3"]})
                    
two = pd.DataFrame({"key2": ["K0", "K1", "K2", "K3"],
                    "C": ["C0", "C1", "C2", "C3"],
                    "D": ["D0", "D1", "D2", "D3"]})   

In [50]:
pd.merge(one, two, how = "inner", right_on = "key2", left_on = "key1")

Unnamed: 0,key1,A,B,key2,C,D
0,K0,A0,B0,K0,C0,D0
1,K1,A1,B1,K1,C1,D1
2,K2,A2,B2,K2,C2,D2
3,K3,A3,B3,K3,C3,D3


### [Full] Outer Merge

It should now be obvious that the ```outer``` merge is practically is doing both ```left``` and ```right``` merges.

We take all the common combinationss, and the combinations that exist only in one of each dataframe while placing ```NaN``` values where those positions don't exist in the other dataframe.

essentially it is doing the following:

$$ L[key] \cup R[key]$$

In [51]:
pd.merge(left, right, how="outer", on=["key1", "key2"])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,
5,K2,K0,,,C3,D3


## Joining


Joining is the same as merging, with the two differences being:

a) ```join()``` is a method defined on a ```DataFrame``` object, and not a function.

b) ```join()``` merges two dataframes based on their index. So while ```how``` parameter is still needed (and works the same way as in merge), the ```on``` parameter is not needed (but still can be used, it just has a default value)

This  makes ```merge()``` a more flexible, albiet complex version of ```join()```.

In [52]:
left2 = pd.DataFrame({"A": ["A0", "A1", "A2"],
                    "B": ["B0", "B1", "B2"]},
                      index=["K0", "K1", "K2"]) 

right2 = pd.DataFrame({"C": ["C0", "C2", "C3"],
                    "D": ["D0", "D2", "D3"]},
                      index=["K0", "K2", "K3"])

In [53]:
# important: by default join() performs a left join
left2.join(right2)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [54]:
right2.join(left2, how="inner")

Unnamed: 0,C,D,A,B
K0,C0,D0,A0,B0
K2,C2,D2,A2,B2


In [55]:
right2.join(left2, how="outer")

Unnamed: 0,C,D,A,B
K0,C0,D0,A0,B0
K1,,,A1,B1
K2,C2,D2,A2,B2
K3,C3,D3,,
