This video will focus on how to combine two or more DataFames using the pandas merge() and concat() method. We will also explore the usage of the merge() method to join DataFrames in various ways. 

In [1]:
import pandas as pd

In [2]:
dataset1 = pd.DataFrame({'Age': ['32', '26', '29'],
                         'Sex': ['F', 'M', 'F'],
                         'State': ['CA', 'NY', 'OH']},
                         index=['Jane', 'John', 'Cathy'])
dataset2 = pd.DataFrame({'Age': ['34', '23', '24', '21'],
                         'Sex': ['M', 'F', 'F', 'F'],
                         'State': ['AZ', 'OR', 'CA', 'WA']}, 
                         index=['Dave', 'Kris', 'Xi', 'Jo'])

In [3]:
dataset1

Unnamed: 0,Age,Sex,State
Jane,32,F,CA
John,26,M,NY
Cathy,29,F,OH


In [4]:
dataset2

Unnamed: 0,Age,Sex,State
Dave,34,M,AZ
Kris,23,F,OR
Xi,24,F,CA
Jo,21,F,WA


In this example, let's put these 2 DataFrames together vertically. The pandas concat() method is used to do this with the 2 DataFrames passed as its parameters.

In [5]:
pd.concat([dataset1, dataset2])

Unnamed: 0,Age,Sex,State
Jane,32,F,CA
John,26,M,NY
Cathy,29,F,OH
Dave,34,M,AZ
Kris,23,F,OR
Xi,24,F,CA
Jo,21,F,WA


We can observe that dataset2 has been appended to dataset1 vertically.


Another way to concatenate datasets is to use the append() method. 

In [6]:
dataset1.append(dataset2)

Unnamed: 0,Age,Sex,State
Jane,32,F,CA
John,26,M,NY
Cathy,29,F,OH
Dave,34,M,AZ
Kris,23,F,OR
Xi,24,F,CA
Jo,21,F,WA


So far, we have concatenated rows in datasets but it is also possible to concatenate columns. 

In [7]:
pd.concat([dataset1, dataset2], axis=1)

Unnamed: 0,Age,Sex,State,Age.1,Sex.1,State.1
Jane,32.0,F,CA,,,
John,26.0,M,NY,,,
Cathy,29.0,F,OH,,,
Dave,,,,34.0,M,AZ
Kris,,,,23.0,F,OR
Xi,,,,24.0,F,CA
Jo,,,,21.0,F,WA


We can perform the inner merge on these dataset. To pefrom an inner merge on these datasets, we pass the DataFrames to the merge() method. We also specify the column on which the merge has to be carried out, while making sure we specify that it is an inner merge.

In [20]:
dataset1 = pd.DataFrame({'Name': ['Jane', 'John', 'Cathy', 'Sarah'],
                         'Age': ['32', '26', '29', '23'],
                         'Sex': ['F', 'M', 'F', 'F'],
                         'State': ['CA', 'NY', 'OH', 'TX']})

dataset2 = pd.DataFrame({'Name': ['Jane', 'John', 'Cathy', 'Rob'],
                        'City': ['SF', 'NY', 'Columbus', 'Austin'],
                         'Work Status': ['No', 'Yes', 'Yes', 'Yes']})

In [21]:
dataset1

Unnamed: 0,Name,Age,Sex,State
0,Jane,32,F,CA
1,John,26,M,NY
2,Cathy,29,F,OH
3,Sarah,23,F,TX


In [22]:
dataset2

Unnamed: 0,Name,City,Work Status
0,Jane,SF,No
1,John,NY,Yes
2,Cathy,Columbus,Yes
3,Rob,Austin,Yes


In [23]:
pd.merge(dataset1, dataset2, on='Name', how='inner')

Unnamed: 0,Name,Age,Sex,State,City,Work Status
0,Jane,32,F,CA,SF,No
1,John,26,M,NY,NY,Yes
2,Cathy,29,F,OH,Columbus,Yes


This now means that we have the data from both datasets together.  This contains only those rows that have common labels in both DataFrames Next we will carry out the outer merge. 

In [24]:
pd.merge(dataset1, dataset2, on='Name', how='left')

Unnamed: 0,Name,Age,Sex,State,City,Work Status
0,Jane,32,F,CA,SF,No
1,John,26,M,NY,NY,Yes
2,Cathy,29,F,OH,Columbus,Yes
3,Sarah,23,F,TX,,


The result of this operation is that the rows which are in both datasets as well as rows only present in the first dataset will be retained. Rows that exist int he second dataset only will be discarded.

In [25]:
pd.merge(dataset1, dataset2, on='Name', how='right')

Unnamed: 0,Name,Age,Sex,State,City,Work Status
0,Jane,32.0,F,CA,SF,No
1,John,26.0,M,NY,NY,Yes
2,Cathy,29.0,F,OH,Columbus,Yes
3,Rob,,,,Austin,Yes


To retain everything we do a full outer merge.

In [26]:
pd.merge(dataset1, dataset2, on='Name', how='outer')

Unnamed: 0,Name,Age,Sex,State,City,Work Status
0,Jane,32.0,F,CA,SF,No
1,John,26.0,M,NY,NY,Yes
2,Cathy,29.0,F,OH,Columbus,Yes
3,Sarah,23.0,F,TX,,
4,Rob,,,,Austin,Yes


This now contains all the rows, irrespective of whether they exist in one dataset or another or both , even for the columns which have no values and are marked as NaN.