# Data Analysis in Python - VII: Combining DataFrames

## Introduction


In this lesson, we will learn how to combine DataFrames using concatenation and merges. 

Note: 
1. Use the TOC to navigate between sections.


## Concatenating DataFrames

### Combining rows

We can combine the rows from two or more DataFrames into one DataFrame using the `concat()` function. Typically, this makes sense when the DataFrames being combined have the same columns. 

In [3]:
# read the poverty data
import pandas as pd
povData = pd.read_csv('../scratch/PovertyData.csv', sep=',',na_values='*')
povData.head()

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,1,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary


In [4]:
# split data frame into individual dataframes that correspond to regions 1, 3, and 5.

grpData = povData.groupby('Region')
reg1_data = grpData.get_group(1)
reg3_data = grpData.get_group(3)
reg5_data = grpData.get_group(5)

In [5]:
print(reg1_data.shape)
print(reg3_data.shape)
print(reg5_data.shape)

(11, 8)
(19, 8)
(17, 8)


In [6]:
reg1_data

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,1,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary
5,14.3,10.2,16.0,67.2,75.7,1690.0,1,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1,Romania
7,14.0,9.0,20.2,68.6,74.5,,1,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1,Byelorussian_SSR


In [7]:
reg3_data

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
23,12.0,10.6,7.9,70.0,76.8,15540.0,3,Belgium
24,13.2,10.1,5.8,70.7,78.7,26040.0,3,Finland
25,12.4,11.9,7.5,71.8,77.7,22080.0,3,Denmark
26,13.6,9.4,7.4,72.3,80.5,19490.0,3,France
27,11.4,11.2,7.4,71.8,78.4,22320.0,3,Germany
28,10.1,9.2,11.0,65.4,74.0,5990.0,3,Greece
29,15.1,9.1,7.5,71.0,76.7,9550.0,3,Ireland
30,9.7,9.1,8.8,72.0,78.6,16830.0,3,Italy
31,13.2,8.6,7.1,73.3,79.9,17320.0,3,Netherlands
32,14.3,10.7,7.8,67.2,75.7,23120.0,3,Norway


In [8]:
reg5_data

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
42,40.4,18.7,181.6,41.0,42.0,168.0,5,Afghanistan
54,42.2,15.5,119.0,56.9,56.0,210.0,5,Bangladesh
55,41.4,16.6,130.0,47.0,49.9,,5,Cambodia
56,21.2,6.7,32.0,68.0,70.9,380.0,5,China
57,11.7,4.9,6.1,74.3,80.1,14210.0,5,Hong_Kong
58,30.5,10.2,91.0,52.5,52.1,350.0,5,India
59,28.6,9.4,75.0,58.5,62.0,570.0,5,Indonesia
60,23.5,18.1,25.0,66.2,72.7,,5,Korea
61,31.6,5.6,24.0,67.5,71.6,2320.0,5,Malaysia
62,36.1,8.8,68.0,60.0,62.5,110.0,5,Mongolia


In [9]:
# combine region 1, 3, and 5 data frames into one
combined_data=pd.concat([reg1_data,reg3_data,reg5_data])
print(combined_data.shape)
print(combined_data.index)
combined_data.head(20)

(47, 8)
Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 23, 24, 25, 26, 27, 28,
            29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 54, 55, 56,
            57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
           dtype='int64')


Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,1,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary
5,14.3,10.2,16.0,67.2,75.7,1690.0,1,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1,Romania
7,14.0,9.0,20.2,68.6,74.5,,1,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1,Byelorussian_SSR


The index of each row in the concatenated DataFrame is the same as in the original DataFrame. This means that sometimes there will be "gaps" in numeric indices and sometimes we will see duplicate indices. If that can pose a problem, you can either use `ignore_index = True` in the call to the `concat()` function or reset the index after combining the DataFrames.

In [10]:
# combine region 1, 3, and 5 data frames into one while ignoring the indices in the original data frames
combined_data1=pd.concat([reg1_data,reg3_data,reg5_data],ignore_index=True)
print(combined_data1.shape)
print(combined_data1.index)
combined_data1.head(20)

(47, 8)
RangeIndex(start=0, stop=47, step=1)


Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,1,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary
5,14.3,10.2,16.0,67.2,75.7,1690.0,1,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1,Romania
7,14.0,9.0,20.2,68.6,74.5,,1,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1,Byelorussian_SSR


In [11]:
# reset the index from the previous combined data frame
combined_data.reset_index(inplace=True)
combined_data.head(20)

Unnamed: 0,index,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,0,24.7,5.7,30.8,69.6,75.5,600.0,1,Albania
1,1,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
2,2,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
3,3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,4,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary
5,5,14.3,10.2,16.0,67.2,75.7,1690.0,1,Poland
6,6,13.6,10.7,26.9,66.5,72.4,1640.0,1,Romania
7,7,14.0,9.0,20.2,68.6,74.5,,1,Yugoslavia
8,8,17.7,10.0,23.0,64.6,74.0,2242.0,1,USSR
9,9,15.2,9.5,13.1,66.4,75.9,1880.0,1,Byelorussian_SSR


In [12]:
combined_data.drop('index',axis= "columns",inplace = True)
combined_data.head(20)

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
0,24.7,5.7,30.8,69.6,75.5,600.0,1,Albania
1,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
2,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
3,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
4,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary
5,14.3,10.2,16.0,67.2,75.7,1690.0,1,Poland
6,13.6,10.7,26.9,66.5,72.4,1640.0,1,Romania
7,14.0,9.0,20.2,68.6,74.5,,1,Yugoslavia
8,17.7,10.0,23.0,64.6,74.0,2242.0,1,USSR
9,15.2,9.5,13.1,66.4,75.9,1880.0,1,Byelorussian_SSR


### Combining columns

The `concat()` function can be used to combine DataFrames that represent different columns of the same data set. The `axis` is set to `1` to indicate that columns are being concatenated instead of rows (which corresponds to `axis = 0`).

In [13]:
# combine the birth and death rate data with the life expectancy data 
bdRatesData = povData[['LiveBirthRate', 'DeathRate', 'InfantDeaths']]
print(bdRatesData.head())
lifeExpData = povData[['MaleLifeExpectancy','FemaleLifeExpectancy']]
print(lifeExpData.head())

   LiveBirthRate  DeathRate  InfantDeaths
0           24.7        5.7          30.8
1           12.5       11.9          14.4
2           13.4       11.7          11.3
3           12.0       12.4           7.6
4           11.6       13.4          14.8
   MaleLifeExpectancy  FemaleLifeExpectancy
0                69.6                  75.5
1                68.3                  74.7
2                71.8                  77.7
3                69.8                  75.9
4                65.4                  73.8


In [14]:
bdlifedata=pd.concat([bdRatesData,lifeExpData],axis=1)
bdlifedata.head()

Unnamed: 0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy
0,24.7,5.7,30.8,69.6,75.5
1,12.5,11.9,14.4,68.3,74.7
2,13.4,11.7,11.3,71.8,77.7
3,12.0,12.4,7.6,69.8,75.9
4,11.6,13.4,14.8,65.4,73.8


## Merging DataFrames

Often the data we want to analyze exists in different files and/or database tables. When working with such data, we can merge the data into one DataFrame. The concept of merging is similar to a database join operation. Four different kinds of merges or joins can be performed. 
* Left
* Right
* Inner
* Outer
* Cross

If the join type is not specified, an inner join is performed as the default.

Let's create some simple DataFrames to learn about merging data. 

In [15]:
# create data frames

students = pd.DataFrame(
    [
        [9000, 'Amir', 'A1@psu.edu'],
        [9001, 'Biko', 'b10@psu.edu'],
        [9002, 'Chen', 'C2@psu.edu'],
        [9003, 'Darren', 'd@psu.edu'],
        [9004, 'Elena', 'e@psu.edu'],
    ], 
    columns = ['ID','Name','Email']
)

demographics = pd.DataFrame(
    [
        [9001, 30, True],
        [9002, 29, False],
        [9003, 28, False],
        [9004, 32, True],
        [9005, 30, True]
    ], 
    columns = ['StudentID','Age','InState']
)

scores = pd.DataFrame(
    [
        [9001, 'MKTG 300', 90],
        [9001, 'SCM 200', 89],
        [9002, 'FIN 301', 90],
        [9003, 'BAN 830', 92],
        [9004, 'SCM 301', 90],
        [9004, 'SCM 200', 93]        
    ], 
    columns = ['StudentID','Subject','Score']
)

Now let's exlpore different kinds of merges.

In [23]:
# inner merge students and demographics
merged_data=students.merge(demographics, 'inner', left_on='ID', right_on='StudentID')
merged_data

Unnamed: 0,ID,Name,Email,StudentID,Age,InState
0,9001,Biko,b10@psu.edu,9001,30,True
1,9002,Chen,C2@psu.edu,9002,29,False
2,9003,Darren,d@psu.edu,9003,28,False
3,9004,Elena,e@psu.edu,9004,32,True


In [24]:
# left merge students and demographics
merged_data=students.merge(demographics, 'left', left_on='ID', right_on='StudentID')
merged_data


Unnamed: 0,ID,Name,Email,StudentID,Age,InState
0,9000,Amir,A1@psu.edu,,,
1,9001,Biko,b10@psu.edu,9001.0,30.0,True
2,9002,Chen,C2@psu.edu,9002.0,29.0,False
3,9003,Darren,d@psu.edu,9003.0,28.0,False
4,9004,Elena,e@psu.edu,9004.0,32.0,True


In [25]:
# right merge students and demographics
merged_data=students.merge(demographics, 'right', left_on='ID', right_on='StudentID')
merged_data



Unnamed: 0,ID,Name,Email,StudentID,Age,InState
0,9001.0,Biko,b10@psu.edu,9001,30,True
1,9002.0,Chen,C2@psu.edu,9002,29,False
2,9003.0,Darren,d@psu.edu,9003,28,False
3,9004.0,Elena,e@psu.edu,9004,32,True
4,,,,9005,30,True


In [26]:
# outer merge students and demographics

merged_data=students.merge(demographics, 'outer', left_on='ID', right_on='StudentID')
merged_data


Unnamed: 0,ID,Name,Email,StudentID,Age,InState
0,9000.0,Amir,A1@psu.edu,,,
1,9001.0,Biko,b10@psu.edu,9001.0,30.0,True
2,9002.0,Chen,C2@psu.edu,9002.0,29.0,False
3,9003.0,Darren,d@psu.edu,9003.0,28.0,False
4,9004.0,Elena,e@psu.edu,9004.0,32.0,True
5,,,,9005.0,30.0,True


In [28]:
# cross merge students and demographics
merged_data=students.merge(demographics, 'cross')
merged_data


Unnamed: 0,ID,Name,Email,StudentID,Age,InState
0,9000,Amir,A1@psu.edu,9001,30,True
1,9000,Amir,A1@psu.edu,9002,29,False
2,9000,Amir,A1@psu.edu,9003,28,False
3,9000,Amir,A1@psu.edu,9004,32,True
4,9000,Amir,A1@psu.edu,9005,30,True
5,9001,Biko,b10@psu.edu,9001,30,True
6,9001,Biko,b10@psu.edu,9002,29,False
7,9001,Biko,b10@psu.edu,9003,28,False
8,9001,Biko,b10@psu.edu,9004,32,True
9,9001,Biko,b10@psu.edu,9005,30,True


In [31]:
# inner join of students with scores
merged_data=students.merge(scores, 'inner', left_on ='ID', right_on='StudentID')
merged_data


Unnamed: 0,ID,Name,Email,StudentID,Subject,Score
0,9001,Biko,b10@psu.edu,9001,MKTG 300,90
1,9001,Biko,b10@psu.edu,9001,SCM 200,89
2,9002,Chen,C2@psu.edu,9002,FIN 301,90
3,9003,Darren,d@psu.edu,9003,BAN 830,92
4,9004,Elena,e@psu.edu,9004,SCM 301,90
5,9004,Elena,e@psu.edu,9004,SCM 200,93
