# Merging DataFrames Together

In this module, we're going to talk about two different types of merging: concatenation and masking

In [1]:
import pandas as pd

## Concatenation

To "concatenate" means to combine things end-to-end.  That is, we're going to merge together multiple data sets in a way that we just keep appending more rows end-on-end.

In `/data/drinking/` there are a whole list of files that we want to merge together into a single data frame.  They all have the same format, but the are from different cities.

In [2]:
# First, we can get a list of the files that are in a particular directory using the os package
import os
files = os.listdir('/data/drinking/')

In [3]:
files

['Fort_Worth_Tarrant_County_TX.csv',
 'Las_Vegas_Clark_County_NV.csv',
 'Long_Beach_CA.csv',
 'Oakland_Alameda_County_CA.csv',
 'San_Antonio_TX.csv',
 'Boston_MA.csv',
 'Charlotte_NC.csv',
 'Philadelphia_PA.csv',
 'Los_Angeles_CA.csv',
 'San_Diego_County_CA.csv',
 'Indianapolis_Marion_County_IN.csv',
 'San_Jose_CA.csv',
 'Columbus_OH.csv',
 'Minneapolis_MN.csv',
 'Seattle_WA.csv',
 'Miami_Miami-Dade_County_FL.csv',
 'Kansas_City_MO.csv',
 'Detroit_MI.csv',
 'Washington_DC.csv',
 'Phoenix_AZ.csv',
 'New_York_City_NY.csv',
 'Portland_Multnomah_County_OR.csv',
 'U.S._Total_U.S._Total.csv',
 'Houston_TX.csv',
 'Denver_CO.csv',
 'Chicago_Il.csv',
 'Baltimore_MD.csv']

In [4]:
len(files)

27

In [5]:
# Then, let's read each of those files into their own df and store that in a list of dfs
dataframes = []

In [6]:
for f in files:
    df = pd.read_csv('/data/drinking/'+f)
    dataframes.append(df)

In [7]:
len(dataframes)

27

In [9]:
type(dataframes[0])

pandas.core.frame.DataFrame

In [10]:
dataframes[0].head()

Unnamed: 0.1,Unnamed: 0,Indicator Category,Indicator,Year,Sex,Race/Ethnicity,Value,Place,BCHC Requested Methodology,Source,Methods,Notes,90% Confidence Level - Low,90% Confidence Level - High,95% Confidence Level - Low,95% Confidence Level - High
0,15,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,All,15.3,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,Behavior Risk Factors: Selected Metropolitan A...,,Tarrant County (not just Fort Worth),,,,
1,128,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2011,Both,All,19.7,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,Behavior Risk Factors: Selected Metropolitan A...,,Tarrant County (not just Fort Worth),,,,
2,180,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2012,Both,All,17.7,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,Behavior Risk Factors: Selected Metropolitan A...,,Tarrant County (not just Fort Worth),,,,
3,354,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2013,Both,All,14.6,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,"Centers for Disease Control and Prevention, Na...",,FW-Arlington Metropolitan Division includes re...,,,11.0,18.2
4,400,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2014,Both,All,13.4,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,"Centers for Disease Control and Prevention, Na...",,FW-Arlington Metropolitan Division includes re...,,,9.4,17.3


In [11]:
len(dataframes)

27

In [12]:
# Then we can concatenate them together with pd.concat
drinking = pd.concat(dataframes)

Let's check to make sure the counts match up...

Length of combined dataframe == Sum of the length of the individual dataframes?

In [13]:
len(drinking)

599

In [14]:
sum([len(x) for x in dataframes])

599

In [15]:
drinking.head()

Unnamed: 0.1,Unnamed: 0,Indicator Category,Indicator,Year,Sex,Race/Ethnicity,Value,Place,BCHC Requested Methodology,Source,Methods,Notes,90% Confidence Level - Low,90% Confidence Level - High,95% Confidence Level - Low,95% Confidence Level - High
0,15,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,All,15.3,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,Behavior Risk Factors: Selected Metropolitan A...,,Tarrant County (not just Fort Worth),,,,
1,128,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2011,Both,All,19.7,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,Behavior Risk Factors: Selected Metropolitan A...,,Tarrant County (not just Fort Worth),,,,
2,180,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2012,Both,All,17.7,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,Behavior Risk Factors: Selected Metropolitan A...,,Tarrant County (not just Fort Worth),,,,
3,354,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2013,Both,All,14.6,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,"Centers for Disease Control and Prevention, Na...",,FW-Arlington Metropolitan Division includes re...,,,11.0,18.2
4,400,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2014,Both,All,13.4,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,"Centers for Disease Control and Prevention, Na...",,FW-Arlington Metropolitan Division includes re...,,,9.4,17.3


It's also possible to label the rows as they get concatenated together.  That can be handy if you want to keep track of which input file each row came from.

In [16]:
drinking2 = pd.concat(dataframes, keys=files)

In [17]:
drinking2.head()

Unnamed: 0.1,Unnamed: 1,Unnamed: 0,Indicator Category,Indicator,Year,Sex,Race/Ethnicity,Value,Place,BCHC Requested Methodology,Source,Methods,Notes,90% Confidence Level - Low,90% Confidence Level - High,95% Confidence Level - Low,95% Confidence Level - High
Fort_Worth_Tarrant_County_TX.csv,0,15,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,All,15.3,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,Behavior Risk Factors: Selected Metropolitan A...,,Tarrant County (not just Fort Worth),,,,
Fort_Worth_Tarrant_County_TX.csv,1,128,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2011,Both,All,19.7,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,Behavior Risk Factors: Selected Metropolitan A...,,Tarrant County (not just Fort Worth),,,,
Fort_Worth_Tarrant_County_TX.csv,2,180,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2012,Both,All,17.7,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,Behavior Risk Factors: Selected Metropolitan A...,,Tarrant County (not just Fort Worth),,,,
Fort_Worth_Tarrant_County_TX.csv,3,354,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2013,Both,All,14.6,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,"Centers for Disease Control and Prevention, Na...",,FW-Arlington Metropolitan Division includes re...,,,11.0,18.2
Fort_Worth_Tarrant_County_TX.csv,4,400,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2014,Both,All,13.4,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,"Centers for Disease Control and Prevention, Na...",,FW-Arlington Metropolitan Division includes re...,,,9.4,17.3


In [18]:
drinking2.head().reset_index()

Unnamed: 0.1,level_0,level_1,Unnamed: 0,Indicator Category,Indicator,Year,Sex,Race/Ethnicity,Value,Place,BCHC Requested Methodology,Source,Methods,Notes,90% Confidence Level - Low,90% Confidence Level - High,95% Confidence Level - Low,95% Confidence Level - High
0,Fort_Worth_Tarrant_County_TX.csv,0,15,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,All,15.3,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,Behavior Risk Factors: Selected Metropolitan A...,,Tarrant County (not just Fort Worth),,,,
1,Fort_Worth_Tarrant_County_TX.csv,1,128,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2011,Both,All,19.7,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,Behavior Risk Factors: Selected Metropolitan A...,,Tarrant County (not just Fort Worth),,,,
2,Fort_Worth_Tarrant_County_TX.csv,2,180,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2012,Both,All,17.7,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,Behavior Risk Factors: Selected Metropolitan A...,,Tarrant County (not just Fort Worth),,,,
3,Fort_Worth_Tarrant_County_TX.csv,3,354,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2013,Both,All,14.6,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,"Centers for Disease Control and Prevention, Na...",,FW-Arlington Metropolitan Division includes re...,,,11.0,18.2
4,Fort_Worth_Tarrant_County_TX.csv,4,400,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2014,Both,All,13.4,"Fort Worth (Tarrant County), TX",BRFSS (or similar) How many times during the p...,"Centers for Disease Control and Prevention, Na...",,FW-Arlington Metropolitan Division includes re...,,,9.4,17.3


In [None]:
drinking2.index.levels[0]

## Concatenating Side-by-Side

The stacking example above is more common, but it might be interesting to concatenate data side-by-side. 

In [33]:
names1=[['Paul','Boal'],['Anny', 'Monroe'],['Eric','Westhus'],['Andy','Slavitt']]
names2=[['Paul Boal'],['Anny Monroe'],['Eric Westhus'],[''],['Mario Garza']]
n1 = pd.DataFrame(names1, columns=['First','Last'])
n2 = pd.DataFrame(names2, columns=['Full Name'])

In [34]:
n1

Unnamed: 0,First,Last
0,Paul,Boal
1,Anny,Monroe
2,Eric,Westhus
3,Andy,Slavitt


In [35]:
n2

Unnamed: 0,Full Name
0,Paul Boal
1,Anny Monroe
2,Eric Westhus
3,
4,Mario Garza


In [36]:
pd.concat([n1,n2], axis=1)

Unnamed: 0,First,Last,Full Name
0,Paul,Boal,Paul Boal
1,Anny,Monroe,Anny Monroe
2,Eric,Westhus,Eric Westhus
3,Andy,Slavitt,
4,,,Mario Garza


## Masking

With "masking", we are taking two data sets and overlaying one ontop of the other.  If the first has values, then those will be kept.  If the first has a blank (NaN), then the underlying value from the next data set will be shown.

In [37]:
nppes1 = pd.read_csv('/data/nppes1.csv')
nppes2 = pd.read_csv('/data/nppes2.csv')
nppes1.set_index('NPI', inplace=True)
nppes2.set_index('NPI', inplace=True)

In [38]:
nppes2.head()

Unnamed: 0_level_0,Entity Type Code,Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,City,State
NPI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1710950183,1,,JIMENEZ MORALES,LUZ,M,SAN GERMAN,PR
1740769413,1,,RIVERA TORRES,NOELLIE,MARIE,SAN JUAN,PR
1164984100,1,,LUGO JOSE,YADHIRA,,PONCE,PR
1497217442,2,HEALTHSTAT CLINICS LLC,,,,VEGA BAJA,PR
1841752896,1,,DU,XIAO ZHOU,,WINNIPEG,MB


In [39]:
nppes1['State'].count()

18699

In [40]:
len(nppes1)

18717

In [41]:
len(nppes2)

111

In [42]:
nppes2

Unnamed: 0_level_0,Entity Type Code,Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,City,State
NPI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1710950183,1,,JIMENEZ MORALES,LUZ,M,SAN GERMAN,PR
1740769413,1,,RIVERA TORRES,NOELLIE,MARIE,SAN JUAN,PR
1164984100,1,,LUGO JOSE,YADHIRA,,PONCE,PR
1497217442,2,HEALTHSTAT CLINICS LLC,,,,VEGA BAJA,PR
1841752896,1,,DU,XIAO ZHOU,,WINNIPEG,MB
...,...,...,...,...,...,...,...
1952864084,1,,REICHMAN,JAMES,,JERUSALEM,JERUSALEM
1265995302,1,,SURI,KARTIK,RAJ,BURNABY,BC
1841609351,1,,SANTANA,ABRAHAM,,SAN JUAN,PR
1902369085,1,,LUGO MUNOZ,JOSE,JUAN,SANTA ISABEL,PR


In [43]:
nppes1[pd.isnull(nppes1['State'])]

Unnamed: 0_level_0,Entity Type Code,Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,City,State
NPI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1225590060,1,,,,,,
1649732488,1,,,,,,
1235691015,1,,,,,,
1609338318,1,,,,,,
1841752532,1,,,,,,
1184186736,1,,,,,,
1780146365,1,,,,,,
1124580600,2,KHKN CORPORATION,,,,,
1912469404,1,,,,,,
1326500869,1,,,,,,


In [44]:
combined = nppes1.combine_first(nppes2)

In [45]:
combined['State'].count()

18717

In [46]:
len(nppes1)

18717

In [47]:
combined.loc[1225590060]

Entity Type Code                                                1
Provider Organization Name (Legal Business Name)              NaN
Provider Last Name (Legal Name)                     ALICEA CASTRO
Provider First Name                                          ERIC
Provider Middle Name                                      GABRIEL
City                                                    VEGA BAJA
State                                                 PUERTO RICO
Name: 1225590060, dtype: object

In [48]:
nppes1.loc[1225590060]

Entity Type Code                                      1
Provider Organization Name (Legal Business Name)    NaN
Provider Last Name (Legal Name)                     NaN
Provider First Name                                 NaN
Provider Middle Name                                NaN
City                                                NaN
State                                               NaN
Name: 1225590060, dtype: object