# Merging DataFrames Together

In this module, we're going to talk about two different types of merging: concatenation and masking

In [1]:
import pandas as pd

## Concatenation

To "concatenate" means to combine things end-to-end.  That is, we're going to merge together multiple data sets in a way that we just keep appending more rows end-on-end.

In `https://hds5210-data.s3.amazonaws.com/drinking/` there are a whole list of files that we want to merge together into a single data frame.  They all have the same format, but the are from different cities.

In [2]:
%%bash

# *** WARNING ***
# Do not run this code on your local machine if you have the AWS CLI already configured
# It could cause a problem with your existing security credentials
# and permantently erase your existing access keys and secrets

# If you're curious about what this code does, 
# it creates a file called ~/.aws/credentials with credentials I've created
# that allow you to list files in a particular AWS s3 storage bucket.

mkdir -p ~/.aws
grep hds5210 ~/.aws/credentials 2>/dev/null || cat >>~/.aws/credentials <<EOF 
[hds5210]
aws_access_key_id = AKIAUXBOKEFK63ZPGD62
aws_secret_access_key = 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
aws_default_region = us-east-1
EOF
chmod 644 ~/.aws/credentials
cat ~/.aws/credentials

[hds5210]
aws_access_key_id = AKIAUXBOKEFK63ZPGD62
aws_secret_access_key = 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
aws_default_region = us-east-1


In [3]:
# Then this one-liner gets a list of the files in a specific storage
# bucket subfolder and writes that list of files to a files.txt file.
# After you run this code, you should see a file in Google Colab
# with this same name.  From there, we'll use Python code to get the files.
!aws --profile hds5210 s3 ls s3://hds5210-data/drinking/ >files.txt

In [4]:
# Here's a function we'll use to read all of the file names from that
# text file that the aws command above created.
# The command above outputs in a "human readable" format that we have to parse
# making some assumptions (like file names won't have spaces in them).  It 
# only works because this specific subfolder doesn't have any files with spaces
# in the name.

def get_files(listing_file):
  files = []

  # Open the listing file
  with open(listing_file) as f:
    for line in f.readlines():
      # Split based on space, grab the last item, strip off extra newline
      name = line.split(' ')[-1].strip()
      # The aws command returns an empty-name file as well for some reason
      # So, we'll strip that out
      if len(name) > 0:
        files.append(name)

  # Return the list of files
  return files

In [5]:
files = get_files('files.txt')

In [6]:
files

['Baltimore_MD.csv',
 'Boston_MA.csv',
 'Charlotte_NC.csv',
 'Chicago_Il.csv',
 'Columbus_OH.csv',
 'Denver_CO.csv',
 'Detroit_MI.csv',
 'Fort_Worth_Tarrant_County_TX.csv',
 'Houston_TX.csv',
 'Indianapolis_Marion_County_IN.csv',
 'Kansas_City_MO.csv',
 'Las_Vegas_Clark_County_NV.csv',
 'Long_Beach_CA.csv',
 'Los_Angeles_CA.csv',
 'Miami_Miami-Dade_County_FL.csv',
 'Minneapolis_MN.csv',
 'New_York_City_NY.csv',
 'Oakland_Alameda_County_CA.csv',
 'Philadelphia_PA.csv',
 'Phoenix_AZ.csv',
 'Portland_Multnomah_County_OR.csv',
 'San_Antonio_TX.csv',
 'San_Diego_County_CA.csv',
 'San_Jose_CA.csv',
 'Seattle_WA.csv',
 'U.S._Total_U.S._Total.csv',
 'Washington_DC.csv']

In [7]:
len(files)

27

In [8]:
# Then, let's read each of those files into their own df and store that in a list of dfs
dataframes = []

In [9]:
for f in files:
    df = pd.read_csv('https://hds5210-data.s3.amazonaws.com/drinking/'+f)
    print(f'Read {f}')
    dataframes.append(df)

Read Baltimore_MD.csv
Read Boston_MA.csv
Read Charlotte_NC.csv
Read Chicago_Il.csv
Read Columbus_OH.csv
Read Denver_CO.csv
Read Detroit_MI.csv
Read Fort_Worth_Tarrant_County_TX.csv
Read Houston_TX.csv
Read Indianapolis_Marion_County_IN.csv
Read Kansas_City_MO.csv
Read Las_Vegas_Clark_County_NV.csv
Read Long_Beach_CA.csv
Read Los_Angeles_CA.csv
Read Miami_Miami-Dade_County_FL.csv
Read Minneapolis_MN.csv
Read New_York_City_NY.csv
Read Oakland_Alameda_County_CA.csv
Read Philadelphia_PA.csv
Read Phoenix_AZ.csv
Read Portland_Multnomah_County_OR.csv
Read San_Antonio_TX.csv
Read San_Diego_County_CA.csv
Read San_Jose_CA.csv
Read Seattle_WA.csv
Read U.S._Total_U.S._Total.csv
Read Washington_DC.csv


In [10]:
len(dataframes)

27

In [11]:
type(dataframes[0])

pandas.core.frame.DataFrame

In [12]:
dataframes[0].head()

Unnamed: 0.1,Unnamed: 0,Indicator Category,Indicator,Year,Sex,Race/Ethnicity,Value,Place,BCHC Requested Methodology,Source,Methods,Notes,90% Confidence Level - Low,90% Confidence Level - High,95% Confidence Level - Low,95% Confidence Level - High
0,21,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,All,14.5,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
1,22,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,Black,9.5,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
2,29,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,White,21.1,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
3,30,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Female,All,9.7,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
4,31,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Male,All,20.3,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,


In [13]:
len(dataframes)

27

In [14]:
# Then we can concatenate them together with pd.concat
drinking = pd.concat(dataframes)

Let's check to make sure the counts match up...

Length of combined dataframe == Sum of the length of the individual dataframes?

In [15]:
len(drinking)

599

In [16]:
sum([len(x) for x in dataframes])

599

In [17]:
drinking.head()

Unnamed: 0.1,Unnamed: 0,Indicator Category,Indicator,Year,Sex,Race/Ethnicity,Value,Place,BCHC Requested Methodology,Source,Methods,Notes,90% Confidence Level - Low,90% Confidence Level - High,95% Confidence Level - Low,95% Confidence Level - High
0,21,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,All,14.5,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
1,22,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,Black,9.5,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
2,29,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,White,21.1,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
3,30,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Female,All,9.7,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
4,31,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Male,All,20.3,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,


It's also possible to label the rows as they get concatenated together.  That can be handy if you want to keep track of which input file each row came from.

In [18]:
drinking2 = pd.concat(dataframes, keys=files)

In [19]:
drinking2.head()

Unnamed: 0.1,Unnamed: 1,Unnamed: 0,Indicator Category,Indicator,Year,Sex,Race/Ethnicity,Value,Place,BCHC Requested Methodology,Source,Methods,Notes,90% Confidence Level - Low,90% Confidence Level - High,95% Confidence Level - Low,95% Confidence Level - High
Baltimore_MD.csv,0,21,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,All,14.5,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
Baltimore_MD.csv,1,22,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,Black,9.5,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
Baltimore_MD.csv,2,29,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,White,21.1,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
Baltimore_MD.csv,3,30,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Female,All,9.7,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
Baltimore_MD.csv,4,31,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Male,All,20.3,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,


In [20]:
drinking2.head().reset_index()

Unnamed: 0.1,level_0,level_1,Unnamed: 0,Indicator Category,Indicator,Year,Sex,Race/Ethnicity,Value,Place,BCHC Requested Methodology,Source,Methods,Notes,90% Confidence Level - Low,90% Confidence Level - High,95% Confidence Level - Low,95% Confidence Level - High
0,Baltimore_MD.csv,0,21,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,All,14.5,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
1,Baltimore_MD.csv,1,22,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,Black,9.5,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
2,Baltimore_MD.csv,2,29,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,White,21.1,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
3,Baltimore_MD.csv,3,30,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Female,All,9.7,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
4,Baltimore_MD.csv,4,31,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Male,All,20.3,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,


In [21]:
drinking2.index.levels[0]

Index(['Baltimore_MD.csv', 'Boston_MA.csv', 'Charlotte_NC.csv',
       'Chicago_Il.csv', 'Columbus_OH.csv', 'Denver_CO.csv', 'Detroit_MI.csv',
       'Fort_Worth_Tarrant_County_TX.csv', 'Houston_TX.csv',
       'Indianapolis_Marion_County_IN.csv', 'Kansas_City_MO.csv',
       'Las_Vegas_Clark_County_NV.csv', 'Long_Beach_CA.csv',
       'Los_Angeles_CA.csv', 'Miami_Miami-Dade_County_FL.csv',
       'Minneapolis_MN.csv', 'New_York_City_NY.csv',
       'Oakland_Alameda_County_CA.csv', 'Philadelphia_PA.csv',
       'Phoenix_AZ.csv', 'Portland_Multnomah_County_OR.csv',
       'San_Antonio_TX.csv', 'San_Diego_County_CA.csv', 'San_Jose_CA.csv',
       'Seattle_WA.csv', 'U.S._Total_U.S._Total.csv', 'Washington_DC.csv'],
      dtype='object')

## Concatenating Side-by-Side

The stacking example above is more common, but it might be interesting to concatenate data side-by-side. 

In [22]:
names1=[['Paul','Boal'],['Anny', 'Monroe'],['Eric','Westhus'],['Andy','Slavitt']]
names2=[['Paul Boal'],['Anny Monroe'],['Eric Westhus'],[''],['Mario Garza']]
n1 = pd.DataFrame(names1, columns=['First','Last'])
n2 = pd.DataFrame(names2, columns=['Full Name'])

In [23]:
n1

Unnamed: 0,First,Last
0,Paul,Boal
1,Anny,Monroe
2,Eric,Westhus
3,Andy,Slavitt


In [24]:
n2

Unnamed: 0,Full Name
0,Paul Boal
1,Anny Monroe
2,Eric Westhus
3,
4,Mario Garza


In [25]:
pd.concat([n1,n2], axis=1)

Unnamed: 0,First,Last,Full Name
0,Paul,Boal,Paul Boal
1,Anny,Monroe,Anny Monroe
2,Eric,Westhus,Eric Westhus
3,Andy,Slavitt,
4,,,Mario Garza


## Masking

With "masking", we are taking two data sets and overlaying one ontop of the other.  If the first has values, then those will be kept.  If the first has a blank (NaN), then the underlying value from the next data set will be shown.

In [26]:
nppes1 = pd.read_csv('https://hds5210-data.s3.amazonaws.com/nppes1.csv')
nppes2 = pd.read_csv('https://hds5210-data.s3.amazonaws.com/nppes2.csv')
nppes1.set_index('NPI', inplace=True)
nppes2.set_index('NPI', inplace=True)

In [27]:
nppes2.head()

Unnamed: 0_level_0,Entity Type Code,Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,City,State
NPI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1710950183,1,,JIMENEZ MORALES,LUZ,M,SAN GERMAN,PR
1740769413,1,,RIVERA TORRES,NOELLIE,MARIE,SAN JUAN,PR
1164984100,1,,LUGO JOSE,YADHIRA,,PONCE,PR
1497217442,2,HEALTHSTAT CLINICS LLC,,,,VEGA BAJA,PR
1841752896,1,,DU,XIAO ZHOU,,WINNIPEG,MB


In [28]:
nppes1['State'].count()

18699

In [29]:
len(nppes1)

18717

In [30]:
len(nppes2)

111

In [31]:
nppes2

Unnamed: 0_level_0,Entity Type Code,Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,City,State
NPI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1710950183,1,,JIMENEZ MORALES,LUZ,M,SAN GERMAN,PR
1740769413,1,,RIVERA TORRES,NOELLIE,MARIE,SAN JUAN,PR
1164984100,1,,LUGO JOSE,YADHIRA,,PONCE,PR
1497217442,2,HEALTHSTAT CLINICS LLC,,,,VEGA BAJA,PR
1841752896,1,,DU,XIAO ZHOU,,WINNIPEG,MB
...,...,...,...,...,...,...,...
1952864084,1,,REICHMAN,JAMES,,JERUSALEM,JERUSALEM
1265995302,1,,SURI,KARTIK,RAJ,BURNABY,BC
1841609351,1,,SANTANA,ABRAHAM,,SAN JUAN,PR
1902369085,1,,LUGO MUNOZ,JOSE,JUAN,SANTA ISABEL,PR


In [32]:
nppes1[pd.isnull(nppes1['State'])]

Unnamed: 0_level_0,Entity Type Code,Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Middle Name,City,State
NPI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1225590060,1,,,,,,
1649732488,1,,,,,,
1235691015,1,,,,,,
1609338318,1,,,,,,
1841752532,1,,,,,,
1184186736,1,,,,,,
1780146365,1,,,,,,
1124580600,2,KHKN CORPORATION,,,,,
1912469404,1,,,,,,
1326500869,1,,,,,,


In [33]:
combined = nppes1.combine_first(nppes2)

In [34]:
combined['State'].count()

18717

In [35]:
len(nppes1)

18717

In [36]:
combined.loc[1225590060]

Entity Type Code                                                1
Provider Organization Name (Legal Business Name)              NaN
Provider Last Name (Legal Name)                     ALICEA CASTRO
Provider First Name                                          ERIC
Provider Middle Name                                      GABRIEL
City                                                    VEGA BAJA
State                                                 PUERTO RICO
Name: 1225590060, dtype: object

In [37]:
nppes1.loc[1225590060]

Entity Type Code                                      1
Provider Organization Name (Legal Business Name)    NaN
Provider Last Name (Legal Name)                     NaN
Provider First Name                                 NaN
Provider Middle Name                                NaN
City                                                NaN
State                                               NaN
Name: 1225590060, dtype: object