# Merging DataFrames with Pandas

As a data scientist, you'll often find that the data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. You’ll want to be able to import the data you’re interested in as a collection of DataFrames and combine them to answer your central questions. This notebook is all about the act of combining—or merging—DataFrames, an essential part of any data scientist's toolbox. You'll hone your pandas skills by learning how to organize, reshape, and aggregate multiple datasets to answer your specific questions.

## Reading DataFrames from multiple files

When data is spread among several files, you usually invoke pandas' `read_csv()` (or a similar data import function) multiple times to load the data into several DataFrames.

The data files for this example have been derived from a [list of Olympic medals awarded between 1896 & 2008 compiled by the Guardian](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data).

The column labels of each DataFrame are NOC, Country, & Total where `NOC` is a three-letter code for the name of the country and Total is the number of medals of that type won (bronze, silver, or gold).

In [3]:
# Import pandas
import pandas as pd

In [4]:
# Read 'Bronze.csv' into a DataFrame: bronze
bronze = pd.read_csv('data/dc01/summer-olympic-medals/Bronze.csv')
bronze.head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,1052.0
1,URS,Soviet Union,584.0
2,GBR,United Kingdom,505.0
3,FRA,France,475.0
4,GER,Germany,454.0


In [5]:
# Read 'Silver.csv' into a DataFrame: silver
silver = pd.read_csv('data/dc01/summer-olympic-medals/Silver.csv')
bronze.head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,1052.0
1,URS,Soviet Union,584.0
2,GBR,United Kingdom,505.0
3,FRA,France,475.0
4,GER,Germany,454.0


In [6]:
# Read 'Gold.csv' into a DataFrame: gold
gold = pd.read_csv('data/dc01/summer-olympic-medals/Gold.csv')
bronze.head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,1052.0
1,URS,Soviet Union,584.0
2,GBR,United Kingdom,505.0
3,FRA,France,475.0
4,GER,Germany,454.0


## Reading DataFrames from multiple files in a loop

As you saw in the video, loading data from multiple files into DataFrames is more efficient in a **loop** or a **list comprehension**.

Notice that this approach is not restricted to working with CSV files. That is, even if your data comes in other formats, as long as pandas has a suitable data import function, you can apply a loop or comprehension to generate a list of DataFrames imported from the source files.

In [9]:
# Import pandas
import pandas as pd

# Create the list of file names: filenames
path = 'data/dc01/summer-olympic-medals/'
filenames = ['Gold.csv', 'Silver.csv', 'Bronze.csv']

# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(path+filename))

# Print top 5 rows of 1st DataFrame in dataframes
print(dataframes[0].head())

   NOC         Country   Total
0  USA   United States  2088.0
1  URS    Soviet Union   838.0
2  GBR  United Kingdom   498.0
3  FRA          France   378.0
4  GER         Germany   407.0


## Combining DataFrames from multiple data files

In this exercise, you'll combine the three DataFrames from earlier exercises - gold, silver, & bronze - into a single DataFrame called medals. The approach you'll use here is clumsy. Later on in the course, you'll see various powerful methods that are frequently used in practice for concatenating or merging DataFrames.

Remember, the column labels of each DataFrame are NOC, Country, and Total, where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won.

In [11]:
# Import pandas
import pandas as pd

# Make a copy of gold: medals
medals = gold.copy()

# Create list of new column labels: new_labels
new_labels = ['NOC', 'Country', 'Gold']

# Rename the columns of medals using new_labels
medals.columns = new_labels
medals.head()

Unnamed: 0,NOC,Country,Gold
0,USA,United States,2088.0
1,URS,Soviet Union,838.0
2,GBR,United Kingdom,498.0
3,FRA,France,378.0
4,GER,Germany,407.0


In [12]:
# Add columns 'Silver' & 'Bronze' to medals
medals['Silver'] = silver['Total']
medals['Bronze'] = bronze['Total']

In [13]:
# Print the head of medals
medals.head()

Unnamed: 0,NOC,Country,Gold,Silver,Bronze
0,USA,United States,2088.0,1195.0,1052.0
1,URS,Soviet Union,838.0,627.0,584.0
2,GBR,United Kingdom,498.0,591.0,505.0
3,FRA,France,378.0,461.0,475.0
4,GER,Germany,407.0,350.0,454.0
