# Merging DataFrames with Pandas

As a data scientist, you'll often find that the data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. You’ll want to be able to import the data you’re interested in as a collection of DataFrames and combine them to answer your central questions. This notebook is all about the act of combining—or merging—DataFrames, an essential part of any data scientist's toolbox. You'll hone your pandas skills by learning how to organize, reshape, and aggregate multiple datasets to answer your specific questions.

## Reading DataFrames from multiple files

When data is spread among several files, you usually invoke pandas' `read_csv()` (or a similar data import function) multiple times to load the data into several DataFrames.

The data files for this example have been derived from a [list of Olympic medals awarded between 1896 & 2008 compiled by the Guardian](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data).

The column labels of each DataFrame are NOC, Country, & Total where `NOC` is a three-letter code for the name of the country and Total is the number of medals of that type won (bronze, silver, or gold).

In [63]:
# Import pandas
import pandas as pd

dpath = 'data/dc10/summer-olympic-medals/'
dpath2 = 'data/dc10/'
dpath3 = 'data/dc10/baby-names/'
dpath4 = 'data/dc10/sales/'
dpath5 = 'data/dc10/gdp/'

In [23]:
# Read 'Bronze.csv' into a DataFrame: bronze
bronze = pd.read_csv(dpath+'Bronze.csv')
bronze.head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,1052.0
1,URS,Soviet Union,584.0
2,GBR,United Kingdom,505.0
3,FRA,France,475.0
4,GER,Germany,454.0


In [24]:
# Read 'Silver.csv' into a DataFrame: silver
silver = pd.read_csv(dpath+'Silver.csv')
bronze.head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,1052.0
1,URS,Soviet Union,584.0
2,GBR,United Kingdom,505.0
3,FRA,France,475.0
4,GER,Germany,454.0


In [25]:
# Read 'Gold.csv' into a DataFrame: gold
gold = pd.read_csv(dpath+'Gold.csv')
bronze.head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,1052.0
1,URS,Soviet Union,584.0
2,GBR,United Kingdom,505.0
3,FRA,France,475.0
4,GER,Germany,454.0


## Reading DataFrames from multiple files in a loop

As you saw in the video, loading data from multiple files into DataFrames is more efficient in a **loop** or a **list comprehension**.

Notice that this approach is not restricted to working with CSV files. That is, even if your data comes in other formats, as long as pandas has a suitable data import function, you can apply a loop or comprehension to generate a list of DataFrames imported from the source files.

In [26]:
# Import pandas
import pandas as pd

# Create the list of file names: filenames
filenames = ['Gold.csv', 'Silver.csv', 'Bronze.csv']

# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(dpath+filename))

# Print top 5 rows of 1st DataFrame in dataframes
print(dataframes[0].head())

   NOC         Country   Total
0  USA   United States  2088.0
1  URS    Soviet Union   838.0
2  GBR  United Kingdom   498.0
3  FRA          France   378.0
4  GER         Germany   407.0


## Combining DataFrames from multiple data files

In this exercise, you'll combine the three DataFrames from earlier exercises - gold, silver, & bronze - into a single DataFrame called medals. The approach you'll use here is clumsy. Later on in the course, you'll see various powerful methods that are frequently used in practice for concatenating or merging DataFrames.

Remember, the column labels of each DataFrame are NOC, Country, and Total, where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won.

In [27]:
# Import pandas
import pandas as pd

# Make a copy of gold: medals
medals = gold.copy()

# Create list of new column labels: new_labels
new_labels = ['NOC', 'Country', 'Gold']

# Rename the columns of medals using new_labels
medals.columns = new_labels
medals.head()

Unnamed: 0,NOC,Country,Gold
0,USA,United States,2088.0
1,URS,Soviet Union,838.0
2,GBR,United Kingdom,498.0
3,FRA,France,378.0
4,GER,Germany,407.0


In [28]:
# Add columns 'Silver' & 'Bronze' to medals
medals['Silver'] = silver['Total']
medals['Bronze'] = bronze['Total']

In [29]:
# Print the head of medals
medals.head()

Unnamed: 0,NOC,Country,Gold,Silver,Bronze
0,USA,United States,2088.0,1195.0,1052.0
1,URS,Soviet Union,838.0,627.0,584.0
2,GBR,United Kingdom,498.0,591.0,505.0
3,FRA,France,378.0,461.0,475.0
4,GER,Germany,407.0,350.0,454.0


## Sorting DataFrame with the Index & columns

It is often useful to rearrange the sequence of the rows of a DataFrame by sorting. You don't have to implement these yourself; the principal methods for doing this are `.sort_index()` and `.sort_values()`.

In this exercise, you'll use these methods with a DataFrame of temperature values indexed by month names. You'll sort the rows alphabetically using the Index and numerically using a column. Notice, for this data, the original ordering is probably most useful and intuitive: the purpose here is for you to understand what the sorting methods do.

In [32]:
# Read 'monthly_max_temp.csv' into a DataFrame: weather1
# Read 'monthly_max_temp.csv' into a DataFrame called weather1 
#  with 'Month' as the index.
weather1 = pd.read_csv(dpath2+'monthly_max_temp.csv', index_col='Month')
weather1.head()

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Jan,68
Feb,60
Mar,68
Apr,84
May,88


In [33]:
# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()
weather2.head()

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Apr,84
Aug,86
Dec,68
Feb,60
Jan,68


In [34]:
# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending=False)
weather3.head()

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Sep,90
Oct,84
Nov,72
May,88
Mar,68


In [35]:
# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather1.sort_values('Max TemperatureF')
weather4.head()

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Feb,60
Jan,68
Mar,68
Dec,68
Nov,72


## Reindexing DataFrame from a list

Sorting methods are not the only way to change DataFrame Indexes. There is also the `.reindex()` method.

In this exercise, you'll reindex a DataFrame of quarterly-sampled mean temperature values to contain monthly samples (this is an example of upsampling or increasing the rate of samples).

The original data has the first month's abbreviation of the quarter (three-month interval) on the Index, namely Apr, Jan, Jul, and Oct. This data has been loaded into a DataFrame called `weather1`. Notice it has only four rows (corresponding to the first month of each quarter) and that the rows are not sorted chronologically.

You'll initially use a list of all twelve month abbreviations and subsequently apply the `.ffill()` method to forward-fill the null entries when upsampling. This list of month abbreviations has been pre-loaded as `year`.

In [36]:
year = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

In [37]:
weather1 = pd.read_csv(dpath2+'weather1.csv', index_col='Month')
weather1.head()

Unnamed: 0_level_0,Mean TemperatureF
Month,Unnamed: 1_level_1
Apr,61.956044
Jan,32.133333
Jul,68.934783
Oct,43.434783


In [38]:
# Reindex weather1 using the list year: weather2
# Reorder the rows of weather1 using the .reindex() method 
#  with the list year as the argument, which contains 
#  the abbreviations for each month.
weather2 = weather1.reindex(year)
weather2

Unnamed: 0_level_0,Mean TemperatureF
Month,Unnamed: 1_level_1
Jan,32.133333
Feb,
Mar,
Apr,61.956044
May,
Jun,
Jul,68.934783
Aug,
Sep,
Oct,43.434783


In [39]:
# Reindex weather1 using the list year with forward-fill: weather3
# Reorder the rows of weather1 just as you did above, 
#  this time chaining the .ffill() method to replace the null values 
#  with the last preceding non-null value.

weather3 = weather1.reindex(year).ffill()
weather3

Unnamed: 0_level_0,Mean TemperatureF
Month,Unnamed: 1_level_1
Jan,32.133333
Feb,32.133333
Mar,32.133333
Apr,61.956044
May,61.956044
Jun,61.956044
Jul,68.934783
Aug,68.934783
Sep,68.934783
Oct,43.434783


## Reindexing using another DataFrame Index

Another common technique is to reindex a DataFrame using the Index of another DataFrame. The DataFrame `.reindex()` method can accept the Index of a DataFrame or Series as input. You can access the Index of a DataFrame with its `.index` attribute.

The [Baby Names Dataset from data.gov](https://www.data.gov/developers/baby-names-dataset/) summarizes counts of names (with genders) from births registered in the US since 1881. In this exercise, you will start with two baby-names DataFrames `names_1981` and `names_1881` loaded for you.

The DataFrames `names_1981` and `names_1881` both have a MultiIndex with levels name and gender giving unique labels to counts in each row. If you're interested in seeing how the MultiIndexes were set up, `names_1981` and `names_1881` were read in using the following commands:

In [41]:
names_1981 = pd.read_csv(dpath3+'names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1881 = pd.read_csv(dpath3+'names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))

As you can see by looking at their shapes, the DataFrame corresponding to 1981 births is much larger, reflecting the greater diversity of names in 1981 as compared to 1881.

In [42]:
print(names_1881.shape)
print(names_1981.shape)

(1935, 1)
(19455, 1)


Your job here is to use the DataFrame `.reindex()` and `.dropna()` methods to make a DataFrame `common_names` counting names from 1881 that were still popular in 1981.

In [44]:
names_1881.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
name,gender,Unnamed: 2_level_1
Mary,F,6919
Anna,F,2698
Emma,F,2034
Elizabeth,F,1852
Margaret,F,1658


In [43]:
names_1981.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
name,gender,Unnamed: 2_level_1
Jennifer,F,57032
Jessica,F,42519
Amanda,F,34370
Sarah,F,28162
Melissa,F,28003


In [47]:
# Reindex names_1981 with index of names_1881: common_names
common_names = names_1981.reindex(names_1881.index)
common_names.shape

(1935, 1)

In [48]:
# Drop rows with null counts: common_names
common_names = common_names.dropna()
common_names.shape

(1587, 1)

## Adding unaligned DataFrames

The DataFrames `january` and `february`, represent the sales a company made in the corresponding months.

The Indexes in both DataFrames are called `Company`, identifying which company bought that quantity of units. The column `Units` is the number of units sold.

If you were to add these two DataFrames by executing the command `total = january + february`, how many rows would the resulting DataFrame have? Try this and find out for yourself.

In [50]:
jan = pd.read_csv(dpath4+'january.csv', index_col='Company')
jan

Unnamed: 0_level_0,Units
Company,Unnamed: 1_level_1
Acme Corporation,19
Hooli,17
Initech,20
Mediacore,10
Streeplex,13


In [51]:
feb = pd.read_csv(dpath4+'february.csv', index_col='Company')
feb

Unnamed: 0_level_0,Units
Company,Unnamed: 1_level_1
Acme Corporation,15
Hooli,3
Mediacore,13
Vandelay Inc,25


In [52]:
total = jan + feb
total

Unnamed: 0_level_0,Units
Company,Unnamed: 1_level_1
Acme Corporation,34.0
Hooli,20.0
Initech,
Mediacore,23.0
Streeplex,
Vandelay Inc,


january and february both consist of the sales of the Companies Acme Corporation, Hooli, and Mediacore. january has the additional two companies Initech and Streeplex, while february has the additional company Vandelay Inc. Together, they consist of the sales of 6 unique companies, and so total would have 6 rows.

## Broadcasting in arithmetic formulas

In this exercise, you'll work with weather data pulled from [wunderground.com](https://www.wunderground.com/). The DataFrame `weather` has been pre-loaded. It has 365 rows (observed each day of the year 2013 in Pittsburgh, PA) and 22 columns reflecting different weather measurements each day.

You'll subset a collection of columns related to temperature measurements in degrees Fahrenheit, convert them to degrees Celsius, and relabel the columns of the new DataFrame to reflect the change of units.

Remember, ordinary arithmetic operators (like +, -, *, and /) **broadcast** scalar values to conforming DataFrames when combining scalars & DataFrames in arithmetic expressions. **Broadcasting** also works with pandas Series and NumPy arrays.

In [57]:
weather = pd.read_csv(dpath2+'weather2.csv', index_col='Date', parse_dates=True)
weather.head()

Unnamed: 0_level_0,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,MeanDew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,Max Sea Level PressureIn,...,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01,32,28,21,30,27,16,100,89,77,30.1,...,10,6,2,10,8,,0.0,8,Snow,277
2013-01-02,25,21,17,14,12,10,77,67,55,30.27,...,10,10,10,14,5,,0.0,4,,272
2013-01-03,32,24,16,19,15,9,77,67,56,30.25,...,10,10,10,17,8,26.0,0.0,3,,229
2013-01-04,30,28,27,21,19,17,75,68,59,30.28,...,10,10,6,23,16,32.0,0.0,4,,250
2013-01-05,34,30,25,23,20,16,75,68,61,30.42,...,10,10,10,16,10,23.0,0.21,5,,221


In [60]:
# Extract selected columns from weather as new DataFrame: temps_f
# Create a new DataFrame temps_f by extracting the columns 'Min TemperatureF',
#  'Mean TemperatureF', & 'Max TemperatureF' from weather 
#  as a new DataFrame temps_f. To do this, pass the relevant 
#  columns as a list to weather[].
temps_f = weather[['Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]
temps_f.head()

Unnamed: 0_level_0,Min TemperatureF,Mean TemperatureF,Max TemperatureF
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-01-01,21,28,32
2013-01-02,17,21,25
2013-01-03,16,24,32
2013-01-04,27,28,30
2013-01-05,25,30,34


In [61]:
# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5/9

In [62]:
# Rename 'F' in column names with 'C': temps_c.columns
# Rename the columns of temps_c to replace 'F' with 'C' 
#  using the .str.replace('F', 'C') method on temps_c.columns.
temps_c.columns = temps_c.columns.str.replace('F', 'C')
temps_c.head()

Unnamed: 0_level_0,Min TemperatureC,Mean TemperatureC,Max TemperatureC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-01-01,-6.111111,-2.222222,0.0
2013-01-02,-8.333333,-6.111111,-3.888889
2013-01-03,-8.888889,-4.444444,0.0
2013-01-04,-2.777778,-2.222222,-1.111111
2013-01-05,-3.888889,-1.111111,1.111111


## Computing percentage growth of GDP

Your job in this exercise is to compute the yearly percent-change of US **GDP (Gross Domestic Product)** since 2008.

The data has been obtained from the [Federal Reserve Bank of St. Louis](https://fred.stlouisfed.org/series/GDP) and is available in the file `GDP.csv,` which contains quarterly data; you will **resample** it to annual sampling and then compute the annual growth of GDP.

In [66]:
# Read 'GDP.csv' into a DataFrame: gdp
# Read the file 'GDP.csv' into a DataFrame called gdp, 
#  using parse_dates=True and index_col='DATE'.
gdp = pd.read_csv(dpath5+'gdp_usa.csv', parse_dates=True, index_col='DATE')
gdp.head()

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
1947-01-01,243.1
1947-04-01,246.3
1947-07-01,250.1
1947-10-01,260.3
1948-01-01,266.2


In [70]:
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008':]
post2008

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
2008-01-01,14668.4
2008-04-01,14813.0
2008-07-01,14843.0
2008-10-01,14549.9
2009-01-01,14383.9
2009-04-01,14340.4
2009-07-01,14384.1
2009-10-01,14566.5
2010-01-01,14681.1
2010-04-01,14888.6


In [71]:
# Resample post2008 by year, keeping last(): yearly
# Create the DataFrame yearly by resampling the slice post2008 by year. 
#  Remember, you need to chain .resample() (using the alias 'A' 
#  for annual frequency) with some kind of aggregation; 
#  you will use the aggregation method .last() to select the last 
#  element when resampling.
yearly = post2008.resample('A').last()
yearly

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
2008-12-31,14549.9
2009-12-31,14566.5
2010-12-31,15230.2
2011-12-31,15785.3
2012-12-31,16297.3
2013-12-31,16999.9
2014-12-31,17692.2
2015-12-31,18222.8
2016-12-31,18436.5


In [72]:
# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change() * 100
yearly

Unnamed: 0_level_0,VALUE,growth
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
2008-12-31,14549.9,
2009-12-31,14566.5,0.11409
2010-12-31,15230.2,4.556345
2011-12-31,15785.3,3.644732
2012-12-31,16297.3,3.243524
2013-12-31,16999.9,4.311144
2014-12-31,17692.2,4.072377
2015-12-31,18222.8,2.999062
2016-12-31,18436.5,1.172707


## Converting currency of stocks

In this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained from Yahoo Finance. The files sp500.csv for `sp500` and exchange.csv for the `exchange` rates are both provided to you.

Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.

In [73]:
# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv(dpath2+'sp500.csv', parse_dates=True, index_col='Date')
sp500.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-02,2058.899902,2072.360107,2046.040039,2058.199951,2708700000,2058.199951
2015-01-05,2054.439941,2054.439941,2017.339966,2020.579956,3799120000,2020.579956
2015-01-06,2022.150024,2030.25,1992.439941,2002.609985,4460110000,2002.609985
2015-01-07,2005.550049,2029.609985,2005.550049,2025.900024,3805480000,2025.900024
2015-01-08,2030.609985,2064.080078,2030.609985,2062.139893,3934010000,2062.139893


In [74]:
# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv(dpath2+'exchange.csv', parse_dates=True, index_col='Date')
exchange.head()

Unnamed: 0_level_0,GBP/USD
Date,Unnamed: 1_level_1
2015-01-02,0.65101
2015-01-05,0.65644
2015-01-06,0.65896
2015-01-07,0.66344
2015-01-08,0.66151


In [75]:
# Subset 'Open' & 'Close' columns from sp500: dollars
# Extract the columns 'Open' & 'Close' from the DataFrame sp500 
#  as a new DataFrame dollars and print the first 5 rows.
dollars = sp500[['Open', 'Close']]
dollars.head()

Unnamed: 0_level_0,Open,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-02,2058.899902,2058.199951
2015-01-05,2054.439941,2020.579956
2015-01-06,2022.150024,2002.609985
2015-01-07,2005.550049,2025.900024
2015-01-08,2030.609985,2062.139893


In [76]:
# Convert dollars to pounds: pounds
# Construct a new DataFrame pounds by converting US dollars to British pounds. 
#  You'll use the .multiply() method of dollars with exchange['GBP/USD'] 
#  and axis='rows'
pounds = dollars.multiply(exchange['GBP/USD'], axis='rows')
pounds.head()

Unnamed: 0_level_0,Open,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-02,1340.364425,1339.90875
2015-01-05,1348.616555,1326.389506
2015-01-06,1332.51598,1319.639876
2015-01-07,1330.562125,1344.063112
2015-01-08,1343.268811,1364.126161


## Appending Series with nonunique Indices

The Series bronze and silver represent the 5 countries that won the most bronze and silver Olympic medals respectively between 1896 & 2008. The Indexes of both Series are called Country and the values are the corresponding number of medals won.

If you were to run the command `combined = bronze.append(silver)`, how many rows would combined have? And how many rows would `combined.loc['United States']` return? Find out for yourself by running these commands.

The `combined` Series has 10 rows and `combined.loc['United States']` has two rows, since the index value 'United States' occurs in both series bronze and silver.

## Appending pandas Series

In this exercise, you'll load sales data from the months January, February, and March into DataFrames. Then, you'll extract Series with the 'Units' column from each and append them together with method chaining using `.append()`.

To check that the stacking worked, you'll print slices from these Series, and finally, you'll add the result to figure out the total units sold in the first quarter.

In [80]:
# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv(dpath4+'sales-jan-2015.csv', parse_dates=True, index_col='Date')
jan.head()

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-21 19:13:21,Streeplex,Hardware,11
2015-01-09 05:23:51,Streeplex,Service,8
2015-01-06 17:19:34,Initech,Hardware,17
2015-01-02 09:51:06,Hooli,Hardware,16
2015-01-11 14:51:02,Hooli,Hardware,11


In [81]:
# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv(dpath4+'sales-feb-2015.csv', parse_dates=True, index_col='Date')
feb.head()

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-02-26 08:57:45,Streeplex,Service,4
2015-02-16 12:09:19,Hooli,Software,10
2015-02-03 14:14:18,Initech,Software,13
2015-02-02 08:33:01,Hooli,Software,3
2015-02-25 00:29:00,Initech,Service,10


In [82]:
# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv(dpath4+'sales-mar-2015.csv', parse_dates=True, index_col='Date')
mar.head()

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-03-22 14:42:25,Mediacore,Software,6
2015-03-12 18:33:06,Initech,Service,19
2015-03-22 03:58:28,Streeplex,Software,8
2015-03-15 00:53:12,Hooli,Hardware,19
2015-03-17 19:25:37,Hooli,Hardware,10


In [83]:
# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']

# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']

# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

In [88]:
# Append feb_units and then mar_units to jan_units: quarter1
# Construct the Series quarter1 by appending feb_units to jan_units 
#  and then appending mar_units to the result. Use chained calls 
#  to the .append() method to do this.
quarter1 = jan_units.append(feb_units).append(mar_units)
quarter1[:10]

Date
2015-01-21 19:13:21    11
2015-01-09 05:23:51     8
2015-01-06 17:19:34    17
2015-01-02 09:51:06    16
2015-01-11 14:51:02    11
2015-01-01 07:31:20    18
2015-01-24 08:01:16     1
2015-01-25 15:40:07     6
2015-01-13 05:36:12     7
2015-01-03 18:00:19    19
Name: Units, dtype: int64

In [85]:
# Print the first slice from quarter1
# Print the slice containing rows from jan 27, 2015 to feb 2, 2015.
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64


In [86]:
# Print the second slice from quarter1
# Print the slice containing rows from feb 26, 2015 to mar 7, 2015.
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64


In [89]:
# Compute & print total sales in quarter1
print(quarter1.sum())

642


## Concatenating pandas Series along row axis

Having learned how to append Series, you'll now learn how to achieve the same result by concatenating Series instead. You'll continue to work with the sales data you've seen previously.

Your job is to use `pd.concat()` with a list of Series to achieve the same result that you would get by chaining calls to `.append()`.

You may be wondering about the difference between `pd.concat()` and pandas' `.append()` method. One way to think of the difference is that `.append()` is a specific case of a concatenation, while `pd.concat()` gives you more flexibility, as you'll see in later exercises.

In [91]:
# Initialize empty list: units
units = []

# Build the list of Series
# In each iteration of the loop, append the 'Units' column of each DataFrame to units.
for month in [jan, feb, mar]:
    units.append(month['Units'])

In [92]:
units

[Date
 2015-01-21 19:13:21    11
 2015-01-09 05:23:51     8
 2015-01-06 17:19:34    17
 2015-01-02 09:51:06    16
 2015-01-11 14:51:02    11
 2015-01-01 07:31:20    18
 2015-01-24 08:01:16     1
 2015-01-25 15:40:07     6
 2015-01-13 05:36:12     7
 2015-01-03 18:00:19    19
 2015-01-16 00:33:47    17
 2015-01-16 07:21:12    13
 2015-01-20 19:49:24    12
 2015-01-26 01:50:25    14
 2015-01-15 02:38:25    16
 2015-01-06 13:47:37    16
 2015-01-15 15:33:40     7
 2015-01-27 07:11:55    18
 2015-01-20 11:28:02    13
 2015-01-16 19:20:46     8
 Name: Units, dtype: int64, Date
 2015-02-26 08:57:45     4
 2015-02-16 12:09:19    10
 2015-02-03 14:14:18    13
 2015-02-02 08:33:01     3
 2015-02-25 00:29:00    10
 2015-02-05 01:53:06    19
 2015-02-09 08:57:30    19
 2015-02-11 20:03:08     7
 2015-02-04 21:52:45    14
 2015-02-09 13:09:55     7
 2015-02-07 22:58:10     1
 2015-02-11 22:50:44     4
 2015-02-26 08:58:51     1
 2015-02-05 22:05:03    10
 2015-02-04 15:36:29    13
 2015-02-19 16:0

In [93]:
# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis='rows')

In [94]:
# Print slices from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64


## Appending DataFrames with ignore_index

In this exercise, you'll use the Baby Names Dataset (from data.gov) again. This time, both DataFrames `names_1981` and `names_1881` are loaded without specifying an Index column (so the default Indexes for both are RangeIndexes).

You'll use the DataFrame `.append()` method to make a DataFrame `combined_names`. To distinguish rows from the original two DataFrames, you'll add a `'year'` column to each with the year (1881 or 1981 in this case). In addition, you'll specify `ignore_index=True` so that the index values are not used along the concatenation axis. The resulting axis will instead be labeled 0, 1, ..., n-1, which is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.

In [117]:
names_1981 = pd.read_csv(dpath3+'names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1881 = pd.read_csv(dpath3+'names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))

In [118]:
names_1881 = names_1881.reset_index()
names_1881.head()

Unnamed: 0,name,gender,count
0,Mary,F,6919
1,Anna,F,2698
2,Emma,F,2034
3,Elizabeth,F,1852
4,Margaret,F,1658


In [119]:
names_1981 = names_1981.reset_index()
names_1981.head()

Unnamed: 0,name,gender,count
0,Jennifer,F,57032
1,Jessica,F,42519
2,Amanda,F,34370
3,Sarah,F,28162
4,Melissa,F,28003


In [120]:
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

In [121]:
names_1881.head()

Unnamed: 0,name,gender,count,year
0,Mary,F,6919,1881
1,Anna,F,2698,1881
2,Emma,F,2034,1881
3,Elizabeth,F,1852,1881
4,Margaret,F,1658,1881


In [122]:
# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index=True)

In [123]:
# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

(19455, 4)
(1935, 4)
(21390, 4)


In [124]:
combined_names.head()

Unnamed: 0,name,gender,count,year
0,Mary,F,6919,1881
1,Anna,F,2698,1881
2,Emma,F,2034,1881
3,Elizabeth,F,1852,1881
4,Margaret,F,1658,1881


In [125]:
# Print all rows that contain the name 'Morgan'
print(combined_names.loc[combined_names['name'] == 'Morgan'])

         name gender  count  year
1283   Morgan      M     23  1881
2096   Morgan      F   1769  1981
14390  Morgan      M    766  1981


## Concatenating pandas DataFrames along column axis

The function `pd.concat()` can concatenate DataFrames horizontally as well as vertically (vertical is the default). To make the DataFrames stack horizontally, you have to specify the keyword argument `axis=1` or `axis='columns'`.

In this exercise, you'll use weather data with maximum and mean daily temperatures sampled at different rates (quarterly versus monthly). You'll concatenate the rows of both and see that, where rows are missing in the `coarser` DataFrame, null values are inserted in the `concatenated` DataFrame. This corresponds to an **outer join** (which you will explore in more detail in later exercises).

The files `'quarterly_max_temp.csv'` and `'monthly_mean_temp.csv'` have been pre-loaded into the DataFrames `weather_max` and `weather_mean` respectively, and pandas has been imported as pd.

In [126]:
weather_max = pd.read_csv(dpath2+'quarterly_max_temp.csv')
weather_max

Unnamed: 0,Month,Max TemperatureF
0,Jan,68
1,Apr,89
2,Jul,91
3,Oct,84


In [127]:
weather_mean = pd.read_csv(dpath2+'monthly_mean_temp.cs')
weather_mean

Unnamed: 0,Month,Mean TemperatureF
0,Apr,53.1
1,Aug,70.0
2,Dec,34.935484
3,Feb,28.714286
4,Jan,32.354839
5,Jul,72.870968
6,Jun,70.133333
7,Mar,35.0
8,May,62.612903
9,Nov,39.8


In [128]:
# Create a list of weather_max and weather_mean
weather_list = [weather_max, weather_mean]

In [129]:
# Concatenate weather_list horizontally
weather = pd.concat(weather_list, axis=1)
weather

Unnamed: 0,Month,Max TemperatureF,Month.1,Mean TemperatureF
0,Jan,68.0,Apr,53.1
1,Apr,89.0,Aug,70.0
2,Jul,91.0,Dec,34.935484
3,Oct,84.0,Feb,28.714286
4,,,Jan,32.354839
5,,,Jul,72.870968
6,,,Jun,70.133333
7,,,Mar,35.0
8,,,May,62.612903
9,,,Nov,39.8


## Reading multiple files to build a DataFrame

It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You'll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.

Here, you'll work with DataFrames compiled from The Guardian's Olympic medal dataset.

In [130]:
medal_types = ['bronze', 'silver', 'gold']

In [138]:
#Initialize an empyy list: medals
medals =[]

In [139]:
for medal in medal_types:
    # Create the file name: file_name
    # The expression "%s_top5.csv" % medal evaluates as a string 
    #  with the value of medal replacing %s in the format string.
    file_name = "%s_top5.csv" % medal
    # Create list of column names: columns
    columns = ['Country', medal]
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(dpath+file_name, header=0, index_col='Country', names=columns)
    print(file_name)
    # Append medal_df to medals
    medals.append(medal_df)

medals

bronze_top5.csv
silver_top5.csv
gold_top5.csv


[                bronze
 Country               
 United States   1052.0
 Soviet Union     584.0
 United Kingdom   505.0
 France           475.0
 Germany          454.0,                 silver
 Country               
 United States   1195.0
 Soviet Union     627.0
 United Kingdom   591.0
 France           461.0
 Italy            394.0,                   gold
 Country               
 United States   2088.0
 Soviet Union     838.0
 United Kingdom   498.0
 Italy            460.0
 Germany          407.0]

In [140]:
# Concatenate medals horizontally: medals_df
medals_df = pd.concat(medals, axis='columns')
medals_df

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,bronze,silver,gold
France,475.0,461.0,
Germany,454.0,,407.0
Italy,,394.0,460.0
Soviet Union,584.0,627.0,838.0
United Kingdom,505.0,591.0,498.0
United States,1052.0,1195.0,2088.0


## Concatenating vertically to get MultiIndexed rows

When stacking a sequence of DataFrames vertically, it is sometimes desirable to construct a MultiIndex to indicate the DataFrame from which each row originated. This can be done by specifying the `keys` parameter in the call to `pd.concat()`, which generates a hierarchical index with the labels from keys as the outermost index label. So you don't have to rename the columns of each DataFrame as you load it. Instead, only the Index column needs to be specified.

In [148]:
medals = []
medal_types = ['bronze', 'silver', 'gold']

In [150]:
for medal in medal_types:

    file_name = "%s_top5.csv" % medal
    
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(dpath+file_name, index_col='Country')
    
    # Append medal_df to medals
    medals.append(medal_df)
    
# Concatenate medals: medals
medals = pd.concat(medals, keys=['bronze', 'silver', 'gold'])
medals

Unnamed: 0_level_0,Unnamed: 1_level_0,Total
Unnamed: 0_level_1,Country,Unnamed: 2_level_1
bronze,United States,1052.0
bronze,Soviet Union,584.0
bronze,United Kingdom,505.0
bronze,France,475.0
bronze,Germany,454.0
silver,United States,1195.0
silver,Soviet Union,627.0
silver,United Kingdom,591.0
silver,France,461.0
silver,Italy,394.0


## Slicing MultiIndexed DataFrames

This exercise picks up where the last ended (again using The Guardian's Olympic medal dataset).

You are provided with the MultiIndexed DataFrame as produced at the end of the preceding exercise. Your task is to sort the DataFrame and to use the `pd.IndexSlice` to extract specific slices.

In [151]:
# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)
medals_sorted

Unnamed: 0_level_0,Unnamed: 1_level_0,Total
Unnamed: 0_level_1,Country,Unnamed: 2_level_1
bronze,France,475.0
bronze,Germany,454.0
bronze,Soviet Union,584.0
bronze,United Kingdom,505.0
bronze,United States,1052.0
gold,Germany,407.0
gold,Italy,460.0
gold,Soviet Union,838.0
gold,United Kingdom,498.0
gold,United States,2088.0


In [153]:
# Print the number of Bronze medals won by Germany
medals_sorted.loc[('bronze','Germany')]

Total    454.0
Name: (bronze, Germany), dtype: float64

In [154]:
# Print data about silver medals
print(medals_sorted.loc['silver'])

                 Total
Country               
France           461.0
Italy            394.0
Soviet Union     627.0
United Kingdom   591.0
United States   1195.0


In [155]:
# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
print(medals_sorted.loc[idx[:,'United Kingdom'], :])

                       Total
       Country              
bronze United Kingdom  505.0
gold   United Kingdom  498.0
silver United Kingdom  591.0
