# Concatenating data
Having learned how to import multiple DataFrames and share information using Indexes, in this chapter you'll learn how to perform database-style operations to combine DataFrames. In particular, you'll learn about appending and concatenating DataFrames while working with a variety of real-world datasets.

# 1. Appending & concatenating Series
### 1.1 Appending Series with nonunique Indices
The Series `bronze` and `silver`, which have been printed in the IPython Shell, represent the 5 countries that won the most bronze and silver Olympic medals respectively between 1896 & 2008. The Indexes of both Series are called `Country` and the values are the corresponding number of medals won.

If you were to run the command `combined = bronze.append(silver)`, how many rows would `combined` have? And how many rows would `combined.loc['United States']` return? Find out for yourself by running these commands in the IPython Shell.

Possible Answers:
* `combined` has 5 rows and `combined.loc['United States']` is empty (0 rows).
* `combined` has 10 rows and `combined.loc['United States']` has 2 rows.
* `combined` has 6 rows and `combined.loc['United States']` has 1 row.
* `combined` has 5 rows and `combined.loc['United States']` has 2 rows.

In [1]:
import pandas as pd
silver = pd.read_csv('_datasets/Summer_Olympics/Silver.csv', index_col='Country')['Total'].nlargest(5)
silver.head()

Country
United States     1195.0
Soviet Union       627.0
United Kingdom     591.0
France             461.0
Italy              394.0
Name: Total, dtype: float64

In [2]:
bronze = pd.read_csv('_datasets/Summer_Olympics/Bronze.csv', index_col='Country')['Total'].nlargest(5)
bronze

Country
United States     1052.0
Soviet Union       584.0
United Kingdom     505.0
France             475.0
Germany            454.0
Name: Total, dtype: float64

In [3]:
combined = bronze.append(silver)
combined

Country
United States     1052.0
Soviet Union       584.0
United Kingdom     505.0
France             475.0
Germany            454.0
United States     1195.0
Soviet Union       627.0
United Kingdom     591.0
France             461.0
Italy              394.0
Name: Total, dtype: float64

In [4]:
combined.loc['United States']

Country
United States    1052.0
United States    1195.0
Name: Total, dtype: float64

The combined Series has 10 rows and `combined.loc['United States']` has two rows, since the index value `'United States'` occurs in both series `bronze` and `silver`.

### 1.2 Appending pandas Series
In this exercise, you'll load sales data from the months January, February, and March into DataFrames. Then, you'll extract Series with the `'Units'` column from each and append them together with method chaining using `.append()`.

To check that the stacking worked, you'll print slices from these Series, and finally, you'll add the result to figure out the total units sold in the first quarter.

### Instructions:
* Read the files `'sales-jan-2015.csv'`, `'sales-feb-2015.csv'` and `'sales-mar-2015.csv'` into the DataFrames `jan`, `feb`, and `mar` respectively.
* Use `parse_dates=True` and `index_col='Date'`.
* Extract the `'Units'` column of `jan`, `feb`, and `mar` to create the Series `jan_units`, `feb_units`, and `mar_units` respectively.
* Construct the Series `quarter1` by appending `feb_units` to `jan_units` and then appending `mar_units` to the result. Use chained calls to the `.append()` method to do this.
* Verify that `quarter1` has the individual Series stacked vertically. To do this:
* Print the slice containing rows from `jan 27, 2015` to `feb 2, 2015`.
* Print the slice containing rows from `feb 26, 2015` to `mar 7, 2015`.
* Compute and print the total number of units sold from the Series `quarter1`. 

In [5]:
# Import pandas
import pandas as pd

# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv('_datasets/Sales/sales-jan-2015.csv', parse_dates=True, index_col='Date')
# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv('_datasets/Sales/sales-feb-2015.csv', parse_dates=True, index_col='Date')
# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv('_datasets/Sales/sales-mar-2015.csv', parse_dates=True, index_col='Date')

# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']
# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']
# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)

# Print the first slice from quarter1
quarter1.loc['jan 27, 2015':'feb 2, 2015']

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64

In [6]:
# Print the second slice from quarter1
quarter1.loc['feb 26, 2015':'mar 7, 2015']

Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64

In [7]:
# Compute & print total sales in quarter1
quarter1.sum()

642

### 1.3 Concatenating pandas Series along row axis
Having learned how to append Series, you'll now learn how to achieve the same result by concatenating Series instead. You'll continue to work with the sales data you've seen previously. 

Your job is to use `pd.concat()` with a list of Series to achieve the same result that you would get by chaining calls to `.append()`.

You may be wondering about the difference between `pd.concat()` and pandas' `.append()` method. One way to think of the difference is that `.append()` is a specific case of a concatenation, while `pd.concat()` gives you more flexibility, as you'll see in later exercises.

### Instructions:
* Create an empty list called `units`. This has been done for you.
* Use a `for` loop to iterate over `[jan, feb, mar]`:
    * In each iteration of the loop, append the `'Units'` column of each DataFrame to `units`.
* Concatenate the Series contained in the list `units` into a longer Series called `quarter1` using `pd.concat()`.
    * Specify the keyword argument `axis='rows'` to stack the Series vertically.
* Verify that `quarter1` has the individual Series stacked vertically by printing slices.

In [8]:
# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month['Units'])

# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis='rows')

# Print slices from quarter1
quarter1.loc['jan 27, 2015':'feb 2, 2015']

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64

In [9]:
quarter1.loc['feb 26, 2015':'mar 7, 2015']

Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64

# 2. Appending & concatenating DataFrames
### 2.1 Appending DataFrames with ignore_index
In this exercise, you'll use the [Baby Names Dataset](https://www.data.gov/developers/baby-names-dataset/) (from [data.gov](https://www.data.gov/)) again. This time, both DataFrames `names_1981` and `names_1881` are loaded without specifying an Index column (so the default Indexes for both are RangeIndexes).

You'll use the DataFrame `.append()` method to make a DataFrame `combined_names`. To distinguish rows from the original two DataFrames, you'll add a `'year'` column to each with the year (1881 or 1981 in this case). In addition, you'll specify `ignore_index=True` so that the index values are not used along the concatenation axis. The resulting axis will instead be labeled `0, 1, ..., n-1`, which is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.

### Instructions:
* Create a `'year'` column in the DataFrames `names_1881` and `names_1981`, with values of `1881` and `1981` respectively. Recall that assigning a scalar value to a DataFrame column broadcasts that value throughout.
* Create a new DataFrame called `combined_names` by appending the rows of `names_1981` underneath the rows of `names_1881`. Specify the keyword argument `ignore_index=True` to make a new RangeIndex of unique integers for each row.
* Print the shapes of all three DataFrames. 
* Extract all rows from `combined_names` that have the name `'Morgan'`. To do this, use the `.loc[]` accessor with an appropriate filter. The relevant column of `combined_names` here is `'name'`.

In [10]:
names_1981 = pd.read_csv('_datasets/names1981.csv', header=None, names=['name','gender','count'])
names_1881 = pd.read_csv('_datasets/names1881.csv', header=None, names=['name','gender','count'])

In [11]:
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index=True)

# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

(19455, 4)
(1935, 4)
(21390, 4)


In [12]:
# Print all rows that contain the name 'Morgan'
combined_names.head()

Unnamed: 0,name,gender,count,year
0,Mary,F,6919,1881
1,Anna,F,2698,1881
2,Emma,F,2034,1881
3,Elizabeth,F,1852,1881
4,Margaret,F,1658,1881


In [13]:
combined_names.loc[combined_names['name']=='Morgan']

Unnamed: 0,name,gender,count,year
1283,Morgan,M,23,1881
2096,Morgan,F,1769,1981
14390,Morgan,M,766,1981


### 2.2 Concatenating pandas DataFrames along column axis
The function `pd.concat()` can concatenate DataFrames _horizontally_ as well as _vertically_ (vertical is the default). To make the DataFrames stack horizontally, you have to specify the keyword argument `axis=1` or `axis='columns'`.

In this exercise, you'll use weather data with maximum and mean daily temperatures sampled at different rates (quarterly versus monthly). You'll concatenate the rows of both and see that, where rows are missing in the coarser DataFrame, null values are inserted in the concatenated DataFrame. This corresponds to an _outer join_ (which you will explore in more detail in later exercises).

The files `'quarterly_max_temp.csv'` and `'monthly_mean_temp.csv'` have been pre-loaded into the DataFrames `weather_max` and `weather_mean` respectively.

### Instructions:
* Create `weather_list`, a list of the DataFrames `weather_max` and `weather_mean`.
* Create a new DataFrame called `weather` by concatenating `weather_list` _horizontally_.
    * Pass the list to `pd.concat()` and specify the keyword argument `axis=1` to stack them horizontally.
* Print the new DataFrame `weather`.

In [14]:
weather_max = pd.read_csv('_datasets/quarterly_max_temp.csv', index_col='Month')
weather_mean = pd.read_csv('_datasets/monthly_mean_temp.csv', index_col='Month')

In [15]:
# Create a list of weather_max and weather_mean
weather_list = [weather_max, weather_mean]

# Concatenate weather_max and weather_mean horizontally: weather
weather = pd.concat([weather_max, weather_mean], axis=1, sort='True')

# Print weather
weather

Unnamed: 0,Mean TemperatureF,Mean TemperatureF.1
Apr,89.0,53.1
Aug,,70.0
Dec,,34.935484
Feb,,28.714286
Jan,68.0,32.354839
Jul,91.0,72.870968
Jun,,70.133333
Mar,,35.0
May,,62.612903
Nov,,39.8


### 2.3 Reading multiple files to build a DataFrame
It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You'll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.

Here, you'll work with DataFrames compiled from [The Guardian's Olympic medal dataset](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data).

`pandas` has been imported as `pd` and two lists have been pre-loaded: An empty list called `medals`, and `medal_types`, which contains the strings `'bronze'`, `'silver'`, and `'gold'`.

### Instructions:
* Iterate over `medal_types` in the `for` loop.
* Inside the `for` loop:
    * Create `file_name` using string interpolation with the loop variable `medal`. The expression `"%s_top5.csv" % medal` evaluates as a string with the _value_ of `medal` replacing `%s` in the format string.
    * Create the list of column names called `columns`. This has been done for you.
    * Read `file_name` into a DataFrame called `medal_df`. Specify the keyword arguments `header=0`, `index_col='Country'`, and `names=columns` to get the correct row and column Indexes.
    * Append `medal_df` to `medals` using the list `.append()` method.
* Concatenate the list of DataFrames `medals` horizontally (using `axis='columns'`) to create a single DataFrame called `medals`. Print it in its entirety.

In [16]:
# Create a file path
path = '_datasets/Summer_Olympics/'

# Create a list of medal_types
medal_types = ['gold', 'silver', 'bronze']

# Create a empty medals list
medals = []

for medal in medal_types:

    # Create the file name: file_name
    file_name = "%s_top5.csv" % medal
    
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv(path+file_name, header=0, index_col='Country', names=columns)

    # Append medal_df to medals
    medals.append(medal_df)

# Concatenate medals horizontally: medals
medals = pd.concat(medals, axis='columns', sort=True)

# Print medals
medals

Unnamed: 0,gold,silver,bronze
France,,461.0,475.0
Germany,407.0,,454.0
Italy,460.0,394.0,
Soviet Union,838.0,627.0,584.0
United Kingdom,498.0,591.0,505.0
United States,2088.0,1195.0,1052.0


# 3. Concatenation, keys, & MultiIndexes
### 3.1 Concatenating vertically to get MultiIndexed rows
When stacking a sequence of DataFrames vertically, it is sometimes desirable to construct a MultiIndex to indicate the DataFrame from which each row originated. This can be done by specifying the `keys` parameter in the call to `pd.concat()`, which generates a hierarchical index with the labels from `keys` as the outermost index label. So you don't have to rename the columns of each DataFrame as you load it. Instead, only the Index column needs to be specified.

Here, you'll continue working with DataFrames compiled from [The Guardian's Olympic medal dataset](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data). Two lists have been pre-loaded: An empty list called `medals`, and `medal_types`, which contains the strings `'bronze'`, `'silver'`, and `'gold'`.

### Instructions:
* Within the `for` loop:
    * Read `file_name` into a DataFrame called `medal_df`. Specify the index to be `'Country'`.
    * Append `medal_df` to `medals`.
* Concatenate the list of DataFrames `medals` into a single DataFrame called `medals`. Be sure to use the keyword argument `keys=['bronze', 'silver', 'gold']` to create a vertically stacked DataFrame with a MultiIndex.
* Print the new DataFrame `medals`.

In [17]:
# Create a list of medal_types
medal_types = ['gold', 'silver', 'bronze']
# Create a empty medals list
medals = []

# Create a file path
path = '_datasets/Summer_Olympics/'

In [18]:
for medal in medal_types:

    file_name = "%s_top5.csv" % medal
    
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(path+file_name, index_col='Country')
    
    # Append medal_df to medals
    medals.append(medal_df)
    
# Concatenate medals: medals
medals = pd.concat(medals, keys=['bronze', 'silver', 'gold'])

# Print medals in entirety
medals

Unnamed: 0_level_0,Unnamed: 1_level_0,Total
Unnamed: 0_level_1,Country,Unnamed: 2_level_1
bronze,United States,2088.0
bronze,Soviet Union,838.0
bronze,United Kingdom,498.0
bronze,Italy,460.0
bronze,Germany,407.0
silver,United States,1195.0
silver,Soviet Union,627.0
silver,United Kingdom,591.0
silver,France,461.0
silver,Italy,394.0


Notice the MultiIndex of `medals`

### 3.2 Slicing MultiIndexed DataFrames
This exercise picks up where the last ended (again using [The Guardian's Olympic medal dataset](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data)).

You are provided with the MultiIndexed DataFrame as produced at the end of the preceding exercise. Your task is to sort the DataFrame and to use the pd.IndexSlice to extract specific slices. Check out 'Indexing multiple levels of a MultiIndex' in 'Advanced Indexing' chapter from Manipulating DataFrames with pandas to refresh your memory on how to deal with MultiIndexed DataFrames.

### Instructions:
* Create a new DataFrame `medals_sorted` with the entries of `medals` sorted. Use `.sort_index(level=0)` to ensure the Index is sorted suitably.
* Print the number of bronze medals won by Germany and all of the silver medal data. This has been done for you.
* Create an alias for `pd.IndexSlice` called `idx`. A _slicer_ `pd.IndexSlice` is required when slicing on the _inner_ level of a MultiIndex.
* Slice all the data on medals won by the United Kingdom. To do this, use the `.loc[]` accessor with `idx[:,'United Kingdom'], :`.

In [19]:
medals = []
medal_types = ['gold', 'silver', 'bronze']
for medal in medal_types:
    file_name = "%s_top5.csv" % medal
    medal_df = pd.read_csv(path+file_name, index_col='Country')
    medals.append(medal_df)
medals = pd.concat(medals, keys=['bronze', 'silver', 'gold'])
medals.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Total
Unnamed: 0_level_1,Country,Unnamed: 2_level_1
bronze,United States,2088.0
bronze,Soviet Union,838.0
bronze,United Kingdom,498.0
bronze,Italy,460.0
bronze,Germany,407.0


In [20]:
# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)

# Print the number of Bronze medals won by Germany
medals_sorted.loc[('bronze','Germany')]

Total    407.0
Name: (bronze, Germany), dtype: float64

In [21]:
# Print data about silver medals
medals_sorted.loc['silver']

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,461.0
Italy,394.0
Soviet Union,627.0
United Kingdom,591.0
United States,1195.0


In [22]:
# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
medals_sorted.loc[idx[:,'United Kingdom'], :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Total
Unnamed: 0_level_1,Country,Unnamed: 2_level_1
bronze,United Kingdom,498.0
gold,United Kingdom,505.0
silver,United Kingdom,591.0


It looks like only the United States and the Soviet Union have won more Silver medals than the United Kingdom.

### 3.3 Concatenating horizontally to get MultiIndexed columns
It is also possible to construct a DataFrame with hierarchically indexed columns. For this exercise, you'll start with pandas imported and a list of three DataFrames called `dataframes`. All three DataFrames contain `'Company'`, `'Product'`, and `'Units'` columns with a `'Date'` column as the index pertaining to sales transactions during the month of February, 2015. The first DataFrame describes `Hardware` transactions, the second describes `Software` transactions, and the third, `Service` transactions.

Your task is to concatenate the DataFrames horizontally and to create a MultiIndex on the columns. From there, you can summarize the resulting DataFrame and slice some information from it.

### Instructions:
* Construct a new DataFrame `february` with MultiIndexed columns by concatenating the list `dataframes`.
* Use `axis=1` to stack the DataFrames horizontally and the keyword argument `keys=['Hardware', 'Software', 'Service']` to construct a hierarchical Index from each DataFrame.
* Print summary information from the new DataFrame `february` using the `.info()` method.
* Create an alias called `idx` for `pd.IndexSlice`.
* Extract a slice called `slice_2_8` from `february` (using `.loc[]` & `idx`) that comprises rows between Feb. 2, 2015 to Feb. 8, 2015 from columns under `'Company'`.
* Print the `slice_2_8`. 

In [23]:
dataframes = []
file_names = ['feb-sales-Hardware.csv', 'feb-sales-Software.csv', 'feb-sales-Service.csv']
path = '_datasets/Sales/'
for file_name in file_names:
    feb_df = pd.read_csv(path+file_name, parse_dates=True, index_col='Date')
    dataframes.append(feb_df)

In [24]:
# Concatenate dataframes: february
february = pd.concat(dataframes, axis=1, keys=['Hardware', 'Software', 'Service'])

# Print february.info()
february.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20 entries, 2015-02-02 08:33:01 to 2015-02-26 08:58:51
Data columns (total 9 columns):
(Hardware, Company)    5 non-null object
(Hardware, Product)    5 non-null object
(Hardware, Units)      5 non-null float64
(Software, Company)    9 non-null object
(Software, Product)    9 non-null object
(Software, Units)      9 non-null float64
(Service, Company)     6 non-null object
(Service, Product)     6 non-null object
(Service, Units)       6 non-null float64
dtypes: float64(3), object(6)
memory usage: 1.1+ KB


In [25]:
# Assign pd.IndexSlice: idx
idx = pd.IndexSlice

# Create the slice: slice_2_8
slice_2_8 = february.loc['Feb. 2, 2015':'Feb. 8, 2015', idx[:, 'Company']]

# Print slice_2_8
slice_2_8

Unnamed: 0_level_0,Hardware,Software,Service
Unnamed: 0_level_1,Company,Company,Company
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2015-02-02 08:33:01,,Hooli,
2015-02-02 20:54:49,Mediacore,,
2015-02-03 14:14:18,,Initech,
2015-02-04 15:36:29,,Streeplex,
2015-02-04 21:52:45,Acme Coporation,,
2015-02-05 01:53:06,,Acme Coporation,
2015-02-05 22:05:03,,,Hooli
2015-02-07 22:58:10,Acme Coporation,,


# 3.4 Concatenating DataFrames from a dict
You're now going to revisit the sales data you worked with earlier in the chapter. Three DataFrames `jan`, `feb`, and `mar` have been pre-loaded. Your task is to aggregate the sum of all sales over the `'Company'` column into a single DataFrame. You'll do this by constructing a dictionary of these DataFrames and then concatenating them.

### Instructions:
* Create a list called `month_list` consisting of the tuples `('january', jan)`, `('february', feb)`, and `('march', mar)`.
* Create an empty dictionary called `month_dict`.
* Inside the `for` loop:
    * Group `month_data` by `'Company'` and use `.sum()` to aggregate.
* Construct a new DataFrame called `sales` by concatenating the DataFrames stored in `month_dict`.
* Create an alias for `pd.IndexSlice` and print all sales by `'Mediacore'`. 

In [26]:
path = '_datasets/Sales/'
jan = pd.read_csv(path + 'sales-jan-2015.csv')
feb = pd.read_csv(path + 'sales-feb-2015.csv')
mar = pd.read_csv(path + 'sales-mar-2015.csv')

In [27]:
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]

# Create an empty dictionary: month_dict
month_dict = {}

for month_name, month_data in month_list:

    # Group month_data: month_dict[month_name]
    month_dict[month_name] = month_data.groupby('Company').sum()

# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)

# Print sales
sales

Unnamed: 0_level_0,Unnamed: 1_level_0,Units
Unnamed: 0_level_1,Company,Unnamed: 2_level_1
february,Acme Coporation,34
february,Hooli,30
february,Initech,30
february,Mediacore,45
february,Streeplex,37
january,Acme Coporation,76
january,Hooli,70
january,Initech,37
january,Mediacore,15
january,Streeplex,50


In [28]:
# Print all sales by Mediacore
idx = pd.IndexSlice
sales.loc[idx[:, 'Mediacore'], :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Units
Unnamed: 0_level_1,Company,Unnamed: 2_level_1
february,Mediacore,45
january,Mediacore,15
march,Mediacore,68


# 4. Outer & inner joins
### 4.1 Concatenating DataFrames with inner join
Here, you'll continue working with DataFrames compiled from [The Guardian's Olympic medal dataset](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data).

The DataFrames bronze, silver, and gold have been pre-loaded.

Your task is to compute an inner join.

### Instructions:
* Construct a list of DataFrames called `medal_list` with entries `bronze`, `silver`, and `gold`.
* Concatenate `medal_list` horizontally with an _inner join_ to create `medals`.
    * Use the keyword argument `keys=['bronze', 'silver', 'gold']` to yield suitable hierarchical indexing.
    * Use `axis=1` to get horizontal concatenation.
    * Use `join='inner'` to keep only rows that share common index labels.
* Print the new DataFrame `medals`.

In [29]:
path = '_datasets/Summer_Olympics/'
bronze = pd.read_csv(path+'bronze_top5.csv', index_col='Country')
silver = pd.read_csv(path+'silver_top5.csv', index_col='Country')
gold = pd.read_csv(path+'gold_top5.csv', index_col='Country')

In [30]:
# Create the list of DataFrames: medal_list
medal_list = [bronze, silver, gold]

# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'], axis=1, join='inner')

# Print medals
medals

Unnamed: 0_level_0,bronze,silver,gold
Unnamed: 0_level_1,Total,Total,Total
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
United States,1052.0,1195.0,2088.0
Soviet Union,584.0,627.0,838.0
United Kingdom,505.0,591.0,498.0


France, Italy, and Germany got dropped as part of the join since they are not present in each of `bronze`, `silver`, and `gold`. Therefore, the final DataFrame has only the United States, Soviet Union, and United Kingdom.

### 4.2 Resampling & concatenating DataFrames with inner join
In this exercise, you'll compare the historical 10-year GDP (Gross Domestic Product) growth in the US and in China. The data for the US starts in 1947 and is recorded quarterly; by contrast, the data for China starts in 1961 and is recorded annually.

You'll need to use a combination of resampling and an inner join to align the index labels. You'll need an appropriate [offset alias](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) for resampling, and the method `.resample()` must be chained with some kind of aggregation method (`.pct_change()` and `.last()` in this case).

The DataFrames `china` and `us` have been pre-loaded, with the output of `china.head()` and `us.head()` printed in the IPython Shell.

### Instructions:
* Make a new DataFrame `china_annual` by resampling the DataFrame `china` with `.resample('A').last()` (i.e., with _annual_ frequency) and chaining two method calls:
* Chain `.pct_change(10)` as an aggregation method to compute the percentage change with an offset of ten years.
* Chain `.dropna()` to eliminate rows containing null values.
* Make a new DataFrame `us_annual` by resampling the DataFrame `us` exactly as you resampled `china`.
* Concatenate `china_annual` and `us_annual` to construct a DataFrame called `gdp`. Use `join='inner'` to perform an inner join and use `axis=1` to concatenate _horizontally_.
* Print the result of resampling `gdp` every decade (i.e., using `.resample('10A')`) and aggregating with the method `.last()`.

In [31]:
us = pd.read_csv('_datasets/gdp_usa.csv', index_col='Year', parse_dates=True, header=1, names=['Year','US'])
china = pd.read_csv('_datasets/gdp_china.csv', index_col='Year', parse_dates=True, header=1, names=['Year','China'])

print(china.head())
print(us.head())

                China
Year                 
1961-01-01  49.557050
1962-01-01  46.685179
1963-01-01  50.097303
1964-01-01  59.062255
1965-01-01  69.709153
               US
Year             
1947-04-01  246.3
1947-07-01  250.1
1947-10-01  260.3
1948-01-01  266.2
1948-04-01  272.9


In [32]:
# Resample and tidy china: china_annual
china_annual = china.resample('A').last().pct_change(10).dropna()

# Resample and tidy us: us_annual
us_annual = us.resample('A').last().pct_change(10).dropna()

# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual, us_annual], axis=1, join='inner')

# Resample gdp and print
gdp.resample('10A').last()

Unnamed: 0_level_0,China,US
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1971-12-31,0.98886,1.05227
1981-12-31,0.972048,1.750922
1991-12-31,0.962528,0.91238
2001-12-31,2.492511,0.704219
2011-12-31,4.623958,0.475082
2021-12-31,3.789936,0.36178
