# Concatenating data
Having learned how to import multiple DataFrames and share information using Indexes, in this chapter you'll learn how to perform database-style operations to combine DataFrames. In particular, you'll learn about appending and concatenating DataFrames while working with a variety of real-world datasets.

In [1]:
from IPython.display import HTML, Image
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Appending & concatenating Series
## append()
- .append(): Series & DataFrame method
- Invocation:
    - s1.append(s2)
- Stacks rows of s2 below s1
- Method for Series & DataFrames

## concat()
- concat(): pandas module function
- Invocation:
    - pd.concat([s1, s2, s3])
- Can stack row-wise or column-wise

## concat() & .append()
- Equivalence of concat() & .append():
    - result1 = pd.concat([s1, s2, s3])
    - result2 = s1.append(s2).append(s3)
- result1 == result2 elementwise

In [2]:
%%HTML
<video style="display:block; margin: 0 auto;" controls>
      <source src="_Docs/01-Appending_&_concatenating_Series.mp4" type="video/mp4">
</video>

### Appending Series with nonunique Indices
The Series `bronze` and `silver`, which have been printed, represent the 5 countries that won the most bronze and silver Olympic medals respectively between 1896 & 2008. The Indexes of both Series are called `Country` and the values are the corresponding number of medals won.

If you were to run the command `combined = bronze.append(silver)`, how many rows would combined have? And how many rows would `combined.loc['United States']` return?

In [3]:
medals_filepath = '../_datasets/Summer_Olympic_medals/'

# Read 'Bronze.csv' into a DataFrame: bronze
bronze = pd.read_csv(medals_filepath+'Bronze.csv', index_col='Country')
bronze = bronze.iloc[:5,1]
bronze

Country
United States     1052.0
Soviet Union       584.0
United Kingdom     505.0
France             475.0
Germany            454.0
Name: Total, dtype: float64

In [4]:
# Read 'Silver.csv' into a DataFrame: silver
silver = pd.read_csv(medals_filepath+'Silver.csv', index_col='Country')
silver = silver.iloc[:5,1]
silver

Country
United States     1195.0
Soviet Union       627.0
United Kingdom     591.0
France             461.0
Germany            350.0
Name: Total, dtype: float64

In [5]:
combined = bronze.append(silver)
combined

Country
United States     1052.0
Soviet Union       584.0
United Kingdom     505.0
France             475.0
Germany            454.0
United States     1195.0
Soviet Union       627.0
United Kingdom     591.0
France             461.0
Germany            350.0
Name: Total, dtype: float64

In [6]:
combined.loc['United States']

Country
United States    1052.0
United States    1195.0
Name: Total, dtype: float64

### Appending pandas Series
In this exercise, you'll load sales data from the months `January`, `February`, and `March` into DataFrames. Then, you'll extract Series with the `'Units'` column from each and append them together with method chaining using `.append()`.

To check that the stacking worked, you'll print slices from these Series, and finally, you'll add the result to figure out the total units sold in the first quarter.

In [7]:
sales_filepath = '../_datasets/Sales/'

# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv(sales_filepath+'sales-jan-2015.csv',parse_dates=True,index_col='Date')

# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv(sales_filepath+'sales-feb-2015.csv',parse_dates=True,index_col='Date')

# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv(sales_filepath+'sales-mar-2015.csv',parse_dates=True,index_col='Date')

# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']

# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']

# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)
quarter1.head()

Date
2015-01-21 19:13:21    11
2015-01-09 05:23:51     8
2015-01-06 17:19:34    17
2015-01-02 09:51:06    16
2015-01-11 14:51:02    11
Name: Units, dtype: int64

In [8]:
# Print the first slice from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])

# Print the second slice from quarter1
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

# Compute & print total sales in quarter1
print(quarter1.sum())

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64
642


### Concatenating pandas Series along row axis
Having learned how to append Series, you'll now learn how to achieve the same result by concatenating Series instead. You'll continue to work with the sales data you've seen previously. This time, the DataFrames `jan`, `feb`, and `mar` have been pre-loaded.

Your job is to use `pd.concat()` with a list of Series to achieve the same result that you would get by chaining calls to `.append()`.

You may be wondering about the difference between `pd.concat()` and pandas' `.append()` method. **One way to think of the difference is that `.append()` is a specific case of a concatenation, while `pd.concat()` gives you more flexibility**.

In [9]:
# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month['Units'])

# Concatenate the list: quarter1
quarter1 = pd.concat(units,axis='rows')

# Print slices from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64


# Appending & concatenating DataFrames

In [10]:
%%HTML
<video style="display:block; margin: 0 auto;" controls>
      <source src="_Docs/02-Appending_&_concatenating_DataFrames.mp4" type="video/mp4">
</video>

### Appending DataFrames with `ignore_index`
In this exercise, you'll use the [Baby Names Dataset][1] (from [data.gov][2]) again. This time, both DataFrames `names_1981` and `names_1881` are loaded without specifying an Index column (so the default Indexes for both are RangeIndexes).

You'll use the DataFrame `.append()` method to make a DataFrame `combined_names`. To distinguish rows from the original two DataFrames, you'll add a `'year'` column to each with the year (1881 or 1981 in this case). In addition, you'll specify `ignore_index=True` so that the index values are not used along the concatenation axis. The resulting axis will instead be labeled `0, 1, ..., n-1`, which is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.

[1]: https://www.data.gov/developers/baby-names-dataset/
[2]: http://data.gov/

In [11]:
BabyNames_filepath = '../_datasets/Baby_names/'
names_1981 = pd.read_csv(BabyNames_filepath+'names1981.csv', header=None, names=['name','gender','count'])
names_1881 = pd.read_csv(BabyNames_filepath+'names1881.csv', header=None, names=['name','gender','count'])

print('Shape of names_1981 DataFrame: '+str(names_1981.shape))
print('Shape of names_1881 DataFrame: '+str(names_1881.shape))

Shape of names_1981 DataFrame: (19455, 3)
Shape of names_1881 DataFrame: (1935, 3)


In [12]:
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981,ignore_index=True)

# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

combined_names.head()

(19455, 4)
(1935, 4)
(21390, 4)


Unnamed: 0,name,gender,count,year
0,Mary,F,6919,1881
1,Anna,F,2698,1881
2,Emma,F,2034,1881
3,Elizabeth,F,1852,1881
4,Margaret,F,1658,1881


In [13]:
# Print all rows that contain the name 'Morgan'
combined_names.loc[combined_names['name']=='Morgan']

Unnamed: 0,name,gender,count,year
1283,Morgan,M,23,1881
2096,Morgan,F,1769,1981
14390,Morgan,M,766,1981


### Concatenating pandas DataFrames along column axis
The function `pd.concat()` can concatenate DataFrames horizontally as well as vertically (vertical is the default). To make the DataFrames stack horizontally, you have to specify the keyword argument `axis=1` or `axis='columns'`.

In this exercise, you'll use `weather` data with maximum and mean daily temperatures sampled at different rates (quarterly versus monthly). You'll concatenate the rows of both and see that, where rows are missing in the coarser DataFrame, null values are inserted in the concatenated DataFrame. This corresponds to an outer join (which you will explore in more detail in later exercises).

The files `'quarterly_max_temp.csv'` and `'monthly_mean_temp.csv'` have been pre-loaded into the DataFrames `weather_max` and `weather_mean` respectively.

In [14]:
# Read 'monthly_max_temp.csv' into a DataFrame
weather_max = pd.read_csv('../_datasets/monthly_max_temp.csv', index_col='Month')
weather_max = weather_max.loc[['Jan','Apr','Jul','Oct'],:]
weather_max

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Jan,68
Apr,84
Jul,91
Oct,84


In [15]:
# Read 'monthly_mean_temp.csv' into a DataFrame
weather_mean = pd.read_csv('../_datasets/monthly_mean_temp.csv', index_col='Month')
weather_mean

Unnamed: 0_level_0,Mean TemperatureF
Month,Unnamed: 1_level_1
Apr,53.1
Aug,70.0
Dec,34.935484
Feb,28.714286
Jan,32.354839
Jul,72.870968
Jun,70.133333
Mar,35.0
May,62.612903
Nov,39.8


In [16]:
# Concatenate weather_max and weather_mean horizontally: weather
weather = pd.concat([weather_max,weather_mean],axis=1, sort=True)

# Print weather
weather

Unnamed: 0,Max TemperatureF,Mean TemperatureF
Apr,84.0,53.1
Aug,,70.0
Dec,,34.935484
Feb,,28.714286
Jan,68.0,32.354839
Jul,91.0,72.870968
Jun,,70.133333
Mar,,35.0
May,,62.612903
Nov,,39.8


This is where you start to see the advantages of concatenating over appending.

### Reading multiple files to build a DataFrame
It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You'll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.

Here, you'll work with DataFrames compiled from The [Guardian's Olympic medal dataset][1].

pandas has been imported as `pd` and two lists have been pre-loaded: An empty list called `medals`, and `medal_types`, which contains the strings `'bronze'`, `'silver'`, and `'gold'`.

[1]: https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data

In [17]:
medal_types = ['bronze', 'silver', 'gold']
medals = []

In [18]:
for medal in medal_types:

    # Create the file name: file_name
    file_name = "%s_top5.csv" % medal
    print(file_name)
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv(medals_filepath+file_name,header=0,index_col='Country',names=columns)

    # Append medal_df to medals
    medals.append(medal_df)

# Concatenate medals horizontally: medals
medals = pd.concat(medals,axis='columns', sort=True)

# Print medals
medals

bronze_top5.csv
silver_top5.csv
gold_top5.csv


Unnamed: 0,bronze,silver,gold
France,475.0,461.0,
Germany,454.0,,407.0
Italy,,394.0,460.0
Soviet Union,584.0,627.0,838.0
United Kingdom,505.0,591.0,498.0
United States,1052.0,1195.0,2088.0


# Concatenation, keys, & MultiIndexes

In [19]:
%%HTML
<video style="display:block; margin: 0 auto;" controls>
      <source src="_Docs/03-Concatenation_keys_&_MultiIndexes.mp4" type="video/mp4">
</video>

### Concatenating vertically to get MultiIndexed rows
When stacking a sequence of DataFrames vertically, it is sometimes desirable to construct a MultiIndex to indicate the DataFrame from which each row originated. This can be done by specifying the `keys` parameter in the call to `pd.concat()`, which generates a hierarchical index with the labels from `keys` as the outermost index label. So you don't have to rename the columns of each DataFrame as you load it. Instead, only the Index column needs to be specified.

Here, you'll continue working with DataFrames compiled from [The Guardian's Olympic medal dataset][1]. Once again, pandas has been imported as `pd` and two lists have been pre-loaded: An empty list called `medals`, and `medal_types`, which contains the strings `'bronze'`, `'silver'`, and `'gold'`.

[1]: https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data

In [20]:
medal_types = ['bronze', 'silver', 'gold']
medals = []

In [21]:
for medal in medal_types:

    file_name = "%s_top5.csv" % medal
    
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(medals_filepath+file_name, index_col='Country')
    
    # Append medal_df to medals
    medals.append(medal_df)
    
# Concatenate medals: medals
medals = pd.concat(medals,keys=medal_types)

# Print medals in entirety
medals

Unnamed: 0_level_0,Unnamed: 1_level_0,Total
Unnamed: 0_level_1,Country,Unnamed: 2_level_1
bronze,United States,1052.0
bronze,Soviet Union,584.0
bronze,United Kingdom,505.0
bronze,France,475.0
bronze,Germany,454.0
silver,United States,1195.0
silver,Soviet Union,627.0
silver,United Kingdom,591.0
silver,France,461.0
silver,Italy,394.0


### Slicing MultiIndexed DataFrames
This exercise picks up where the last ended (again using [The Guardian's Olympic medal dataset][1]).

You are provided with the MultiIndexed DataFrame as produced at the end of the preceding exercise. Your task is to sort the DataFrame and to use the `pd.IndexSlice` to extract specific slices. Check out [this exercise][2] from Manipulating DataFrames with pandas to refresh your memory on how to deal with MultiIndexed DataFrames.

[1]: https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data
[2]: https://campus.datacamp.com/courses/manipulating-dataframes-with-pandas/advanced-indexing?ex=10

In [22]:
# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)

# Print the number of Bronze medals won by Germany
medals_sorted.loc[('bronze','Germany')]

Total    454.0
Name: (bronze, Germany), dtype: float64

In [23]:
# Print data about silver medals
print(medals_sorted.loc['silver'])

                 Total
Country               
France           461.0
Italy            394.0
Soviet Union     627.0
United Kingdom   591.0
United States   1195.0


In [24]:
# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
print(medals_sorted.loc[idx[:,'United Kingdom'], :])

                       Total
       Country              
bronze United Kingdom  505.0
gold   United Kingdom  498.0
silver United Kingdom  591.0


### Concatenating horizontally to get MultiIndexed columns
It is also possible to construct a DataFrame with hierarchically indexed columns. For this exercise, you'll start with pandas imported and a list of three DataFrames called `dataframes`. All three DataFrames contain `'Company'`, `'Product'`, and `'Units'` columns with a `'Date'` column as the index pertaining to sales transactions during the month of February, 2015. The first DataFrame describes `Hardware` transactions, the second describes `Software` transactions, and the third, `Service` transactions.

Your task is to concatenate the DataFrames horizontally and to create a MultiIndex on the columns. From there, you can summarize the resulting DataFrame and slice some information from it.

In [25]:
transaction_types = ['Hardware','Software','Service']
dataframes = []
for transaction in transaction_types:
    file_name = "feb-sales-"+transaction+".csv" 
    print(file_name)
    # Read file_name into a DataFrame: medal_df
    df = pd.read_csv(sales_filepath+file_name, index_col='Date', parse_dates=True)
    
    # Append medal_df to medals
    dataframes.append(df)
dataframes

feb-sales-Hardware.csv
feb-sales-Software.csv
feb-sales-Service.csv


[                             Company   Product  Units
 Date                                                 
 2015-02-04 21:52:45  Acme Coporation  Hardware     14
 2015-02-07 22:58:10  Acme Coporation  Hardware      1
 2015-02-19 10:59:33        Mediacore  Hardware     16
 2015-02-02 20:54:49        Mediacore  Hardware      9
 2015-02-21 20:41:47            Hooli  Hardware      3,
                              Company   Product  Units
 Date                                                 
 2015-02-16 12:09:19            Hooli  Software     10
 2015-02-03 14:14:18          Initech  Software     13
 2015-02-02 08:33:01            Hooli  Software      3
 2015-02-05 01:53:06  Acme Coporation  Software     19
 2015-02-11 20:03:08          Initech  Software      7
 2015-02-09 13:09:55        Mediacore  Software      7
 2015-02-11 22:50:44            Hooli  Software      4
 2015-02-04 15:36:29        Streeplex  Software     13
 2015-02-21 05:01:26        Mediacore  Software      3,
        

In [26]:
# Concatenate dataframes: february
february = pd.concat(dataframes,axis=1,keys=['Hardware', 'Software', 'Service'], sort=True)

# Print february.info()
print(february.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20 entries, 2015-02-02 08:33:01 to 2015-02-26 08:58:51
Data columns (total 9 columns):
(Hardware, Company)    5 non-null object
(Hardware, Product)    5 non-null object
(Hardware, Units)      5 non-null float64
(Software, Company)    9 non-null object
(Software, Product)    9 non-null object
(Software, Units)      9 non-null float64
(Service, Company)     6 non-null object
(Service, Product)     6 non-null object
(Service, Units)       6 non-null float64
dtypes: float64(3), object(6)
memory usage: 1.1+ KB
None


In [27]:
february

Unnamed: 0_level_0,Hardware,Hardware,Hardware,Software,Software,Software,Service,Service,Service
Unnamed: 0_level_1,Company,Product,Units,Company,Product,Units,Company,Product,Units
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
2015-02-02 08:33:01,,,,Hooli,Software,3.0,,,
2015-02-02 20:54:49,Mediacore,Hardware,9.0,,,,,,
2015-02-03 14:14:18,,,,Initech,Software,13.0,,,
2015-02-04 15:36:29,,,,Streeplex,Software,13.0,,,
2015-02-04 21:52:45,Acme Coporation,Hardware,14.0,,,,,,
2015-02-05 01:53:06,,,,Acme Coporation,Software,19.0,,,
2015-02-05 22:05:03,,,,,,,Hooli,Service,10.0
2015-02-07 22:58:10,Acme Coporation,Hardware,1.0,,,,,,
2015-02-09 08:57:30,,,,,,,Streeplex,Service,19.0
2015-02-09 13:09:55,,,,Mediacore,Software,7.0,,,


In [28]:
# Assign pd.IndexSlice: idx
idx = pd.IndexSlice

# Create the slice: slice_2_8
slice_2_8 = february.loc['Feb. 2, 2015':'Feb. 8, 2015']

# Print slice_2_8
slice_2_8

Unnamed: 0_level_0,Hardware,Hardware,Hardware,Software,Software,Software,Service,Service,Service
Unnamed: 0_level_1,Company,Product,Units,Company,Product,Units,Company,Product,Units
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
2015-02-02 08:33:01,,,,Hooli,Software,3.0,,,
2015-02-02 20:54:49,Mediacore,Hardware,9.0,,,,,,
2015-02-03 14:14:18,,,,Initech,Software,13.0,,,
2015-02-04 15:36:29,,,,Streeplex,Software,13.0,,,
2015-02-04 21:52:45,Acme Coporation,Hardware,14.0,,,,,,
2015-02-05 01:53:06,,,,Acme Coporation,Software,19.0,,,
2015-02-05 22:05:03,,,,,,,Hooli,Service,10.0
2015-02-07 22:58:10,Acme Coporation,Hardware,1.0,,,,,,


### Concatenating DataFrames from a dict
You're now going to revisit the sales data you worked with earlier in the chapter. Three DataFrames `jan`, `feb`, and `mar` have been pre-loaded for you. Your task is to aggregate the sum of all sales over the `'Company'` column into a single DataFrame. You'll do this by constructing a dictionary of these DataFrames and then concatenating them.

In [29]:
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]

# Create an empty dictionary: month_dict
month_dict = dict(month_list)

for month_name, month_data in month_list:

    # Group month_data: month_dict[month_name]
    month_dict[month_name] = month_data.groupby('Company').sum()

# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)

# Print sales
sales

Unnamed: 0_level_0,Unnamed: 1_level_0,Units
Unnamed: 0_level_1,Company,Unnamed: 2_level_1
february,Acme Coporation,34
february,Hooli,30
february,Initech,30
february,Mediacore,45
february,Streeplex,37
january,Acme Coporation,76
january,Hooli,70
january,Initech,37
january,Mediacore,15
january,Streeplex,50


In [30]:
# Print all sales by Mediacore
idx = pd.IndexSlice
sales.loc[idx[:, 'Mediacore'], :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Units
Unnamed: 0_level_1,Company,Unnamed: 2_level_1
february,Mediacore,45
january,Mediacore,15
march,Mediacore,68


# Outer & inner joins
## Joins
- Joining tables: Combining rows of multiple tables
- Outer join
    - Union of index sets (all labels, no repetition)
    - Missing fields filled with NaN
- Inner join
    - Intersection of index sets (only common labels)

In [31]:
%%HTML
<video style="display:block; margin: 0 auto;" controls>
      <source src="_Docs/04-Outer_&_inner_joins.mp4" type="video/mp4">
</video>

### Concatenating DataFrames with inner join
Here, you'll continue working with DataFrames compiled from The [Guardian's Olympic medal dataset][1].

The DataFrames `bronze`, `silver`, and `gold` have been pre-loaded for you.

Your task is to compute an inner join.

[1]: https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data

In [32]:
medals_filepath = '../_datasets/Summer_Olympic_medals/'

# Read 'Bronze.csv' into a DataFrame: bronze
bronze = pd.read_csv(medals_filepath+'Bronze.csv', index_col='Country')
bronze = bronze.iloc[:5,1]

# Read 'Silver.csv' into a DataFrame: silver
silver = pd.read_csv(medals_filepath+'Silver.csv', index_col='Country')
silver = silver.iloc[:5,1]

# Read 'Gold.csv' into a DataFrame: gold
gold = pd.read_csv(medals_filepath+'Gold.csv', index_col='Country')
gold = gold.iloc[:5,1]

In [33]:
# Create the list of DataFrames: medal_list
medal_list = [bronze,silver,gold]

# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, keys=['bronze','silver','gold'], axis=1, join='inner')

# Print medals
medals

Unnamed: 0_level_0,bronze,silver,gold
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United States,1052.0,1195.0,2088.0
Soviet Union,584.0,627.0,838.0
United Kingdom,505.0,591.0,498.0
France,475.0,461.0,378.0
Germany,454.0,350.0,407.0


### Resampling & concatenating DataFrames with inner join
In this exercise, you'll compare the historical 10-year GDP (Gross Domestic Product) growth in the US and in China. The data for the US starts in 1947 and is recorded quarterly; by contrast, the data for China starts in 1961 and is recorded annually.

You'll need to use a combination of resampling and an inner join to align the index labels. You'll need an appropriate[offset alias][1] for resampling, and the method `.resample()` must be chained with some kind of aggregation method (`.pct_change()` and `.last()` in this case).

[1]: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

In [34]:
# Read 'GDP.csv' 
us = pd.read_csv('../_datasets/GDP/gdp_usa.csv',parse_dates=True,index_col='DATE')
us.index.name = 'Year'
us.columns = ['US']
us.head()

Unnamed: 0_level_0,US
Year,Unnamed: 1_level_1
1947-01-01,243.1
1947-04-01,246.3
1947-07-01,250.1
1947-10-01,260.3
1948-01-01,266.2


In [35]:
# Read 'GDP.csv' 
china = pd.read_csv('../_datasets/GDP/gdp_china.csv',parse_dates=True,index_col='Year')
china.index.name = 'Year'
china.columns = ['China']
china.head()

Unnamed: 0_level_0,China
Year,Unnamed: 1_level_1
1960-01-01,59.184116
1961-01-01,49.55705
1962-01-01,46.685179
1963-01-01,50.097303
1964-01-01,59.062255


In [36]:
# Resample and tidy china: china_annual
china_annual = china.resample('A').last().pct_change(10).dropna()
china_annual.head()

Unnamed: 0_level_0,China
Year,Unnamed: 1_level_1
1970-12-31,0.546128
1971-12-31,0.98886
1972-12-31,1.402472
1973-12-31,1.730085
1974-12-31,1.408556


In [37]:
# Resample and tidy us: us_annual
us_annual = us.resample('A').last().pct_change(10).dropna()
us_annual.head()

Unnamed: 0_level_0,US
Year,Unnamed: 1_level_1
1957-12-31,0.827507
1958-12-31,0.782686
1959-12-31,0.953137
1960-12-31,0.689354
1961-12-31,0.630959


In [38]:
# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual,us_annual],join='inner',axis=1)

# Resample gdp and print
gdp.resample('10A').last()

Unnamed: 0_level_0,China,US
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1970-12-31,0.546128,1.017187
1980-12-31,1.072537,1.742556
1990-12-31,0.89282,1.012126
2000-12-31,2.357522,0.738632
2010-12-31,4.011081,0.454332
2020-12-31,3.789936,0.36178
