# Concatenating data
Having learned how to import multiple DataFrames and share information using Indexes, in this chapter you'll learn how to perform database-style operations to combine DataFrames. In particular, you'll learn about appending and concatenating DataFrames while working with a variety of real-world datasets.

# 1. Appending & concatenating Series
### 1.1 Appending Series with nonunique Indices
The Series `bronze` and `silver`, which have been printed in the IPython Shell, represent the 5 countries that won the most bronze and silver Olympic medals respectively between 1896 & 2008. The Indexes of both Series are called `Country` and the values are the corresponding number of medals won.

If you were to run the command `combined = bronze.append(silver)`, how many rows would `combined` have? And how many rows would `combined.loc['United States']` return? Find out for yourself by running these commands in the IPython Shell.

Possible Answers:
* `combined` has 5 rows and `combined.loc['United States']` is empty (0 rows).
* `combined` has 10 rows and `combined.loc['United States']` has 2 rows.
* `combined` has 6 rows and `combined.loc['United States']` has 1 row.
* `combined` has 5 rows and `combined.loc['United States']` has 2 rows.

In [1]:
import pandas as pd
silver = pd.read_csv('_datasets/Summer_Olympics/Silver.csv', index_col='Country')['Total'].nlargest(5)
silver.head()

Country
United States     1195.0
Soviet Union       627.0
United Kingdom     591.0
France             461.0
Italy              394.0
Name: Total, dtype: float64

In [2]:
bronze = pd.read_csv('_datasets/Summer_Olympics/Bronze.csv', index_col='Country')['Total'].nlargest(5)
bronze

Country
United States     1052.0
Soviet Union       584.0
United Kingdom     505.0
France             475.0
Germany            454.0
Name: Total, dtype: float64

In [3]:
combined = bronze.append(silver)
combined

Country
United States     1052.0
Soviet Union       584.0
United Kingdom     505.0
France             475.0
Germany            454.0
United States     1195.0
Soviet Union       627.0
United Kingdom     591.0
France             461.0
Italy              394.0
Name: Total, dtype: float64

In [4]:
combined.loc['United States']

Country
United States    1052.0
United States    1195.0
Name: Total, dtype: float64

The combined Series has 10 rows and `combined.loc['United States']` has two rows, since the index value `'United States'` occurs in both series `bronze` and `silver`.

### 1.2 Appending pandas Series
In this exercise, you'll load sales data from the months January, February, and March into DataFrames. Then, you'll extract Series with the `'Units'` column from each and append them together with method chaining using `.append()`.

To check that the stacking worked, you'll print slices from these Series, and finally, you'll add the result to figure out the total units sold in the first quarter.

### Instructions:
* Read the files `'sales-jan-2015.csv'`, `'sales-feb-2015.csv'` and `'sales-mar-2015.csv'` into the DataFrames `jan`, `feb`, and `mar` respectively.
* Use `parse_dates=True` and `index_col='Date'`.
* Extract the `'Units'` column of `jan`, `feb`, and `mar` to create the Series `jan_units`, `feb_units`, and `mar_units` respectively.
* Construct the Series `quarter1` by appending `feb_units` to `jan_units` and then appending `mar_units` to the result. Use chained calls to the `.append()` method to do this.
* Verify that `quarter1` has the individual Series stacked vertically. To do this:
* Print the slice containing rows from `jan 27, 2015` to `feb 2, 2015`.
* Print the slice containing rows from `feb 26, 2015` to `mar 7, 2015`.
* Compute and print the total number of units sold from the Series `quarter1`. 

In [5]:
# Import pandas
import pandas as pd

# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv('_datasets/Sales/sales-jan-2015.csv', parse_dates=True, index_col='Date')
# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv('_datasets/Sales/sales-feb-2015.csv', parse_dates=True, index_col='Date')
# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv('_datasets/Sales/sales-mar-2015.csv', parse_dates=True, index_col='Date')

# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']
# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']
# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)

# Print the first slice from quarter1
quarter1.loc['jan 27, 2015':'feb 2, 2015']

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64

In [6]:
# Print the second slice from quarter1
quarter1.loc['feb 26, 2015':'mar 7, 2015']

Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64

In [7]:
# Compute & print total sales in quarter1
quarter1.sum()

642

### 1.3 Concatenating pandas Series along row axis
Having learned how to append Series, you'll now learn how to achieve the same result by concatenating Series instead. You'll continue to work with the sales data you've seen previously. 

Your job is to use `pd.concat()` with a list of Series to achieve the same result that you would get by chaining calls to `.append()`.

You may be wondering about the difference between `pd.concat()` and pandas' `.append()` method. One way to think of the difference is that `.append()` is a specific case of a concatenation, while `pd.concat()` gives you more flexibility, as you'll see in later exercises.

### Instructions:
* Create an empty list called `units`. This has been done for you.
* Use a `for` loop to iterate over `[jan, feb, mar]`:
    * In each iteration of the loop, append the `'Units'` column of each DataFrame to `units`.
* Concatenate the Series contained in the list `units` into a longer Series called `quarter1` using `pd.concat()`.
    * Specify the keyword argument `axis='rows'` to stack the Series vertically.
* Verify that `quarter1` has the individual Series stacked vertically by printing slices.

In [8]:
# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month['Units'])

# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis='rows')

# Print slices from quarter1
quarter1.loc['jan 27, 2015':'feb 2, 2015']

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64

In [9]:
quarter1.loc['feb 26, 2015':'mar 7, 2015']

Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64

# 2. Appending & concatenating DataFrames
### 2.1 Appending DataFrames with ignore_index
In this exercise, you'll use the [Baby Names Dataset](https://www.data.gov/developers/baby-names-dataset/) (from [data.gov](https://www.data.gov/)) again. This time, both DataFrames `names_1981` and `names_1881` are loaded without specifying an Index column (so the default Indexes for both are RangeIndexes).

You'll use the DataFrame `.append()` method to make a DataFrame `combined_names`. To distinguish rows from the original two DataFrames, you'll add a `'year'` column to each with the year (1881 or 1981 in this case). In addition, you'll specify `ignore_index=True` so that the index values are not used along the concatenation axis. The resulting axis will instead be labeled `0, 1, ..., n-1`, which is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.

### Instructions:
* Create a `'year'` column in the DataFrames `names_1881` and `names_1981`, with values of `1881` and `1981` respectively. Recall that assigning a scalar value to a DataFrame column broadcasts that value throughout.
* Create a new DataFrame called `combined_names` by appending the rows of `names_1981` underneath the rows of `names_1881`. Specify the keyword argument `ignore_index=True` to make a new RangeIndex of unique integers for each row.
* Print the shapes of all three DataFrames. 
* Extract all rows from `combined_names` that have the name `'Morgan'`. To do this, use the `.loc[]` accessor with an appropriate filter. The relevant column of `combined_names` here is `'name'`.

In [10]:
names_1981 = pd.read_csv('_datasets/names1981.csv', header=None, names=['name','gender','count'])
names_1881 = pd.read_csv('_datasets/names1881.csv', header=None, names=['name','gender','count'])

In [11]:
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index=True)

# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

(19455, 4)
(1935, 4)
(21390, 4)


In [12]:
# Print all rows that contain the name 'Morgan'
combined_names.head()

Unnamed: 0,name,gender,count,year
0,Mary,F,6919,1881
1,Anna,F,2698,1881
2,Emma,F,2034,1881
3,Elizabeth,F,1852,1881
4,Margaret,F,1658,1881


In [13]:
combined_names.loc[combined_names['name']=='Morgan']

Unnamed: 0,name,gender,count,year
1283,Morgan,M,23,1881
2096,Morgan,F,1769,1981
14390,Morgan,M,766,1981


### 2.2 Concatenating pandas DataFrames along column axis
The function `pd.concat()` can concatenate DataFrames _horizontally_ as well as _vertically_ (vertical is the default). To make the DataFrames stack horizontally, you have to specify the keyword argument `axis=1` or `axis='columns'`.

In this exercise, you'll use weather data with maximum and mean daily temperatures sampled at different rates (quarterly versus monthly). You'll concatenate the rows of both and see that, where rows are missing in the coarser DataFrame, null values are inserted in the concatenated DataFrame. This corresponds to an _outer join_ (which you will explore in more detail in later exercises).

The files `'quarterly_max_temp.csv'` and `'monthly_mean_temp.csv'` have been pre-loaded into the DataFrames `weather_max` and `weather_mean` respectively.

### Instructions:
* Create `weather_list`, a list of the DataFrames `weather_max` and `weather_mean`.
* Create a new DataFrame called `weather` by concatenating `weather_list` _horizontally_.
    * Pass the list to `pd.concat()` and specify the keyword argument `axis=1` to stack them horizontally.
* Print the new DataFrame `weather`.

In [14]:
weather_max = pd.read_csv('_datasets/quarterly_max_temp.csv', index_col='Month')
weather_mean = pd.read_csv('_datasets/monthly_mean_temp.csv', index_col='Month')

In [15]:
# Create a list of weather_max and weather_mean
weather_list = [weather_max, weather_mean]

# Concatenate weather_max and weather_mean horizontally: weather
weather = pd.concat([weather_max, weather_mean], axis=1, sort='True')

# Print weather
weather

Unnamed: 0,Mean TemperatureF,Mean TemperatureF.1
Apr,89.0,53.1
Aug,,70.0
Dec,,34.935484
Feb,,28.714286
Jan,68.0,32.354839
Jul,91.0,72.870968
Jun,,70.133333
Mar,,35.0
May,,62.612903
Nov,,39.8


### 2.3 Reading multiple files to build a DataFrame
It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You'll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.

Here, you'll work with DataFrames compiled from [The Guardian's Olympic medal dataset](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data).

`pandas` has been imported as `pd` and two lists have been pre-loaded: An empty list called `medals`, and `medal_types`, which contains the strings `'bronze'`, `'silver'`, and `'gold'`.

### Instructions:
* Iterate over `medal_types` in the `for` loop.
* Inside the `for` loop:
    * Create `file_name` using string interpolation with the loop variable `medal`. The expression `"%s_top5.csv" % medal` evaluates as a string with the _value_ of `medal` replacing `%s` in the format string.
    * Create the list of column names called `columns`. This has been done for you.
    * Read `file_name` into a DataFrame called `medal_df`. Specify the keyword arguments `header=0`, `index_col='Country'`, and `names=columns` to get the correct row and column Indexes.
    * Append `medal_df` to `medals` using the list `.append()` method.
* Concatenate the list of DataFrames `medals` horizontally (using `axis='columns'`) to create a single DataFrame called `medals`. Print it in its entirety.

In [16]:
# Create a file path
path = '_datasets/Summer_Olympics/'

# Create a list of medal_types
medal_types = ['gold', 'silver', 'bronze']

# Create a empty dataframe
medals = []

for medal in medal_types:

    # Create the file name: file_name
    file_name = "%s_top5.csv" % medal
    
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv(path+file_name, header=0, index_col='Country', names=columns)

    # Append medal_df to medals
    medals.append(medal_df)

# Concatenate medals horizontally: medals
medals = pd.concat(medals, axis='columns', sort=True)

# Print medals
print(medals)

                  gold  silver  bronze
France             NaN   461.0   475.0
Germany          407.0     NaN   454.0
Italy            460.0   394.0     NaN
Soviet Union     838.0   627.0   584.0
United Kingdom   498.0   591.0   505.0
United States   2088.0  1195.0  1052.0
