# Combining Data with Pandas
In the last mission, we worked with just one data set, the 2015 World Happiness Report, to explore data aggregation. However, it's very common in practice to work with more than one data set at a time.

Often, you'll find that you need additional data to perform analysis or you'll find that you have the data, but need to pull it from mulitiple sources. In this mission, we'll learn a couple of different techniques for combining data using pandas to easily handle situations like these.

We'll use what we learned in the last mission to analyze the 2015, 2016, and 2017 World Happiness Reports. Specifically, we'll look to answer the following question:

Did world happiness increase, decrease, or stay about the same from 2015 to 2017?

As a reminder, these reports assign each country a happiness score based on a poll question that asks respondents to rank their life on a scale of 0 - 10, so "world happiness" refers to this definition specifically.
Below are descriptions for some of the columns:

- Country - Name of the country
- Region - Name of the region the country belongs to
- Happiness Rank - The rank of the country, as determined by its happiness score
- Happiness Score - A score assigned to each country based on the answers to a poll question that asks respondents to rate their happiness on a scale of 0-10
Let's start by reading the 2015, 2016, and 2017 reports into a pandas dataframe and adding a Year column to each to make it easier to distinguish between them.

### Instructions

We've already read the World_Happiness_2015.csv file into a dataframe called happiness2015.

- Use the pandas.read_csv() function to read the World_Happiness_2016.csv file into a dataframe called happiness2016 and the World_Happiness_2017.csv file into a dataframe called happiness2017.
- Add a column called Year to each dataframe with the corresponding year. For example, the Year column in happiness2015 should contain the value 2015 for each row.

In [2]:
import pandas as pd
happiness2015 = pd.read_csv("World_Happiness_2015.csv")
happiness2016 = pd.read_csv("World_Happiness_2016.csv")
happiness2017 = pd.read_csv("World_Happiness_2017.csv")
happiness2015['Year'] = 2015
happiness2016['Year'] = 2016
happiness2017['Year'] = 2017


Let's start by exploring the *pd.concat()* function. The concat() function combines dataframes one of two ways:

- Stacked: Axis = 0 (This is the default option.)
![concat_updated](Concat_Updated.svg)

- Side by Side: Axis = 1
![concat_axis](Concat_Axis1.svg)

Since concat is a function, not a method, we use the syntax below:

![concat_syntax](Concat_syntax.svg)

In the next exercise, we'll use the concat() function to combine subsets of happiness2015 and happiness2016 and then debrief the results on the following screen.

Below are the subsets we'll be working with:

In [4]:
head_2015 = happiness2015[['Country','Happiness Score', 'Year']].head(3)
head_2015

Unnamed: 0,Country,Happiness Score,Year
0,Switzerland,7.587,2015
1,Iceland,7.561,2015
2,Denmark,7.527,2015


Let's use the concat() function to combine head_2015 and head_2016 next.

### Instructions

We've already saved the subsets from happiness2015 and happiness2016 to the variables head_2015 and head_2016.

- Use the pd.concat() function to combine head_2015 and head_2016 along axis = 0. Remember to pass the head_2015 and head_2016 into the function as a list. Assign the result to concat_axis0.
- Use the pd.concat() function to combine head_2015 and head_2016 along axis = 1. Remember to pass head_2015 and head_2016 into the function as a list and set the axis parameter equal to 1. Assign the result to concat_axis1.
- Use the variable inspector to view concat_axis0 and concat_axis1.
- Assign the number of rows in concat_axis0 to a variable called question1.
- Assign the number of rows in concat_axis1 to a variable called question2.

In [5]:
head_2016 = happiness2016[['Country','Happiness Score', 'Year']].head(3)
concat_axis0 = pd.concat([head_2015,head_2016], axis=0)
concat_axis1 = pd.concat([head_2015,head_2016], axis=1)
question1 = concat_axis0.shape[0]
question2 = concat_axis1.shape[0]
display(concat_axis0)
display(concat_axis1)

Unnamed: 0,Country,Happiness Score,Year
0,Switzerland,7.587,2015
1,Iceland,7.561,2015
2,Denmark,7.527,2015
0,Denmark,7.526,2016
1,Switzerland,7.509,2016
2,Iceland,7.501,2016


Unnamed: 0,Country,Happiness Score,Year,Country.1,Happiness Score.1,Year.1
0,Switzerland,7.587,2015,Denmark,7.526,2016
1,Iceland,7.561,2015,Switzerland,7.509,2016
2,Denmark,7.527,2015,Iceland,7.501,2016


When you reviewed the results from the last exercise, you probably noticed that we merely pushed the dataframes together vertically or horizontally - none of the values, column names, or indexes changed. For this reason, when you use the concat() function to combine dataframes with the same shape and index, you can think of the function as "gluing" dataframes together.

![glue](Glue.svg)

However, what happens if the dataframes have different shapes or columns? Let's confirm the concat() function's behavior when we combine dataframes that don't have the same shape in the next exercise.
We will work with the following subsets:

In [7]:
head_2015 = happiness2015[['Year','Country','Happiness Score', 'Standard Error']].head(4)
head_2016 = happiness2016[['Country','Happiness Score', 'Year']].head(3)
display(head_2015.head(1))
head_2016.head(1)

Unnamed: 0,Year,Country,Happiness Score,Standard Error
0,2015,Switzerland,7.587,0.03411


Unnamed: 0,Country,Happiness Score,Year
0,Denmark,7.526,2016


### Instructions

- Use the pd.concat() function to combine head_2015 and head_2016 along axis = 0. Remember to pass the head_2015 and head_2016 into the function as a list. Assign the result to concat_axis0.
- Use the variable inspector to view concat_axis0.
    - Assign the number of rows in concat_axis0 to a variable called rows.
    - Assign the number of columns in concat_axis0 to a variable called columns.

In [12]:
concat_axis0 = pd.concat([head_2015,head_2016], axis=0, sort=True)
rows, columns = concat_axis0.shape
display(concat_axis0, rows, columns)

Unnamed: 0,Country,Happiness Score,Standard Error,Year
0,Switzerland,7.587,0.03411,2015
1,Iceland,7.561,0.04884,2015
2,Denmark,7.527,0.03328,2015
3,Norway,7.522,0.0388,2015
0,Denmark,7.526,,2016
1,Switzerland,7.509,,2016
2,Iceland,7.501,,2016


7

4

![gluing](Concat_DifShapes.svg)

Note that because the Standard Error column didn't exist in head_2016, NaN values were created to signify those values are missing. By default, the concat function will keep ALL of the data, no matter if missing values are created.

Also, notice again the indexes of the original dataframes didn't change. If the indexes aren't meaningful, it can be better to reset them. This is especially true when we create duplicate indexes, because they could cause errors as we perform other data cleaning tasks.

Luckily, the concat function has a parameter, ignore_index, that can be used to clear the existing index and reset it in the result. Let's practice using it next.

### Instructions

- Use the pd.concat() function to combine head_2015 and head_2016 along axis = 0 again. This time, however, set the ignore_index parameter to True to reset the index in the result. Assign the result to concat_update_index.
- Use the variable inspector to view the results and confirm the index was reset.

In [11]:
concat_update_index = pd.concat([head_2015,head_2016], axis=0, ignore_index=True, sort=True)
concat_update_index

Unnamed: 0,Country,Happiness Score,Standard Error,Year
0,Switzerland,7.587,0.03411,2015
1,Iceland,7.561,0.04884,2015
2,Denmark,7.527,0.03328,2015
3,Norway,7.522,0.0388,2015
4,Denmark,7.526,,2016
5,Switzerland,7.509,,2016
6,Iceland,7.501,,2016
