## Overview

Data sets come in all shapes and sizes and sometimes we need to combine them in various ways. This type of data wrangling task is very well suited for pandas. There are several different tools that we can use, from very simple to very complex DataFrame combinations. We'll begin with one of the easiest: concatenation using the pandas method `pd.concat`. 

Concatenation means to join two things together. For example, with string concatenation, we join two or more strings end-to-end. We'll see that for a structure like a DataFrame, "end" has a slightly different meaning. 

Fun fact: the Unix terminal command `cat` is short for concatenate and is a way to both display files and to join them together.

### Introduction to pd.concat()

To become more familiar with this method, let's look at the default parameters of `pd.concat()`.

In [1]:
# Import libraries
import pandas as pd
import numpy as np

# Display the arguments for pd.concat()
pd.concat

<function pandas.core.reshape.concat.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)>

We can see that there are a number of parameters to use with this method. To start, our use cases will focus on the `axis` and `join` parameters.

## Follow Along

To demonstrate how to combine DataFrames we need to create a few to practice with.

In [2]:
# Make DataFrames
df1 = pd.DataFrame(np.random.rand(3,2),
                  columns = ['cat', 'dog'])
df2 = pd.DataFrame(np.random.rand(3,2),
                  columns = ['cat', 'dog'])

# Print out the DataFrames
print(df1); print(df2)

        cat       dog
0  0.693574  0.094608
1  0.045537  0.449816
2  0.077305  0.590424
        cat       dog
0  0.685117  0.628959
1  0.079331  0.955059
2  0.530680  0.949267


### Using `pd.concat`

We'll combine them along the row axis (axis=0) which is the default setting for `pd.concat()`.

In [3]:
# Concatenate along the row axis
pd.concat([df1, df2])

Unnamed: 0,cat,dog
0,0.693574,0.094608
1,0.045537,0.449816
2,0.077305,0.590424
0,0.685117,0.628959
1,0.079331,0.955059
2,0.53068,0.949267


We can see that the concatenation preserves the indexes, so we now have some that repeat. One way to avoid this is to set the parameter `ignore_index=True`.

Say we have another DataFrame and we would like to concatenate them along the *column axis*. We need to set the argument `axis=1` to do this. To test this out, we'll create a DataFrame with different column names, combine them, and view the results.

In [4]:
# Create the new DataFrame
df3 = pd.DataFrame(np.random.rand(3,2),
                  columns = ['bird', 'horse'])

# Concatenate it with one of the previous examples
pd.concat([df1, df3], axis=1)

Unnamed: 0,cat,dog,bird,horse
0,0.693574,0.094608,0.818225,0.971892
1,0.045537,0.449816,0.306902,0.428235
2,0.077305,0.590424,0.929212,0.186076


In the above examples, `df1` and `df2` have the same column names; it was easy to combine them. What do we do if we don't have any *column names* in common but still want to combine the DataFrames? There are additional arguments that we can pass to `pd.concat`.

Using the `join` argument we can combine two DataFrames with overlapping columns. The default value of `join` is `outer` which is a union of all of the columns. Let's create a few more DataFrames to practice with.

In [5]:
# Create the DataFrames

data4 = {'Alpaca':['A1', 'A2', 'A3'], 'Bird':['B1', 'B2', 'B3'],
        'Camel':['C1', 'C2', 'C3']} 
df4 = pd.DataFrame(data4)
print(df4)

data5 = {'Bird':['B3', 'B4', 'B5'], 'Camel':['C3', 'C4', 'C5'],
        'Duck':['D3', 'D4', 'D5']} 
df5 = pd.DataFrame(data5, index=[3,4,5])
print(df5)

# Concatenate with default join='outer'
# (sort=False is used to suppress the warning that
# future pandas will no longer sort the rows) 
print(pd.concat([df4, df5], sort=False))

  Alpaca Bird Camel
0     A1   B1    C1
1     A2   B2    C2
2     A3   B3    C3
  Bird Camel Duck
3   B3    C3   D3
4   B4    C4   D4
5   B5    C5   D5
  Alpaca Bird Camel Duck
0     A1   B1    C1  NaN
1     A2   B2    C2  NaN
2     A3   B3    C3  NaN
3    NaN   B3    C3   D3
4    NaN   B4    C4   D4
5    NaN   B5    C5   D5


We can see in the above example that the cells where there isn't data available now have `NaN` values. To join only the columns that the two DataFrames have in common, use the `join=inner` argument.

In [6]:
# Concatenate with default join='inner'
print(pd.concat([df4, df5], join='inner'))

  Bird Camel
0   B1    C1
1   B2    C2
2   B3    C3
3   B3    C3
4   B4    C4
5   B5    C5


Now the only columns that remain are the ones that the two DataFrames had in common.

## Challenge

Using the same process as above, practice concatenating some DataFrames that you create. You should follow these general steps:

* Create DataFrames with either the np.array syntax or by creating a dictionary; create something that is ~10 rows
* Concatenate along either the rows or columns, depending on the column names you created; experiment!
* Use `pd.concat` with the `join` argument

## Additional Resources

* [pandas concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)
* [Combining Datasets: Concat and Append](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html)