# Lesson 5: Merging DataFrames with pandas

[View this lesson on datacamp](https://learn.datacamp.com/courses/merging-dataframes-with-pandas)

## Chapter 1: Preparing Data
To read a single file into a dataframe, use the following format:

`variable = pd.read_csv('filename.csv')`

It is also possible to use a for loop to read in multiple files:

First, create a list of the files

`filenames = ['filename1.csv', 'filename2.csv', 'filename3.csv']`

Next, create an empty list to append the files to

`dataframes = []`

`for filename in filenames:
    dataframes.append(pd.read_csv(filename))`

|Method                    |Funciton               |
|--------------------------|-----------------------|
|`.sort_index()`           |sort index - default is by ascending order, specifying `ascending=False` reverses order|
|`.sort_values()`          |sort column values by increasing numerical order by default|
|`.reindex()`              |reorders rows based on a new index column|
|`.ffill()`                |forward fill; replace NaN with last preceding non-null value|
|`.str.replace('', '')`    |replace first specified string with second down a column|
|`.multiply()`             |performs multiplication along specified column/row and broadcasts output|

## Chapter 2: Concatenating Data

### Using `.append()` and `pd.concat()` for series

The `.append()` method is used for both series and dataframes to stack rows on top of each other. For example, appeding the 2 pandas series below results in 1 series in which the elements of 'blue' are stacked underneath the elements of 'pink'

In [1]:
import pandas as pd

pink = pd.Series(['rose', 'fuchsia', 'ruby', 'magenta'])
blue = pd.Series(['turquoise', 'sky blue', 'navy', 'ocean blue'])
print(pink.append(blue))

0          rose
1       fuchsia
2          ruby
3       magenta
0     turquoise
1      sky blue
2          navy
3    ocean blue
dtype: object


One thing to note here is that the `.append()` method doesn't adjust the index of the series when it stacks them. This can be done using the `.reset_index()` method. Setting the argument `drop=True` will drop the original index of 'blue'

In [2]:
print(pink.append(blue).reset_index(drop=True))

0          rose
1       fuchsia
2          ruby
3       magenta
4     turquoise
5      sky blue
6          navy
7    ocean blue
dtype: object


Using `pd.concat()` can produce the same output as above, but in this case, the input is a list of series or dataframes. The argument `ignore_index=True` performs the same function as `reset_index(drop=True)` above.

In [3]:
#pass 2 dataframes as a list into pd.concat
print(pd.concat([pink, blue], ignore_index=True))

0          rose
1       fuchsia
2          ruby
3       magenta
4     turquoise
5      sky blue
6          navy
7    ocean blue
dtype: object


### Using `.append()` and `pd.concat()` for dataframes
For pandas dataframes, the append method stacks the rows, just like for pandas series.  Here we will use 2 small csv files, one containing participants birthday month, and the other their age

In [16]:
#we begin by reading the csv files into dataframes, and making 'Participant num' the index column
fav_colour = pd.read_csv('fav_colour.csv', index_col='Participant num')
birthday_month = pd.read_csv('birthday_months.csv', index_col='Participant num')

#append birthday_month to fav_colour
print(fav_colour.append(birthday_month))

                Fav Colour Birthday Month
Participant num                          
1                     blue            NaN
2                      red            NaN
3                    green            NaN
4                  purple             NaN
5                      red            NaN
6                    green            NaN
7                   orange            NaN
8                  yellow             NaN
9                  yellow             NaN
10                    pink            NaN
1                      NaN           may 
2                      NaN          june 
3                      NaN        january
4                      NaN      february 
5                      NaN     september 
6                      NaN          july 
7                      NaN           may 
8                      NaN           may 
9                      NaN         august
10                     NaN       december


If the two dataframes have different columns, such as in the example above, the places in columns where there is no corresponding value will be filled with NaN

Using `pd.concat()` for dataframes produces similar output as `.append()`, but it will overlay specified DataFrames along shared row/columns.

In [25]:
print(pd.concat([fav_colour, birthday_month]))

                Fav Colour Birthday Month
Participant num                          
1                     blue            NaN
2                      red            NaN
3                    green            NaN
4                  purple             NaN
5                      red            NaN
6                    green            NaN
7                   orange            NaN
8                  yellow             NaN
9                  yellow             NaN
10                    pink            NaN
1                      NaN           may 
2                      NaN          june 
3                      NaN        january
4                      NaN      february 
5                      NaN     september 
6                      NaN          july 
7                      NaN           may 
8                      NaN           may 
9                      NaN         august
10                     NaN       december


The default when concatenating dataframes is to do so vertically, however `pd.concat()` allows you to concatenate horizontally as well. To do this, you must specify either `axis=1`, or `axis=columns`. Note in the example below, the rows with identical indices get combined when concatenated.

In [15]:
print(pd.concat([fav_colour, birthday_month], axis=1))

                Fav Colour Birthday Month
Participant num                          
1                     blue           may 
2                      red          june 
3                    green        january
4                  purple       february 
5                      red     september 
6                    green          july 
7                   orange           may 
8                  yellow            may 
9                  yellow          august
10                    pink       december


### Multi-indexes
`pd.IndexSlice(keys=, ` allows you to index along specified index within multi-indexed DataFrame. The argument`keys=` assigns an outer index label for each of the specified dataframes.

### Outer and Inner Joins 
An outer join is performed by using the keyword argument `join='outer'` in the call to `pd.concat()`, it preserves teh indices of each dataframe. Specifying `join='inner'` will perform an inner join, which preserves indices common to both dataframes.

## Chapter 3: Merging Data


### pd.merge()
`pd.merge(df1, df2)` allows you to combine 2 or more dataframes. The default format computes a merge on all shared columns

### Merging 'on'
The keyword argument `on=` specifies which columns of a dataframe you would like to merge on. Using the example from the previous chapter, merging 'fav_colour' and 'birthday_month' on the 'Participant num' column creates a dataframe with 3 columns. One thing to note here - the column specified for `on=` must be a shared column between both dataframes.

In [24]:
merged = pd.merge(fav_colour, birthday_month, on='Participant num')
print(merged)

                Fav Colour Birthday Month
Participant num                          
1                     blue           may 
2                      red          june 
3                    green        january
4                  purple       february 
5                      red     september 
6                    green          july 
7                   orange           may 
8                  yellow            may 
9                  yellow          august
10                    pink       december


### More pd.merge() arguments

|                           |                     |
|---------------------------|---------------------|
|`left_on=' '`              |specify columns to merge on, left refers to df argument placement in call to `pd.merge()`|
|`right_on=`                |specify columns to merge on, right refers to df argument placement in call to `pd.merge()`|
|`how='left'`               |keeps all rows in left dataframe|
|`how='right'`              |keeps all rows in right dataframe|            

### `pd.merge.ordered(df1, df2)`
This merges and puts data in specified order, and has a few keyword arguments:

|                                     |                                    |
|-------------------------------------|                                    |
|`on=`                                |specify order type (e.g., on='date')|
|`suffixes=`                          |add suffixes to column headers to improve organization|
|`fill_method=`                       |specify NaN treatment (e.g., 'ffill')|