# Lesson 5: Merging DataFrames with pandas

[View this lesson on datacamp](https://learn.datacamp.com/courses/merging-dataframes-with-pandas)

## Chapter 2: Concatenating Data

### Appending pandas Series and DataFrames

The `.append()` method is used for both pandas Series and DataFrames, to stack rows on top of each other. For example, appending the two pandas Series below results in one series in which the elements of 'blue' are stacked underneath the elements of 'pink' (since we append blue to pink in this case).

In [9]:
import pandas as pd

pink = pd.Series(['rose', 'fuchsia', 'ruby', 'magenta'])
blue = pd.Series(['turquoise', 'sky blue', 'navy', 'ocean blue'])
pb = pink.append(blue)
pb

0          rose
1       fuchsia
2          ruby
3       magenta
0     turquoise
1      sky blue
2          navy
3    ocean blue
dtype: object

One thing to note here is that the `.append()` method doesn't adjust the index of the series when it stacks them.  You can see this above in that the first column (which is the index) goes from zero to three, then zero to thre again. This can be an issue when working with a Series or DataFrame later on, so it's a good idea to re-index.

This can be done using the `.reset_index()` method:

In [10]:
pb.reset_index()

Unnamed: 0,index,0
0,0,rose
1,1,fuchsia
2,2,ruby
3,3,magenta
4,0,turquoise
5,1,sky blue
6,2,navy
7,3,ocean blue


You'll see there's a weird extra column in there now; the index is the rightmost column (in bold), and the original index is now a column called "index". In some cases this might be a helpful historical record, but in many cases it's just annoying. Adding the argument `drop=True` will drop the original index:

In [11]:
pb.reset_index(drop=True)

0          rose
1       fuchsia
2          ruby
3       magenta
4     turquoise
5      sky blue
6          navy
7    ocean blue
dtype: object

One important "gotcha" with `.reset_index()` — and many pandas DataFrame methods — is that by default they don't actually modify the Series or DataFrame you run them on. So after running the command above, you might think you reset the index of `pb`, but actually you didn't; instead you just saw the copy that was created by your command, printed as output. Thus when we ask to see `pb` again, the index is unchanged: 

In [12]:
pb

0          rose
1       fuchsia
2          ruby
3       magenta
0     turquoise
1      sky blue
2          navy
3    ocean blue
dtype: object

So to actually reset the index of `pb`, we need to *assign* the output of the method back to `pb`, like this:

In [14]:
pb = pb.reset_index(drop=True)
pb

0          rose
1       fuchsia
2          ruby
3       magenta
4     turquoise
5      sky blue
6          navy
7    ocean blue
dtype: object

You can alternatively do this by including the `inplace=True` argument, in which case you don't need to assign the output with `pb=`

In [17]:
pb = pink.append(blue)
pb.reset_index(drop=True, inplace=True)
pb

0          rose
1       fuchsia
2          ruby
3       magenta
4     turquoise
5      sky blue
6          navy
7    ocean blue
dtype: object

### `pd.concat()`

Another way to combine pandas objects (Series or DataFrames) is concatenation. The `pd.concat()` function is a more powerful and flexible tool than the `.append()` method. Whereas appending always adds rows to the bottom of a DataFrame, concatenation can do this, *or* add columns to a DataFrame.

[API for `pd.concat()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html?highlight=concat#pandas.concat)

THere's also a nice, detailed explanation of appending, concatenating, merging, and joining DataFrames [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

Here's how we would use `pd.concat()` to do the same thing we did above, with `.append()`. There are two major difference to pay attention to. Firstly, `pd.concat()` is a *function*, whereas `.append()` is a *method*. Recall that methods are applied by dot-adding them to the variable name you want to modify (e.g., `pink.append(blue)`). With a function, we have to specify `pd` before the dot and the function name after, and give it all the input data as the first argument inside the parentheses. It's also important to pay attention to how we specify the input data: since the functions arguments are separated by commas, you can't just list the input data like this:
`pd.concat(pink, blue)`, because `'pink` will be interpreted as the input data, and `blue` as a second argument. We need to put the input data inside a list, like this:

In [19]:
pb = pd.concat([pink, blue])
pb

0          rose
1       fuchsia
2          ruby
3       magenta
0     turquoise
1      sky blue
2          navy
3    ocean blue
dtype: object

As with `.append()`, the original index values are preserved, which we might not want. While with `.append()` we had to run a separate method to reset the index, with `pd.concat()` we can do this at the same time as the concatenation, using the `ignore_index` argument:

In [21]:
pb = pd.concat([pink, blue], ignore_index=True)
pb

0          rose
1       fuchsia
2          ruby
3       magenta
4     turquoise
5      sky blue
6          navy
7    ocean blue
dtype: object

### Using `.append()` and `pd.concat()` on DataFrames

Above we were working with pandas Series. For pandas DataFrames, the append method works just the same, stacking the rows. 

Here we will use what we learned in chapter 1 to read two CSV files as DataFrames, then combine them with the `append()` method. 

In [30]:
s1 = pd.read_csv('s1.csv')
s2 = pd.read_csv('s2.csv')

all_data = s1.append(s2)

all_data

Unnamed: 0,trial,RT
0,1,0.508971
1,2,0.389858
2,3,0.404175
3,4,0.26952
4,5,0.437765
5,6,0.368142
6,7,0.400544
7,8,0.335198
8,9,0.341722
9,10,0.439583


We can also do this with `pd.concat()`:

In [32]:
# s1 and s2 are already loaded into memory from above

df = pd.concat([s1, s2], ignore_index=True)

df

Unnamed: 0,trial,RT
0,1,0.508971
1,2,0.389858
2,3,0.404175
3,4,0.26952
4,5,0.437765
5,6,0.368142
6,7,0.400544
7,8,0.335198
8,9,0.341722
9,10,0.439583


We can get a bit more fancy and use a loop to read in files, as in the previous chapter, and then combine them. Here's the code from the last chapter, which reads the CSV files in to a list of DataFrames:

In [33]:
filenames = ['s1.csv', 's2.csv', 's3.csv']

df_list = []

for filename in filenames:
    df_list.append(pd.read_csv(filename))

Since `df_list` is already a list — which is the format that `pdconcat()` wants its input in — we can just pass the whole thing to `pd.concat()`:

In [35]:
df = pd.concat(df_list, ignore_index=True)

df

Unnamed: 0,trial,RT
0,1,0.508971
1,2,0.389858
2,3,0.404175
3,4,0.26952
4,5,0.437765
5,6,0.368142
6,7,0.400544
7,8,0.335198
8,9,0.341722
9,10,0.439583


#### Extending `pd.concat()` to columns

As noted earlier, `pd.concat()` is more powerful in how it combine inputs. 

Consider this example, where we have different data about the same participants, in different files. One file contains participants' birthday month, and the other their age. When we read these in and concatenate them, we get a column for colour and a column for month, with lots of NaN values in each because each input file had different column names, but we've stacked the rows of the inputs:

In [36]:
fav_colour = pd.read_csv('fav_colour.csv')
birthday_month = pd.read_csv('birthday_months.csv')

df = pd.concat([fav_colour, birthday_month])

df

Unnamed: 0,Participant num,Fav Colour,Birthday Month
0,1,blue,
1,2,red,
2,3,green,
3,4,purple,
4,5,red,
5,6,green,
6,7,orange,
7,8,yellow,
8,9,yellow,
9,10,pink,


You can see above that there's also a `participant num` column, which indicates how we can match colours to months. What we actually want is to combine the two inputs "horizontally", such that we have 10 rows (one for each participant), with the colour and month corresponding to each participant in the same row. 

The default when concatenating dataframes is to do so vertically, as we saw above. However, `pd.concat()` allows us to concatenate horizontally as well. To do this, you must specify either `axis=1`, or `axis=columns`. Note in the example below, the rows with identical indices get combined when concatenated.

In [37]:
df = pd.concat([fav_colour, birthday_month], axis=1)
df

Unnamed: 0,Participant num,Fav Colour,Participant num.1,Birthday Month
0,1,blue,1,may
1,2,red,2,june
2,3,green,3,january
3,4,purple,4,february
4,5,red,5,september
5,6,green,6,july
6,7,orange,7,may
7,8,yellow,8,may
8,9,yellow,9,august
9,10,pink,10,december


We're still not quite where we want to be, as we have two redundant `Participant num` columns. One way to fix this is, when we load the data in the beginning, we make the index of each input DataFrame the `participant num` column. 

In [41]:
fav_colour = pd.read_csv('fav_colour.csv', index_col='Participant num')
birthday_month = pd.read_csv('birthday_months.csv', index_col='Participant num')

df = pd.concat([fav_colour, birthday_month], axis=1)
df

Unnamed: 0_level_0,Fav Colour,Birthday Month
Participant num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,blue,may
2,red,june
3,green,january
4,purple,february
5,red,september
6,green,july
7,orange,may
8,yellow,may
9,yellow,august
10,pink,december


Although we could also make one of the `Participant num` columns the index after concatenation, specifying the index when we read in the data is a safer way of doing things. This is because, it could happen that your data aren't in the same order in both data files (e.g., one data file might not be sorted by `Participant num`), or one file might have missing data. By making `Participant num` the index for each file before we concatenate them, we ensure that pandas matches the rows from each input based on its index. Importantly, this is a case where we would *not* want to include the `ignore_index=True` argument to `pd.concat()`, because the index is important and meaningful.

### Multi-indexes
`pd.IndexSlice(keys=, ` allows you to index along specified index within multi-indexed DataFrame. The argument`keys=` assigns an outer index label for each of the specified dataframes.

### Outer and Inner Joins 
An outer join is performed by using the keyword argument `join='outer'` in the call to `pd.concat()`, it preserves teh indices of each dataframe. Specifying `join='inner'` will perform an inner join, which preserves indices common to both dataframes.

It is also possible to use a for loop to read in multiple files:

First, create a list of the files

`filenames = ['filename1.csv', 'filename2.csv', 'filename3.csv']`

Next, create an empty list to append the files to

`dataframes = []`

`for filename in filenames:
    dataframes.append(pd.read_csv(filename))`

|Method                    |Funciton               |
|--------------------------|-----------------------|
|`.sort_index()`           |sort index - default is by ascending order, specifying `ascending=False` reverses order|
|`.sort_values()`          |sort column values by increasing numerical order by default|
|`.reindex()`              |reorders rows based on a new index column|
|`.ffill()`                |forward fill; replace NaN with last preceding non-null value|
|`.str.replace('', '')`    |replace first specified string with second down a column|
|`.multiply()`             |performs multiplication along specified column/row and broadcasts output|