## Joining DataFrames

In [4]:
import pandas as pd

# Catch up on the data we loaded in the previous section
sess_1 = pd.read_csv('session_1.csv', index_col='trial')
sess_2 = pd.read_csv('session_2.csv', index_col='trial')
sess_3 = pd.read_csv('session_3.csv')
sess_12 = pd.merge(sess_1, sess_2, on='trial', suffixes=['_sess_1', '_sess_2'])

Yet another way of combining pandas DataFrames is with the `.join()` method. While `pd.merge()` is a function (you can tell because the command name, `.merge`, is preceded by `pd` rather than a DataFrame, and all the input data is inside the parentheses), `.join()` is a method and so must be appended to the name of an existing DataFrame, with the DataFrame you want to join to it specified in the parentheses:

In [5]:
sess_12.join(sess_3)

Unnamed: 0_level_0,rt_sess_1,rt_sess_2,Trial,RT
trial,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.988,0.718,0,0.844168
1,0.753,0.851,1,0.913048
2,0.949,0.747,2,0.843295
3,0.824,0.52,3,0.530306
4,0.262,0.991,4,0.266715
5,0.803,0.004,5,0.707006
6,0.376,0.547,6,0.973193
7,0.496,0.883,7,0.432562
8,0.235,0.841,8,0.522106
9,0.336,0.195,9,0.876626


Note that `.join()` is less picky than `pd.merge`: it ran fine even though there are no exactly-matching column labels shared by the inputs. Indeed, we can join DataFrames that have totally different columns and even lengths:

In [7]:
fav_colour = pd.read_csv('fav_colour.csv')

sess_12.join(fav_colour)

Unnamed: 0_level_0,rt_sess_1,rt_sess_2,Participant num,Fav Colour
trial,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.988,0.718,1.0,blue
1,0.753,0.851,2.0,red
2,0.949,0.747,3.0,green
3,0.824,0.52,4.0,purple
4,0.262,0.991,5.0,red
5,0.803,0.004,6.0,green
6,0.376,0.547,7.0,orange
7,0.496,0.883,8.0,yellow
8,0.235,0.841,9.0,yellow
9,0.336,0.195,10.0,pink


In other words, `.join()` simply adds columns of the 'right' DataFrme (the one in parentheses) to the columns of the 'left' DataFrame (preceding the dot), lining up the rows and adding extra rows of `NaN`s if the inputs are not the same length. 

This is more flexible, but potentially more messy or dangerous. We want to be certain that the order of inputs in the two DataFrames we're merging is exactly the same to avoid mis-aligning the data. Interestingly, if pandas does note identical column labels in the two DataFrames, it will throw an error because it doesn't know if you want to use those to match rows between the inputs:

In [8]:
fav_colour = pd.read_csv('fav_colour.csv')
eye_colour = pd.read_csv('eye_colour.csv')

fav_colour.join(eye_colour)

ValueError: columns overlap but no suffix specified: Index(['Participant num'], dtype='object')

**Indexing** can help `.join()` operate more safely and reliably. If we specify the shared columns as indexes of each DataFrame, pandas will match the inputs based on the indexes:

In [9]:
fav_colour = fav_colour.set_index('Participant num')
eye_colour = eye_colour.set_index('Participant num')

fav_colour.join(eye_colour)

Unnamed: 0_level_0,Fav Colour,eye_colour
Participant num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,blue,brown
2,red,blue
3,green,blue
4,purple,hazel
5,red,green
6,green,
7,orange,
8,yellow,
9,yellow,
10,pink,


By default `.join()` uses an outer join, but again we can use an argument to change that behaviour. However, for `.join()` the argument is `how=`:

In [10]:
fav_colour.join(eye_colour, how='inner')

Unnamed: 0_level_0,Fav Colour,eye_colour
Participant num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,blue,brown
2,red,blue
3,green,blue
4,purple,hazel
5,red,green


#### Left and Right joins

In addition to `outer` (union; i.e., all inputs) and `inner` (intersection; i.e., only shared input) joins, we can use `left` and `right` arguments to specify including only the indices in one input that match those in the other input. 

So if we use `how=left`, pandas will include all indices present int he left input, filling any non-matches in the right input with `NaN`: 

In [11]:
fav_colour.join(eye_colour, how='left')

Unnamed: 0_level_0,Fav Colour,eye_colour
Participant num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,blue,brown
2,red,blue
3,green,blue
4,purple,hazel
5,red,green
6,green,
7,orange,
8,yellow,
9,yellow,
10,pink,


Conversely, with `how=right` we get all indices present in the right input, again filling anything missing from the left with `NaN`. 

In [12]:
fav_colour.join(eye_colour, how='right')

Unnamed: 0_level_0,Fav Colour,eye_colour
Participant num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,blue,brown
2,red,blue
3,green,blue
4,purple,hazel
5,red,green
11,,brown
12,,brown
13,,blue
