# 3.7 Joins and Merges

Joining and merging in Pandas is essentially the same thing. If we remember from our SQL class, a **join** involves linking the rows of one table to the rows of another table based on primary key-foreign key relationships, where the primary key of one table is the foreign key of another table. Joining is often performed before the data is imported into Pandas because it is usually best to eliminate as much unnecessary data as early on in the data analysis process as possible, but sometimes, joining in Pandas is necessary.

Pandas dataframes have both a `.join()` method and a `.merge()` method. Both can perform the exact same operation-- however, the difference is that the `.join()` method always uses the dataframe index and `.merge()` allows the user to determine which columns to join on. In other words, the `.merge()` method is more flexible. In this course, we will be using the `.merge()` method only.

### About the data

This notebook uses the classic *Titanic* data set, provided as a CSV file in the `data` folder. However, to demonstrate Pandas' ability to join tables together, a separate CSV file called `titanic_embarkment.csv` has been provided as well, which contains some data about embarkment locations.

First, let's import Pandas and the data. Notice that the dataframe is called `titanic_df` and not just `df`, like before. This is because we will be joining several dataframes together and need to be able to distinguish between them.

In [6]:
import pandas as pd
titanic_df = pd.read_csv("./data/titanic.csv")

The `titanic_df` is the same *Titanic* data set that we have been using throughout this course. However, you may have noticed that this data set doesn't have any foreign keys. That means that we cannot use a join in the same way that we are used to doing it with foreign and primary keys. Not separating tables into foreign keys and primary keys, however, is merely a database decision and should not prevent use from merging data onto our dataframe.

In this notebook, we are going to join on the `Embarked` column, using the letters `S`, `C`, and `Q` as foreign keys to another table that we will import below. In other words, we will be joining information about the embarked locations to our `titanic_df` using another table that has information about them.

Observe the first few rows of the `titanic.csv` data.

In [7]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Before we can merge more data to this dataframe, we need to see the other data and convert it to a dataframe first. Let's import the other `titanic_embarkment.csv` file into the variable `embarkment_df`.

In [8]:
embarkment_df = pd.read_csv("./data/titanic_embarkment.csv")
embarkment_df.head()

Unnamed: 0,EmbarkedLetter,CityName,Country,Latitude,Longitude
0,Q,Queenstown,Ireland,51.8503,8.2943
1,S,Southampton,U.K.,50.9105,1.4049
2,N,New York,U.S.A,40.7128,74.006


Notice that the columns 'Embarked' from the `titanic_df` and `embarkment_df` have similar values. However, also notice that `embarked_df` does not list the letter `C` in the `EmbarkedLetter` column, which does exist in the `titanic_df`. Furthermore, the `Embarked` column in `titanic_df` does not use the letter `N`. Creating the data set this way was intentional and is meant to show how different kinds of joins will affect the dataframes.

The presence of at least some similar values in the `Embarked` and `EmbarkedLetter` columns means that we can join both dataframes together. The result will be a table with more columns and more information about the embarkment location.

To merge, we use the `.merge()` method on the dataframe that we are interested in merge data to. We then pass in the dataframe that we want to merge inside the parentheses along with the column name of the left table to join on in the `left_on` parameter, and the column name of the right table to join on in the `right_on` parameter.

Note that if the column to join on exists with the same name in both dataframes, you can simply specify it once in the `on` parameter (doesn't apply in this case).

In [12]:
titanic_df.merge(embarkment_df, left_on='Embarked', right_on='EmbarkedLetter')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,EmbarkedLetter,CityName,Country,Latitude,Longitude
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,S,Southampton,U.K.,50.9105,1.4049
1,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,S,Southampton,U.K.,50.9105,1.4049
2,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,S,Southampton,U.K.,50.9105,1.4049
3,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,S,Southampton,U.K.,50.9105,1.4049
4,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,S,Southampton,U.K.,50.9105,1.4049
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
716,791,0,3,"Keane, Mr. Andrew ""Andy""",male,,0,0,12460,7.7500,,Q,Q,Queenstown,Ireland,51.8503,8.2943
717,826,0,3,"Flynn, Mr. John",male,,0,0,368323,6.9500,,Q,Q,Queenstown,Ireland,51.8503,8.2943
718,829,1,3,"McCormack, Mr. Thomas Joseph",male,,0,0,367228,7.7500,,Q,Q,Queenstown,Ireland,51.8503,8.2943
719,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q,Q,Queenstown,Ireland,51.8503,8.2943


### Inner/right/left/outer joins

In the merge performed above, the join executed was an *inner* join by default. This means that if the value in the `EmbarkedLetter` column of the `embarked_df` didn't exist in the `Embarked` column of `titanic_df`, the row was simply discarded. For this reason, only 721 rows were returned, even though we the CSV contains 891 rows.

Notice also that in the `embarked_df`, there is no letter "C", which would have stood for Cherbourg, France. There is also no letter "N" in `titanic_df`, which would would have stood for New York.

You can specify how to perform the join by passing in the `how` argument to the `.merge()` method and giving a string that is either "inner", "right", "left", or "outer". The "inner" argument is default and will be given to the join if you don't specify the `how` argument.

The code below demonstrates the differences between inner, left, right, and full outer joins. Pay close attention to the named row indexes, which tell you how many rows were preserved from the join!

##### Inner join

In [18]:
titanic_df.merge(embarkment_df, left_on='Embarked', right_on='EmbarkedLetter', how="inner").tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,EmbarkedLetter,CityName,Country,Latitude,Longitude
716,791,0,3,"Keane, Mr. Andrew ""Andy""",male,,0,0,12460,7.75,,Q,Q,Queenstown,Ireland,51.8503,8.2943
717,826,0,3,"Flynn, Mr. John",male,,0,0,368323,6.95,,Q,Q,Queenstown,Ireland,51.8503,8.2943
718,829,1,3,"McCormack, Mr. Thomas Joseph",male,,0,0,367228,7.75,,Q,Q,Queenstown,Ireland,51.8503,8.2943
719,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q,Q,Queenstown,Ireland,51.8503,8.2943
720,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q,Q,Queenstown,Ireland,51.8503,8.2943


##### Right join

In [24]:
titanic_df.merge(embarkment_df, left_on='Embarked', right_on='EmbarkedLetter', how="right").tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,EmbarkedLetter,CityName,Country,Latitude,Longitude
717,885.0,0.0,3.0,"Sutehall, Mr. Henry Jr",male,25.0,0.0,0.0,SOTON/OQ 392076,7.05,,S,S,Southampton,U.K.,50.9105,1.4049
718,887.0,0.0,2.0,"Montvila, Rev. Juozas",male,27.0,0.0,0.0,211536,13.0,,S,S,Southampton,U.K.,50.9105,1.4049
719,888.0,1.0,1.0,"Graham, Miss. Margaret Edith",female,19.0,0.0,0.0,112053,30.0,B42,S,S,Southampton,U.K.,50.9105,1.4049
720,889.0,0.0,3.0,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1.0,2.0,W./C. 6607,23.45,,S,S,Southampton,U.K.,50.9105,1.4049
721,,,,,,,,,,,,,N,New York,U.S.A,40.7128,74.006


##### Left join

In [20]:
titanic_df.merge(embarkment_df, left_on='Embarked', right_on='EmbarkedLetter', how="left").tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,EmbarkedLetter,CityName,Country,Latitude,Longitude
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S,S,Southampton,U.K.,50.9105,1.4049
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S,S,Southampton,U.K.,50.9105,1.4049
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S,S,Southampton,U.K.,50.9105,1.4049
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C,,,,,
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q,Q,Queenstown,Ireland,51.8503,8.2943


##### Outer join

In [22]:
titanic_df.merge(embarkment_df, left_on='Embarked', right_on='EmbarkedLetter', how="outer").tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,EmbarkedLetter,CityName,Country,Latitude,Longitude
887,886.0,0.0,3.0,"Rice, Mrs. William (Margaret Norton)",female,39.0,0.0,5.0,382652.0,29.125,,Q,Q,Queenstown,Ireland,51.8503,8.2943
888,891.0,0.0,3.0,"Dooley, Mr. Patrick",male,32.0,0.0,0.0,370376.0,7.75,,Q,Q,Queenstown,Ireland,51.8503,8.2943
889,62.0,1.0,1.0,"Icard, Miss. Amelie",female,38.0,0.0,0.0,113572.0,80.0,B28,,,,,,
890,830.0,1.0,1.0,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0.0,0.0,113572.0,80.0,B28,,,,,,
891,,,,,,,,,,,,,N,New York,U.S.A,40.7128,74.006
