# Types of Joins

We have seen how to _merge_ (or _join_) two data sets by matching on certain variables. But what happens when a row in one `DataFrame` has no match in the other?

First, let's investigate how _pandas_ handles this situation by default. The name "Nevaeh", which is "heaven" spelled backwards, gained popularity after Sonny Sandoval of the band P.O.D. gave his daughter the name in 2000. Let's look at how common this name was five years earlier and five years after.

In [None]:
import pandas as pd

data_dir = "http://dlsun.github.io/pods/data/names/"

names1995 = pd.read_csv(data_dir + "yob1995.txt",
                        header=None, names=["Name", "Sex", "Count"])
names2005 = pd.read_csv(data_dir + "yob2005.txt",
                        header=None, names=["Name", "Sex", "Count"])

In [None]:
names1995[names1995.Name == "Nevaeh"]

In [None]:
names2005[names2005.Name == "Nevaeh"]

In 1995, there were no girls (at least fewer than 5) named Nevaeh; just eight years later, there were over 4500 girls (and even 56 boys) with the name. It seems like Sonny Sandoval had a huge effect.

What happens to the name "Nevaeh" when we merge the two data sets?

In [None]:
names = names1995.merge(names2005, on=["Name", "Sex"])
names[names.Name == "Nevaeh"]

By default, _pandas_ only includes combinations that are present in _both_ `DataFrame`s. If it cannot find a match for a row in one `DataFrame`, then the combination is simply dropped.

But in this context, the fact that a name does not appear in one data set is informative. It means that no babies were born in that year with that name. We might want to also include names that appeared in only one of the two `DataFrame`s, rather than just the names that appeared in both.

There are four types of joins, distinguished by whether they include the rows from the left `DataFrame`, the right `DataFrame`, both, or neither:

1. _inner join_ (default): only values that are present in both `DataFrame`s are included in the result
2. _outer join_: any value that appears in either `DataFrame` is included in the result
3. _left join_: any value that appears in the left `DataFrame` is included in the result, whether or not it appears in the right `DataFrame`
4. _right join_: any value that appears in the right `DataFrame` is included in the result, whether or not it appears in the left `DataFrame`.

One way to visualize the different types of joins is using Venn diagrams. The shaded region indicates which rows that are included in the output. For example, only rows that appear in both the left and right `DataFrame`s are included in the output of an inner join.

![](https://github.com/dlsun/pods/blob/master/09-Joining-Tabular-Data/joins.png?raw=1)

In _pandas_, the join type is specified using the `how=` argument.

Now let's look at examples of each of these types of joins. Pay attention to the numbers of rows that each join returns.

In [None]:
# inner join
names_inner = names1995.merge(names2005, on=["Name", "Sex"], how="inner")
names_inner

In [None]:
# outer join
names_outer = names1995.merge(names2005, on=["Name", "Sex"], how="outer")
names_outer

Names like "Zyrell" and "Zyron" appeared in the 2005 data but not the 1995 data. For this reason, their count in 1995 is `NaN`. In general, there will be missing values in `DataFrame`s that result from an outer join. Any time a value appears in one `DataFrame` but not the other, there will be `NaN`s in the columns from the `DataFrame` missing that value.

In [None]:
names_outer.isnull().sum()

By contrast, there are no `NaN`s when we do an inner join. That is because we restrict to only **Name** and **Sex** pairs that appeared in both `DataFrame`s, which guarantees that there are counts for both 1995 and 2005.

In [None]:
names_inner.isnull().sum()

Left and right joins preserve data from one `DataFrame` but not the other. For example, if we were trying to calculate the percentage change for each name from 1995 to 2005, we would want to include all of the names that appeared in the 1995 data. If the name did not appear in the 2005 data, then that is informative.

In [None]:
# left join
names_left = names1995.merge(names2005, on=["Name", "Sex"], how="left")
names_left

The result of a left join has `NaN`s in the columns from the right `DataFrame`.

In [None]:
names_left.isnull().sum()

The result of a right join, on the other hand, has `NaN`s in the column from the left `DataFrame`.

In [None]:
# right join
names_right = names1995.merge(names2005, on=["Name", "Sex"], how="right")
names_right

In [None]:
names_right.isnull().sum()