# Tutorial 10 - Joining `DataFrames`

In data analysis you will often need to combine the data from two or more `DataFrames`.  

This is known as *joining*, which is a term that is borrowed from the world of SQL.  

The effect of joining is similar to the lookup functions in Excel.

## Loading Packages

Let's load the packages we will need for this tutorial.

In [1]:
##> import numpy as np
##> import pandas as pd




## Reading-In Data

The data set we are going to use is a list of ETFs that have weekly expiring options.  What does that mean?  Most stocks or ETFs have exchange traded options that expire every month, and at any given time the monthly expiring options go out about a year.  The most liquid underlyings have actually have options that expire every week; these weekly expiring options go out about 6-8 weeks.

This is a list that is published by the CBOE and it consists of all the ETFs that have weekly options trading.

In [2]:
##> df_weekly = pd.read_csv("../data/weekly_etf.csv")
##> df_weekly.head()




The next data set that we are going to load is a comprehensive list of all ETFs that are trading in the market.

In [3]:
##> df_etf = pd.read_csv("../data/etf_list.csv")
##> df_etf.head()




**Motivation:** Notice that `df_etf` has a `segment` column which `df_weekly` does not.  This `segment` column contains asset-class information that could be useful for categorizing the weekly ETFs.

**Objective:** we want to get the `segment` column into `df_weekly`.

There are a couple of ways of accomplishing this in `pandas` and both of them involve the `pd.merge()` method: 

1. *inner-joins*

2. *left/right-joins* (sometimes called *outer joins*)

## Inner-Join

As with many of the basic operations in data analysis, it's easiest to understand inner-joins by digging into an example.

Here is the line of code that accomplishes most of the work that we want done:

In [4]:
##> pd.merge(df_weekly, df_etf, how='inner', left_on='ticker', right_on='symbol')




Observations on the syntax:

1. The first two arguments of `pd.merge()` are the two `DataFrames` we want to join together.  The first `DataFrame` is the *left* `DataFrame` and the second on is the *right* `DataFrame`. 

2. The `how` argument defines the type of join.

3. `left_on` is the column in the left table that will be used for matching, `right_on` is the column in the right table that will be used for matching.


Observations on output:

1. The output is basically each of the two tables smashed together, however only the rows with matching ticker/symbol are retained in the output.  All columns of both tables are included.

2. `df_weekly` had 67 rows in it, and `df_etf` had 2,160 row in it.  The `DataFrame` that results from `pd.merge()` has 66 rows in it.

3. Notice that both `df_weekly` and `df_etf` have a column called `name`.  In the joined dataframe, suffixes of `_x` and `_y` have been added to the column names to make them unambiguous.

Let's do a little clean up of our `DataFrame` so that it's just the information that we wanted: `df_weekly` with the `segment` column added to it.

In [5]:
##> # keeping only the columns that we want
##> # assigning the joined result to a variable called df_joined  
##> df_joined = \
##>     pd.merge(df_weekly, df_etf, how='inner', left_on='ticker', right_on='symbol') \
##>     [['ticker', 'name_x', 'segment']]
##> 
##> # renaming the 'name' column
##> df_joined.rename(columns={'name_x':'name'}, inplace=True)
##> 
##> # let's look at our result
##> df_joined.head()




## Left-Join

Notice that in the inner-join example from the previous section, the original list of ETFs with weekly options (`df_weekly`) had 67 rows, but the joined table with the `segment` column added only has 66 rows.

In [6]:
##> print(df_weekly.shape)
##> print(df_joined.shape)




So what happened?  This means that one of the `tickers` from `df_weekly` had no matching `symbol` in `df_eft`.

Inner-joins, by design, are only intended to retain rows that have matches in both tables.  This may or may not be the desired behavior you are looking for.

Let's say that instead we wanted to keep *ALL* the rows in the left `DataFrame`, `df_weekly`, irrespective of whether there is a match in the right `DataFrame`.

This is precisely what a *left-join* is.  The syntax is the exact same as before except for the `how` argument is set to `inner`.

In [7]:
##> pd.merge(df_weekly, df_etf, how='left', left_on='ticker', right_on='symbol')



Observations:

1. Notice that ticker FTK has `NaNs` for all the columns from `df_etf`.  That's because it doesn't exist in `df_etf`.


**Research Challenge:** Google `FTK` and figure out why it's not in `df_etf`.

**Coding Challenge:** Use `DataFrame` masking to grab the row of `df_joined` that consists of a non-match.  Hint: the method `pd.isna()` takes as an argument an array, it returns a `True` in for all the entries that are `NaN`, and returns `False` otherwise.

Let's clean up our result so it's exactly the information that we wanted in the first place: `df_weekly` with the `segment` column added to it.

In [8]:
##> # keeping only the columns that we want
##> # assigning the joined result to a variable called df_joined  
##> df_joined = \
##>     pd.merge(df_weekly, df_etf, how='left', left_on='ticker', right_on='symbol')\
##>     [['ticker', 'name_x', 'segment']]
##> 
##> # renaming the 'name' column
##> df_joined.rename(columns={'name_x':'name'}, inplace=True)
##> 
##> # let's look at our result
##> df_joined.head()




**Coding Challenge:** create the same `df_joined` as above, but instead use a `right-join` instead of a `left-join.`

## Related Reading

*PDSH* - 3.7 - Combining Datasets: Merging and Joining