# Merging `DataFrames`

In this tutorial we will learn how to *merge* together two `DataFrames`.  In SQL and R, this is referred to as *joining*.

The effect of merging is similar to the lookup functions in Excel.

## Loading Packages

Let's load the packages we will need for this tutorial.

In [1]:
import numpy as np
import pandas as pd

## Reading-In Data

The data set we are going to use is a list of ETFs that have weekly expiring options.  What does that mean?  Most stocks or ETFs have exchange traded options that expire every month, and at any given time the monthly expiring options go out about a year.  The most liquid underlyings actually have options that expire every week; these weekly expiring options go out about 6-8 weeks.

This is a list that is published by the CBOE and it consists of all the ETFs that have weekly options trading.

In [2]:
df_weekly = pd.read_csv('weekly_etf.csv')
df_weekly.head()

Unnamed: 0,ticker,name
0,AMJ,JP Morgan Alerian MLP Index ETN
1,AMLP,Alerian MLP ETF
2,ASHR,Xtrackers Harvest CSI 300 China A-Shares ETF
3,DIA,SPDR Dow Jones Ind Av ETF Trust
4,DUST,Direxion Daily Gold Miners Index Bear 3X Shares


The next data set that we are going to load is a comprehensive list of all ETFs that are trading in the market.

In [3]:
df_etf = pd.read_csv("etf.csv")
df_etf.head()

Unnamed: 0,symbol,name,issuer,expense_ratio,aum,spread,segment
0,SPY,SPDR S&P 500 ETF Trust,State Street Global Advisors,0.09%,$275.42B,0.00%,Equity: U.S. - Large Cap
1,IVV,iShares Core S&P 500 ETF,BlackRock,0.04%,$155.86B,0.01%,Equity: U.S. - Large Cap
2,VTI,Vanguard Total Stock Market ETF,Vanguard,0.04%,$103.58B,0.01%,Equity: U.S. - Total Market
3,VOO,Vanguard S&P 500 ETF,Vanguard,0.04%,$96.91B,0.01%,Equity: U.S. - Large Cap
4,EFA,iShares MSCI EAFE ETF,BlackRock,0.32%,$72.12B,0.01%,Equity: Developed Markets Ex-U.S. - Total Market


**Motivation:** Notice that `df_etf` has a `segment` column which `df_weekly` does not.  This `segment` column contains asset-class information that could be useful for categorizing the weekly ETFs.

**Objective:** we want to get the `segment` column into `df_weekly`.

There are a couple of ways of accomplishing this in `pandas` and both of them involve the `pd.merge()` method: 

1. *inner-merge*

2. *left/right-merge* (sometimes called *outer merge*)

## Inner

As with many of the basic operations in data analysis, it's easiest to understand inner-merges by digging into an example.

Here is the line of code that accomplishes most of the work that we want done:

In [4]:
pd.merge(df_weekly, df_etf, how='inner', left_on='ticker', right_on='symbol')

Unnamed: 0,ticker,name_x,symbol,name_y,issuer,expense_ratio,aum,spread,segment
0,AMJ,JP Morgan Alerian MLP Index ETN,AMJ,J.P. Morgan Alerian MLP Index ETN,JPMorgan,0.85%,$3.45B,0.04%,Equity: U.S. MLPs
1,AMLP,Alerian MLP ETF,AMLP,Alerian MLP ETF,ALPS,0.85%,$10.64B,0.10%,Equity: U.S. MLPs
2,ASHR,Xtrackers Harvest CSI 300 China A-Shares ETF,ASHR,Xtrackers Harvest CSI 300 China A-Shares ETF,Deutsche Bank,0.65%,$630.14M,0.04%,Equity: China - Total Market
3,DIA,SPDR Dow Jones Ind Av ETF Trust,DIA,SPDR Dow Jones Industrial Average ETF Trust,State Street Global Advisors,0.17%,$21.70B,0.01%,Equity: U.S. - Large Cap
4,DUST,Direxion Daily Gold Miners Index Bear 3X Shares,DUST,Direxion Daily Gold Miners Index Bear 3x Shares,Direxion,1.08%,$122.21M,0.06%,Inverse Equity: Global Gold Miners
...,...,...,...,...,...,...,...,...,...
61,XLV,HEALTH CARE SELECT SECTOR SPDR,XLV,Health Care Select Sector SPDR Fund,State Street Global Advisors,0.13%,$17.49B,0.01%,Equity: U.S. Health Care
62,XLY,Consumer Discretionary Select Sector SPDR,XLY,Consumer Discretionary Select Sector SPDR Fund,State Street Global Advisors,0.13%,$14.35B,0.01%,Equity: U.S. Consumer Cyclicals
63,XME,SPDR S&P Metals & Mining ETF,XME,SPDR S&P Metals & Mining ETF,State Street Global Advisors,0.35%,$879.10M,0.03%,Equity: U.S. Metals & Mining
64,XOP,P Oil & Gas Exploration & Production ETF,XOP,SPDR S&P Oil & Gas Exploration & Production ETF,State Street Global Advisors,0.35%,$3.06B,0.02%,Equity: U.S. Oil & Gas Exploration & Production


Observations on the syntax:

1. The first two arguments of `pd.merge()` are the two `DataFrames` we want to merge together.  The first `DataFrame` is the *left* `DataFrame` and the second one is the *right* `DataFrame`. 

2. The `how` argument defines the type of merge.

3. `left_on` is the column in the left table that will be used for matching, `right_on` is the column in the right table that will be used for matching.


Observations on output:

1. The output is basically each of the two tables smashed together, however only the rows with matching ticker/symbol are retained in the output.  All columns of both tables are included.

2. `df_weekly` had 67 rows in it, and `df_etf` had 2,160 row in it.  The `DataFrame` that results from `pd.merge()` has 66 rows in it.

3. Notice that both `df_weekly` and `df_etf` have a column called `name`.  In the merged `DataFrame`, suffixes of `_x` and `_y` have been added to the column names to make them unique.

Let's do a little clean up of our `DataFrame` so that it's just the information that we wanted: `df_weekly` with the `segment` column added to it.  Notice that `.merge()` is also a `DataFrame` method, and we use this form to invoke method chaining.

In [5]:
df_inner = \
    (
    df_weekly
        .merge(df_etf, how='inner', left_on='ticker', right_on='symbol')
        [['ticker', 'name_x', 'segment']]
        .rename(columns={'name_x':'name'})
    )
df_inner

Unnamed: 0,ticker,name,segment
0,AMJ,JP Morgan Alerian MLP Index ETN,Equity: U.S. MLPs
1,AMLP,Alerian MLP ETF,Equity: U.S. MLPs
2,ASHR,Xtrackers Harvest CSI 300 China A-Shares ETF,Equity: China - Total Market
3,DIA,SPDR Dow Jones Ind Av ETF Trust,Equity: U.S. - Large Cap
4,DUST,Direxion Daily Gold Miners Index Bear 3X Shares,Inverse Equity: Global Gold Miners
...,...,...,...
61,XLV,HEALTH CARE SELECT SECTOR SPDR,Equity: U.S. Health Care
62,XLY,Consumer Discretionary Select Sector SPDR,Equity: U.S. Consumer Cyclicals
63,XME,SPDR S&P Metals & Mining ETF,Equity: U.S. Metals & Mining
64,XOP,P Oil & Gas Exploration & Production ETF,Equity: U.S. Oil & Gas Exploration & Production


## Left

Notice that in the inner-join example from the previous section, the original `DataFrame` of ETFs with weekly options (`df_weekly`) had 67 rows, but the merged `DataFrame` with the `segment` column added (`df_inner`) only has 66 rows.

In [6]:
print(df_weekly.shape)
print(df_inner.shape)

(67, 2)
(66, 3)


So what happened?  This means that one of the `tickers` from `df_weekly` had no matching `symbol` in `df_eft`.

Inner-merges, by design, are only intended to retain rows that have matches in both tables.  This may or may not be the desired behavior you are looking for.

Let's say that instead we wanted to keep *ALL* the rows in the left `DataFrame`, `df_weekly`, irrespective of whether there is a match in the right `DataFrame`.

This is precisely what a *left-merge* is.  The syntax is the exact same as before except for the `how` argument is set to `'left'`.

In [7]:
pd.merge(df_weekly, df_etf, how='left', left_on='ticker', right_on='symbol')

Unnamed: 0,ticker,name_x,symbol,name_y,issuer,expense_ratio,aum,spread,segment
0,AMJ,JP Morgan Alerian MLP Index ETN,AMJ,J.P. Morgan Alerian MLP Index ETN,JPMorgan,0.85%,$3.45B,0.04%,Equity: U.S. MLPs
1,AMLP,Alerian MLP ETF,AMLP,Alerian MLP ETF,ALPS,0.85%,$10.64B,0.10%,Equity: U.S. MLPs
2,ASHR,Xtrackers Harvest CSI 300 China A-Shares ETF,ASHR,Xtrackers Harvest CSI 300 China A-Shares ETF,Deutsche Bank,0.65%,$630.14M,0.04%,Equity: China - Total Market
3,DIA,SPDR Dow Jones Ind Av ETF Trust,DIA,SPDR Dow Jones Industrial Average ETF Trust,State Street Global Advisors,0.17%,$21.70B,0.01%,Equity: U.S. - Large Cap
4,DUST,Direxion Daily Gold Miners Index Bear 3X Shares,DUST,Direxion Daily Gold Miners Index Bear 3x Shares,Direxion,1.08%,$122.21M,0.06%,Inverse Equity: Global Gold Miners
...,...,...,...,...,...,...,...,...,...
62,XLV,HEALTH CARE SELECT SECTOR SPDR,XLV,Health Care Select Sector SPDR Fund,State Street Global Advisors,0.13%,$17.49B,0.01%,Equity: U.S. Health Care
63,XLY,Consumer Discretionary Select Sector SPDR,XLY,Consumer Discretionary Select Sector SPDR Fund,State Street Global Advisors,0.13%,$14.35B,0.01%,Equity: U.S. Consumer Cyclicals
64,XME,SPDR S&P Metals & Mining ETF,XME,SPDR S&P Metals & Mining ETF,State Street Global Advisors,0.35%,$879.10M,0.03%,Equity: U.S. Metals & Mining
65,XOP,P Oil & Gas Exploration & Production ETF,XOP,SPDR S&P Oil & Gas Exploration & Production ETF,State Street Global Advisors,0.35%,$3.06B,0.02%,Equity: U.S. Oil & Gas Exploration & Production


Let's put this left-merged table into a `DataFrame` called `df_left`, and perform a bit of data munging.

In [8]:
df_left = \
    (
    df_weekly
        .merge(df_etf, how='left', left_on='ticker', right_on='symbol')
        [['ticker', 'name_x', 'segment']]
        .rename(columns={'name_x':'name'})
    )
df_left

Unnamed: 0,ticker,name,segment
0,AMJ,JP Morgan Alerian MLP Index ETN,Equity: U.S. MLPs
1,AMLP,Alerian MLP ETF,Equity: U.S. MLPs
2,ASHR,Xtrackers Harvest CSI 300 China A-Shares ETF,Equity: China - Total Market
3,DIA,SPDR Dow Jones Ind Av ETF Trust,Equity: U.S. - Large Cap
4,DUST,Direxion Daily Gold Miners Index Bear 3X Shares,Inverse Equity: Global Gold Miners
...,...,...,...
62,XLV,HEALTH CARE SELECT SECTOR SPDR,Equity: U.S. Health Care
63,XLY,Consumer Discretionary Select Sector SPDR,Equity: U.S. Consumer Cyclicals
64,XME,SPDR S&P Metals & Mining ETF,Equity: U.S. Metals & Mining
65,XOP,P Oil & Gas Exploration & Production ETF,Equity: U.S. Oil & Gas Exploration & Production


**Code Challenge:** Use `.query()` on `df_left` to verify that `ticker` `FTK` has `NaNs` for all the columns from `df_etf`. Do this in two separate ways:

1. querying on `ticker`
1. querying on `segment`

In [9]:
df_left.query('ticker == "FTK"')

Unnamed: 0,ticker,name,segment
17,FTK,FLOTEK INDUSTRIES INC,


In [10]:
df_left.query('segment.isnull()')

Unnamed: 0,ticker,name,segment
17,FTK,FLOTEK INDUSTRIES INC,


**Research Challenge:** Google `FTK` and figure out why it's not in `df_etf`.

In [11]:
# FTK is a stock not and ETF.

## Related Reading

*PDSH* - 3.7 - Combining Datasets: Merging and Joining