# Tutoral 12 - OCC Volume Analysis

The purpose of this analysis is to put together a variety of the skills we have learned in previous tutorials to do a bit of analysis work that's akin to something you might do in a professional context.  The analysis will consist of:

1. Producing a volume-by-underlying report from a data file sourced from the Options Clearing Corporation (OCC).  The data in the OCC file is for the month of August 2018. The data is far more granular than we need, so we will need to group and summarize (a very common task in data analysis).  

2. Combine the monthly volume report with master list of ETFs to determine the 100 highest volumne non-volatility ETFs.


Don't worry if you are not familiar with the finance concepts discussed in this tutorial, it is my intention to focus on the mechanics of the analysis.

### Loading Packages

The `pandas` package contains much of the data wrangling functionality that we will need.  For those of you who are familiar with R, you can thing of `pandas` as the Python equivalent of R's `tidyverse`, however `pandas` has a larger scope than the core tidyverse.

In [1]:
##> import numpy as np
##> import pandas as pd

import numpy as np
import pandas as pd

**Note:** In theory you can give *pandas* any alias that you want, but it would be *highly* non-pythonic to call it anything other than **pd**.

Next we are going to make a couple of changes to the way that the notebook behaves (both of these are largely a matter of preference). This first bit of code changes the maximimun number of rows that will be displayed when we print a `DataFrame`.


In [2]:
##> pd.options.display.max_rows = 6

pd.options.display.max_rows = 6

Next, we will make it so that the output of every line of code in a cell is printed.  The default behavior is that only the last line of code is printed.

In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Reading In Data from a CSV

The `pandas` library has a `.read_csv()` method that will read in a table of data from a CSV and put it in a `DataFrame`, which is the main data object in `pandas`.

In [4]:
##> df_occ = pd.read_csv('data/occ_option_volume_201808.csv')
##> df_occ.head()

df_occ = pd.read_csv('../data/occ_option_volume_201808.csv')

### Initial Exploration of the Data

Our data is a monthly option volume report from the OCC.  The data is broken down by trade-date, underlying, account type, puts/calls, and exchange. We can print a the contents of the `DataFrame` by simply typing its name and then running that code.

In [5]:
##> df_occ

df_occ

Unnamed: 0,quantity,underlying,symbol,actype,porc,exchange,actdate
0,5850,ABX,1ABX,C,C,AMEX,08/02/2018
1,3050,ABX,1ABX,C,C,AMEX,08/16/2018
2,3050,ABX,1ABX,F,C,AMEX,08/16/2018
...,...,...,...,...,...,...,...
1870438,1,ZYNE,ZYNE,M,P,EDGX,08/20/2018
1870439,25,ZYNE,ZYNE,C,C,MCRY,08/30/2018
1870440,25,ZYNE,ZYNE,M,C,MCRY,08/30/2018


Notice at the very bottom that the total number of rows and columns has been printed.  We can also get this information from the `.shape` property.

In [6]:
##> df_occ.shape

df_occ.shape

(1870441, 7)

The `type()` function is also useful for exploring objects.

In [7]:
# let's first create a list object to test on.
##> my_list = [1, "twelve", False]


# then let's check the type of my_list as well as df_occ
##> type(my_list)
##> type(df_occ)

type(df_occ)

pandas.core.frame.DataFrame

We can check the data type of each of the columns with the `.dtypes` property.

In [8]:
##> df_occ.dtypes

df_occ.dtypes

quantity       int64
underlying    object
symbol        object
               ...  
porc          object
exchange      object
actdate       object
Length: 7, dtype: object

Notice that all the string colums are given a data type of `object`.  Also notice that the `actdate` column was read in as a string, rather than a date, which we will fix later in this tutorial.

### Accessing Columns

We can access the columns of a dataframe by use of th `[` notation as well as the `.` notation.  Let's use both of these approaches to isolate the `underlying` column of `df_occ`.

In [9]:
# both of these are equivalent
##> df_occ['underlying']
##> df_occ.underlying

df_occ['underlying']
df_occ.underlying

0           ABX
1           ABX
2           ABX
           ... 
1870438    ZYNE
1870439    ZYNE
1870440    ZYNE
Name: underlying, Length: 1870441, dtype: object

0           ABX
1           ABX
2           ABX
           ... 
1870438    ZYNE
1870439    ZYNE
1870440    ZYNE
Name: underlying, Length: 1870441, dtype: object

We can use the the `type()` function to see that a column of a `DataFrame` is a `Series`, which is a different kind of `pandas` object.

In [10]:
##> type(df_occ['underlying'])
##> type(df_occ.underlying)

type(df_occ['underlying'])
type(df_occ.underlying)

pandas.core.series.Series

pandas.core.series.Series

We won't get into the weeds about this point too much, but it is worth noting that a `DataFrame` is a bunch of `Series` glued together.  For those of you familiar with R, this is similar the fact that a `data.frame` is a `list` of atomic vectors, all of the same length.

### Refactoring the Date Column

Recall that we saw that the `df_occ.actdate` is infact an `object` data type, rather than a date.

In [11]:
# let's compare the dtype of the df_occ.quantity and df_occ.actdate
##> df_occ.quantity.dtype
##> df_occ.actdate.dtype

df_occ.quantity.dtype
df_occ.actdate.dtype

dtype('int64')

dtype('O')

We can use the `pandas.to_datetime()` function for the purposes of this refactoring.

In [12]:
##> pd.to_datetime(df_occ.actdate, format='%m/%d/%Y')

pd.to_datetime(df_occ.actdate, format='%m/%d/%Y')

0         2018-08-02
1         2018-08-16
2         2018-08-16
             ...    
1870438   2018-08-20
1870439   2018-08-30
1870440   2018-08-30
Name: actdate, Length: 1870441, dtype: datetime64[ns]

Note that this code above doesn't actually change `df_occ.actdate` but rather creates a new `Series` and prints it to the output.  We can test this by checking the `dtype` property of the column again.

In [13]:
##> df_occ.actdate.dtype

df_occ.actdate.dtype

dtype('O')

In order to actually affect the change we are looking for, we need to reassign to `df_occ.actdate`.

In [14]:
##> df_occ.actdate = pd.to_datetime(df_occ.actdate, format = '%m/%d/%Y')

df_occ.actdate = pd.to_datetime(df_occ.actdate, format = '%m/%d/%Y')

Now the refactoring from `object` to `date` has actually occured.  Let's check the `dtype`.

In [15]:
##> df_occ.dtypes

df_occ.dtypes

quantity               int64
underlying            object
symbol                object
                   ...      
porc                  object
exchange              object
actdate       datetime64[ns]
Length: 7, dtype: object

### Further Exploration - Unique Values

When you first encounter a data set, it is useful to explore the unique values in some of the columns to try to get a feel for what is in the data set.  Let's look at the `underlying` column and the `actdate` column.

In [16]:
##> df_occ.actdate.unique()

df_occ.actdate.unique()

array(['2018-08-02T00:00:00.000000000', '2018-08-16T00:00:00.000000000',
       '2018-08-13T00:00:00.000000000', '2018-08-09T00:00:00.000000000',
       '2018-08-10T00:00:00.000000000', '2018-08-22T00:00:00.000000000',
       '2018-08-06T00:00:00.000000000', '2018-08-15T00:00:00.000000000',
       '2018-08-23T00:00:00.000000000', '2018-08-24T00:00:00.000000000',
       '2018-08-07T00:00:00.000000000', '2018-08-21T00:00:00.000000000',
       '2018-08-14T00:00:00.000000000', '2018-08-01T00:00:00.000000000',
       '2018-08-03T00:00:00.000000000', '2018-08-17T00:00:00.000000000',
       '2018-08-30T00:00:00.000000000', '2018-08-31T00:00:00.000000000',
       '2018-08-27T00:00:00.000000000', '2018-08-20T00:00:00.000000000',
       '2018-08-08T00:00:00.000000000', '2018-08-28T00:00:00.000000000',
       '2018-08-29T00:00:00.000000000'], dtype='datetime64[ns]')

It looks like our dates are all in August of 2018, but it's hard to be sure from visual inspection because the priting is messy.  Let's use the `min` and `max` methods just to be sure.

In [17]:
##> df_occ.actdate.unique().min()
##> df_occ.actdate.unique().max()

df_occ.actdate.unique().min()
df_occ.actdate.unique().max()

numpy.datetime64('2018-08-01T00:00:00.000000000')

numpy.datetime64('2018-08-31T00:00:00.000000000')

We next look at the unique values of `underlying` - there are a lot of them so they are not all printed.  Notice that the `.unique()` method returns an `array` object.  In fact this is a `ndarray` which is the foundation data structure of `numpy`.  The `pandas` package is built on top of `numpy`.


Let's check the size of the array, with the `.size` property.

In [18]:
##> df_occ.underlying.unique()
##> df_occ.underlying.unique().size

df_occ.underlying.unique()
df_occ.underlying.unique().size

array(['ABX', 'AGI', 'ALB', ..., 'ZUMZ', 'ZUO', 'ZYNE'], dtype=object)

4243

**Codinge Challenge:** How many individual dates are represented in the `df_occ`?

### Grouping And Summarizing

Our ultimate objective is to know the total option volume for each underlying in the month of August 2018.  The data in it's current form is far more granular than that.  We will do some aggregation in order to get the data in the form that we need.

In [19]:
##> df_occ

df_occ

Unnamed: 0,quantity,underlying,symbol,actype,porc,exchange,actdate
0,5850,ABX,1ABX,C,C,AMEX,2018-08-02
1,3050,ABX,1ABX,C,C,AMEX,2018-08-16
2,3050,ABX,1ABX,F,C,AMEX,2018-08-16
...,...,...,...,...,...,...,...
1870438,1,ZYNE,ZYNE,M,P,EDGX,2018-08-20
1870439,25,ZYNE,ZYNE,C,C,MCRY,2018-08-30
1870440,25,ZYNE,ZYNE,M,C,MCRY,2018-08-30


What we need to do is sum up the total quantity for each underlying.  We can do this with the `groupby()` method couple with the `sum()` method.

In [20]:
##> df_occ.groupby('underlying')['quantity'].sum()


df_occ.groupby('underlying')['quantity'].sum()

underlying
A        166644
AA       612596
AABA    2914502
         ...   
ZUMZ       9882
ZUO      168444
ZYNE      36008
Name: quantity, Length: 4243, dtype: int64

As you can see, this is actually a `Series` object, not a `DataFrame`. (Which is different than how `summarize()` works in `dplyr`.)

In [21]:
##> type(df_occ.groupby('underlying')['quantity'].sum())

type(df_occ.groupby('underlying')['quantity'].sum())

pandas.core.series.Series

Let's convert this to a `DataFrame` using the `Series.to_frame()` method.

In [22]:
##> df_report = df_occ.groupby('underlying')['quantity'].sum().to_frame().reset_index()

df_report = df_occ.groupby('underlying')['quantity'].sum().to_frame().reset_index()
df_report

Unnamed: 0,underlying,quantity
0,A,166644
1,AA,612596
2,AABA,2914502
...,...,...
4240,ZUMZ,9882
4241,ZUO,168444
4242,ZYNE,36008


## Top 100 ETFs (non-volatility)

In the previous section we used the `.groupby()` function to get the aggregated data we wanted.  Since the backtesting we will eventually do will focus on liquid (high volume), non-volatility ETFs, let's see if we can restrict ourselves to those underlyings.  In particular, the goal of this section is to find the 100 highest volume, non-volatility, ETFs.

We will start by reading in a master list of ETFs.

In [23]:
##> df_etf = pd.read_csv('data/etf_list.csv')

df_etf = pd.read_csv('../data/etf_list.csv')

In [24]:
##> df_etf

df_etf

Unnamed: 0,symbol,name,issuer,expense_ratio,aum,spread,segment
0,SPY,SPDR S&P 500 ETF Trust,State Street Global Advisors,0.09%,$275.42B,0.00%,Equity: U.S. - Large Cap
1,IVV,iShares Core S&P 500 ETF,BlackRock,0.04%,$155.86B,0.01%,Equity: U.S. - Large Cap
2,VTI,Vanguard Total Stock Market ETF,Vanguard,0.04%,$103.58B,0.01%,Equity: U.S. - Total Market
...,...,...,...,...,...,...,...
2157,ADZCF,DB Agriculture Short ETN,Deutsche Bank,0.75%,$250.85K,72.50%,Inverse Commodities: Agriculture
2158,DEE,DB Commodity Double Short ETN,Deutsche Bank,0.75%,$221.25K,73.72%,Inverse Commodities: Broad Market
2159,FRLG,Large Cap Growth Index-Linked ETN,Goldman Sachs,1.46%,$NaN,0.54%,Leveraged Equity: U.S. - Large Cap Growth


To identify the non-volatility ETFs we will perform a `DataFrame` masking operation on the `segment` column.  But first we will need to lowercase all the letters in that column.  We use the `Series.str.lower()` method for this.

In [25]:
##> df_etf['segment'] = df_etf['segment'].str.lower()

df_etf['segment'] = df_etf['segment'].str.lower()

In [26]:
##> df_etf

df_etf

Unnamed: 0,symbol,name,issuer,expense_ratio,aum,spread,segment
0,SPY,SPDR S&P 500 ETF Trust,State Street Global Advisors,0.09%,$275.42B,0.00%,equity: u.s. - large cap
1,IVV,iShares Core S&P 500 ETF,BlackRock,0.04%,$155.86B,0.01%,equity: u.s. - large cap
2,VTI,Vanguard Total Stock Market ETF,Vanguard,0.04%,$103.58B,0.01%,equity: u.s. - total market
...,...,...,...,...,...,...,...
2157,ADZCF,DB Agriculture Short ETN,Deutsche Bank,0.75%,$250.85K,72.50%,inverse commodities: agriculture
2158,DEE,DB Commodity Double Short ETN,Deutsche Bank,0.75%,$221.25K,73.72%,inverse commodities: broad market
2159,FRLG,Large Cap Growth Index-Linked ETN,Goldman Sachs,1.46%,$NaN,0.54%,leveraged equity: u.s. - large cap growth


Next we use the `Series.str.contains()` method to isolate all the non-volatility ETFs.  Notice that this utilizes boolean indexing of dataframes.  

**Note:** This step requires knowledge of the dataset that you wouldn't necessarily have unless I gave it to you.  The way I came up with it was through exploration of the data.

In [27]:
##> df_non_vol_etf = df_etf[~df_etf['segment'].str.contains('volatility')]

df_non_vol_etf = df_etf[~df_etf['segment'].str.contains('volatility')]

In [28]:
##> df_non_vol_etf.join(df_report, how = 'inner')

df_joined = \
    df_non_vol_etf.merge(df_report, how='inner', left_on='symbol', right_on='underlying')\
    [['symbol', 'name', 'quantity']]

df_joined

Unnamed: 0,symbol,name,quantity
0,SPY,SPDR S&P 500 ETF Trust,107351530
1,IVV,iShares Core S&P 500 ETF,22838
2,VTI,Vanguard Total Stock Market ETF,21084
...,...,...,...
610,REW,ProShares UltraShort Technology,20
611,SSG,ProShares UltraShort Semiconductors,1050
612,GDXS,Proshares Ultrashort Gold Miners,152


**Coding Challenge:** Combine `.sort_values()` with `.head()` to find the 100 highest volume non-volatility ETFs.