# Tutoral 2 - OCC Volume Analysis

### Objective

This purpose of this analysis is to produce a volume-by-underlying report from a data file sourced from the OCC.  The data in the OCC file is for the month of August 2018.  However, it is far more granular than we need, so we will need to group and summarize (a very common task in data analysis).  

Along the way we will discuss basic features of the `pandas` package.

### Loading Packages

The `pandas` package contains much of the data wrangling functionality that we will need.  For those of you who are familiar with R, you can thing of `pandas` as the Python equivalent of R's `tidyverse`, however `pandas` has a larger scope than the core tidyverse.

In [3]:
##> import pandas as pd



**Note:** In theory you can give *pandas* any alias that you want, but it would be *highly* non-pythonic to call it anything other than **pd**.

Next we are going to make a couple of changes to the way that the notebook behaves (both of these are largely a matter of preference). This first bit of code changes the maximimun number of rows that will be displayed when we print a `DataFrame`.


In [126]:
##> pd.options.display.max_rows = 6



Next, we will make it so that the output of every line of code in a cell is printed.  The default behavior is that only the last line of code is printed.

In [125]:
##> from IPython.core.interactiveshell import InteractiveShell
##> InteractiveShell.ast_node_interactivity = "all"



### Reading In Data from a CSV

The `pandas` library has a `.read_csv()` method that will read in a table of data from a CSV and put it in a `DataFrame`, which is the main data object in `pandas`.

In [124]:
##> df_occ = pd.read_csv('data/occ_option_volume_201808.csv')



### Initial Exploration of the Data

Our data is a monthly option volume report from the OCC.  The data is broken down by trade-date, underlying, account type, puts/calls, and exchange. We can print a the contents of the `DataFrame` by simply typing its name and then running that code.

In [123]:
##> df_occ



Notice at the very bottom that the total number of rows and columns has been printed.  We can also get this information from the `.shape` property.

In [122]:
##> df_occ.shape



The `type()` function is also useful for exploring objects.

In [121]:
# let's first create a list object to test on.
##> my_list = [1, "twelve", False]


# then let's check the type of my_list as well as df_occ
##> type(my_list)
##> type(df_occ)



We can check the data type of each of the columns with the `.dtypes` property.

In [120]:
##> df_occ.dtypes



Notice that all the string colums are given a data type of `object`.  Also notice that the `actdate` column was read in as a string, rather than a date, which we will fix later in this tutorial.

### Accessing Columns

We can access the columns of a dataframe by use of th `[` notation as well as the `.` notation.  Let's use both of these approaches to isolate the `underlying` column of `df_occ`.

In [119]:
# both of these are equivalent
##> df_occ['underlying']
##> df_occ.underlying



We can use the the `type()` function to see that a column of a `DataFrame` is a `Series`, which is a different kind of `pandas` object.

In [118]:
##> type(df_occ['underlying'])
##> type(df_occ.underlying)



We won't get into the weeds about this point too much, but it is worth noting that a `DataFrame` is a bunch of `Series` glued together.  For those of you familiar with R, this is similar the fact that a `data.frame` is a `list` of atomic vectors, all of the same length.

### Refactoring the Date Column

Recall that we saw that the `df_occ.actdate` is infact an `object` data type, rather than a date.

In [117]:
# let's compare the dtype of the df_occ.quantity and df_occ.actdate
##> df_occ.quantity.dtype
##> df_occ.actdate.dtype



We can use the `pandas.to_datetime()` function for the purposes of this refactoring.

In [115]:
##> pd.to_datetime(df_occ.actdate, format='%m/%d/%Y')



Note that this code above doesn't actually change `df_occ.actdate` but rather creates a new `Series` and prints it to the output.  We can test this by checking the `dtype` property of the column again.

In [114]:
##> df_occ.actdate.dtype



In order to actually affect the change we are looking for, we need to reassign to `df_occ.actdate`.

In [113]:
##> df_occ.actdate = pd.to_datetime(df_occ.actdate, format = '%m/%d/%Y')



Now the refactoring from `object` to `date` has actually occured.  Let's check the `dtype`.

In [111]:
##> df_occ.dtypes



### Further Exploration - Unique Values

When you first encounter a data set, it is useful to explore the unique values in some of the columns to try to get a feel for what is in the data set.  Let's look at the `underlying` column and the `actdate` column.

In [109]:
##> df_occ.actdate.unique()



It looks like our dates are all in August of 2018, but it's hard to be sure from visual inspection because the priting is messy.  Let's use the `min` and `max` methods just to be sure.

In [108]:
##> df_occ.actdate.unique().min()
##> df_occ.actdate.unique().max()



We next look at the unique values of `underlying` - there are a lot of them so they are not all printed.  Notice that the `.unique()` method returns an `array` object.  In fact this is a `ndarray` which is the foundation data structure of `numpy`.  The `pandas` package is built on top of `numpy`.


Let's check the size of the array, with the `.size` property.

In [107]:
##> df_occ.underlying.unique()
##> df_occ.underlying.unique().size



**Challenge:** How many individual dates are represented in the `df_occ`?

### Grouping And Summarizing

Our ultimate objective is to know the total option volume for each underlying in the month of August 2018.  The data in it's current form is far more granular than that.  We will do some aggregation in order to get the data in the form that we need.

In [106]:
##> df_occ



What we need to do is sum up the total quantity for each underlying.  We can do this with the `groupby()` method couple with the `sum()` method.

In [105]:
##> df_occ.groupby('underlying')['quantity'].sum()



As you can see, this is actually a `Series` object, not a `DataFrame`. (Which is different than how `summarize()` works in `dplyr`.)

In [104]:
##> type(df_occ.groupby('underlying')['quantity'].sum())



Let's convert this to a `DataFrame` using the `Series.to_frame()` method.

In [100]:
##> df_report = df_occ.groupby('underlying')['quantity'].sum().to_frame()

## Top 100 ETFs (non-volatility)

In the previous section we used the `.groupby()` function to get the aggregated data we wanted.  Since the backtesting we will eventually do will focus on liquid, non-volatility ETFs, let's see if we can restrict ourselves to those underlyings.  In particular, the goal of this section is to find the 100 highest volume, non-volatility, ETFs.

We will start by reading in a master list of ETFs.

In [132]:
##> df_etf = pd.read_csv('data/etf_list.csv')



In [133]:
##> df_etf



To identify the non-volatility ETFs we will perform a filtering operation using the segment column.  But first we will need to lowercase all the letters in that column.  We use the `Series.str.lower()` method for this.

In [97]:
##> df_etf['segment'] = df_etf['segment'].str.lower()


In [96]:
##> df_etf


Next we use the `Series.str.contains()` method to isolate all the non-volatility ETFs.  Notice that this utilizes boolean indexing of dataframes.

In [95]:
##> df_non_vol_etf = df_etf[~df_etf['segment'].str.contains('volatility')]



In `pandas`, both `Series` and `DataFrame` objects have an index. Among other things, an `index` both names and orders the entries of at `Series` or `DataFrame`.  Indices are also used when joining together two dataframes.

We can view the index by using the `DataFrame.index` property.

In [131]:
##> df_non_vol_etf.index



Notice above that the index of `df_non_vol_etf` is an index of integers.

In order to affect the join that we want, we will reindex `df_non_vol_etf` with the `DataFrame.set_index()` method.  The following code drops the integer index and turns the `symbol` column into an index.

In [130]:
##> df_non_vol_etf.set_index('symbol', inplace = True)



Let's take a look at the dataframe to make sure the change took place as we wanted.

In [134]:
##> df_non_vol_etf.index
##> df_non_vol_etf



We are now ready to perform an inner-join on `df_non_vol_etf` and `df_report`.  The result of this join is that only underlyings that exist in both dataframes remain.  By default the `.join()` uses the indexes of the two dataframes. The resulting joined datframes has the columns from both the original dataframes.

In [128]:
##> df_non_vol_etf.join(df_report, how = 'inner')



Recall that what we want is the top 100 most liquid ETFs.  In order to do this, we first sort with `DataFrame.sort_values()` and then use `.head()` to grab the top 100 rows.

In [127]:
##> df_non_vol_etf.join(df_report, how = 'inner').sort_values('quantity', ascending = False).head(100)

