### Objective

This purpose of this analysis is to produce a volume-by-underlying report from a data file sourced from the OCC.  The data is for the month of August, but it is far more granular than we need, so we will nee to group and summarize (a very common task in data analysis).  

Along the way we will discuss basic features of the `pandas` package.

### Loading Packages

The `pandas` package contains much of the data wrangling functionality that we will need.  For those of you who are familiar with R, you can thing of `pandas` as the Python equivalent of R's `tidyverse`.

In [43]:
##> import pandas as pd

import pandas as pd

**Note:** In theory you can give *pandas* any alias that you want, but it would be *highly* non-pythonic to call it anything other than **pd**.

Next we are going to make a couple of changes to the way that the notebook behaves (both of these are largely a matter of preference). This first bit of code changes the maximimun number of rows that will be displayed when we print a `DataFrame`.

In [50]:
##> pd.options.display.max_rows = 10

pd.options.display.max_rows = 6

Next, we will make it so that every line of code in a cell that has an out, is printed.  The default behavior is that only the last line of code is printed.

In [39]:
##> from IPython.core.interactiveshell import InteractiveShell
##> InteractiveShell.ast_node_interactivity = "all"

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Reading In Data from a CSV

The `pandas` library has a `.read_csv()` method that will read in a table of data from a CSV and put it in a `DataFrame`, which is the main data object in `pandas`.

In [89]:
##> df_occ = pd.read_csv('../option_data/occ_option_volume_201808.csv')

df_occ = pd.read_csv('../option_data/occ_option_volume_201808.csv')

### Initial Exploration of the Data

Our data is a monthly option volume report from the OCC.  The data is broken down by trade-date, underlying, account type, puts/calls, and exchange. We can print a the contents of the `DataFrame` by simply typing it's name and then running that code.

In [51]:
# df_occ

df_occ

Unnamed: 0,quantity,underlying,symbol,actype,porc,exchange,actdate
0,5850,ABX,1ABX,C,C,AMEX,08/02/2018
1,3050,ABX,1ABX,C,C,AMEX,08/16/2018
2,3050,ABX,1ABX,F,C,AMEX,08/16/2018
...,...,...,...,...,...,...,...
1870438,1,ZYNE,ZYNE,M,P,EDGX,08/20/2018
1870439,25,ZYNE,ZYNE,C,C,MCRY,08/30/2018
1870440,25,ZYNE,ZYNE,M,C,MCRY,08/30/2018


Notice at the very bottom that the total number of rows and columns has been printed.  We can also get this information from the `.shape` property.

In [35]:
##> df_occ.shape

df_occ.shape

(1870441, 7)

The `type()` function is also useful for exploring objects.

In [40]:
# let's first create a list object to test on.
##> my_list = [1, "twelve", False]

# then let's check the type of my_list as well as df_occ
type(my_list)
type(df_occ)

list

pandas.core.frame.DataFrame

We can check the data-types of each of the columns with the `.dtypes` property.

In [46]:
##> df_occ.dtypes

df_occ.dtypes

quantity       int64
underlying    object
symbol        object
actype        object
porc          object
exchange      object
actdate       object
dtype: object

Notice that all the string colums are given a data type of `object`.  Also notice that the `actdate` column was read in as a string, rather than a date, which we will fix later in this tutorial.

### Accessing Columns

We can access the columns of a dataframe by use of th `[` notation as well as the `.` notation.  Let's use both of these approaches to isolate the `underlying` column of `df_occ`.

In [52]:
# both of these are equivalent
##> df_occ['underlying']
##> df_occ.underlying

df_occ['underlying']
df_occ.underlying

0           ABX
1           ABX
2           ABX
           ... 
1870438    ZYNE
1870439    ZYNE
1870440    ZYNE
Name: underlying, Length: 1870441, dtype: object

0           ABX
1           ABX
2           ABX
           ... 
1870438    ZYNE
1870439    ZYNE
1870440    ZYNE
Name: underlying, Length: 1870441, dtype: object

We can use the the `type()` function to see that a column of a `DataFrame` is a `Series`, which is a different kind of `pandas` object.

In [53]:
type(df_occ['underlying'])
type(df_occ.underlying)

pandas.core.series.Series

pandas.core.series.Series

We won't get into the weeds about this point too much, but it is worth noting that a `DataFrame` is a bunch of `Series` glued together.  For those of you familiar with R, this is similar the fact that a `data.frame` is a `list` of vectors, all of the same length.

### Refactoring the Date Column

Recall that we saw that the `df_occ.actdate` is infact an `object` data type, rather than a date.

In [92]:
# let's compare the dtype of the df_occ.quantity and df_occ.actdate
##> df_occ.quantity.dtype
##> df_occ.actdate.dtype

df_occ.quantity.dtype
df_occ.actdate.dtype

dtype('int64')

dtype('O')

We can use the `pandas.to_datetime()` function for the purposes of this refactoring.

In [90]:
##> spd.to_datetime(df_occ.actdate, format='%m/%d/%Y')

pd.to_datetime(df_occ.actdate, format='%m/%d/%Y')

0         2018-08-02
1         2018-08-16
2         2018-08-16
             ...    
1870438   2018-08-20
1870439   2018-08-30
1870440   2018-08-30
Name: actdate, Length: 1870441, dtype: datetime64[ns]

Note that this code above doesn't actually change `df_occ.actdate` but rather creates a new `Series` and prints it to the output.  We can test this by checking the `dtype` property of the column again.

In [93]:
##> df_occ.actdate.dtype

df_occ.actdate.dtype

dtype('O')

In order to actually affect the change we are looking for, we need to reassign to `df_occ.actdate`.

In [98]:
##> df_occ.actdate = pd.to_datetime(df_occ.actdate, format = '%m/%d/%Y')

df_occ.actdate = pd.to_datetime(df_occ.actdate, format = '%m/%d/%Y')

Now the refactoring to  date has actually occured.

In [100]:
##> df_occ.dtypes

df_occ.dtypes

quantity               int64
underlying            object
symbol                object
                   ...      
porc                  object
exchange              object
actdate       datetime64[ns]
Length: 7, dtype: object

### Looking at Unique Values

When you first encounter a data set, it is useful to explore the unique values in some of the columns to try to get a feel for what is in the data set.  Let's look at the `underlying` column and the `actdate` column.

In [103]:
##> df_occ.actdate.unique()

df_occ.actdate.unique()

array(['2018-08-02T00:00:00.000000000', '2018-08-16T00:00:00.000000000',
       '2018-08-13T00:00:00.000000000', '2018-08-09T00:00:00.000000000',
       '2018-08-10T00:00:00.000000000', '2018-08-22T00:00:00.000000000',
       '2018-08-06T00:00:00.000000000', '2018-08-15T00:00:00.000000000',
       '2018-08-23T00:00:00.000000000', '2018-08-24T00:00:00.000000000',
       '2018-08-07T00:00:00.000000000', '2018-08-21T00:00:00.000000000',
       '2018-08-14T00:00:00.000000000', '2018-08-01T00:00:00.000000000',
       '2018-08-03T00:00:00.000000000', '2018-08-17T00:00:00.000000000',
       '2018-08-30T00:00:00.000000000', '2018-08-31T00:00:00.000000000',
       '2018-08-27T00:00:00.000000000', '2018-08-20T00:00:00.000000000',
       '2018-08-08T00:00:00.000000000', '2018-08-28T00:00:00.000000000',
       '2018-08-29T00:00:00.000000000'], dtype='datetime64[ns]')

It looks like the date are all in August of 2018, but it's hare to be sure from just visual inspection, because the priting is messy.  Let's use the `min` and `max` methods just to be sure.

In [115]:
##> df_occ.actdate.unique().min()
##> df_occ.actdate.unique().max()

df_occ.actdate.unique().min()
df_occ.actdate.unique().max()

numpy.datetime64('2018-08-01T00:00:00.000000000')

numpy.datetime64('2018-08-31T00:00:00.000000000')

We next look at the unique values of `underlying` - there are a lot of them so they are not all printed.  Notice that the `.unique()` method returns an `array` object.  We can check it's size by using the `.size` property.

In [116]:
##> df_occ.underlying.unique()
##> df_occ.underlying.unique().size

df_occ.underlying.unique()
df_occ.underlying.unique().size

array(['ABX', 'AGI', 'ALB', ..., 'ZUMZ', 'ZUO', 'ZYNE'], dtype=object)

4243

**Challenge:** How many individual dates are represented in the `df_occ`?

### Grouping And Summarizing

Our ultimate objective is to know the total volume, for each underlying, in the month of August 2018.  The data in it's current form is far more granular than that.

In [117]:
df_occ

Unnamed: 0,quantity,underlying,symbol,actype,porc,exchange,actdate
0,5850,ABX,1ABX,C,C,AMEX,2018-08-02
1,3050,ABX,1ABX,C,C,AMEX,2018-08-16
2,3050,ABX,1ABX,F,C,AMEX,2018-08-16
...,...,...,...,...,...,...,...
1870438,1,ZYNE,ZYNE,M,P,EDGX,2018-08-20
1870439,25,ZYNE,ZYNE,C,C,MCRY,2018-08-30
1870440,25,ZYNE,ZYNE,M,C,MCRY,2018-08-30


What we need to do is sum up the total quantity for each underlying.  We can do this with the `groupby()` method couple with the `sum()` method.

In [121]:
df_occ.groupby('underlying')['quantity'].sum()

underlying
A        166644
AA       612596
AABA    2914502
         ...   
ZUMZ       9882
ZUO      168444
ZYNE      36008
Name: quantity, Length: 4243, dtype: int64

As you can see, this is actually a `Series` object, not a `DataFrame`. (Which is different than how `summarize()` works in `dplyr`.)