# Tutorial 3 - Data Visualization

### Objective

The purpose of this tutorial is to give a brief introduction to the data visualization capabilities of Python.  Data visualization is a massive topic and we are only going to scratch the surface.

The foundation of data visualization is the `matplotlib` package.  Another package that is quite useful is the `seaborn` package.  We will utilize both in this tutorial.

### Loading Packages and Configuration

Let's begin by importing the packages that we need.

In [54]:
##> import numpy as np
##> import pandas as pd
##> import matplotlib.pyplot as plt
##> import seaborn as sns



Now that we have the packages loaded, we'll do a bit of configuration of the Jupyter environment.  First, in order to see the graphs within the sheet, we need to run the following.

In [36]:
##> %matplotlib inline


This bit of code above that starts with a `%` is called an *IPython magic* command.  IPython magics are another way in which IPython enhances the baseline functionality of Python.  We won't concern ourselves anymore about magic commands in this tutorial, but it's good to be aware of their existance.

The next few lines of code are a matter of preference.  The first limits the number of rows that are printed when you print a dataframe, and the next two prints all lines of the code that have an outputs (as opposed to just the final line, which is the default behavior in Jupyter).

In [55]:
##> pd.options.display.max_rows = 6
##> from IPython.core.interactiveshell import InteractiveShell
##> InteractiveShell.ast_node_interactivity = "all"



### Reading In the Data

We are going to performs some visualization exercises with the data in the `practice_market_history.csv` file that can be found in the `data` folder.  To load the data we us the following code:

In [57]:
##> df_market_history = pd.read_csv('data/practice_market_history.csv')



Let's have a look at the data.

In [59]:
##> df_market_history



This data consists of end-of-day option prices, for 100 different ETF underlying, over four different expirations in late 2013 and early 2014.  There are a few other details that are worth noting to fully understand the dataset, but we won't concern ourselves with these for now.  For the purposes of this tutorial, just know that each row corresponds to the end-of-day price for a given option on a given day.

### Liquidity - Volume vs Spread

*Liquidity* is an extremely important concept in trading.  We say that a security or instrument is *liquid* if three things hold.

1. Large volumes trade daily.

2. There is a narrow bid-ask spread.

3. Large quantities can be traded with out moving the bid or ask.

Usually it is the case that these three things go together.  The purpose of our first visualization analysis will be to demonstrate that #1 and #2 coincide in our options data.

Let's begin by creating a new column called `spread` to measure the spread of each option.  The spread for an option is the difference between the ask price and the bid price.

In [60]:
##> df_market_history['spread'] = df_market_history['ask'] - df_market_history['bid']



Next, we will use a `groupby` statement to calculate the total volume over each underlying, as well as the average spread over each underlying.

In [62]:
##> df_volume = df_market_history.groupby('underlying')['volume'].sum().to_frame()
##> df_spread = df_market_history.groupby('underlying')['spread'].mean().to_frame()



Let's now combine together `df_volume` and `df_spread` into a single dataframe called liquidity.

In [63]:
##> df_liquidity = df_volume.join(df_spread, how = 'inner')


Let's now graph a scatter plot with volume on the x-axis and average spread on the x-axis.  **How do you think it should look?**

In [64]:
##> sns.lmplot('volume', # Horizontal axis
##>            'spread', # Vertical axis
##>            data=df_liquidity, # Data source
##>            fit_reg=False
##>           )



This scatter plot doesn't look very good because there are a couple of underlyings with way more volume than the rest.  Can you guess what one of the outlier underlyings there is.

In order to better see the relationship, let's calculate the log of the volume.

In [65]:
##> df_liquidity['log_volume'] = np.log(df_liquidity['volume'])

Now if we graph the `log-volume` against the `spread` we can see the relationship that we would expect.

In [66]:
##> sns.lmplot('log_volume', # Horizontal axis
##>           'spread', # Vertical axis
##>            data=df_liquidity, # Data source
##>            fit_reg=False # Don't fix a regression line
##>           )

