# Manipulating Tick Data with pandas

We will work with data from [QuantQuote](https://quantquote.com/historical-stock-data).

- Dimensions: date, time, stock symbol
- Metrics: opening, high, low and closing prices, as well as trade volume
- Frequency: daily
- Dates: 1998 to 2015
- Scope: 500 stock symbols that constitute the S&P500 as of Dec 2015.

Let's get the data

In [None]:
from urllib.request import urlretrieve
from zipfile import ZipFile

def download(url):
    local_fname = url.split('/')[-1]
    fname, headers = urlretrieve(url, local_fname)
    return fname, headers

data_url = 'http://quantquote.com/files/quantquote_daily_sp500_83986.zip'
metadata_url = 'https://quantquote.com/docs/QuantQuote_Minute.pdf'

# Download data
data_fname, data_headers = download(data_url)
# Extract the data
with ZipFile(data_fname) as zf:
    zf.extractall()

# Download PDF with the metadata
metadata_fname, metadata_headers = download(metadata_url)

For each one of the 500 stock symbols, we have a file. Here is a sample:

In [None]:
import os
from pprint import pprint

data_dir = os.path.join('quantquote_daily_sp500_83986', 'daily')
pprint(os.listdir(data_dir)[:10])

From the downloaded PDF (see `metadata_fname`) we can obtain the field names (and their descriptions):

In [None]:
fieldnames = [
    'date',
    'time',
    'open',
    'high',
    'low',
    'close',
    'volume' 
]

## Building the dataset for a single stock symbol

We can now easily import the data of a single stock, for example, Apple (AAPL). We will drop the `time` field because it's useless.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline
print(plt.style.available)
plt.style.use('seaborn-notebook')

In [None]:
def import_data(symbol, fieldnames=fieldnames, set_index=True, add_symbol=False):
    data_path = os.path.join('quantquote_daily_sp500_83986',
                             'daily',
                             'table_' + symbol + '.csv')
    data = pd.read_csv(data_path,
                       names=fieldnames).drop('time', axis=1)
    data['date'] = pd.to_datetime(data.date, format='%Y%m%d')
    if set_index:
        data.set_index('date', inplace=True)
    if add_symbol:
        data['symbol'] = symbol
    return data

aapl = import_data('aapl')

In [None]:
print(aapl.info())
print(aapl.describe())

We can quickly look at things like the closing prices:

In [None]:
aapl.close.plot(title='AAPL closing prices')

This kind of plot where you have a lot of data points in the x-axis begs for interactivity: sometimes we want to inspect prices dates where something happened. Bokeh can help greatly here, allowing you to zoom in the dates you are interested in easily.

In [None]:
import bokeh.charts, bokeh.io
bokeh.io.output_notebook()

In [None]:
p = bokeh.charts.Line(aapl.close.reset_index(), x='date', y='close')
p.notebook(True).show()

We can also look at the relative difference between open and close prices:

In [None]:
_df = (aapl.close - aapl.open) / aapl.open
print(_df.describe())
_df.plot(title='AAPL relative difference between close and open prices')

In [None]:
_df.hist(bins=50)

## Building a dataset for all stocks

First, you should check if the data is "too big" just by looking at its size. It's 35M compressed, so we will be fine loading it all in memory.

In [None]:
_data_dir = os.path.join('quantquote_daily_sp500_83986', 'daily')

def make_dataset(data_dir=_data_dir):      
    data_files = [f for f in os.listdir(data_dir) if f.endswith('.csv')]
    symbols = [os.path.splitext(f)[0].split('_')[1] for f in data_files]
    df_by_symbol = (import_data(s, set_index=False, add_symbol=True)
                    for s in symbols)
    df = pd.concat(df_by_symbol)
    
    # Encode categorical variables efficiently
    df['symbol'] = df.symbol.astype('category')
    # Set an index and assert it is well behaved
    df = df.set_index(['symbol', 'date']).sort_index()
    assert df.index.is_unique and df.index.is_monotonic
    return df

In [None]:
%time df = make_dataset()

In [None]:
df.groupby(level='symbol').mean()

We could now, for example, plot the closing prices of the top 20 stocks according to some ranking metric (let's say trading volume).

In [None]:
top20sym = (
    df.groupby(level='symbol')
      .sum()
      .sort_values(by='volume', ascending=False)
      .head(20)
      .index
      .tolist()
)

top20sym

In [None]:
top20 = df.loc[(top20sym,slice(None)), ['open', 'close']]

top20

In [None]:
top20.close.unstack('symbol').head()

In [None]:
top20.close.unstack('symbol').plot()

We may want to exclude AAPL from the plot:

In [None]:
top20.close.drop('aapl').unstack('symbol').plot()

In [None]:
# bokeh plot

In [None]:
import seaborn as sbn

sbn.heatmap(top20.close.unstack('symbol').corr())