# Getting minute-level data from Algoseek!

In this notebook, we are going to load a sample of minute-level data from Algoseek.


In [1]:
# turn off warnings so they don't annoy us too much
import warnings
warnings.filterwarnings('ignore')

In [25]:
# get some auxiliary functions
from pathlib import Path
from tqdm import tqdm
# used to download the data while saving ONLY the extracted file
from zipfile import ZipFile
from urllib.request import urlopen
from io import BytesIO


# load libraries for in-Python data management
import numpy as np
import pandas as pd

# load libraries for graphs!
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# create object to help with multi-Index slicing
idx = pd.IndexSlice

## Preparing the Data

Before working with the data, we really should be familiar with the form that data takes. Check the documentation here:

https://us-equity-market-data-docs.s3.amazonaws.com/algoseek.US.Equity.TAQ.Minute.Bars.pdf

In [51]:
# these are all time info
time_cols = ['openbartime',
              'firsttradetime',
              'highbidtime',
              'highasktime',
              'hightradetime',
              'lowbidtime',
              'lowasktime',
              'lowtradetime',
              'closebartime',
              'lasttradetime']

In [5]:
# we don't want these
drop_cols = ['unknowntickvolume',
             'cancelsize',
             'tradeatcrossorlocked']

In [6]:
# we will need these
keep_cols = ['firsttradeprice',
             'hightradeprice',
             'lowtradeprice',
             'lasttradeprice',
             'minspread',
             'maxspread',
             'volumeweightprice',
             'nbboquotecount',
             'tradeatbid',
             'tradeatmidbid',
             'tradeatmid',
             'tradeatmidask',
             'tradeatask',
             'volume',
             'totaltrades',
             'finravolume',
             'finravolumeweightprice',
             'uptickvolume',
             'downtickvolume',
             'repeatuptickvolume',
             'repeatdowntickvolume',
             'tradetomidvolweight',
             'tradetomidvolweightrelative']

Some of those column names are long and cumbersome. We will ease our burden by renaming them.

In [7]:
column_change = {'volumeweightprice':'price',
                 'finravolume':'fvolume',
                 'finravolumeweightprice':'fprice',
                 'uptickvolume':'up',
                 'downtickvolume':'down',
                 'repeatuptickvolume':'rup',
                 'repeatdowntickvolume':'rdown',
                 'firsttradeprice':'first',
                 'hightradeprice':'high',
                 'lowtradeprice':'low',
                 'lasttradeprice':'last',
                 'nbboquotecount':'nbbo',
                 'totaltrades':'ntrades',
                 'openbidprice':'obprice',
                 'openbidsize':'obsize',
                 'openaskprice':'oaprice',
                 'openasksize':'oasize',
                 'highbidprice':'hbprice',
                 'highbidsize':'hbsize',
                 'highaskprice':'haprice',
                 'highasksize':'hasize',
                 'lowbidprice':'lbprice',
                 'lowbidsize':'lbsize',
                 'lowaskprice':'laprice',
                 'lowasksize':'lasize',
                 'closebidprice':'cbprice',
                 'closebidsize':'cbsize',
                 'closeaskprice':'caprice',
                 'closeasksize':'casize',
                 'firsttradesize':'firstsize',
                 'hightradesize':'highsize',
                 'lowtradesize':'lowsize',
                 'lasttradesize':'lastsize',
                 'tradetomidvolweight':'volweight',
                 'tradetomidvolweightrelative':'volweightrel'}

## Getting the data!

Now we are going to get the data. It is pretty large for a personal computer, so take it easy. I use colab and generally don't run into any problems.

In [23]:
nasdaq_path = Path('../../data/nasdaq100')

In [26]:
# download the data. It is pretty big. Might take a minute.
zip_url = 'https://algoseek-public.s3.amazonaws.com/nasdaq100-1min.zip'
with urlopen(zip_url) as zipresp:
  with ZipFile(BytesIO(zipresp.read())) as zfile:
    zfile.extractall(nasdaq_path)

In [54]:
def extract_and_combine_data():
  """ Downloads the algoseek data, combines all the files """

  # set the filepath
  path = nasdaq_path / 'nasdaq100'
  if not path.exists():
    path.mkdir(parents=True)

  data = []
  # this next part processes a LOT of files!!
  for f in tqdm(list(path.glob('*/**/*.csv.gz'))):
    # get temp DataFrame and format it
    temp = pd.read_csv(f, parse_dates=[['Date', 'TimeBarStart']]).rename(columns=str.lower).drop(time_cols+drop_cols, axis=1).rename(columns=column_change).set_index('date_timebarstart').sort_index()
    temp = temp.between_time('9:30', '16:00').set_index('ticker', append=True).swaplevel().rename(columns = lambda x: x.replace('tradeat', 'at'))
    # append it to the data list
    data.append(temp)

  # turn the list into a big DataFrame, with a more manageable index
  data = pd.concat(data).apply(pd.to_numeric, downcast='integer')
  data.index.rename(['ticker', 'date_time'], inplace=True)

  # check to make sure it is doing what we hoped
  print(data.info(show_counts=True))

  # save it as a speedy little HDF5 file
  data.to_hdf(nasdaq_path / 'algoseek.h5', '1min_taq')

In [55]:
# now we can run that data loading function
extract_and_combine_data()

  2%|▏         | 1743/80194 [01:21<1:01:59, 21.09it/s]

KeyboardInterrupt: ignored

In [None]:
ee = nasdaq_path / 'nasdaq100'
list(ee.glob('*/**/*.csv.gz'))