# taq_data_load_V2

#### Juan Camilo Henao Londono - 26.03.2019
#### AG Guhr - Universitaet Duisburg-Essen

The TAQ data is stored in a compressed way, and it is needed a `C++` to decompress the data. For my analysis I only decompressed the AAPL and MSFT tickers in `CSV` format to work with them in Python.

The TAQ data have two files, one with the trades information and another with the quoutes information. Each quotes file is really big (> 5 GB) as it cointains the information of a full year and the trades files are around ~ 1 GB, in our case we analyze the year 2008.

I try to load the data in different ways to avoid the memory of the PC overloads with this files, and I found a module in `Python` called `Dask` that is a flexible library for parallel computing in Python.

However, the results were not consistent. In the second version I tried to improve the time to load and analyze the data. 

The following characteristics from V1 were making the implementation really slow

1. In the function `taq_data_extract` I was loading the full data every month, again and again, instead of load it just once.

2. I was computing the dask dataframe to convert it in pandas dataframe every month, but I think it is better to do it every day.

3. The algorithm is loading the data from the days that do not have data, and only when the day is used appear the error. It takes almost the same time it takes to a data day to be saved.

With this in mind I made an implementation (V2) where the data is loaded with dask but is computed every day. The test for one single day shows the following

* V1 = 3min 6s to load the data and compute the result to pandas. Then 52.7 s to select and save the required data. In total it used 3min 59s.
* V2 = 1min 27s to load the data. Then 1min 50s to select, compute and save the required data. In total it used 3min 17s.

As the result to select, compute and save the required data is almost 2 times slower than the V1, the results for the whole year will be slower. The results for one month are

* V1 = 2min 53s to load the data and compute the result to pandas. Then 45 s to select and save the required data. In total it used 25min 12s to analyze January.
* V2 = 1min 20s to load the data. Then 1min 40s to select, compute and save the required data. In total it used 53min 53s to analyze January.

In conclusion, the V2 is not the solution.

#### This implementation is NOT used to obtain the TAQ data

In [1]:
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.multiprocessing import get
import pickle
import matplotlib.pyplot as plt
import swifter
%matplotlib inline

In [2]:
def get_sec(time_str):
    h, m, s = time_str.split(':')
    return int(h) * 3600 + int(m) * 60 + int(s)

In [3]:
def taq_data_extract(ticker, year):

    print('Obtaining data from ticker {}'.format(ticker))
    data_quotes = dd.read_csv('../TAQ_{1}/Data/{0}_{1}_NASDAQ_quotes.csv'
                              .format(ticker, year),
                              usecols=range(4),
                              sep=' ',
                              names=['Date', 'Time', 'Bid', 'Ask'],
                              parse_dates=['Date']).set_index('Date')

    data_trades = dd.read_csv('../TAQ_{1}/Data/{0}_{1}_NASDAQ_trades.csv'
                              .format(ticker, year),
                              usecols=range(3),
                              sep=' ',
                              names=['Date', 'Time', 'Ask'],
                              parse_dates=['Date']).set_index('Date')
    
    return (data_quotes, data_trades)

In [4]:
def data_to_array(quotes, trades, year, month, day):
    
    print('Processing data')

    print('Day data')

    data_q = quotes.loc[year + '-' + month + '-' + d].copy().compute()
    data_t = trades.loc[year + '-' + month + '-' + d].copy().compute()

    data_q.loc[:, 'Time'] = data_q['Time'].apply(get_sec)
    data_t.loc[:, 'Time'] = data_t['Time'].apply(get_sec)

    data_q = data_q.loc[(data_q['Time'] >= 34800) & (data_q['Time'] < 57000)]
    data_t = data_t.loc[(data_t['Time'] >= 34800) & (data_t['Time'] < 57000)]

    print('Saving data ' + year + '-' + month + '-' + d)

    print('Quotes')
    time_q = np.array(data_q['Time'])
    bid_q = np.array(data_q['Bid'])
    ask_q = np.array(data_q['Ask'])

    print('Time, bid and ask')
    pickle.dump((time_q, bid_q, ask_q),
                open('../TAQ_{1}/TAQ_py/TAQ_{0}_quotes_{1}{2}{3}.pickle'
                     .format(ticker, year, month, d), 'wb'))

    print('Trades')
    time_t = np.array(data_t['Time'])
    ask_t = np.array(data_t['Ask'])
    time_t, ask_t = zip(*sorted(zip(time_t, ask_t)))
    time_t = np.asarray(time_t)
    ask_t = np.asarray(ask_t)        

    print('Time and ask')
    pickle.dump((time_t, ask_t),
                open('../TAQ_{1}/TAQ_py/TAQ_{0}_trades_{1}{2}{3}.pickle'
                     .format(ticker, year, month, d), 'wb'))

    print()
        
    return (time_q, bid_q, ask_q, time_t, ask_t)

In [None]:
def months_days_list():
    days_list = []
    months_list = []

    for i in range(1,32):
        if (i < 10):
            days_list.append('0' + str(i))
        else:
            days_list.append(str(i))

    for i in range(1,13):
        if (i < 10):
            months_list.append('0' + str(i))
        else:
            months_list.append(str(i))

    return(months_list, days_list)

In [None]:
%%time

tickers = ['AAPL', 'MSFT']
year = '2008'

print(get_sec('09:40:00'))
print(get_sec('15:50:00'))

months_list, days_list = months_days_list()

for ticker in [tickers[0]]:
    
    %time data_quotes, data_trades = taq_data_extract(ticker, year)
    
    for month in [months_list[0]]:
        
        om_days = days_list[:]
            
        for d in days_list:

            try:
                %time data_to_array(data_quotes, data_trades, year, month, d)
            except KeyError:
                om_days.remove(d)
                
        pickle.dump(om_days,
                open('days_{}.pickle'
                     .format(month), 'wb'))

34800
57000
Obtaining data from ticker AAPL
CPU times: user 2min 23s, sys: 8.44 s, total: 2min 31s
Wall time: 1min 18s
Processing data
Day data
Saving data 2008-01-01
Quotes
Time, bid and ask
Trades


ValueError: not enough values to unpack (expected 2, got 0)

Processing data
Day data
Saving data 2008-01-02
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 44s, sys: 14.5 s, total: 2min 59s
Wall time: 1min 43s
Processing data
Day data
Saving data 2008-01-03
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 44s, sys: 14.5 s, total: 2min 59s
Wall time: 1min 42s
Processing data
Day data
Saving data 2008-01-04
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 46s, sys: 14.5 s, total: 3min 1s
Wall time: 1min 43s
Processing data
Day data
Saving data 2008-01-05
Quotes
Time, bid and ask
Trades


ValueError: not enough values to unpack (expected 2, got 0)

Processing data
Day data
Saving data 2008-01-06
Quotes
Time, bid and ask
Trades


ValueError: not enough values to unpack (expected 2, got 0)

Processing data
Day data
Saving data 2008-01-07
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 47s, sys: 14.5 s, total: 3min 2s
Wall time: 1min 45s
Processing data
Day data
Saving data 2008-01-08
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 48s, sys: 14.7 s, total: 3min 2s
Wall time: 1min 44s
Processing data
Day data
Saving data 2008-01-09
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 47s, sys: 14 s, total: 3min 1s
Wall time: 1min 42s
Processing data
Day data
Saving data 2008-01-10
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 46s, sys: 15 s, total: 3min 1s
Wall time: 1min 40s
Processing data
Day data
Saving data 2008-01-11
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 47s, sys: 14.4 s, total: 3min 1s
Wall time: 1min 40s
Processing data
Day data
Saving data 2008-01-12
Quotes
Time, bid and ask
Trades


ValueError: not enough values to unpack (expected 2, got 0)

Processing data
Day data
Saving data 2008-01-13
Quotes
Time, bid and ask
Trades


ValueError: not enough values to unpack (expected 2, got 0)

Processing data
Day data
Saving data 2008-01-14
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 47s, sys: 14.9 s, total: 3min 1s
Wall time: 1min 40s
Processing data
Day data
Saving data 2008-01-15
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 47s, sys: 14.8 s, total: 3min 2s
Wall time: 1min 41s
Processing data
Day data
Saving data 2008-01-16
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 48s, sys: 14.8 s, total: 3min 3s
Wall time: 1min 43s
Processing data
Day data
Saving data 2008-01-17
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 46s, sys: 14.7 s, total: 3min 1s
Wall time: 1min 42s
Processing data
Day data
Saving data 2008-01-18
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 47s, sys: 13.9 s, total: 3min 1s
Wall time: 1min 41s
Processing data
Day data
Saving data 2008-01-19
Quotes
Time, bid and ask
Trades


ValueError: not enough values to unpack (expected 2, got 0)

Processing data
Day data
Saving data 2008-01-20
Quotes
Time, bid and ask
Trades


ValueError: not enough values to unpack (expected 2, got 0)

Processing data
Day data
Saving data 2008-01-21
Quotes
Time, bid and ask
Trades


ValueError: not enough values to unpack (expected 2, got 0)

Processing data
Day data
Saving data 2008-01-22
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 53s, sys: 15.4 s, total: 3min 9s
Wall time: 1min 46s
Processing data
Day data
Saving data 2008-01-23
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 53s, sys: 14.7 s, total: 3min 8s
Wall time: 1min 45s
Processing data
Day data
Saving data 2008-01-24
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 47s, sys: 14.7 s, total: 3min 2s
Wall time: 1min 42s
Processing data
Day data
Saving data 2008-01-25
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 47s, sys: 14.7 s, total: 3min 2s
Wall time: 1min 41s
Processing data
Day data
Saving data 2008-01-26
Quotes
Time, bid and ask
Trades


ValueError: not enough values to unpack (expected 2, got 0)

Processing data
Day data
Saving data 2008-01-27
Quotes
Time, bid and ask
Trades


ValueError: not enough values to unpack (expected 2, got 0)

Processing data
Day data
Saving data 2008-01-28
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 48s, sys: 14.7 s, total: 3min 3s
Wall time: 1min 40s
Processing data
Day data
Saving data 2008-01-29
Quotes
Time, bid and ask
Trades
Time and ask

CPU times: user 2min 43s, sys: 13.8 s, total: 2min 57s
Wall time: 1min 38s
Processing data
Day data
