# itch_trade_sign_classification_exp

#### Juan Camilo Henao Londono - 28.02.2019
#### AG Guhr - Universitaet Duisburg-Essen

Test to find the number of identified trades and number of matches in the ITCH data using the two models used by S. Wang. They can be seen in this [paper](https://arxiv.org/pdf/1603.01580.pdf) identified as equation (2) and (3)

#### THIS IS NOT THE FINAL VERSION OF THE IMPLEMENTATION

# Classification of trades signs

It is first classified the sign for each trade by comparing the current and the prior price

$$\epsilon (t;n) = \left\{ \begin{array}{cc} \text{sgn}\left(S\left(t;n\right)-S\left(t;n-1\right)\right), & \text{if }S\left(t;n\right)\ne S\left(t;n-1\right)\\ \epsilon\left(t;n-1\right), & otherwise \end{array}\right.$$

If the current price is higher (lower) than the prior price, the trade sign is defined as +1 (-1). If two consecutive trades having the same trading direction together did not exhaust the available volume at the best quote, the prices of both trades would be the same.
 
During the time interval t, the number of trades is denoted by N(t), and the individual trades carried out are numbered by n = 1, ..., N(t). Therefore, it is defined the trade sign for each time interval of one second by

$$\epsilon (t) = \left\{ \begin{array}{cc} \text{sgn}\left(\sum_{n=1}^{N(t)} \epsilon(t;n) \right), & \text{if }N(t) > 0\\ 0, & \text{if } N(t) = 0 \end{array}\right.$$

Here, if more than one trade occur in the one-second interval t, it is averaged all the trade signs in this interval.

As a result, ε(t) = +1 implies that a majority of trades in the time interval t were triggered by buy market orders, whereas ε(t) = −1 indicates a majority of sell market orders. If ε(t) = 0 trades did not take place in the time interval t or there was a balance of buy and sell market orders in this interval.

In [1]:
# Modules

import numpy as np
import os
from matplotlib import pyplot as plt
%matplotlib inline

import gzip
import pickle

__tau__ = 1000

In [2]:
def itch_taq_trade_signs_load_test(ticker, year, month, day):
    
    # Load data

    print('Load trade sign classification for the stock ' + ticker + ' the ' + year + '.' + month 
          + '.' + day)
    
    data = np.genfromtxt(gzip.open('../ITCH_{1}/{1}{2}{3}_{0}.csv.gz'
                         .format(ticker, year, month, day)),
                         dtype='str', skip_header=1, delimiter=',')

    # Lists of times, ids, types, volumes and prices
    # List of all the available information available in the data excluding
    # the last two columns

    # List of order types:
    # "B" = 1 - > Add buy order
    # "S" = 2 - > Add sell order
    # "E" = 3 - > Execute outstanding order in part
    # "C" = 4 - > Cancel outstanding order in part
    # "F" = 5 - > Execute outstanding order in full
    # "D" = 6 - > Delete outstanding order in full
    # "X" = 7 - > Bulk volume for the cross event
    # "T" = 8 - > Execute non-displayed order
    times_ = np.array([int(mytime) for mytime in data[:, 0]])
    ids_ = np.array([int(myid) for myid in data[:, 2]])
    types_ = np.array([1 * (mytype == 'B') +
                       2 * (mytype == 'S') +
                       3 * (mytype == 'E') +
                       4 * (mytype == 'C') +
                       5 * (mytype == 'F') +
                       6 * (mytype == 'D') +
                       7 * (mytype == 'X') +
                       8 * (mytype == 'T') for mytype in data[:, 3]])
    volumes_ = np.array([int(myvolume) for myvolume in data[:,4]])
    prices_ = np.array([int(myprice) for myprice in data[:, 5]])

    ids = ids_[types_ < 7]
    times = times_[types_ < 7]
    types = types_[types_ < 7]
    volumes = volumes_[types_<7]
    prices = prices_[types_ < 7]
        
    # Reference lists
    # Reference lists using the original values or the length of the original
    # lists

    prices_ref = 1 * prices
    types_ref = 0 * types
    times_ref = 0 * times
    volumes_ref = 0 * types
    newids = {}
    hv = 0

    # Help lists with the data of the buy orders and sell orders

    hv_prices = prices[types < 3]
    hv_types = types[types < 3]
    hv_times = times[types < 3]
    hv_volumes = volumes[types < 3]

    trade_sign = 0 * types
    price_sign = 0 * types
    volume_sign = 0 * types
    
    # Fill the reference lists where the values of 'T' are 'E', 'C', 'F', 'D'

    # For the data in the length of the ids list (all data)
    for iii in range(len(ids)):

        # If the data is a sell or buy order
        if (types[iii] < 3):

            # Insert in the dictionary newids a key with the valor of the id
            # and the value of hv (a counter) that is the index in hv_types
            newids[ids[iii]] = hv

            # Increase the value of hv
            hv += 1

            trade_sign[iii] = 0
            price_sign[iii] = 0

        # If the data is not a sell or buy order
        elif (types[iii] == 3 or
                types[iii] == 5):

            # Fill the values of prices_ref with no prices ('E', 'C', 'F', 'D')
            # with the price of the order
            prices_ref[iii] = hv_prices[newids[ids[iii]]]

            # Fill the values of types_ref with no  prices ('E', 'C', 'F', 'D')
            # with the type of the order
            types_ref[iii] = hv_types[newids[ids[iii]]]

            # Fill the values of time_ref with no  prices ('E', 'C', 'F', 'D')
            # with the time of the order
            times_ref[iii] = hv_times[newids[ids[iii]]]
            
            # Fill the values of volumes_ref with no  prices ('E','C','F', 'D')
            # with the volume of the order
            volumes_ref[iii] = hv_volumes[newids[ids[iii]]] 

            if (hv_types[newids[ids[iii]]] == 2):

                trade_sign[iii] = 1.
                price_sign[iii] = prices_ref[iii]
                volume_sign[iii] = volumes_ref[iii]

            elif (hv_types[newids[ids[iii]]] == 1):

                trade_sign[iii] = - 1.
                price_sign[iii] = prices_ref[iii]
                volume_sign[iii] = volumes_ref[iii]

        else:

            # Fill the values of types_ref with no  prices ('E', 'C', 'F', 'D')
            # with the type of the order
            types_ref[iii] = hv_types[newids[ids[iii]]]

            # Fill the values of time_ref with no  prices ('E', 'C', 'F', 'D')
            # with the time of the order
            times_ref[iii] = hv_times[newids[ids[iii]]]

            trade_sign[iii] = 0
            price_sign[iii] = 0
            volume_sign[iii] = 0
            
        # Ordering the data in the open market time

    # This line behaves as an or.the two arrays must achieve a condition, in
    # this case, be in the market trade hours
    day_times_ind = (1. * times / 3600 / 1000 > 9.666666) * \
                    (1. * times / 3600 / 1000 < 15.833333) > 0

    price_signs = price_sign[day_times_ind]
    trade_signs = trade_sign[day_times_ind]
    volume_signs = volume_sign[day_times_ind]
    times_signs = times[day_times_ind]
        
    return (price_signs, trade_signs, volume_signs, times_signs)

## Equation 2

In [3]:
def itch_taq_trade_signs_consecutive_trades_ms(ticker, price_signs, trade_signs, times_signs, year, month, day):
    
    print('Accuracy of the trade sign classification for the stock ' + ticker + ' the ' + year + '.' + month 
          + '.' + day)
    
    # trades with values different to zero to obtain the theoretical value
    price_no_0 = price_signs[trade_signs != 0]
    trades_no_0 = trade_signs[trade_signs!= 0]
    time_no_0 = times_signs[trade_signs!= 0]
    time_no_0_set = np.array(list(sorted(set(time_no_0))))
    
    identified_trades = np.zeros(len(time_no_0))

    count = 0

    for t_idx, t_val in enumerate(time_no_0_set):

        while (count < len(time_no_0) and time_no_0[count] == t_val):

            diff = price_no_0[count] - price_no_0[count - 1]

            if (diff):

                identified_trades[count] = np.sign(diff)
                count += 1

            else:

                identified_trades[count] = identified_trades[count - 1]
                count += 1
                
    print('For consecutive trades in ms:')
    print('Accuracy of the classification:', round(sum(trades_no_0 == identified_trades) / len(trades_no_0) * 100, 2), '%')
    print('Number of identified trades:', len(trades_no_0))
    print('Number of matches:', sum(trades_no_0 == identified_trades))
    print()
    
    return identified_trades


In [4]:
def itch_taq_trade_signs_eq2_ms(ticker, trade_signs, times_signs, identified_trades, year, month, day):
    
    trades_no_0 = trade_signs[trade_signs!= 0]
    time_no_0 = times_signs[trade_signs!= 0]
    time_no_0_set = np.array(list(sorted(set(time_no_0))))
    
    trades_exp_ms = np.zeros(len(time_no_0_set))
    trades_teo_ms = np.zeros(len(time_no_0_set))

    for t_idx, t_val in enumerate(time_no_0_set):

        # Experimental
        trades_same_t_exp = identified_trades[time_no_0 == t_val]
        sign_exp = np.sign(np.sum(trades_same_t_exp))
        trades_exp_ms[t_idx] = sign_exp
        
        # Theoric
        trades_same_t_teo = trades_no_0[time_no_0 == t_val]
        sign_teo = np.sign(np.sum(trades_same_t_teo))
        trades_teo_ms[t_idx] = sign_teo
        
    print('Reducing the trades to 1 per millisecond:')
    print('Accuracy of the classification:', round(sum(trades_teo_ms == trades_exp_ms) / len(trades_teo_ms) * 100, 2), '%')
    print('Number of identified trades signs:', len(trades_teo_ms))
    print('Number of matches:', sum(trades_teo_ms == trades_exp_ms))
    print()
    
    return (trades_teo_ms, trades_exp_ms)

In [5]:
def itch_taq_trade_signs_s(ticker, trades_teo_ms, trades_exp_ms, times_signs, year, month, day):

    time_no_0 = times_signs[trade_signs != 0]
    time_no_0_set = np.array(list(sorted(set(time_no_0))))
    
    full_time = np.array(range(34800, 57000))
    trades_teo_s_0 = 0. * full_time
    trades_exp_s_0 = 0. * full_time

    for t_idx, t_val in enumerate(full_time):
        trades_teo_s_0[t_idx] = np.sign(np.sum(trades_teo_ms[(time_no_0_set > t_val * 1000) & (time_no_0_set < (t_val + 1) * 1000)]))
        trades_exp_s_0[t_idx] = np.sign(np.sum(trades_exp_ms[(time_no_0_set > t_val * 1000) & (time_no_0_set < (t_val + 1) * 1000)]))
        
    trades_teo_s = trades_teo_s_0[trades_teo_s_0 != 0]
    trades_exp_s = trades_exp_s_0[trades_teo_s_0 != 0]
    
    print('Reducing the trades to 1 per second:')
    print('Accuracy of the classification:', round(sum(trades_teo_s == trades_exp_s) / len(trades_teo_s) * 100, 2), '%')
    print('Number of identified trades signs:', len(trades_teo_s))
    print('Number of matches:', sum(trades_teo_s == trades_exp_s))
    print()

In [6]:
ticker = ['AAPL', 'AAPL', 'GS', 'GS', 'XOM', 'XOM']
year = '2008'
month = ['01', '06', '10', '12', '02', '08']
day = ['07', '02', '07', '10', '11', '04']

for (t, m, d) in zip(ticker, month, day):

    price_signs, trade_signs, _, times_signs = itch_taq_trade_signs_load_test(t, year, m, d)
    identified_trades = itch_taq_trade_signs_consecutive_trades_ms(t, price_signs, trade_signs,
                                                                       times_signs, year, m, d)
    trades_teo_ms, trades_exp_ms = itch_taq_trade_signs_eq2_ms(t, trade_signs, times_signs, 
                                                               identified_trades, year, m, d)
    itch_taq_trade_signs_s(t, trades_teo_ms, trades_exp_ms, times_signs, year, m, d)
    

Load trade sign classification for the stock AAPL the 2008.01.07
Accuracy of the trade sign classification for the stock AAPL the 2008.01.07
For consecutive trades in ms:
Accuracy of the classification: 83.03 %
Number of identified trades: 120287
Number of matches: 99871

Reducing the trades to 1 per millisecond:
Accuracy of the classification: 81.51 %
Number of identified trades signs: 83411
Number of matches: 67988

Reducing the trades to 1 per second:
Accuracy of the classification: 78.05 %
Number of identified trades signs: 15591
Number of matches: 12169

Load trade sign classification for the stock AAPL the 2008.06.02
Accuracy of the trade sign classification for the stock AAPL the 2008.06.02
For consecutive trades in ms:
Accuracy of the classification: 89.05 %
Number of identified trades: 52691
Number of matches: 46919

Reducing the trades to 1 per millisecond:
Accuracy of the classification: 87.41 %
Number of identified trades signs: 35906
Number of matches: 31385

Reducing the 

## Equation 3

In [7]:
def itch_taq_trade_signs_eq3_ms(ticker, trade_signs, volume_signs, times_signs, identified_trades, year, month, day):
    
    trades_no_0 = trade_signs[trade_signs!= 0]
    volumes_no_0 = volume_signs[trade_signs!= 0]
    time_no_0 = times_signs[trade_signs!= 0]
    time_no_0_set = np.array(list(sorted(set(time_no_0))))
    
    trades_exp_ms = np.zeros(len(time_no_0_set))
    trades_teo_ms = np.zeros(len(time_no_0_set))

    for t_idx, t_val in enumerate(time_no_0_set):

        # Experimental
        trades_same_t_exp = identified_trades[time_no_0 == t_val]
        volumes_same_t = volumes_no_0[time_no_0 == t_val]
        sign_exp = np.sign(np.sum(trades_same_t_exp * volumes_same_t))
        trades_exp_ms[t_idx] = sign_exp
        
        # Theoric
        trades_same_t_teo = trades_no_0[time_no_0 == t_val]
        sign_teo = np.sign(np.sum(trades_same_t_teo))
        trades_teo_ms[t_idx] = sign_teo
        
    print('Reducing the trades to 1 per millisecond:')
    print('Accuracy of the classification:', round(sum(trades_teo_ms == trades_exp_ms) / len(trades_teo_ms) * 100, 2), '%')
    print('Number of identified trades signs:', len(trades_teo_ms))
    print('Number of matches:', sum(trades_teo_ms == trades_exp_ms))
    print()
    
    return (trades_teo_ms, trades_exp_ms)

In [8]:
ticker = ['AAPL', 'AAPL', 'GS', 'GS', 'XOM', 'XOM']
year = '2008'
month = ['01', '06', '10', '12', '02', '08']
day = ['07', '02', '07', '10', '11', '04']

for (t, m, d) in zip(ticker, month, day):

    price_signs, trade_signs, volume_signs, times_signs = itch_taq_trade_signs_load_test(t, year, m, d)
    identified_trades = itch_taq_trade_signs_consecutive_trades_ms(t, price_signs, trade_signs,
                                                                       times_signs, year, m, d)
    trades_teo_ms, trades_exp_ms = itch_taq_trade_signs_eq3_ms(t, trade_signs, volume_signs, times_signs, 
                                                               identified_trades, year, m, d)
    itch_taq_trade_signs_s(t, trades_teo_ms, trades_exp_ms, times_signs, year, m, d)
    

Load trade sign classification for the stock AAPL the 2008.01.07
Accuracy of the trade sign classification for the stock AAPL the 2008.01.07
For consecutive trades in ms:
Accuracy of the classification: 83.03 %
Number of identified trades: 120287
Number of matches: 99871

Reducing the trades to 1 per millisecond:
Accuracy of the classification: 81.73 %
Number of identified trades signs: 83411
Number of matches: 68175

Reducing the trades to 1 per second:
Accuracy of the classification: 78.14 %
Number of identified trades signs: 15591
Number of matches: 12183

Load trade sign classification for the stock AAPL the 2008.06.02
Accuracy of the trade sign classification for the stock AAPL the 2008.06.02
For consecutive trades in ms:
Accuracy of the classification: 89.05 %
Number of identified trades: 52691
Number of matches: 46919

Reducing the trades to 1 per millisecond:
Accuracy of the classification: 87.53 %
Number of identified trades signs: 35906
Number of matches: 31427

Reducing the 