In [None]:
# Import libraries

import numpy as np
import pandas as pd
from IPython.display import display

___
### Auxiliary functions

In [None]:
def remove_outliers(df, s):
    """
    Removes outliers based on the median absolute deviation or
    the mean absolute deviation when the former is zero. Only
    the column s is looked up for outliers.

    :param df: input dataframe
    :param s: column to test for outliers
    :return: dataframe with outlier rows removed
    """
    # Median Absolute Deviation
    term = (df[s] - df[s].median()).abs()
    mad = 1.4826 * term.median()

    if mad > 0:
        df = df[term / mad <= 3]
    else:
        # Mean Absolute Deviation
        mad = 0.7979 * term.mean()
        df = df[term / mad <= 3]

    return df

___
### Data parsing and processing

Although there is latency from physical restrictions within exchanges, the time accuracy within the data is 1 second, which is much larger than the average latency in the real world. Therefore, duplicate records for a security at exactly the same timestamp are possibly different trades rather than data errors.

In [None]:
# read dataset
df = pd.read_csv("credit_derivatives_trades.csv").drop(columns=['Unnamed: 0'])

# convert timestamp to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'], format="%Y-%m-%d %H:%M:%S")

# sanity check
df = df[df['notional'] > 0]

# sort by ticker and timestamp
df = df.sort_values(by=['ticker', 'timestamp']).reset_index(drop=True)

___
### Trade flow indicator

In the presence of only trade and no quote data, Lee and Ready's (1991) "tick test" is a robust algorithm to label trades as "buy" or "sell". Besides, quote-based methods have the additional burden of identifying the particular quote that triggered the trade.

The tick-test labels a trade's direction by comparing its price with that of its preceding trade. A trade is an uptick (downtick) if the price is higher (lower) than the price of the previous trade. When the price is the same, it's a zero tick. If the tick before the zero tick is an uptick (downtick), then the trade is a zero-uptick (zero-downtick). A trade is classified as a "buy" if it occurs on an uptick or a zero-uptick, otherwise it's a "sell".

The tick-test can classify all trades in the trade data without additional information. However, certain irregular trades can be mislabeled. For instance, trades that are recorded out of sequence or include special conditions attached (such as a pre-agreed, non-market price) cannot be labeled. Fortunately, such irregular trades are only a minority among the regular trades.

I vectorize the tick-test algorithm by taking first differences on the spreads. Irregular trades will then appear as outliers in the data. Therefore, I remove the outliers from the data.

The first differences will give a missing value for the first trade per ticker. Lee and Ready's (1991) suggest setting the first trade as a "buy".

In [None]:
# get the spread's first difference by ticker and replace the missing value for
# the first trade per ticker with a positive value to force a "buy" label
df['dspread'] = df.groupby('ticker')['spread'].diff()
df.loc[df['dspread'].isna(), 'dspread'] = 0.0001

# remove the outliers (irregular trades)
df = df.groupby('ticker').apply(remove_outliers, 'dspread').reset_index(drop=True)

# summary statistics for the new variable
prc = [0.01, 0.25, 0.5, 0.75, 0.99]
print('\033[1m' + 'Summary Statistics for the Spread First Differences' + '\033[0m')
display(df.groupby('ticker')['dspread'].describe(percentiles=prc).reset_index())

# create the trade flow indicator based on spread first difference Δs:
# Δs > 0 -> buy, Δs < 0 -> sell, set Δs = 0 cases to missing for now
df['direction'] = np.select([df['dspread'] > 0, df['dspread'] < 0],
                            ['buy', 'sell'], default=None)

# label zero-tick trades in the same direction as their preceding trades
df['direction'] = df['direction'].ffill()

# show the flow indicator
print('\033[1m' + 'Trade Flow Indicator and Summary Statistics' + '\033[0m')
display(df[['ticker', 'timestamp', 'direction']])

# trade flow summary statistics per ticker
print('\033[1m' + 'Trade Flow Indicator and Summary Statistics' + '\033[0m')
display(df.groupby('ticker')['direction'].value_counts())