# Event-Based Trading: Data Preparation

This notebook focuses on loading and preprocessing the Squid_Ink data for event-based trading analysis. We'll use only the first 20,000 timestamps (in-sample data) for our analysis.

In [None]:
import sys
import os

# Import our backtester package
sys.path.append(os.path.abspath('../../'))
from backtester import get_price_data, get_vwap, relative_entropy_binned
print("Using backtester package")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm

## 1. Load Data

First, let's load the Squid_Ink price data and limit it to the first 20,000 timestamps (in-sample data).

In [None]:
# Load data directly using backtester package
print("Loading price data...")
prices = get_price_data('SQUID_INK', 1)
print(f"Loaded {len(prices)} price data points")

# Limit to first 20,000 timestamps (in-sample data)
in_sample_prices = prices.iloc[:20000]
print(f"Limited to {len(in_sample_prices)} in-sample data points")

# Get VWAP
print("Getting VWAP for SQUID_INK...")
squid_vwap = in_sample_prices['vwap']
print(f"Got VWAP with {len(squid_vwap)} data points")
print(f"VWAP range: {squid_vwap.min()} to {squid_vwap.max()}")

# Calculate log returns
log_ret = np.log(squid_vwap).diff().dropna()
print(f"Calculated log returns with {len(log_ret)} data points")

# Plot VWAP
plt.figure(figsize=(15, 5))
plt.plot(squid_vwap)
plt.title('Squid_Ink VWAP (In-Sample Data)')
plt.grid(True)
plt.show()

## 2. Calculate Basic Statistics

Let's calculate some basic statistics for the in-sample data.

In [None]:
# Calculate basic statistics for VWAP
vwap_stats = {
    'Mean': squid_vwap.mean(),
    'Median': squid_vwap.median(),
    'Std Dev': squid_vwap.std(),
    'Min': squid_vwap.min(),
    'Max': squid_vwap.max(),
    'Range': squid_vwap.max() - squid_vwap.min(),
    'IQR': squid_vwap.quantile(0.75) - squid_vwap.quantile(0.25)
}

# Calculate basic statistics for log returns
ret_stats = {
    'Mean': log_ret.mean(),
    'Median': log_ret.median(),
    'Std Dev': log_ret.std(),
    'Min': log_ret.min(),
    'Max': log_ret.max(),
    'Range': log_ret.max() - log_ret.min(),
    'IQR': log_ret.quantile(0.75) - log_ret.quantile(0.25)
}

# Display statistics
print("VWAP Statistics:")
pd.Series(vwap_stats)


In [None]:
print("Log Returns Statistics:")
pd.Series(ret_stats)

## 3. Visualize Distributions

Let's visualize the distributions of VWAP and log returns.

In [None]:
# Plot VWAP distribution
plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
plt.hist(squid_vwap, bins=50, alpha=0.7)
plt.title('VWAP Distribution')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.hist(log_ret, bins=50, alpha=0.7)
plt.title('Log Returns Distribution')
plt.grid(True)

plt.tight_layout()
plt.show()

## 4. Calculate Order Book Features

Let's calculate some order book features that might be useful for event-based trading.

In [None]:
# Calculate bid-ask spread
in_sample_prices['spread'] = in_sample_prices['ask_price_1'] - in_sample_prices['bid_price_1']

# Calculate mid price
in_sample_prices['mid_price_calc'] = (in_sample_prices['ask_price_1'] + in_sample_prices['bid_price_1']) / 2

# Calculate order book imbalance
in_sample_prices['bid_volume_total'] = in_sample_prices['bid_volume_1'] + in_sample_prices['bid_volume_2'].fillna(0) + in_sample_prices['bid_volume_3'].fillna(0)
in_sample_prices['ask_volume_total'] = in_sample_prices['ask_volume_1'] + in_sample_prices['ask_volume_2'].fillna(0) + in_sample_prices['ask_volume_3'].fillna(0)
in_sample_prices['volume_imbalance'] = (in_sample_prices['bid_volume_total'] - in_sample_prices['ask_volume_total']) / (in_sample_prices['bid_volume_total'] + in_sample_prices['ask_volume_total'])

# Display the first few rows with the new features
in_sample_prices[['spread', 'mid_price_calc', 'bid_volume_total', 'ask_volume_total', 'volume_imbalance']].head()

## 5. Visualize Order Book Features

Let's visualize the order book features we calculated.

In [None]:
# Plot spread
plt.figure(figsize=(15, 10))

plt.subplot(3, 1, 1)
plt.plot(in_sample_prices['spread'])
plt.title('Bid-Ask Spread')
plt.grid(True)

plt.subplot(3, 1, 2)
plt.plot(in_sample_prices['volume_imbalance'])
plt.title('Volume Imbalance')
plt.grid(True)

plt.subplot(3, 1, 3)
plt.scatter(in_sample_prices['volume_imbalance'], in_sample_prices['spread'], alpha=0.5)
plt.title('Volume Imbalance vs. Spread')
plt.xlabel('Volume Imbalance')
plt.ylabel('Spread')
plt.grid(True)

plt.tight_layout()
plt.show()

## 6. Save Processed Data

Let's save the processed data for use in other notebooks.

In [None]:
# Create a directory for processed data if it doesn't exist
import os
if not os.path.exists('processed_data'):
    os.makedirs('processed_data')

# Save the processed data
in_sample_prices.to_pickle('processed_data/in_sample_prices.pkl')
pd.Series(squid_vwap).to_pickle('processed_data/squid_vwap.pkl')
pd.Series(log_ret).to_pickle('processed_data/log_ret.pkl')

print("Processed data saved successfully.")

## 7. Conclusion

In this notebook, we've loaded and preprocessed the Squid_Ink data for event-based trading analysis. We've limited our analysis to the first 20,000 timestamps (in-sample data) and calculated various features that might be useful for identifying trading events.

In the next notebooks, we'll explore different types of events and develop trading strategies based on these events.