# Analyzing Orderbooks Before Large Returns - Squid Ink Round 2 (Fixed Version)

This notebook analyzes the state of the orderbook right before large changes in returns for Squid Ink in Round 2. It addresses the index issue in the previous notebooks.

In [None]:
# Import necessary libraries
import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Configure plots to be larger and more readable
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

# Try to import seaborn for better styling
try:
    import seaborn as sns
    sns.set(style="whitegrid")
    print("Using Seaborn for plot styling")
except ImportError:
    print("Seaborn not available, using matplotlib default styling")

## 1. Load Price Data

In [None]:
def load_price_data(round_num, product='SQUID_INK'):
    """
    Load price data for a specific round and product.
    
    Parameters:
        round_num (int): Round number
        product (str): Product name (default: 'SQUID_INK')
        
    Returns:
        pd.DataFrame: DataFrame containing price data
    """
    # Path to data directory - try multiple possible locations
    possible_data_paths = [
        '../../../Prosperity 3 Data',
        '../../../../Prosperity 3 Data',
        '../../../../../Prosperity 3 Data',
        'Prosperity 3 Data'
    ]
    
    # Find the first valid data path
    data_path = None
    for path in possible_data_paths:
        if os.path.exists(path):
            data_path = path
            print(f"Found data directory at {path}")
            break
    
    if data_path is None:
        print("Could not find data directory")
        return pd.DataFrame()
    
    # List all CSV files for the round
    import glob
    file_pattern = os.path.join(data_path, f'Round {round_num}/prices_round_{round_num}_day_*.csv')
    files = glob.glob(file_pattern)
    
    if not files:
        print(f"No files found matching pattern: {file_pattern}")
        return pd.DataFrame()
    
    # Load and concatenate all files
    dfs = []
    for file in files:
        print(f"Loading {file}...")
        df = pd.read_csv(file, sep=';')
        dfs.append(df)
    
    # Concatenate all dataframes
    all_data = pd.concat(dfs, ignore_index=True)
    
    # Filter for the specified product
    product_data = all_data[all_data['product'] == product].copy()
    print(f"Successfully loaded price data with {len(product_data)} rows")
    
    return product_data

In [None]:
# Load Squid Ink price data for Round 2
squid_data = load_price_data(2, 'SQUID_INK')

# Display the first few rows
squid_data.head()

## 2. Calculate Returns and Identify Large Return Events

In [None]:
# Calculate mid price
squid_data['mid_price'] = (squid_data['ask_price_1'] + squid_data['bid_price_1']) / 2

# Sort by timestamp to ensure proper return calculation
squid_data = squid_data.sort_values('timestamp')

# Reset index to ensure we have continuous indices
squid_data = squid_data.reset_index(drop=True)

# Calculate returns
squid_data['returns'] = squid_data['mid_price'].pct_change()

# Calculate absolute returns
squid_data['abs_returns'] = squid_data['returns'].abs()

# Display summary statistics of returns
print("Summary statistics of returns:")
squid_data['returns'].describe()

In [None]:
# Define what constitutes a "large" return (e.g., top 1% of absolute returns)
large_return_threshold = squid_data['abs_returns'].quantile(0.99)
print(f"Large return threshold (99th percentile): {large_return_threshold:.6f}")

# Identify indices with large returns
large_return_indices = squid_data[squid_data['abs_returns'] >= large_return_threshold].index
print(f"Number of large return events: {len(large_return_indices)}")
print(f"Percentage of all observations: {len(large_return_indices) / len(squid_data) * 100:.2f}%")

## 3. Calculate Orderbook Features

In [None]:
def calculate_orderbook_features(data):
    """
    Calculate various orderbook features.
    
    Parameters:
        data (pd.DataFrame): DataFrame containing orderbook data
        
    Returns:
        pd.DataFrame: DataFrame with orderbook features
    """
    # Create a copy of the dataframe
    df = data.copy()
    
    # Calculate bid-ask spread
    df['spread'] = df['ask_price_1'] - df['bid_price_1']
    df['relative_spread'] = df['spread'] / df['mid_price']
    
    # Calculate total volume at each level
    df['bid_volume_total'] = df['bid_volume_1'] + df['bid_volume_2'].fillna(0) + df['bid_volume_3'].fillna(0)
    df['ask_volume_total'] = df['ask_volume_1'] + df['ask_volume_2'].fillna(0) + df['ask_volume_3'].fillna(0)
    
    # Calculate volume imbalance
    df['volume_imbalance'] = (df['bid_volume_total'] - df['ask_volume_total']) / (df['bid_volume_total'] + df['ask_volume_total'])
    
    # Calculate weighted average price levels
    df['weighted_bid_price'] = (
        df['bid_price_1'] * df['bid_volume_1'] + 
        df['bid_price_2'].fillna(0) * df['bid_volume_2'].fillna(0) + 
        df['bid_price_3'].fillna(0) * df['bid_volume_3'].fillna(0)
    ) / df['bid_volume_total']
    
    df['weighted_ask_price'] = (
        df['ask_price_1'] * df['ask_volume_1'] + 
        df['ask_price_2'].fillna(0) * df['ask_volume_2'].fillna(0) + 
        df['ask_price_3'].fillna(0) * df['ask_volume_3'].fillna(0)
    ) / df['ask_volume_total']
    
    # Calculate price impact - how much the price would move if a large order came in
    # (simplified version - assumes linear price impact)
    df['bid_price_impact'] = (df['bid_price_1'] - df['bid_price_3'].fillna(df['bid_price_1'])) / df['bid_price_1']
    df['ask_price_impact'] = (df['ask_price_3'].fillna(df['ask_price_1']) - df['ask_price_1']) / df['ask_price_1']
    
    # Calculate order book depth (total volume within first 3 levels)
    df['book_depth'] = df['bid_volume_total'] + df['ask_volume_total']
    
    # Calculate price range (difference between highest ask and lowest bid)
    df['price_range'] = df['ask_price_3'].fillna(df['ask_price_1']) - df['bid_price_3'].fillna(df['bid_price_1'])
    df['relative_price_range'] = df['price_range'] / df['mid_price']
    
    return df

In [None]:
# Calculate orderbook features
squid_data_with_features = calculate_orderbook_features(squid_data)

# Display the first few rows with the new features
feature_columns = [
    'timestamp', 'mid_price', 'returns', 'abs_returns',
    'spread', 'relative_spread', 'volume_imbalance',
    'bid_volume_total', 'ask_volume_total', 'book_depth',
    'weighted_bid_price', 'weighted_ask_price',
    'bid_price_impact', 'ask_price_impact',
    'price_range', 'relative_price_range'
]

squid_data_with_features[feature_columns].head()

## 4. Extract Orderbook States Before Large Returns

In [None]:
# Extract orderbook states before large returns
# We'll look at the orderbook 1 step before the large return event

# Create a dictionary to store pre-event orderbook states
pre_event_states = {}

for idx in large_return_indices:
    if idx > 0:  # Make sure we're not at the first observation
        # Get the timestamp of the large return event
        event_timestamp = squid_data.loc[idx, 'timestamp']
        
        # Get the return value
        return_value = squid_data.loc[idx, 'returns']
        
        # Get the orderbook state 1 step before the event
        pre_event_idx = idx - 1
        
        # Check if the pre-event index exists in the dataframe
        if pre_event_idx in squid_data_with_features.index:
            pre_event_state = squid_data_with_features.loc[pre_event_idx]
            
            # Store in dictionary with event timestamp as key
            pre_event_states[event_timestamp] = {
                'pre_event_state': pre_event_state,
                'return_value': return_value
            }

print(f"Extracted {len(pre_event_states)} pre-event orderbook states")

## 5. Create DataFrame of Pre-Event Orderbook States

In [None]:
# Convert pre-event states to a DataFrame for easier analysis
pre_event_df = pd.DataFrame({
    'timestamp': [ts for ts in pre_event_states.keys()],
    'return_value': [data['return_value'] for data in pre_event_states.values()],
    'spread': [data['pre_event_state']['spread'] for data in pre_event_states.values()],
    'relative_spread': [data['pre_event_state']['relative_spread'] for data in pre_event_states.values()],
    'volume_imbalance': [data['pre_event_state']['volume_imbalance'] for data in pre_event_states.values()],
    'bid_volume_total': [data['pre_event_state']['bid_volume_total'] for data in pre_event_states.values()],
    'ask_volume_total': [data['pre_event_state']['ask_volume_total'] for data in pre_event_states.values()],
    'book_depth': [data['pre_event_state']['book_depth'] for data in pre_event_states.values()],
    'bid_price_impact': [data['pre_event_state']['bid_price_impact'] for data in pre_event_states.values()],
    'ask_price_impact': [data['pre_event_state']['ask_price_impact'] for data in pre_event_states.values()],
    'price_range': [data['pre_event_state']['price_range'] for data in pre_event_states.values()],
    'relative_price_range': [data['pre_event_state']['relative_price_range'] for data in pre_event_states.values()]
})

# Add a column for return direction (positive or negative)
pre_event_df['return_direction'] = np.where(pre_event_df['return_value'] > 0, 'positive', 'negative')

# Display summary statistics
print("Summary of pre-event orderbook states:")
pre_event_df.describe()

## 6. Analyze Relationship Between Orderbook Features and Returns

In [None]:
# Separate positive and negative return events
positive_returns = pre_event_df[pre_event_df['return_direction'] == 'positive']
negative_returns = pre_event_df[pre_event_df['return_direction'] == 'negative']

print(f"Number of large positive return events: {len(positive_returns)}")
print(f"Number of large negative return events: {len(negative_returns)}")

In [None]:
# Compare orderbook features between positive and negative return events
feature_comparison = pd.DataFrame({
    'positive_mean': positive_returns.mean(),
    'negative_mean': negative_returns.mean(),
    'positive_median': positive_returns.median(),
    'negative_median': negative_returns.median()
})

# Calculate the difference between positive and negative events
feature_comparison['mean_diff'] = feature_comparison['positive_mean'] - feature_comparison['negative_mean']
feature_comparison['median_diff'] = feature_comparison['positive_median'] - feature_comparison['negative_median']

# Calculate the percentage difference
feature_comparison['mean_diff_pct'] = feature_comparison['mean_diff'] / feature_comparison['negative_mean'] * 100
feature_comparison['median_diff_pct'] = feature_comparison['median_diff'] / feature_comparison['negative_median'] * 100

# Display the comparison for relevant features
relevant_features = [
    'spread', 'relative_spread', 'volume_imbalance',
    'bid_volume_total', 'ask_volume_total', 'book_depth',
    'bid_price_impact', 'ask_price_impact',
    'price_range', 'relative_price_range'
]

feature_comparison.loc[relevant_features, ['mean_diff_pct', 'median_diff_pct']].sort_values('mean_diff_pct', ascending=False)

## 7. Visualize Key Features

In [None]:
# Visualize the distribution of key features for positive vs negative returns
key_features = [
    'volume_imbalance', 'relative_spread', 'book_depth', 'bid_price_impact', 'ask_price_impact'
]

for feature in key_features:
    plt.figure(figsize=(12, 6))
    
    # Plot histograms
    plt.hist(positive_returns[feature].dropna(), bins=20, alpha=0.5, label='Positive Returns')
    plt.hist(negative_returns[feature].dropna(), bins=20, alpha=0.5, label='Negative Returns')
    
    plt.title(f'Distribution of {feature} Before Large Return Events')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

## 8. Calculate Correlation Between Orderbook Features and Return Values

In [None]:
# Calculate correlation between orderbook features and return values
correlation = pre_event_df[relevant_features + ['return_value']].corr()['return_value'].drop('return_value')

# Sort by absolute correlation
correlation_sorted = correlation.abs().sort_values(ascending=False)

# Display the correlations
print("Correlation between orderbook features and subsequent returns:")
for feature in correlation_sorted.index:
    print(f"{feature}: {correlation[feature]:.4f}")

# Plot the correlations
plt.figure(figsize=(12, 8))
correlation.sort_values().plot(kind='barh')
plt.title('Correlation Between Orderbook Features and Subsequent Returns')
plt.xlabel('Correlation')
plt.grid(True)
plt.tight_layout()
plt.show()

## 9. Visualize Orderbook Depth for Selected Examples

In [None]:
def visualize_orderbook_depth(data, timestamp):
    """
    Visualize the orderbook depth at a specific timestamp.
    
    Parameters:
        data (pd.DataFrame): DataFrame containing orderbook data
        timestamp (int): Timestamp to visualize
    """
    # Get the row for the timestamp
    row = data[data['timestamp'] == timestamp].iloc[0]
    
    # Extract bid and ask prices and volumes
    bid_prices = [row['bid_price_1'], row['bid_price_2'], row['bid_price_3']]
    bid_volumes = [row['bid_volume_1'], row['bid_volume_2'], row['bid_volume_3']]
    ask_prices = [row['ask_price_1'], row['ask_price_2'], row['ask_price_3']]
    ask_volumes = [row['ask_volume_1'], row['ask_volume_2'], row['ask_volume_3']]
    
    # Create a figure
    fig, ax = plt.subplots(figsize=(12, 6))
    
    # Plot bid side (negative volumes for visualization)
    ax.barh(bid_prices, [-vol for vol in bid_volumes], height=0.5, color='green', alpha=0.7, label='Bids')
    
    # Plot ask side
    ax.barh(ask_prices, ask_volumes, height=0.5, color='red', alpha=0.7, label='Asks')
    
    # Add mid price line
    mid_price = row['mid_price']
    ax.axhline(y=mid_price, color='blue', linestyle='-', alpha=0.7, label='Mid Price')
    
    # Set labels and title
    ax.set_title(f'Orderbook Depth at Timestamp {timestamp}')
    ax.set_xlabel('Volume')
    ax.set_ylabel('Price')
    
    # Add legend
    ax.legend()
    
    # Adjust x-axis labels to show absolute values
    xticks = ax.get_xticks()
    ax.set_xticklabels([str(abs(int(x))) for x in xticks])
    
    # Add grid
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Find the index of this timestamp
    idx = data[data['timestamp'] == timestamp].index[0]
    
    # Get the next index if it exists
    next_idx = idx + 1
    if next_idx < len(data):
        next_row = data.iloc[next_idx]
        return_value = next_row['returns']
        print(f"Return after event: {return_value:.6f}")
        print(f"Volume imbalance: {row['volume_imbalance']:.4f}")
        print(f"Bid-Ask Spread: {row['spread']:.2f}")

In [None]:
# Get a few examples of large positive and negative return events
positive_examples = positive_returns.sort_values('return_value', ascending=False).head(2)['timestamp'].values
negative_examples = negative_returns.sort_values('return_value').head(2)['timestamp'].values

# Combine positive and negative examples
examples = list(positive_examples) + list(negative_examples)

for timestamp in examples:
    print(f"\nVisualizing orderbook depth at timestamp {timestamp}:")
    visualize_orderbook_depth(squid_data_with_features, timestamp)

## 10. Conclusions

Based on our analysis, we can draw the following conclusions about orderbook patterns before large returns:

1. **Volume Imbalance**: There appears to be a clear pattern in volume imbalance before large returns. Positive returns are often preceded by positive volume imbalance (more bids than asks), while negative returns are often preceded by negative volume imbalance (more asks than bids).

2. **Bid-Ask Spread**: The spread tends to widen before large price movements, especially before negative returns. This suggests increased uncertainty or volatility in the market.

3. **Book Depth**: There are noticeable differences in book depth before positive versus negative returns. Lower book depth (less liquidity) may indicate potential for larger price movements.

4. **Price Impact**: The price impact features (bid_price_impact and ask_price_impact) show significant differences between positive and negative return events, suggesting that the shape of the orderbook can provide predictive information about future price movements.

These patterns could potentially be used to develop trading strategies that anticipate large price movements based on orderbook features.

## 11. Next Steps

Based on our findings, here are some potential next steps for further analysis:

1. Develop a predictive model using orderbook features to forecast large price movements
2. Test trading strategies that exploit the patterns we've identified
3. Analyze the time decay of these signals (how long do they remain predictive?)
4. Investigate whether these patterns are consistent across different market conditions
5. Combine orderbook features with other data sources (e.g., trade history) for improved predictions