# TverskyGPT: Parameter-Efficient GPT with Shared Feature Bank

This notebook demonstrates the TverskyGPT model, which uses Tversky similarity-based attention with a shared feature bank to significantly reduce parameter count compared to standard GPT models.

<a href="https://colab.research.google.com/github/YOUR_USERNAME/tversky-similarity-grad/blob/main/tverskycv/notebooks/TverskyGPT_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## 1. Setup and Installation

First, let's install the required dependencies and clone/setup the repository.


In [None]:
# Install required packages
%pip install torch torchvision transformers numpy matplotlib seaborn tqdm


## 2. Import Libraries

Import the necessary modules for working with TverskyGPT.


In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import GPT2Config
import sys
import os

# Setup for Google Colab
# Uncomment the following lines if running in Colab:
# %cd /content
# !git clone https://github.com/YOUR_USERNAME/tversky-similarity-grad.git
# %cd tversky-similarity-grad
# sys.path.insert(0, '/content/tversky-similarity-grad')

# For local development, uncomment and adjust:
# sys.path.insert(0, os.path.abspath('../..'))

# Import TverskyGPT utilities
try:
    from tverskycv.models.backbones.tversky_gpt import (
        create_tversky_gpt_from_config,
        TverskyGPTModel,
        count_parameters,
        get_shared_parameter_info
    )
    print("✓ Imports successful!")
except ImportError as e:
    print(f"Import error: {e}")
    print("Make sure you've cloned the repository and added it to the Python path.")
    print("In Colab, run: !git clone https://github.com/YOUR_USERNAME/tversky-similarity-grad.git")


## 3. Understanding TverskyGPT

TverskyGPT uses **Tversky similarity-based attention** instead of standard dot-product attention. The key innovation is the **GlobalFeature bank**, which allows sharing feature matrices and Tversky parameters across layers, dramatically reducing parameter count.

### Key Benefits:
- **Parameter Reduction**: Shared feature matrices reduce total parameters
- **Memory Efficient**: Less memory usage during training and inference
- **GPT-2 Compatible**: Uses same config structure as GPT-2


## 4. Create a TverskyGPT Model

Let's create a small TverskyGPT model to demonstrate the functionality.


In [None]:
# Create a GPT2Config (small model for demonstration)
config = GPT2Config(
    vocab_size=50257,      # Standard GPT-2 vocabulary size
    n_embed=256,           # Embedding dimension (smaller for demo)
    n_layer=6,             # Number of transformer layers
    n_head=8,              # Number of attention heads
    n_positions=1024,      # Maximum sequence length
    n_inner=1024,          # FFN inner dimension
    resid_pdrop=0.1,       # Residual dropout
    embd_pdrop=0.1,        # Embedding dropout
    layer_norm_epsilon=1e-5,
)

print("Configuration:")
print(f"  Vocab size: {config.vocab_size:,}")
print(f"  Embedding dim: {config.n_embed}")
print(f"  Number of layers: {config.n_layer}")
print(f"  Number of heads: {config.n_head}")
print(f"  Max sequence length: {config.n_positions}")


In [None]:
# Create model with shared parameters (maximizes parameter reduction)
model = create_tversky_gpt_from_config(
    config=config,
    feature_key='shared',
    alpha=0.5,
    beta=0.5,
    gamma=1.0,
    share_across_layers=True,  # All layers share the same feature_key
)

print(f"✓ Model created!")
print(f"  Total parameters: {count_parameters(model):,}")


## 5. Analyze Shared Parameters

Let's examine how the GlobalFeature bank reduces parameters by sharing features across layers.


In [None]:
# Get information about shared parameters
info = get_shared_parameter_info(model)

print("Shared Parameter Analysis:")
print("=" * 60)
print(f"Total model parameters: {info['total_model_params']:,}")
print(f"Total shared parameters: {info['total_shared_params']:,}")
print(f"Shared percentage: {info['shared_percentage']:.2f}%")
print("\nShared feature matrices:")
for key, count in info['shared_parameters'].items():
    print(f"  {key}: {count:,} parameters")


## 6. Compare Shared vs Non-Shared Models

Let's compare models with and without parameter sharing to see the reduction.


In [None]:
# Compare Shared vs Non-Shared Models
# Clear GlobalFeature bank to ensure fair comparison
from tverskycv.models.backbones.shared_tversky import GlobalFeature

# Clear the GlobalFeature bank before creating models for accurate comparison
gf = GlobalFeature()
gf.clear()

# Model with shared parameters (maximizes reduction)
model_shared = create_tversky_gpt_from_config(
    config=config,
    feature_key='shared_comparison',
    share_across_layers=True,
)

# Clear again before creating non-shared model
# (This ensures non-shared model doesn't reuse features from shared model)
gf.clear()

# Model without shared parameters (each layer has own features)
model_not_shared = create_tversky_gpt_from_config(
    config=config,
    feature_key='non_shared_comparison',
    share_across_layers=False,
)

params_shared = count_parameters(model_shared)
params_not_shared = count_parameters(model_not_shared)
reduction = ((params_not_shared - params_shared) / params_not_shared) * 100

print("Parameter Comparison:")
print("=" * 60)
print(f"Model with shared features:    {params_shared:,} parameters")
print(f"Model without shared features: {params_not_shared:,} parameters")
print(f"Parameter reduction:           {reduction:.2f}%")
print(f"Parameters saved:              {params_not_shared - params_shared:,}")


In [None]:
# Visualize the comparison
fig, ax = plt.subplots(1, 1, figsize=(10, 6))

models = ['Shared\nFeatures', 'Non-Shared\nFeatures']
params = [params_shared, params_not_shared]
colors = ['#2ecc71', '#e74c3c']

bars = ax.bar(models, params, color=colors, alpha=0.7, edgecolor='black', linewidth=2)
ax.set_ylabel('Number of Parameters', fontsize=12, fontweight='bold')
ax.set_title('Parameter Count: Shared vs Non-Shared TverskyGPT', fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3, linestyle='--')

# Add value labels on bars
for bar, param in zip(bars, params):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{param:,}\n({param/1e6:.2f}M)',
            ha='center', va='bottom', fontsize=11, fontweight='bold')

# Add reduction annotation
ax.annotate(f'Reduction: {reduction:.1f}%',
            xy=(0.5, 0.5), xytext=(0.5, max(params) * 0.7),
            arrowprops=dict(arrowstyle='->', color='black', lw=2),
            fontsize=12, fontweight='bold', ha='center',
            bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.7))

plt.tight_layout()
plt.show()


## 7. Forward Pass Example

Let's test the model with a forward pass.


In [None]:
# Set model to evaluation mode
model.eval()

# Create dummy input tokens
batch_size = 2
seq_length = 10
input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_length))

print(f"Input shape: {input_ids.shape}")
print(f"Input tokens (first batch): {input_ids[0].tolist()}")

# Forward pass
with torch.no_grad():
    outputs = model(input_ids=input_ids)
    hidden_states = outputs.last_hidden_state

print(f"\nOutput shape: {hidden_states.shape}")
print(f"Expected: ({batch_size}, {seq_length}, {config.n_embed})")
print(f"✓ Forward pass successful!")


## 8. Scaling Analysis

Let's see how parameter reduction scales with different model sizes.


In [None]:
# Test different model sizes
from tverskycv.models.backbones.shared_tversky import GlobalFeature

model_configs = [
    {"n_embed": 128, "n_layer": 4, "n_head": 4, "name": "Small"},
    {"n_embed": 256, "n_layer": 6, "n_head": 8, "name": "Medium"},
    {"n_embed": 512, "n_layer": 8, "n_head": 8, "name": "Large"},
]

results = []
gf = GlobalFeature()

for model_cfg in model_configs:
    # Clear GlobalFeature bank for each model size comparison
    gf.clear()
    
    test_config = GPT2Config(
        vocab_size=50257,
        n_embed=model_cfg["n_embed"],
        n_layer=model_cfg["n_layer"],
        n_head=model_cfg["n_head"],
        n_positions=1024,
        n_inner=model_cfg["n_embed"] * 4,
        resid_pdrop=0.1,
        embd_pdrop=0.1,
    )
    
    # Create shared model
    model_s = create_tversky_gpt_from_config(
        test_config, 
        feature_key=f"shared_{model_cfg['name'].lower()}",
        share_across_layers=True
    )
    
    # Clear before creating non-shared model
    gf.clear()
    
    # Create non-shared model
    model_ns = create_tversky_gpt_from_config(
        test_config, 
        feature_key=f"non_shared_{model_cfg['name'].lower()}",
        share_across_layers=False
    )
    
    params_s = count_parameters(model_s)
    params_ns = count_parameters(model_ns)
    reduction = ((params_ns - params_s) / params_ns) * 100
    
    results.append({
        "name": model_cfg["name"],
        "shared": params_s,
        "non_shared": params_ns,
        "reduction": reduction
    })

# Display results
print("Scaling Analysis:")
print("=" * 80)
print(f"{'Model':<10} {'Shared (M)':<15} {'Non-Shared (M)':<18} {'Reduction %':<15} {'Saved (M)':<15}")
print("-" * 80)
for r in results:
    print(f"{r['name']:<10} {r['shared']/1e6:<15.2f} {r['non_shared']/1e6:<18.2f} "
          f"{r['reduction']:<15.2f} {(r['non_shared']-r['shared'])/1e6:<15.2f}")


In [None]:
# Visualize scaling results
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Parameter count comparison
ax1 = axes[0]
x = np.arange(len(results))
width = 0.35

shared_params = [r['shared']/1e6 for r in results]
non_shared_params = [r['non_shared']/1e6 for r in results]

bars1 = ax1.bar(x - width/2, shared_params, width, label='Shared Features', color='#2ecc71', alpha=0.7)
bars2 = ax1.bar(x + width/2, non_shared_params, width, label='Non-Shared Features', color='#e74c3c', alpha=0.7)

ax1.set_xlabel('Model Size', fontsize=12, fontweight='bold')
ax1.set_ylabel('Parameters (Millions)', fontsize=12, fontweight='bold')
ax1.set_title('Parameter Count by Model Size', fontsize=14, fontweight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels([r['name'] for r in results])
ax1.legend()
ax1.grid(axis='y', alpha=0.3, linestyle='--')

# Add value labels
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.2f}M',
                ha='center', va='bottom', fontsize=9)

# Plot 2: Reduction percentage
ax2 = axes[1]
reductions = [r['reduction'] for r in results]
bars3 = ax2.bar([r['name'] for r in results], reductions, color='#3498db', alpha=0.7, edgecolor='black', linewidth=2)

ax2.set_xlabel('Model Size', fontsize=12, fontweight='bold')
ax2.set_ylabel('Parameter Reduction (%)', fontsize=12, fontweight='bold')
ax2.set_title('Parameter Reduction by Model Size', fontsize=14, fontweight='bold')
ax2.grid(axis='y', alpha=0.3, linestyle='--')

# Add value labels
for bar, reduction in zip(bars3, reductions):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height,
            f'{reduction:.2f}%',
            ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()


## 9. Memory Usage Comparison

Let's compare the memory footprint of shared vs non-shared models.


In [None]:
# Estimate memory usage (in MB)
# Each float32 parameter uses 4 bytes
def estimate_memory_mb(model):
    params = count_parameters(model)
    # Parameters + optimizer states (Adam uses ~2x for momentum and variance)
    # For inference, we only count parameters
    return (params * 4) / (1024 * 1024)

memory_shared = estimate_memory_mb(model_shared)
memory_not_shared = estimate_memory_mb(model_not_shared)
memory_saved = memory_not_shared - memory_shared

print("Memory Usage (Inference):")
print("=" * 60)
print(f"Shared model:      {memory_shared:.2f} MB")
print(f"Non-shared model:  {memory_not_shared:.2f} MB")
print(f"Memory saved:       {memory_saved:.2f} MB ({memory_saved/memory_not_shared*100:.2f}%)")


## 10. Custom Configuration Example

You can customize the Tversky parameters (alpha, beta, gamma) to control the similarity function behavior.


In [None]:
# Create model with custom Tversky parameters
custom_model = create_tversky_gpt_from_config(
    config=config,
    feature_key='custom',
    alpha=0.7,   # Higher alpha = more weight on false positives
    beta=0.3,   # Lower beta = less weight on false negatives
    gamma=1.5,  # Higher gamma = more weight on common features
    share_across_layers=True,
)

print("Custom TverskyGPT Model:")
print(f"  Alpha: 0.7 (controls false positives)")
print(f"  Beta:  0.3 (controls false negatives)")
print(f"  Gamma: 1.5 (controls common features)")
print(f"  Parameters: {count_parameters(custom_model):,}")


## 11. Summary

### Key Takeaways:

1. **Parameter Reduction**: TverskyGPT with shared features significantly reduces parameters compared to non-shared models
2. **Memory Efficiency**: Lower memory footprint for both training and inference
3. **Flexibility**: Can customize Tversky parameters (alpha, beta, gamma) for different use cases
4. **GPT-2 Compatible**: Uses standard GPT2Config, making it easy to integrate

### Use Cases:

- **Resource-Constrained Environments**: When memory/compute is limited
- **Large-Scale Models**: Parameter reduction becomes more significant with larger models
- **Research**: Experimenting with Tversky similarity-based attention mechanisms

### Next Steps:

- Try different model sizes and configurations
- Experiment with different alpha, beta, gamma values
- Compare performance with standard GPT-2 models
- Fine-tune on your specific task
