# Compression Basins in Language Models - Colab Setup

This notebook sets up and runs the compression basin experiments in Google Colab.

**Important**: Enable GPU in Colab: Runtime > Change runtime type > GPU



## Step 1: Install Dependencies



In [None]:
# Install required packages
!pip install -q torch transformers numpy scipy scikit-learn faiss-cpu nltk spacy matplotlib seaborn pandas tqdm datasets statsmodels



In [None]:
# Download NLTK data
import nltk
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('universal_tagset', quiet=True)
print("NLTK data downloaded")



## Step 2: Clone Repository

**Replace `YOUR_USERNAME` with your GitHub username**



In [None]:
# Replace with your GitHub repository URL
!git clone https://github.com/YOUR_USERNAME/basin-compression-analysis.git
%cd basin-compression-analysis



In [None]:
# Install the package in development mode
!pip install -e .



## Step 3: Verify GPU Availability



In [None]:
import torch

print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("Warning: No GPU detected. Experiments will run on CPU (much slower).")
    print("To enable GPU: Runtime > Change runtime type > GPU")



## Step 4: Run Experiments

Choose one of the experiments below:

**Note**: The examples use `--use_small_dataset` flag which uses WikiText-2 (smaller, ~4MB) instead of WikiText-103 (larger, ~300MB) for faster testing. Remove this flag for the full dataset.



### Option A: Memorization Experiment



In [None]:
# Use --use_small_dataset for faster testing (WikiText-2 instead of WikiText-103)
# Remove this flag for the full dataset
!python scripts/run_memorization.py --max_sequences 100 --k_neighbors 15 --max_length 128 --use_small_dataset



### Option B: Token Importance Experiment



In [None]:
# Use --use_small_dataset for faster testing
!python scripts/run_token_importance.py --max_sequences 50 --k_neighbors 15 --max_length 128 --use_small_dataset



### Option C: Linguistic Structure Experiment



In [None]:
# Use --use_small_dataset for faster testing
!python scripts/run_linguistic.py --max_sequences 100 --k_neighbors 15 --max_length 128 --use_small_dataset



### Option D: Full Pipeline (All Experiments)



In [None]:
# Use --use_small_dataset for faster testing
!python scripts/run_full_pipeline.py --max_sequences 100 --k_neighbors 15 --max_length 128 --use_small_dataset



## Step 5: View Results



In [None]:
# Display images inline
from IPython.display import Image, display
import glob

# Display any generated plots
for img_file in glob.glob("*.png"):
    print(f"\n{img_file}:")
    display(Image(img_file))



## Alternative: Use as Python Library



In [None]:
# Example: Run memorization experiment programmatically
from compression_lm.models.model_loader import load_model
from compression_lm.data.load_datasets import load_wikitext
from compression_lm.experiments.memorization import run_memorization_experiment

# Load model (automatically uses GPU if available)
model, tokenizer, device = load_model('gpt2')

# Load data
texts = load_wikitext(split='test', max_samples=50)

# Run experiment
results = run_memorization_experiment(
    model=model,
    tokenizer=tokenizer,
    texts=texts,
    k_neighbors=15,
    max_sequences=50,
    max_length=128
)

print("\nExperiment complete!")

