# Layer Experiment with GPT-2-XL

This notebook experiments with the smaller GPT-2-XL model (1.5B parameters, 48 layers) and tests how different layers affect LRE faithfulness.

In [1]:
import json
import random
from lre import LREModel


## 1. Configuration

GPT-2-XL has 48 layers (0-47)

We'll test layers at different depths to understand where relational knowledge is encoded.


In [2]:
MODEL_NAME = "gpt2-xl"  # 1.5B parameters, 48 layers
TEMPLATE = "{} students are typically"

# Test multiple layers: early, middle, and late
LAYERS_TO_TEST = [
    "transformer.h.5",   # Early layer
    "transformer.h.15",   # Middle layer
    "transformer.h.35",   # Late layer
]


## 2. Load and Split Data


In [None]:
with open("data/data_sample.json", "r") as f:
    data = json.load(f)

# Same split as demo for fair comparison
random.seed(42)  # Set seed for reproducibility
random.shuffle(data)
split_idx = int(len(data) * 0.6)
train_data = data[:split_idx]
test_data = data[split_idx:]

print(f"Data: {len(train_data)} train, {len(test_data)} test")


Data: 22 train, 16 test


## 3. Initialize Model

Initialize the GPT-2-XL model.


In [4]:
lre = LREModel(model_name=MODEL_NAME, device="mps")  # Change to "cpu" or "cuda" as needed


Loading gpt2-xl on mps...


## 4. Layer-by-Layer Experiment

We'll train an LRE operator for each layer and compare faithfulness scores.
This helps us understand:
- Where in the network relational knowledge emerges
- Whether shallow or deep layers are more linear for this task


In [5]:
results = {}

for layer_name in LAYERS_TO_TEST:
    layer_num = layer_name.split(".")[-1]
    print(f"\n{'='*80}")
    print(f"TESTING LAYER {layer_num}")
    print(f"{'='*80}")
    
    # Train LRE operator for this layer
    operator = lre.train_lre(train_data, layer_name, TEMPLATE)
    
    # Evaluate and store results
    print(f"\nEvaluating Layer {layer_num}:")
    lre.evaluate(operator, test_data, layer_name, TEMPLATE)
    
    results[layer_name] = operator



TESTING LAYER 5
Extracting training representations...
Solving Linear Regression...

Evaluating Layer 5:

                               EVALUATION RESULTS                               
Subject                   Expected        LRE Prediction      Status
--------------------------------------------------------------------------------
nursing                   women           women            ✓ Correct
geology                   men             men              ✓ Correct
accounting                men             women              ✗ Wrong
environmental science     women           women            ✓ Correct
computer science          men             women              ✗ Wrong
anthropology              women           women            ✓ Correct
marine biology            women           men                ✗ Wrong
political science         men             women              ✗ Wrong
chemistry                 men             men              ✓ Correct
business                  men            

## 5. Additional Experiment: Different Prompt Template

Let's also test whether a different prompt format affects results.


In [6]:
# Try an alternative template
ALT_TEMPLATE = "Most {} majors are"
BEST_LAYER = "transformer.h.15"  # Based on results above, adjust if needed

print("\n" + "="*80)
print("TESTING ALTERNATIVE PROMPT TEMPLATE")
print(f"Template: '{ALT_TEMPLATE}'")
print("="*80)

operator_alt = lre.train_lre(train_data, BEST_LAYER, ALT_TEMPLATE)
lre.evaluate(operator_alt, test_data, BEST_LAYER, ALT_TEMPLATE)



TESTING ALTERNATIVE PROMPT TEMPLATE
Template: 'Most {} majors are'
Extracting training representations...
Solving Linear Regression...

                               EVALUATION RESULTS                               
Subject                   Expected        LRE Prediction      Status
--------------------------------------------------------------------------------
nursing                   women           women            ✓ Correct
geology                   men             men              ✓ Correct
accounting                men             women              ✗ Wrong
environmental science     women           women            ✓ Correct
computer science          men             men              ✓ Correct
anthropology              women           women            ✓ Correct
marine biology            women           women            ✓ Correct
political science         men             women              ✗ Wrong
chemistry                 men             men              ✓ Correct
business   