# Phase 1 — Finding Ethical Circuits (Probing Tournament)

**CSP-Ablation-Project** · v1.0

Interpretability analysis on OpenAI's weight-sparse transformer (`circuit-sparsity`, 419M params, 8 layers).  
We use a *probing tournament* to locate which transformer layers linearly encode the secure/insecure distinction for Python code vulnerabilities.

**Google Colab only.**  
1. Open in Colab. **Runtime → Change runtime type → GPU** (T4 or better).  
2. Run the **install cell** once → **Runtime → Restart session** → run all cells below.  
3. Ensure `minimal_pairs_code.json` is available (repo `data/` or Drive path).

---

## Pipeline

1. Load CSP model + tokenizer  
2. Load minimal-pairs dataset  
3. Extract final-token hidden states at every layer  
4. Per-layer linear (LogReg) probe sweep  
5. Per-layer non-linear (MLP) probe sweep  
6. Train final probe at best linear layer, save artifacts

In [None]:
# Run ONCE, then: Runtime → Restart session. Skip this cell afterwards.
#!pip install -q torch transformers accelerate scikit-learn matplotlib pandas

---
## Setup: Mount Drive & clone repo

In [None]:
import os, sys
from google.colab import drive

drive.mount("/content/drive", force_remount=True)

DRIVE_ROOT = "/content/drive/MyDrive"
CODE_DIR   = os.path.join(DRIVE_ROOT, "CODE", "CSP-Ablation-Project")
DATA_DIR   = os.path.join(DRIVE_ROOT, "DATA", "CSP-Ablation-Project")
SPRINT, VERSION = "sprint1", "v1.0"

os.makedirs(DATA_DIR, exist_ok=True)

# Clone repo to Drive CODE/ (persists across sessions)
if not os.path.isdir(CODE_DIR):
    !git clone https://github.com/piotrwilam/CSP-Ablation-Project.git "{CODE_DIR}"
else:
    !cd "{CODE_DIR}" && git pull

if CODE_DIR not in sys.path:
    sys.path.insert(0, CODE_DIR)

from src.config import artifacts_dir
ARTIFACTS = artifacts_dir(SPRINT, VERSION)

print(f"CODE : {CODE_DIR}")
print(f"DATA : {DATA_DIR}")
print(f"ARTIFACTS : {ARTIFACTS}")

---
## 1. Load CSP model & tokenizer

In [None]:
from src.model_loader import load_model_and_tokenizer

model, tokenizer, layers = load_model_and_tokenizer()

---
## 2. Load minimal-pairs dataset

File `minimal_pairs_code.json` should be in the repo `data/` directory **or** on Drive.  
Adjust `PAIRS_PATH` below if needed.

In [None]:
from src.data_loader import load_minimal_pairs
from src.data_utils import get_dataset_path

PAIRS_PATH = get_dataset_path(SPRINT, CODE_DIR, DATA_DIR)
probe_examples = load_minimal_pairs(PAIRS_PATH)

---
## 3. Extract hidden states at all layers

In [None]:
from src.hidden_states import collect_resid_all_layers

all_layer_data = collect_resid_all_layers(probe_examples, model, tokenizer, layers)

---
## 4. Per-layer linear probe sweep

In [None]:
from src.probing import run_linear_sweep, plot_linear_accuracy

linear_accs = run_linear_sweep(all_layer_data)
plot_linear_accuracy(linear_accs, ARTIFACTS)

---
## 5. MLP probe tournament

In [None]:
from src.probing import run_mlp_sweep, plot_linear_vs_mlp

mlp_accs = run_mlp_sweep(all_layer_data)
plot_linear_vs_mlp(linear_accs, mlp_accs, ARTIFACTS)

---
## 6. Final probe at best linear layer & save

In [None]:
from src.probing import train_final_probe

best_linear_layer = max(linear_accs, key=linear_accs.get)
best_mlp_layer = max(mlp_accs, key=mlp_accs.get)

probe, scaler, acc = train_final_probe(all_layer_data, best_linear_layer, ARTIFACTS)

---
## 7. Summary

In [None]:
import json
from src.config import MODEL_ID

results = {
    "model": MODEL_ID,
    "critical_layer_linear": best_linear_layer,
    "critical_layer_mlp": best_mlp_layer,
    "probe_layer": best_linear_layer,
    "n_examples": len(probe_examples),
    "n_pairs": len(probe_examples) // 2,
    "labels": "0=SECURE, 1=INSECURE",
}

with open(os.path.join(ARTIFACTS, "analysis_results.json"), "w") as f:
    json.dump(results, f, indent=2)

print("Analysis summary:")
for k, v in results.items():
    print(f"  {k}: {v}")
print(f"\nAll outputs saved to {ARTIFACTS}/")

# Push artifacts to Hugging Face
from src.utils import save_to_hub
from src.config import HF_REPO_ID
hf_prefix = f"artifacts/{SPRINT}/{VERSION}"
for fname in ["code_vuln_probe.pkl", "X_train.npy", "y_train.npy", "analysis_results.json",
               "per_layer_linear_accuracy.csv", "per_layer_accuracy_comparison.csv",
               "per_layer_linear_accuracy.png", "per_layer_linear_vs_mlp.png"]:
    p = os.path.join(ARTIFACTS, fname)
    if os.path.exists(p):
        save_to_hub(p, f"{hf_prefix}/{fname}", HF_REPO_ID)