# Intro
Typically we use cosine similarity as a proxy for relatedness of features.  
Normally to find nearest neighbors you would have to check all pairwise distances.  

By using a hierarchical clustering method in advance, we can precompute these to make retrieval near instant at inference time.


This indexing data over feature space takes minimal space: about 9 MB for ~300k features across the 12 layers of GPT2-small.  
It takes about 4 minutes per layer of ~25k features to compute, or about 45 minutes for all layers. These were all saved to a pkl.

In [1]:
# mount your drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Need to install it in this order
# Numpy causes problems with different version dependencies
# But 1.25.2 seems to work for everything
!pip install circuitsvis
!pip install nnsight transformer_lens sae-lens==3.9.0
!pip install bitsandbytes
!pip install numpy==1.25.2

Collecting numpy>=1.24 (from transformer_lens)
  Using cached numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
Collecting torch>=2.1.0 (from nnsight)
  Using cached torch-2.3.0-cp310-cp310-manylinux1_x86_64.whl (779.1 MB)
Collecting nvidia-nccl-cu12==2.20.5 (from torch>=2.1.0->nnsight)
  Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
Collecting triton==2.3.0 (from torch>=2.1.0->nnsight)
  Using cached triton-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (168.1 MB)
Installing collected packages: triton, nvidia-nccl-cu12, numpy, torch
  Attempting uninstall: triton
    Found existing installation: triton 2.1.0
    Uninstalling triton-2.1.0:
      Successfully uninstalled triton-2.1.0
  Attempting uninstall: nvidia-nccl-cu12
    Found existing installation: nvidia-nccl-cu12 2.18.1
    Uninstalling nvidia-nccl-cu12-2.18.1:
      Successfully uninstalled nvidia-nccl-cu12-2.18.1
  Attempting uninstall: num

In [3]:
import numpy as np
print(np.__version__)
!pip show numpy

1.25.2
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 3108, in _dep_map
    return self.__dep_map
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 2901, in __getattr__
    raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/base_command.py", line 169, in exc_logging_wrapper
    status = run_func(*args)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/commands/show.py", line 45, in run
    if not print_results(
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/commands/show.py", line 150, in print_results
    for i, dist in enumerate(distributions):
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/commands/show.py", line 104, in search_p

# Loading, Preprocessing, Tokenizing

Below sections are meant for:

1. Loading in the dataset. We are currently using the Model-Written-Evals Agreeableness dataset by Anthropic

2. Tokenizing the dataset. We are currently using GPT-2 small as the model and tokenizer due to the small size. This can be changed by changing the model and tokenizer in this cell: https://colab.research.google.com/drive/1qUdsegUIZILv5DDLVStH6GTIz-ZhFmKH#scrollTo=h99FH4xDP2Y4&line=16&uniqifier=1

3. Pre-processing: Generating activation deltas. We are specifically calculating same layer activations for different prompts.

In [3]:
#@title Globally used variables

DEBUG = True

# Model and tokenizer
MODEL_NAME = 'google/gemma-2b'
TOKENIZER = "google/gemma-2b"

# Initialize Max Length of Tokens
MAX_PADDED_LENGTH = None

# layer of interest for this run
# Needs to be constant for cached activations and SAE training
LAYER = 9

# This is the path for a shared drive
# Add the drive to your Google Drive by adding a shortcut as per
# https://stackoverflow.com/questions/54351852/accessing-shared-with-me-with-colab
DATASET_PATH = "/content/drive/MyDrive/Hackathon: Contrastive SAE Steering/datasets/agreeableness.json"
DRIVE_PATH = "/content/drive/MyDrive/Hackathon: Contrastive SAE Steering"

SV_PROMPT = "The thought of harming another person is exciting to me"

# Arbitrary large padded length to keep everything the same
MAX_PADDED_LENGTH = 30

In [7]:
from google.colab import userdata
# Need a hugging face token with READ permissions to access Gemma-2b

hf_token = userdata.get('hf_token')

In [9]:
#@title Load Contrastive Dataset and Tokenizer
import torch
import pickle
import json
from scipy.cluster import hierarchy
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformer_lens.hook_points import HookPoint
from nnsight import LanguageModel

from accelerate import Accelerator

# # FOR GPU:
# accelerator = Accelerator()
# device = 'cuda'
# device = accelerator.device

# FOR TPU
device = 'tpu'
accelerator = Accelerator()
device = accelerator.device

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

#load gpt 2 small and tokenizer
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER, token=hf_token)
model = LanguageModel(MODEL_NAME, low_cpu_mem_usage=False, token=hf_token, quantization_config=quantization_config)

with open(DATASET_PATH) as f:
  prompts = json.load(f)


tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


In [10]:
#@title Prepare Prompts

# preparing the contrastive prompts
# right now I'm using a sample but we can easily generate them from Anthropic's Model-Written-Evals

tokenized_prompts = []

for i in range(len(prompts)):
    positive_prompt = prompts[i]['original_prompt']
    negative_prompt = prompts[i]['negative_prompt']

    tokenizer.pad_token = tokenizer.eos_token

    # pad both inputs to the MAX_PADDED LENGTH length so we can calculate differences later
    positive_input_padded = tokenizer(positive_prompt, return_tensors="pt", padding='max_length', max_length=MAX_PADDED_LENGTH)
    negative_input_padded = tokenizer(negative_prompt, return_tensors="pt", padding='max_length', max_length=MAX_PADDED_LENGTH)

    tokenized_prompts.append(positive_input_padded)
    tokenized_prompts.append(negative_input_padded)

# the positive and negative prompts are paired A1A2B1B2C1C2...
# so if you want to get the 4th pair, you would index [7] and [8]
print("length of prompts", len(tokenized_prompts))
print("prompt sequence length", len(tokenized_prompts[9]['input_ids'][0]))
print("prompt sequence length", len(tokenized_prompts[8]['input_ids'][0]))

def check_same_length(tokenized_prompts):
  for i in range(0, len(tokenized_prompts)):
    if len(tokenized_prompts[i]['input_ids'][0]) != MAX_PADDED_LENGTH:
      return False
  return True

# Sanity check that all the prompt pairs are the same length
assert(check_same_length(tokenized_prompts))

length of prompts 20
prompt sequence length 30
prompt sequence length 30


In [11]:
#@title Print Positive Prompts

# print out all the positive prompts just to see what's happening

positive_prompts = []
for i in range(0, len(tokenized_prompts), 2):
  positive_prompts.append(tokenized_prompts[i])

print(positive_prompts)

[{'input_ids': tensor([[     1,      1,      1,      1,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,      1,      1,      1,
              2,   1718, 235303, 235256,   2845,    577,   3516,   1461,  35167,
            578,   2369,   2823]]), 'attention_mask': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1]])}, {'input_ids': tensor([[     1,      1,      1,      1,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      2, 235285,   4313,   1855,
          25201,    577,   3588]]), 'attention_mask': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
         1, 1, 1, 1, 1, 1]])}, {'input_ids': tensor([[     1,      1,      1,      1,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,  

## Preprocessing

Below, we are generating and calculating all activation deltas. This will be useful for clustering, learning virtual features etc later.

Basically:

1. Define layer of interest (global variable)
2. Get layer output of running the prompt through the model
3. Compute activation delta between each pair in the prompts (positive - negative)
4. Get average activation delta for all the pairs
5. Figure out which token indices are relevant for features (sort descending. Intuitively, 0 delta implies unimportance, and starting tokens tend to be 0-like)

In [None]:
print(model)

In [12]:
import math

# Initialize a list to store all activations
all_activations = []

for i in range(len(tokenized_prompts)):
  with model.trace(tokenized_prompts[i]['input_ids']):
    if i % 2 == 0: # if index is even, positive prompt
      prompt_type = 'positive'
    else:
      prompt_type = 'negative'
    # There should be a better way of doing this
    # This below this is kind of hardcoded
    output = model.model.layers[LAYER].output.save()

    pair_num = math.floor(i / 2) + 1

  # print(f"{prompt_type} in {pair_num} prompt:  {output}")
  # print(f"shape of output: {output.shape}")

  # Store the activation
  all_activations.append(output.value[0])

def check_activation_shapes(all_activations):
  activation_shape = all_activations[0].shape
  for activation in all_activations[1:]:
    if activation.shape != activation_shape:
      return False
  return True

# Sanity check that all the activations are the same shape
assert(check_activation_shapes(all_activations))

# Make a new tensor to store the unaveraged activations
# as we will do operations on the other tensor later
unavg = all_activations
print(f"Number of activations stored: {len(unavg)}")
print(f"Activation Shape: {unavg[0].shape}")

import torch

# Calculate cosine similarity
cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
similarity_0_1 = cos(torch.tensor(all_activations[0]), torch.tensor(all_activations[1]))
similarity_0_2 = cos(torch.tensor(all_activations[0]), torch.tensor(all_activations[2]))

print("Cosine similarity between all_activations 0 and 1:", similarity_0_1.mean().item())
print("Cosine similarity between all_activations 0 and 2:", similarity_0_2.mean().item())

You're using a GemmaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]



TypeError: Multiple dispatch failed for 'torch._ops.aten.t.default'; all __torch_dispatch__ handlers returned NotImplemented:

  - mode object <torch._subclasses.fake_tensor.FakeTensorMode object at 0x78dcb03b8040>

For more information, try re-running with TORCH_LOGS=not_implemented

In [None]:
# compute activation deltas between each pair (total of 10 pairs in this sample)
activation_deltas = [unavg[i] - unavg[i + 1] for i in range(0, len(unavg), 2)]
print(len(activation_deltas))

# calculate the mean activation delta across all pairs
# TODO: do this for all pairs not just one prompt number
prompt_num = 8
torch.set_printoptions(profile="full")
print(activation_deltas[prompt_num])
torch.set_printoptions(profile="default")
diff_act_mean = activation_deltas[prompt_num].mean(dim = 1)
abs_diff_act_mean = torch.abs(diff_act_mean)
print("absolute difference in activation before sort", abs_diff_act_mean)


In [None]:
# sort the token positions according to abs activation delta in descending order
sorted_indices =  torch.argsort(abs_diff_act_mean, descending = True)
print(f"descending sort indices: {sorted_indices}")

# Filter zero-valued indices in O(n) time
# TODO Remove Magic Number
filtered_indices = [idx.item() for idx in sorted_indices if abs_diff_act_mean[idx] > 0.0001]
print(f"Filtered non-zero indices: {filtered_indices}")

In [None]:
# sanity check: encode and decode the prompt using gpt2 tokenizer

tokenizer = AutoTokenizer.from_pretrained(TOKENIZER)
text = prompts[prompt_num]['original_prompt']
encoded_text_pos = tokenizer.encode(text)
decoded_text_pos = tokenizer.decode(encoded_text_pos)
print(f"Original Positive Text: {text}")
print(f"Encoded text: {encoded_text_pos}")
decoded_text_list = decoded_text_pos.split()
for i, token in enumerate(decoded_text_list):
    print(f"Token: {i}: {token}")

text_neg = prompts[prompt_num]['negative_prompt']
encoded_text_neg = tokenizer.encode(text_neg)
decoded_text_neg = tokenizer.decode(encoded_text_neg)
print(f"Original Negative Text: {text_neg}")
print(f"Encoded text: {encoded_text_neg}")
decoded_text_list = decoded_text_neg.split()
for i, token in enumerate(decoded_text_list):
    print(f"Token: {i}: {token}")

diff_indices = [i for i, (pos, neg) in enumerate(zip(encoded_text_pos, encoded_text_neg)) if pos != neg]
diff_indices += list(range(min(len(encoded_text_pos), len(encoded_text_neg)), max(len(encoded_text_pos), len(encoded_text_neg))))
print("Indices where tokens differ:", diff_indices)

For prompt 8, it seems that although the token positions 5, 7, and 9 are changed by intervention, the activation patterns also effect 6 and 8, most likely due to the intervention on the previous token. Let's visualize the attention pattern at these positions using TransformerLens

In [None]:
# Load the GPT-2 model
from transformer_lens import HookedTransformer
model = HookedTransformer.from_pretrained(MODEL_NAME, tokenizer=tokenizer)

In [None]:
import torch
from transformer_lens import utils
import circuitsvis as cv

VISUALIZE_PADDED = False

# Define your input text
pos_text = prompts[prompt_num]['original_prompt']
neg_text = prompts[prompt_num]['negative_prompt']

## Can't get this working for some reason??
# def visualize_attention_heads(text, model):
#   text_tokens = model.to_tokens(text)
#   text_logits, text_cache = model.run_with_cache(text_tokens, remove_batch_dim=True)
#   attention_pattern = text_cache["pattern", LAYER, "attn"]
#   print(f"Attention Pattern Shape: {attention_pattern.shape}")
#   text_str_tokens =  model.to_str_tokens(text)
#   print(f"Layer {LAYER} Attention Patterns for {text}")
#   cv.attention.attention_patterns(tokens=text_tokens, attention=attention_pattern)


# Tokenize the input
pos_tokens = model.to_tokens(pos_text)
neg_tokens = model.to_tokens(neg_text)

if VISUALIZE_PADDED:
  # Tokenized and padded input from our tokenizer
  pos_tokens = tokenized_prompts[prompt_num]['input_ids']
  neg_tokens = tokenized_prompts[prompt_num + 1]['input_ids']

# Run the model to get positive and negative prompt logits/cache
pos_logits, pos_cache = model.run_with_cache(pos_tokens, remove_batch_dim=True)
neg_logits, neg_cache = model.run_with_cache(neg_tokens, remove_batch_dim=True)

In [None]:
print(type(pos_cache))
attention_pattern = pos_cache["pattern", LAYER, "attn"]
print(attention_pattern.shape)
pos_str_tokens = model.to_str_tokens(pos_text)

print(f"Layer {LAYER} Head Attention Patterns for POSITIVE:")
cv.attention.attention_patterns(tokens=pos_str_tokens, attention=attention_pattern)

In [None]:
print(type(pos_cache))
attention_pattern = neg_cache["pattern", LAYER, "attn"]
print(attention_pattern.shape)
neg_str_tokens = model.to_str_tokens(neg_text)

print(f"Layer {LAYER} Head Attention Patterns for NEGATIVE:")
cv.attention.attention_patterns(tokens=neg_str_tokens, attention=attention_pattern)

# Finding Features Using Tokens


In [None]:
import os
import sys
sys.path.append(DRIVE_PATH)
os.chdir(DRIVE_PATH)

from transformer_lens import HookedTransformer
from sae_lens import SAE
from sae_lens.toolkit.pretrained_saes import get_gpt2_res_jb_saes

device = 'cpu'

model = HookedTransformer.from_pretrained(MODEL_NAME, device = device, tokenizer=tokenizer)
model.tokenizer.pad_token = model.tokenizer.eos_token
model.tokenizer.pad_length = MAX_PADDED_LENGTH

# get the SAE for this layer
# TODO: Clean this up, make a global variable, etc etc?
sae, cfg_dict, _ = SAE.from_pretrained(
    release = "gpt2-small-res-jb",
    sae_id = f"blocks.{LAYER}.hook_resid_pre",
    device = device
)

# get hook point
hook_point = sae.cfg.hook_name
print(hook_point)

In [None]:
import numpy as np

sv_prompt = "Everyone should be treated with kindness, dignity and respect"
sv_tokens = model.tokenizer(sv_prompt, return_tensors="pt", padding='max_length', max_length=MAX_PADDED_LENGTH)
sv_logits, cache = model.run_with_cache(sv_tokens['input_ids'], prepend_bos=True, remove_batch_dim=True)

# if DEBUG:
#   print("tokens", tokens)
#   print("logits", sv_logits)
#   print("cache", cache)

# feature activations from our SAE
sv_feature_acts = sae.encode(cache[hook_point])

# top k activations
topk = torch.topk(sv_feature_acts, 3)

# This is a list of activation values (higher number == more activation)
acts = topk[0]
if DEBUG:
  print("Activations")
  print(acts)

#This is a list of feature identities that Neuronpedia will have collected
features = topk[1]
if DEBUG:
  print("Features")
  print(features)

all_feats = []
for feat in features[filtered_indices]:
    all_feats.append(feat.tolist())


## Nudging should no longer be needed thanks to the padding being equal for everyone
# print("Indices before nudge", (filtered_indices))

# # avoid index out of bounds and move all indices by the diff
# # taking care of off by one errors
# diff = max(filtered_indices) - len(features[0]) + 1 if max(filtered_indices) >= len(features[0]) else 0
# nudged_indices = [idx - diff for idx in filtered_indices]

# print("Indices after nudge", nudged_indices)
# print("Nudged acts", acts[0][nudged_indices])
# print("Nudged feature indices", features[0][nudged_indices])

# all_feats = []
# for feat in features[0][nudged_indices]:
#     all_feats.append(feat.tolist())

# Convert the nested list to a NumPy array and python set
flat_feat_list = np.array(all_feats).flatten().tolist()
print("number of features collected", len(flat_feat_list))
feature_set = set(flat_feat_list)
print("number of unique features collected", len(feature_set))

In [None]:
from sae_lens.analysis.neuronpedia_integration import get_neuronpedia_quick_list

print("SAE features for relevant indices as per activation delta")
get_neuronpedia_quick_list(list(feature_set), layer = LAYER)

# Hybrid Clustering
We want to find as many features related to this persona.

# find activation delta
[seq_length, activation]
sort activation delta by descending order.

# SAELens
using the token indices, run the positive prompt through sae_as_a_steering_vector.ipynb and grab features indices. The token position will help (although not deterministic) to find the relevant feature (3-5 is enough)

# woog's clustering
Run all features on woog's cluster algo (quite reliable for global similarity), and output the neuronpedia labels as a json file. manually eliminate some spurious features.

# run (weighted) linear regression
Once we have the full set of relevant sae features, run regression with the sae feature as input and activation dim / activation as target. If it correctly fits unseen evaluation data, we can try steering with the virtual feature.





In [None]:
# Initialize a dictionary
activation_dict = {}

# Iterate over the feature set and extract the corresponding activation tensors
for idx in feature_set:
  activation_dict[idx] = sv_feature_acts[0][idx]

print(activation_dict)

In [None]:
import torch
import einops
from typing import Dict

def fit_features_to_activation_delta(
    feature_activations: Dict[int, torch.Tensor],
    activation_delta: torch.Tensor
) -> torch.Tensor:
    """
    Fit a set of curated features to an activation delta using linear regression.

    Args:
    feature_activations (Dict[int, torch.Tensor]): Dictionary mapping feature indices to their activation tensors.
    activation_delta (torch.Tensor): Target activation delta to fit.

    Returns:
    torch.Tensor: Weights that maximize similarity to the activation delta.
    """
    # Convert the dictionary to a list of tensors, preserving the order of indices
    feature_indices = sorted(feature_activations.keys())
    X = torch.stack([feature_activations[idx] for idx in feature_indices])

    # Reshape X to (num_features, -1)
    X = einops.rearrange(X, 'features ... -> features (...)')

    # Reshape y (activation_delta) to (-1,)
    y = einops.rearrange(activation_delta, '... -> (...)')

    # Compute the weights using the normal equation
    weights = torch.linalg.lstsq(X.T, y).solution

    return weights

In [None]:
residual_outputs = []

def hook_fn(module, input, output):
    residual_outputs.append(output)

# Register hooks for each GPT2Block
for block in model.transformer.h:
    block.register_forward_hook(hook_fn)


In [None]:
import torch.nn.functional as F

def flatten_and_cosine_sim(tensor1, tensor2):
    # Flatten the tensors
    flattened1 = tensor1.view(tensor1.size(0), -1)
    flattened2 = tensor2.view(tensor2.size(0), -1)

    if flattened1.shape != flattened2.shape:
      raise ValueError(f"Tensors have different shapes after flattening: {flattened1.shape} vs {flattened2.shape}")


    # Normalize the flattened tensors
    normalized1 = F.normalize(flattened1, p=2, dim=1)
    normalized2 = F.normalize(flattened2, p=2, dim=1)

    # Compute cosine similarity
    cosine_sim = F.cosine_similarity(normalized1, normalized2)

    return cosine_sim

In [None]:
pos_n_neg_cos = flatten_and_cosine_sim(all_activations[0], all_activations[1])
cos2 = flatten_and_cosine_sim(all_activations[2], all_activations[3])
cos3 = flatten_and_cosine_sim(all_activations[4], all_activations[5])
cos4 = flatten_and_cosine_sim(all_activations[6], all_activations[7])
cos5 = flatten_and_cosine_sim(all_activations[8], all_activations[9])

print(pos_n_neg_cos.mean())
print(cos2.mean())
print(cos3.mean())
print(cos4.mean())
print(cos5.mean())

In [None]:
from google.colab import drive
import torch
import pickle
import json
from scipy.cluster import hierarchy
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer
from nnsight import LanguageModel
from accelerate import Accelerator
import math
# # Intro
# Typically we use cosine similarity as a proxy for relatedness of features.
# Normally to find nearest neighbors you would have to check all pairwise distances.
#
# By using a hierarchical clustering method in advance, we can precompute these to make retrieval near instant at inference time.
#
# # Contrastive Pair-Guided Clustering: July 2024 Mech Interp Hackathon Feature Clustering Algorithm
# In this notebook, we try a contrastive-pair guided clustering to cluster the dataset into two clusters. Then we visualize the clusters as a binary tree.
# While we traverse this tree, we identify the top-k features correlated with these sub-clusters and output the set of features maximally correlated to that specific sample dataset.
# We can then use a grid-search like algorithm to find the right linear combinations that minimizes loss on an eval dataset
#
#
#
# This indexing data over feature space takes minimal space: about 9 MB for ~300k features across the 12 layers of GPT2-small.
# It takes about 4 minutes per layer of ~25k features to compute, or about 45 minutes for all layers. These were all saved to a pkl.
# mount your drive
drive.mount('/content/drive')
!pip show accelerate
#@title to compute your own linkage matrices


accelerator = Accelerator()
device = 'cuda'
device = accelerator.device

#load gpt 2 small and tokenizer
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = LanguageModel('gpt2', low_cpu_mem_usage=False)

with open('/content/agreeableness.json') as f:
  prompts = json.load(f)

prompts

## This code is above already
# tokenized_prompts = []

# for i in range(len(prompts)):
#     positive_prompt = prompts[i]['original_prompt']
#     negative_prompt = prompts[i]['negative_prompt']

#     # tokenize without padding first
#     positive_inputs = tokenizer(positive_prompt, return_tensors="pt", padding=False)
#     negative_inputs = tokenizer(negative_prompt, return_tensors="pt", padding=False)

#     # determine the max length between positive and negative inputs
#     max_length = max(positive_inputs['input_ids'].size(1), negative_inputs['input_ids'].size(1))

#     tokenizer.pad_token = tokenizer.eos_token

#     # pad both inputs to the same length so we can calculate differences later
#     positive_inputs_padded = tokenizer(positive_prompt, return_tensors="pt", padding='max_length', max_length=max_length)
#     negative_inputs_padded = tokenizer(negative_prompt, return_tensors="pt", padding='max_length', max_length=max_length)

#     # append the padded inputs to tokenized_prompts
#     tokenized_prompts.append(positive_inputs_padded)
#     tokenized_prompts.append(negative_inputs_padded)

# # the positive and negative prompts are paired A1A2B1B2C1C2...
# # so if you want to get the 4th pair, you would index [7] and [8]
# print("lenght of prompts", len(tokenized_prompts))
# print(tokenized_prompts[18])
# print(tokenized_prompts[19])



# # Initialize a list to store all activations
# all_activations = []

# for i in range(len(tokenized_prompts)):
#   with model.trace(tokenized_prompts[i]['input_ids']):
#     if i % 2 == 0: # if index is even, positive prompt
#       prompt_type = 'positive'
#     else:
#       prompt_type = 'negative'
#     output = model.transformer.h[7].attn.output.save()

#     pair_num = math.floor(i / 2) + 1

#   # print(f"{prompt_type} in {pair_num} prompt:  {output}")
#   # print(f"shape of output: {output.shape}")

#   # Store the activation
#   all_activations.append(output.value[0])

# print("all activations length", len(all_activations))


# print("real diff", abs(all_activations[1]- all_activations[0]).mean())
# print("real diff", abs(all_activations[3]- all_activations[2]).mean())

import torch

# Calculate cosine similarity
cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
similarity_0_1 = cos(torch.tensor(all_activations[0]), torch.tensor(all_activations[1]))
similarity_0_2 = cos(torch.tensor(all_activations[0]), torch.tensor(all_activations[2]))

print("Cosine similarity between all_activations 0 and 1:", similarity_0_1.mean().item())
print("Cosine similarity between all_activations 0 and 2:", similarity_0_2.mean().item())



In [None]:
# load your data in here
decoders = torch.rand([8, 1024, 256]) # e.g. 8 layers, 1024 feats in 256-dim space

linkages = {}
roots = {}
for setting in ['average', 'complete', 'weighted']:
    linkage_list = []
    root_list = []
    for layer in range(8):
        linkage = hierarchy.linkage(decoders[layer], method = setting, metric = 'cosine')
        root_list.append(hierarchy.to_tree(linkage))
        linkage_list.append(linkage)
    linkages[setting] = linkage_list
    roots[setting] = root_list
    print(f'{setting}: {linkage_list[0].shape} for each of {len(linkage_list)} layers')

with open('your_linkages.pkl', 'wb') as f:
    pickle.dump(linkages, f)

In [None]:
#@title to download precomputed indices over GPT2-small residual stream SAEs

#!pip install gdown
filepath = 'https://drive.google.com/u/0/uc?id=1RXoS3woiEU1aX_waL8q1Dr5xiOu4NIht'
destpath = 'linkages.pkl'
!gdown {filepath} -O {destpath}

import pickle
from scipy.cluster import hierarchy

with open('linkages.pkl', 'rb') as f:
    linkages = pickle.load(f)

roots = {}
for key, value in linkages.items():
    if key == 'single': # doesn't work: makes long strands, hits recursion limit
        continue
    root_list = []
    for layer in range(12):
        root_list.append(hierarchy.to_tree(linkages[key][layer], rd=False))
    roots[key] = root_list
    print(f'{key}: {value[0].shape} for each of {len(value)} layers')

In [None]:
#@title Helper methods
import json
import urllib.parse

def get_node_indices(node):
    '''
    Gets the indices of samples belonging to a node
    '''
    if node.is_leaf():
        return [node.id]
    else:
        left_indices = get_node_indices(node.left)
        right_indices = get_node_indices(node.right)
        return left_indices + right_indices

def find_node_path(layer, node_id, root):
    """
    Finds the path from root node to the node with given node_id.
    Returns a list of choices ('left' or 'right') to traverse the path.
    """
    def traverse(node, path=''):
        if node is None:
            return None
        if node.id == node_id:
            return path
        left_path = traverse(node.left, path + 'L')
        right_path = traverse(node.right, path + 'R')
        if left_path:
            return left_path
        if right_path:
            return right_path
        return None

    return traverse(root)

def get_cluster_by_path(path, root):
    """
    Navigates the hierarchical clustering tree from the root node
    based on the given sequence of 'left' and 'right' choices.
    Returns the cluster node reached after following the path.
    """
    node = root
    for direction in path:
        if direction == 'L':
            node = node.left
        elif direction == 'R':
            node = node.right
        else:
            raise ValueError("Invalid direction: {}".format(direction))
    return node

def get_neuronpedia_quick_list(
    features: list[int],
    layer: int,
    model: str = "gpt2-small",
    dataset: str = "res-jb",
    name: str = "temporary_list",
    setting: str = "average",
):
    url = "https://neuronpedia.org/quick-list/"
    name = urllib.parse.quote(name)
    url = url + "?name=" + name
    list_feature = [
        {"modelId": model, "layer": f"{layer}-{dataset}", "index": str(feature)}
        for feature in features
    ]
    url = url + "&features=" + urllib.parse.quote(json.dumps(list_feature))
    print(url)
    return url

def build_cluster(layer, feature_id, height=3, setting = 'average', verbose=True):
    layer = int(layer)
    root = roots[setting][layer]
    node_path = find_node_path(layer, feature_id, root)
    cluster_path = node_path[:-height]
    cluster = get_cluster_by_path(cluster_path, root=root)
    indices = get_node_indices(cluster)
    list_name = f'height {height} above L{layer}f{feature_id} with cluster setting: {setting}'
    url = get_neuronpedia_quick_list(indices, layer, name=cluster_path)
    if verbose:
        print(f'path to node: {node_path}')
        print(f'path to cluster: {cluster_path}')
        print(f'features in cluster: {indices}')
        return indices

In [None]:
# # helper methods related to the contrastive pair
# import random
# import json

# def create_contrastive_pairs(dataset, num_pairs=100):
#     contrastive_pairs = []

#     for data in dataset:
#         question = data["question"]
#         statement = data["statement"]
#         answer_matching = data["answer_matching_behavior"].strip()
#         answer_not_matching = data["answer_not_matching_behavior"].strip()

#         # Create positive example
#         positive_prompt = f"{question}\n{statement}\nA) {answer_matching}\nB) {answer_not_matching}\nAnswer:"
#         positive_completion = " A"

#         # Create negative example
#         negative_prompt = f"{question}\n{statement}\nA) {answer_matching}\nB) {answer_not_matching}\nAnswer:"
#         negative_completion = " B"

#         contrastive_pairs.append({
#             "positive_prompt": positive_prompt,
#             "positive_completion": positive_completion,
#             "negative_prompt": negative_prompt,
#             "negative_completion": negative_completion
#         })

#     # Ensure we have at least the requested number of pairs
#     while len(contrastive_pairs) < num_pairs:
#         contrastive_pairs.extend(contrastive_pairs)

#     # Randomly select the requested number of pairs
#     return random.sample(contrastive_pairs, num_pairs)


In [None]:
# # Specify dataset path
# dataset_path = ''

# with open(dataset_path, 'r') as f:
#     dataset = [json.loads(line) for line in f]

# # Create contrastive pairs
# contrastive_pairs = create_contrastive_pairs(dataset)

# # Print a sample pair
# print("Sample contrastive pair:")
# print("Positive prompt:", contrastive_pairs[0]["positive_prompt"])
# print("Positive completion:", contrastive_pairs[0]["positive_completion"])
# print("\nNegative prompt:", contrastive_pairs[0]["negative_prompt"])
# print("Negative completion:", contrastive_pairs[0]["negative_completion"])

In [None]:
#@title #Usage
#@markdown Pick any layer, any feature from Joseph Bloom's GPT-2-small SAEs on the residual stream. Valid feature ids are between 0 and 24575.

#@markdown `build_cluster` will return a list of features related to it, and a neuronpedia link to visualize of all of them.

#@markdown If you use your own linkages for a different model, the features will still be related but the neuronpedia data won't be valid!

#@markdown The `height` parameter controls how large the cluster is, by including more distant features.

#@markdown If `height` is 6 or more, the URL might be too long to function.

layer = "9" #@param [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
feature_id = 2345 #@param {type: "integer"}
height = 2 #@param {type: "slider", min:1, max:8}
setting = 'average' #@param ['average', 'complete', 'weighted']
indices = build_cluster(
    layer=layer,
    feature_id=feature_id,
    height=height,
    setting=setting,
)

In [None]:
print(indices)

Since we have the clusters for a specific task, now we'll try to use the eval to steer the model in an interesting way

```
# This is formatted as code
```

