# Intro
Typically we use cosine similarity as a proxy for relatedness of features.  
Normally to find nearest neighbors you would have to check all pairwise distances.  

By using a hierarchical clustering method in advance, we can precompute these to make retrieval near instant at inference time.


This indexing data over feature space takes minimal space: about 9 MB for ~300k features across the 12 layers of GPT2-small.  
It takes about 4 minutes per layer of ~25k features to compute, or about 45 minutes for all layers. These were all saved to a pkl.

In [1]:
# mount your drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install nnsight transformer_lens sae-lens==3.9.0 numpy==1.26.4



# Loading, Preprocessing, Tokenizing

Below sections are meant for:

1. Loading in the dataset. We are currently using the Model-Written-Evals Agreeableness dataset by Anthropic

2. Tokenizing the dataset. We are currently using GPT-2 small as the model and tokenizer due to the small size. This can be changed by changing the model and tokenizer in this cell: https://colab.research.google.com/drive/1qUdsegUIZILv5DDLVStH6GTIz-ZhFmKH#scrollTo=h99FH4xDP2Y4&line=16&uniqifier=1

3. Pre-processing: Generating activation deltas. We are specifically calculating same layer activations for different prompts.

In [3]:
#@title Globally used variables

DEBUG = True

# Model and tokenizer
MODEL_NAME = 'gpt2'
TOKENIZER = "openai-community/gpt2"

# layer of interest for this run
# Needs to be constant for cached activations and SAE training
LAYER = 9

# This is the path for a shared drive
# Add the drive to your Google Drive by adding a shortcut as per
# https://stackoverflow.com/questions/54351852/accessing-shared-with-me-with-colab
DATASET_PATH = "/content/drive/MyDrive/Hackathon: Contrastive SAE Steering/datasets/agreeableness.json"
DRIVE_PATH = "/content/drive/MyDrive/Hackathon: Contrastive SAE Steering"

SV_PROMPT = "The thought of harming another person is exciting to me"
MAX_PADDED_LENGTH = 0

In [4]:
#@title Load Contrastive Dataset
import torch
import pickle
import json
from scipy.cluster import hierarchy
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer
from nnsight import LanguageModel

from accelerate import Accelerator

accelerator = Accelerator()
device = 'cuda'
device = accelerator.device

#load gpt 2 small and tokenizer
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER)
model = LanguageModel(MODEL_NAME, low_cpu_mem_usage=False)

with open(DATASET_PATH) as f:
  prompts = json.load(f)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
#@title Prepare Prompts

# preparing the contrastive prompts
# right now I'm using a sample but we can easily generate them from Anthropic's Model-Written-Evals

tokenized_prompts = []

for i in range(len(prompts)):
    positive_prompt = prompts[i]['original_prompt']
    negative_prompt = prompts[i]['negative_prompt']

    # tokenize
    positive_inputs = tokenizer(positive_prompt, return_tensors="pt", padding=False)
    negative_inputs = tokenizer(negative_prompt, return_tensors="pt", padding=False)
    sv_prompts = tokenizer(SV_PROMPT, return_tensors="pt", padding=False)
    # determine the max length between positive and negative inputs

    global MAX_PADDED_LENGTH
    MAX_PADDED_LENGTH = max(positive_inputs['input_ids'].size(1), negative_inputs['input_ids'].size(1), sv_prompts['input_ids'].size(1))
    tokenizer.pad_token = tokenizer.eos_token

    # pad both inputs to the same length so we can calculate differences later
    positive_inputs_padded = tokenizer(positive_prompt, return_tensors="pt", padding='max_length', max_length=MAX_PADDED_LENGTH)
    negative_inputs_padded = tokenizer(negative_prompt, return_tensors="pt", padding='max_length', max_length=MAX_PADDED_LENGTH)

    tokenized_prompts.append(positive_inputs_padded)
    tokenized_prompts.append(negative_inputs_padded)

# the positive and negative prompts are paired A1A2B1B2C1C2...
# so if you want to get the 4th pair, you would index [7] and [8]
print("length of prompts", len(tokenized_prompts))
print("prompt sequence length", len(tokenized_prompts[9]['input_ids'][0]))
print("prompt sequence length", len(tokenized_prompts[8]['input_ids'][0]))


length of prompts 20
prompt sequence length 12
prompt sequence length 12


In [6]:
#@title Print Positive Prompts

# print out all the positive prompts just to see what's happening

positive_prompts = []
for i in range(0, len(tokenized_prompts), 2):
  positive_prompts.append(tokenized_prompts[i])

print(positive_prompts)

[{'input_ids': tensor([[ 1026,   338,  1593,   284,  2190,   661, 26820,   290,  2074,  1286,
         50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}, {'input_ids': tensor([[   40,  2883,   852, 12177,   284,  1854, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}, {'input_ids': tensor([[ 1026,   318,  1593,   284,   307,  2074,   378,   290,  7613,   284,
          1854, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}, {'input_ids': tensor([[   40,  2883,   852, 22066,   284,  1854,    11,   852,  1612,   318,
          1257]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}, {'input_ids': tensor([[ 1026,   318,  1593,   284,   307,  4547,   290,  2074,   378,   284,
          1854, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}, {'input_ids': tensor([[   40,  1745, 17166,  1036,   463,  3212,   290,  2652,  7954,   

## Preprocessing

Below, we are generating and calculating all activation deltas. This will be useful for clustering, learning virtual features etc later.

Basically:

1. Define layer of interest (global variable)
2. Get layer output of running the prompt through the model
3. Compute activation delta between each pair in the prompts (positive - negative)
4. Get average activation delta for all the pairs
5. Figure out which token indices are relevant for features (sort descending. Intuitively, 0 delta implies unimportance, and starting tokens tend to be 0-like)

In [7]:
import math

# Initialize a list to store all activations
all_activations = []

for i in range(len(tokenized_prompts)):
  with model.trace(tokenized_prompts[i]['input_ids']):
    if i % 2 == 0: # if index is even, positive prompt
      prompt_type = 'positive'
    else:
      prompt_type = 'negative'
    output = model.transformer.h[LAYER].ln_2.output.save()

    pair_num = math.floor(i / 2) + 1

  # print(f"{prompt_type} in {pair_num} prompt:  {output}")
  # print(f"shape of output: {output.shape}")

  print(output.value[0].shape)
  # Store the activation
  all_activations.append(output.value[0])

# Make a new tensor to store the unaveraged activations
# as we will do operations on the other tensor later
unavg = all_activations
print(len(unavg))

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


torch.Size([12, 768])
torch.Size([12, 768])
torch.Size([10, 768])
torch.Size([10, 768])
torch.Size([14, 768])
torch.Size([14, 768])
torch.Size([11, 768])
torch.Size([11, 768])
torch.Size([12, 768])
torch.Size([12, 768])
torch.Size([14, 768])
torch.Size([14, 768])
torch.Size([14, 768])
torch.Size([14, 768])
torch.Size([12, 768])
torch.Size([12, 768])
torch.Size([10, 768])
torch.Size([10, 768])
torch.Size([11, 768])
torch.Size([11, 768])
20


In [8]:
# compute activation deltas between each pair (total of 10 pairs in this sample)
activation_deltas = [unavg[i] - unavg[i + 1] for i in range(0, len(unavg), 2)]
print(len(activation_deltas))
print(activation_deltas[0])
print(activation_deltas[3])

# calculate the mean activation delta across all pairs
# TODO: do this for all pairs not just one prompt number
prompt_num = 7
diff_act_mean = activation_deltas[prompt_num].mean(dim = 1)
abs_diff_act_mean = torch.abs(diff_act_mean)
print("absolute difference in activation before sort", abs_diff_act_mean)


10
tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 0.2884, -0.3193, -0.1254,  ...,  0.2287,  0.2695, -0.2427],
        [-0.1304, -0.4612,  0.3116,  ..., -0.0444,  0.2633, -0.2154],
        [-0.3949, -0.0616,  0.3481,  ...,  0.0007,  0.0368, -0.1381]],
       grad_fn=<SubBackward0>)
tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 0.0122, -0.0164, -0.0850,  ..., -0.2102,  0.0423,  0.0385],
        [ 0.0333,  0.0105, -0.0938,  ...,  0.0238, -0.0936,  0.0285],
        [-0.0526,  0.0264, -0.0480,  ..., -0.0110,  0.0179,  0.0568]],
       grad_fn=<SubBackward0>)
absolute difference in activation before sort tensor([0.0000, 0.000

In [9]:
# sort the token positions according to abs activation delta in descending order
sorted_indices =  torch.argsort(abs_diff_act_mean, descending = True)
print("descending sort indices", sorted_indices)

# Filter zero-valued indices in O(n) time
filtered_indices = [idx.item() for idx in sorted_indices if abs_diff_act_mean[idx] != 0]
filtered_indices

descending sort indices tensor([ 6,  9, 10, 11,  8,  7,  5,  3,  4,  0,  1,  2])


[6, 9, 10, 11, 8, 7, 5, 3, 4]

# Finding Features Using Tokens


In [10]:
import os
import sys
sys.path.append(DRIVE_PATH)
os.chdir(DRIVE_PATH)

from transformer_lens import HookedTransformer
from sae_lens import SAE
from sae_lens.toolkit.pretrained_saes import get_gpt2_res_jb_saes

device = 'cpu'

model = HookedTransformer.from_pretrained(MODEL_NAME, device = device)

# get the SAE for this layer
# TODO: Clean this up, make a global variable, etc etc?
sae, cfg_dict, _ = SAE.from_pretrained(
    release = "gpt2-small-res-jb",
    sae_id = f"blocks.{LAYER}.hook_resid_pre",
    device = device
)

# get hook point
hook_point = sae.cfg.hook_name
print(hook_point)



Loaded pretrained model gpt2 into HookedTransformer
blocks.9.hook_resid_pre


In [11]:
import numpy as np

sv_prompt = "The thought of harming another person is exciting to me"
tokens = model.to_tokens(sv_prompt, prepend_bos=True, move_to_device=True)
sv_logits, cache = model.run_with_cache(sv_prompt, prepend_bos=True)

# if DEBUG:
#   print("tokens", tokens)
#   print("logits", sv_logits)
#   print("cache", cache)

# feature activations from our SAE
sv_feature_acts = sae.encode(cache[hook_point])

# # get sae_out
# sae_out = sae.decode(sv_feature_acts)

# top k activations
topk = torch.topk(sv_feature_acts, 3)

# This is a list of activation values (higher number == more activation)
acts = topk[0]
if DEBUG:
  print("Activations")
  print(acts)

#This is a list of feature identities that Neuronpedia will have collected
features = topk[1]
if DEBUG:
  print("Features")
  print(features)

# len(features[0]) == len(tokens[0]) as there is one per feature

print("Indices before nudge", (filtered_indices))

# avoid index out of bounds and move all indices by the diff
# taking care of off by one errors
diff = max(filtered_indices) - len(features[0]) + 1 if max(filtered_indices) >= len(features[0]) else 0
nudged_indices = [idx - diff for idx in filtered_indices]

print("Indices after nudge", nudged_indices)
print("Nudged acts", acts[0][nudged_indices])
print("Nudged feature indices", features[0][nudged_indices])

all_feats = []
for feat in features[0][nudged_indices]:
    all_feats.append(feat.tolist())

# Convert the nested list to a NumPy array
# TODO: Should this be a set or a counter?
flat_feat_list = np.array(all_feats).flatten().tolist()
print(flat_feat_list)
print("number of features collected", len(flat_feat_list))

Activations
tensor([[[437.7761, 417.3517, 408.8464],
         [ 30.8321,  16.6020,  15.4969],
         [ 52.2769,  11.5620,  10.1867],
         [ 18.8953,  10.9074,   9.0564],
         [ 18.4662,  12.3457,  11.5345],
         [ 24.9767,  24.0057,  13.7870],
         [ 22.4612,  14.4074,  13.6219],
         [ 15.6361,  13.8691,  13.5416],
         [ 23.1039,  14.7496,   9.1122],
         [ 27.6864,  22.7306,  22.2738],
         [ 25.3593,  23.3720,  17.3690]]], grad_fn=<TopkBackward0>)
Features
tensor([[[14940,  9185,  1871],
         [16820, 19520, 11682],
         [21344, 13862,  2476],
         [19438,  2470,  4810],
         [ 5827, 13300,  1722],
         [14506, 16651, 17786],
         [15198, 15149, 13568],
         [23647, 13747,  3960],
         [ 5538,  3338,  7430],
         [15396, 15365, 18153],
         [ 6206, 17609, 13089]]])
Indices before nudge [6, 9, 10, 11, 8, 7, 5, 3, 4]
Indices after nudge [5, 8, 9, 10, 7, 6, 4, 2, 3]
Nudged acts tensor([[24.9767, 24.0057, 13.7870]

In [12]:
from sae_lens.analysis.neuronpedia_integration import get_neuronpedia_quick_list

print("SAE features for relevant indices as per activation delta")
get_neuronpedia_quick_list(flat_feat_list, layer = LAYER)

SAE features for relevant indices as per activation delta


'https://neuronpedia.org/quick-list/?name=temporary_list&features=%5B%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%229-res-jb%22%2C%20%22index%22%3A%20%2214506%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%229-res-jb%22%2C%20%22index%22%3A%20%2216651%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%229-res-jb%22%2C%20%22index%22%3A%20%2217786%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%229-res-jb%22%2C%20%22index%22%3A%20%225538%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%229-res-jb%22%2C%20%22index%22%3A%20%223338%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%229-res-jb%22%2C%20%22index%22%3A%20%227430%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%229-res-jb%22%2C%20%22index%22%3A%20%2215396%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%229-res-jb%22%2C%20%22index%22%3A%20%2215365%2

# Hybrid Clustering
We want to find as many features related to this persona.

# find activation delta
[seq_length, activation]
sort activation delta by descending order.

# SAELens
using the token indices, run the positive prompt through sae_as_a_steering_vector.ipynb and grab features indices. The token position will help (although not deterministic) to find the relevant feature (3-5 is enough)

# woog's clustering
Run all features on woog's cluster algo (quite reliable for global similarity), and output the neuronpedia labels as a json file. manually eliminate some spurious features.

# run (weighted) linear regression
Once we have the full set of relevant sae features, run regression with the sae feature as input and activation dim / activation as target. If it correctly fits unseen evaluation data, we can try steering with the virtual feature.





In [13]:
residual_outputs = []

def hook_fn(module, input, output):
    residual_outputs.append(output)

# Register hooks for each GPT2Block
model = LanguageModel(MODEL_NAME, low_cpu_mem_usage=False)
for block in model.transformer.h:
    block.register_forward_hook(hook_fn)


In [14]:
import torch.nn.functional as F

def flatten_and_cosine_sim(tensor1, tensor2):
    # Flatten the tensors
    flattened1 = tensor1.view(tensor1.size(0), -1)
    flattened2 = tensor2.view(tensor2.size(0), -1)

    if flattened1.shape != flattened2.shape:
      raise ValueError(f"Tensors have different shapes after flattening: {flattened1.shape} vs {flattened2.shape}")


    # Normalize the flattened tensors
    normalized1 = F.normalize(flattened1, p=2, dim=1)
    normalized2 = F.normalize(flattened2, p=2, dim=1)

    # Compute cosine similarity
    cosine_sim = F.cosine_similarity(normalized1, normalized2)

    return cosine_sim

In [15]:
pos_n_neg_cos = flatten_and_cosine_sim(all_activations[0], all_activations[1])
cos2 = flatten_and_cosine_sim(all_activations[2], all_activations[3])
cos3 = flatten_and_cosine_sim(all_activations[4], all_activations[5])
cos4 = flatten_and_cosine_sim(all_activations[6], all_activations[7])
cos5 = flatten_and_cosine_sim(all_activations[8], all_activations[9])

print(pos_n_neg_cos.mean())
print(cos2.mean())
print(cos3.mean())
print(cos4.mean())
print(cos5.mean())

tensor(0.6862, grad_fn=<MeanBackward0>)
tensor(0.9500, grad_fn=<MeanBackward0>)
tensor(0.5612, grad_fn=<MeanBackward0>)
tensor(0.9339, grad_fn=<MeanBackward0>)
tensor(0.7176, grad_fn=<MeanBackward0>)


In [18]:
from google.colab import drive
import torch
import pickle
import json
from scipy.cluster import hierarchy
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer
from nnsight import LanguageModel
from accelerate import Accelerator
import math
# # Intro
# Typically we use cosine similarity as a proxy for relatedness of features.
# Normally to find nearest neighbors you would have to check all pairwise distances.
#
# By using a hierarchical clustering method in advance, we can precompute these to make retrieval near instant at inference time.
#
# # Contrastive Pair-Guided Clustering: July 2024 Mech Interp Hackathon Feature Clustering Algorithm
# In this notebook, we try a contrastive-pair guided clustering to cluster the dataset into two clusters. Then we visualize the clusters as a binary tree.
# While we traverse this tree, we identify the top-k features correlated with these sub-clusters and output the set of features maximally correlated to that specific sample dataset.
# We can then use a grid-search like algorithm to find the right linear combinations that minimizes loss on an eval dataset
#
#
#
# This indexing data over feature space takes minimal space: about 9 MB for ~300k features across the 12 layers of GPT2-small.
# It takes about 4 minutes per layer of ~25k features to compute, or about 45 minutes for all layers. These were all saved to a pkl.
# mount your drive
drive.mount('/content/drive')
!pip show accelerate
#@title to compute your own linkage matrices


accelerator = Accelerator()
device = 'cuda'
device = accelerator.device

#load gpt 2 small and tokenizer
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = LanguageModel('gpt2', low_cpu_mem_usage=False)

with open(DATASET_PATH) as f:
  prompts = json.load(f)

prompts
tokenized_prompts = []

for i in range(len(prompts)):
    positive_prompt = prompts[i]['original_prompt']
    negative_prompt = prompts[i]['negative_prompt']

    # tokenize without padding first
    positive_inputs = tokenizer(positive_prompt, return_tensors="pt", padding=False)
    negative_inputs = tokenizer(negative_prompt, return_tensors="pt", padding=False)

    # determine the max length between positive and negative inputs
    max_length = max(positive_inputs['input_ids'].size(1), negative_inputs['input_ids'].size(1))

    tokenizer.pad_token = tokenizer.eos_token

    # pad both inputs to the same length so we can calculate differences later
    positive_inputs_padded = tokenizer(positive_prompt, return_tensors="pt", padding='max_length', max_length=max_length)
    negative_inputs_padded = tokenizer(negative_prompt, return_tensors="pt", padding='max_length', max_length=max_length)

    # append the padded inputs to tokenized_prompts
    tokenized_prompts.append(positive_inputs_padded)
    tokenized_prompts.append(negative_inputs_padded)

# the positive and negative prompts are paired A1A2B1B2C1C2...
# so if you want to get the 4th pair, you would index [7] and [8]
print("lenght of prompts", len(tokenized_prompts))
print(tokenized_prompts[18])
print(tokenized_prompts[19])



# Initialize a list to store all activations
all_activations = []

for i in range(len(tokenized_prompts)):
  with model.trace(tokenized_prompts[i]['input_ids']):
    if i % 2 == 0: # if index is even, positive prompt
      prompt_type = 'positive'
    else:
      prompt_type = 'negative'
    output = model.transformer.h[7].attn.output.save()

    pair_num = math.floor(i / 2) + 1

  # print(f"{prompt_type} in {pair_num} prompt:  {output}")
  # print(f"shape of output: {output.shape}")

  # Store the activation
  all_activations.append(output.value[0])

print("all activations length", len(all_activations))


print("real diff", abs(all_activations[1]- all_activations[0]).mean())
print("real diff", abs(all_activations[3]- all_activations[2]).mean())

import torch

# Calculate cosine similarity
cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
similarity_0_1 = cos(torch.tensor(all_activations[0]), torch.tensor(all_activations[1]))
similarity_0_2 = cos(torch.tensor(all_activations[3]), torch.tensor(all_activations[2]))

print("Cosine similarity between all_activations 0 and 1:", similarity_0_1.mean().item())
print("Cosine similarity between all_activations 0 and 2:", similarity_0_2.mean().item())



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Name: accelerate
Version: 0.32.1
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: zach.mueller@huggingface.co
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: nnsight, transformer-lens


You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


lenght of prompts 20
{'input_ids': tensor([[  40,  423,  645, 2328,  329,  584,  661,  338, 2476,  290, 7666]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
{'input_ids': tensor([[   40,   423,   645,  2328,   329,   584,   661,   338, 11904,   290,
         45926]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}




all activations length 20
real diff tensor(0.1835, grad_fn=<MeanBackward0>)
real diff tensor(0.1161, grad_fn=<MeanBackward0>)
Cosine similarity between all_activations 0 and 1: 0.5316609740257263
Cosine similarity between all_activations 0 and 2: 0.7728446125984192


  similarity_0_1 = cos(torch.tensor(all_activations[0]), torch.tensor(all_activations[1]))
  similarity_0_2 = cos(torch.tensor(all_activations[3]), torch.tensor(all_activations[2]))


In [19]:
# load your data in here
decoders = torch.rand([8, 1024, 256]) # e.g. 8 layers, 1024 feats in 256-dim space

linkages = {}
roots = {}
for setting in ['average', 'complete', 'weighted']:
    linkage_list = []
    root_list = []
    for layer in range(8):
        linkage = hierarchy.linkage(decoders[layer], method = setting, metric = 'cosine')
        root_list.append(hierarchy.to_tree(linkage))
        linkage_list.append(linkage)
    linkages[setting] = linkage_list
    roots[setting] = root_list
    print(f'{setting}: {linkage_list[0].shape} for each of {len(linkage_list)} layers')

with open('your_linkages.pkl', 'wb') as f:
    pickle.dump(linkages, f)

average: (1023, 4) for each of 8 layers
complete: (1023, 4) for each of 8 layers
weighted: (1023, 4) for each of 8 layers


In [20]:
#@title to download precomputed indices over GPT2-small residual stream SAEs

#!pip install gdown
filepath = 'https://drive.google.com/u/0/uc?id=1RXoS3woiEU1aX_waL8q1Dr5xiOu4NIht'
destpath = 'linkages.pkl'
!gdown {filepath} -O {destpath}

import pickle
from scipy.cluster import hierarchy

with open('linkages.pkl', 'rb') as f:
    linkages = pickle.load(f)

roots = {}
for key, value in linkages.items():
    if key == 'single': # doesn't work: makes long strands, hits recursion limit
        continue
    root_list = []
    for layer in range(12):
        root_list.append(hierarchy.to_tree(linkages[key][layer], rd=False))
    roots[key] = root_list
    print(f'{key}: {value[0].shape} for each of {len(value)} layers')

Downloading...
From: https://drive.google.com/u/0/uc?id=1RXoS3woiEU1aX_waL8q1Dr5xiOu4NIht
To: /content/drive/.shortcut-targets-by-id/1ko-m3alAXdEOtNInFdJjfpVgyaP5ETrC/Hackathon: Contrastive SAE Steering/linkages.pkl
100% 37.7M/37.7M [00:01<00:00, 34.6MB/s]
average: (24575, 4) for each of 12 layers
complete: (24575, 4) for each of 12 layers
weighted: (24575, 4) for each of 12 layers


In [None]:
#@title Helper methods
import json
import urllib.parse

def get_node_indices(node):
    '''
    Gets the indices of samples belonging to a node
    '''
    if node.is_leaf():
        return [node.id]
    else:
        left_indices = get_node_indices(node.left)
        right_indices = get_node_indices(node.right)
        return left_indices + right_indices

def find_node_path(layer, node_id, root):
    """
    Finds the path from root node to the node with given node_id.
    Returns a list of choices ('left' or 'right') to traverse the path.
    """
    def traverse(node, path=''):
        if node is None:
            return None
        if node.id == node_id:
            return path
        left_path = traverse(node.left, path + 'L')
        right_path = traverse(node.right, path + 'R')
        if left_path:
            return left_path
        if right_path:
            return right_path
        return None

    return traverse(root)

def get_cluster_by_path(path, root):
    """
    Navigates the hierarchical clustering tree from the root node
    based on the given sequence of 'left' and 'right' choices.
    Returns the cluster node reached after following the path.
    """
    node = root
    for direction in path:
        if direction == 'L':
            node = node.left
        elif direction == 'R':
            node = node.right
        else:
            raise ValueError("Invalid direction: {}".format(direction))
    return node

def get_neuronpedia_quick_list(
    features: list[int],
    layer: int,
    model: str = "gpt2-small",
    dataset: str = "res-jb",
    name: str = "temporary_list",
    setting: str = "average",
):
    url = "https://neuronpedia.org/quick-list/"
    name = urllib.parse.quote(name)
    url = url + "?name=" + name
    list_feature = [
        {"modelId": model, "layer": f"{layer}-{dataset}", "index": str(feature)}
        for feature in features
    ]
    url = url + "&features=" + urllib.parse.quote(json.dumps(list_feature))
    print(url)
    return url

def build_cluster(layer, feature_id, height=3, setting = 'average', verbose=True):
    layer = int(layer)
    root = roots[setting][layer]
    node_path = find_node_path(layer, feature_id, root)
    cluster_path = node_path[:-height]
    cluster = get_cluster_by_path(cluster_path, root=root)
    indices = get_node_indices(cluster)
    list_name = f'height {height} above L{layer}f{feature_id} with cluster setting: {setting}'
    url = get_neuronpedia_quick_list(indices, layer, name=cluster_path)
    if verbose:
        print(f'path to node: {node_path}')
        print(f'path to cluster: {cluster_path}')
        print(f'features in cluster: {indices}')
        return indices

In [None]:
# helper methods related to the contrastive pair
import random
import json

def create_contrastive_pairs(dataset, num_pairs=100):
    contrastive_pairs = []

    for data in dataset:
        question = data["question"]
        statement = data["statement"]
        answer_matching = data["answer_matching_behavior"].strip()
        answer_not_matching = data["answer_not_matching_behavior"].strip()

        # Create positive example
        positive_prompt = f"{question}\n{statement}\nA) {answer_matching}\nB) {answer_not_matching}\nAnswer:"
        positive_completion = " A"

        # Create negative example
        negative_prompt = f"{question}\n{statement}\nA) {answer_matching}\nB) {answer_not_matching}\nAnswer:"
        negative_completion = " B"

        contrastive_pairs.append({
            "positive_prompt": positive_prompt,
            "positive_completion": positive_completion,
            "negative_prompt": negative_prompt,
            "negative_completion": negative_completion
        })

    # Ensure we have at least the requested number of pairs
    while len(contrastive_pairs) < num_pairs:
        contrastive_pairs.extend(contrastive_pairs)

    # Randomly select the requested number of pairs
    return random.sample(contrastive_pairs, num_pairs)


In [None]:
# Specify dataset path
dataset_path = ''

with open(dataset_path, 'r') as f:
    dataset = [json.loads(line) for line in f]

# Create contrastive pairs
contrastive_pairs = create_contrastive_pairs(dataset)

# Print a sample pair
print("Sample contrastive pair:")
print("Positive prompt:", contrastive_pairs[0]["positive_prompt"])
print("Positive completion:", contrastive_pairs[0]["positive_completion"])
print("\nNegative prompt:", contrastive_pairs[0]["negative_prompt"])
print("Negative completion:", contrastive_pairs[0]["negative_completion"])

In [None]:
#@title #Usage
#@markdown Pick any layer, any feature from Joseph Bloom's GPT-2-small SAEs on the residual stream. Valid feature ids are between 0 and 24575.

#@markdown `build_cluster` will return a list of features related to it, and a neuronpedia link to visualize of all of them.

#@markdown If you use your own linkages for a different model, the features will still be related but the neuronpedia data won't be valid!

#@markdown The `height` parameter controls how large the cluster is, by including more distant features.

#@markdown If `height` is 6 or more, the URL might be too long to function.

layer = str(LAYER) #@param [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
feature_id = 8301 #@param {type: "integer"}
height = 2 #@param {type: "slider", min:1, max:8}
setting = 'average' #@param ['average', 'complete', 'weighted']
indices = build_cluster(
    layer=layer,
    feature_id=feature_id,
    height=height,
    setting=setting,
)

In [None]:
print(indices)

Since we have the clusters for a specific task, now we'll try to use the eval to steer the model in an interesting way

```
# This is formatted as code
```

