# Intro
Typically we use cosine similarity as a proxy for relatedness of features.  
Normally to find nearest neighbors you would have to check all pairwise distances.  

By using a hierarchical clustering method in advance, we can precompute these to make retrieval near instant at inference time.


This indexing data over feature space takes minimal space: about 9 MB for ~300k features across the 12 layers of GPT2-small.  
It takes about 4 minutes per layer of ~25k features to compute, or about 45 minutes for all layers. These were all saved to a pkl.

In [None]:
# mount your drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Need to install it in this order
# Numpy causes problems with different version dependencies
# But 1.25.2 seems to work for everything
!pip install circuitsvis
!pip install nnsight transformer_lens sae-lens==3.9.0 bitsandbytes
# !pip install numpy==1.25.2

[0mCollecting circuitsvis
  Using cached circuitsvis-1.43.2-py3-none-any.whl (1.8 MB)
Collecting torch>=1.10 (from circuitsvis)
  Using cached torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl (779.1 MB)
INFO: pip is looking at multiple versions of torch to determine which version is compatible with other requirements. This could take a while.
  Using cached torch-2.3.0-cp310-cp310-manylinux1_x86_64.whl (779.1 MB)
  Using cached torch-2.2.2-cp310-cp310-manylinux1_x86_64.whl (755.5 MB)
  Using cached torch-2.2.1-cp310-cp310-manylinux1_x86_64.whl (755.5 MB)
  Using cached torch-2.2.0-cp310-cp310-manylinux1_x86_64.whl (755.5 MB)
  Using cached torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[0mInstalling collected packages: torch, circuitsvis
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.3.0+cu121 requires torch==2.3.0, but you have torc

In [None]:
!pip install numpy==1.25.2

[0m

In [None]:
import numpy as np
print(np.__version__)
!pip show numpy

1.25.2
[0mName: numpy
Version: 1.25.2
Summary: Fundamental package for array computing in Python
Home-page: https://www.numpy.org
Author: Travis E. Oliphant et al.
Author-email: 
License: BSD-3-Clause
Location: /usr/local/lib/python3.10/dist-packages
Requires: 
Required-by: accelerate, albumentations, altair, arviz, astropy, autograd, automated-interpretability, bitsandbytes, blis, bokeh, bqplot, chex, circuitsvis, cmdstanpy, contourpy, cudf-cu12, cufflinks, cupy-cuda12x, cvxpy, datascience, datasets, db-dtypes, diffusers, dopamine_rl, ecos, flax, folium, geemap, gensim, gym, h5py, holoviews, hyperopt, ibis-framework, imageio, imbalanced-learn, imgaug, jax, jaxlib, librosa, lightgbm, matplotlib, matplotlib-venn, missingno, mizani, ml-dtypes, mlxtend, moviepy, music21, nibabel, numba, numexpr, opencv-contrib-python, opencv-python, opencv-python-headless, opt-einsum, optax, orbax-checkpoint, osqp, pandas, pandas-gbq, pandas-stubs, patsy, plotly-express, plotnine, prophet, pyarrow, pycoc

# Loading, Preprocessing, Tokenizing

Below sections are meant for:

1. Loading in the dataset. We are currently using the Model-Written-Evals Agreeableness dataset by Anthropic

2. Tokenizing the dataset. We are currently using GPT-2 small as the model and tokenizer due to the small size. This can be changed by changing the model and tokenizer in this cell: https://colab.research.google.com/drive/1qUdsegUIZILv5DDLVStH6GTIz-ZhFmKH#scrollTo=h99FH4xDP2Y4&line=16&uniqifier=1

3. Pre-processing: Generating activation deltas. We are specifically calculating same layer activations for different prompts.

In [None]:
from google.colab import userdata
# Need a hugging face token with READ permissions to access Gemma-2b

hf_token = userdata.get('hf_token')

In [None]:
#@title Globally used variables

DEBUG = True

# Model and tokenizer

#GPT-2 Model
MODEL_NAME = 'gpt2'
TOKENIZER = "openai-community/gpt2"

# Gemma Model
# MODEL_NAME = 'google/gemma-2b'
# TOKENIZER = "google/gemma-2b"

# layer of interest for this run
# Needs to be constant for cached activations and SAE training
LAYER = 8

# This is the path for a shared drive
# Add the drive to your Google Drive by adding a shortcut as per
# https://stackoverflow.com/questions/54351852/accessing-shared-with-me-with-colab
DATASET_PATH = "/content/drive/MyDrive/Hackathon: Contrastive SAE Steering/datasets/agreeableness.json"
DRIVE_PATH = "/content/drive/MyDrive/Hackathon: Contrastive SAE Steering"

# SV_PROMPT = "The thought of harming another person is exciting to me"

# Arbitrary large padded length to keep everything the same
MAX_PADDED_LENGTH = 30

# Top N Indices to take
TOP_N_INDICES = 5

In [None]:
#@title Load Contrastive Dataset and Tokenizer
import torch
import pickle
import json
from scipy.cluster import hierarchy
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformer_lens.hook_points import HookPoint
from nnsight import LanguageModel

from accelerate import Accelerator

accelerator = Accelerator()
device = 'cuda'
device = accelerator.device

# quantization_config = BitsAndBytesConfig(load_in_4bit=True)

#load gpt 2 small and
# TODO: Pad left or pad right? For now, left padding so that the last token is the same position.
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER, token=hf_token, padding_side='left')
# model = LanguageModel(MODEL_NAME, low_cpu_mem_usage=False, token=hf_token, quantization_config=quantization_config)
model = LanguageModel(MODEL_NAME, low_cpu_mem_usage=False, token=hf_token)

with open(DATASET_PATH) as f:
  prompts = json.load(f)




In [None]:
#@title Prepare Prompts

# preparing the contrastive prompts
# right now I'm using a sample but we can easily generate them from Anthropic's Model-Written-Evals

tokenized_prompts = []

for i in range(len(prompts)):
    positive_prompt = prompts[i]['original_prompt']
    negative_prompt = prompts[i]['negative_prompt']

    tokenizer.pad_token = tokenizer.eos_token

    # pad both inputs to the MAX_PADDED LENGTH length so we can calculate differences later
    positive_input_padded = tokenizer(positive_prompt, return_tensors="pt",  padding='max_length', max_length=MAX_PADDED_LENGTH)
    negative_input_padded = tokenizer(negative_prompt, return_tensors="pt",  padding='max_length', max_length=MAX_PADDED_LENGTH)

    tokenized_prompts.append(positive_input_padded)
    tokenized_prompts.append(negative_input_padded)

# the positive and negative prompts are paired A1A2B1B2C1C2...
# so if you want to get the 4th pair, you would index [7] and [8]
print("length of prompts", len(tokenized_prompts))
print("prompt sequence length", len(tokenized_prompts[9]['input_ids'][0]))
print("prompt sequence length", len(tokenized_prompts[8]['input_ids'][0]))

def check_same_length(tokenized_prompts):
  for i in range(0, len(tokenized_prompts)):
    if len(tokenized_prompts[i]['input_ids'][0]) != MAX_PADDED_LENGTH:
      return False
  return True

# Sanity check that all the prompt pairs are the same length
assert(check_same_length(tokenized_prompts))

length of prompts 20
prompt sequence length 30
prompt sequence length 30


In [None]:
#@title Print Positive Prompts

# print out all the positive prompts just to see what's happening

positive_prompts = []
for i in range(0, len(tokenized_prompts), 2):
  positive_prompts.append(tokenized_prompts[i])

print(positive_prompts)

[{'input_ids': tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          1026,   338,  1593,   284,  2190,   661, 26820,   290,  2074,  1286]]), 'attention_mask': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1]])}, {'input_ids': tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256,    40,  2883,   852, 12177,   284,  1854]]), 'attention_mask': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 1, 1, 1, 1, 1]])}, {'input_ids': tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,  1026,
           318,  1593,   284,   307,  2074,   378,   290,  

## Preprocessing

Below, we are generating and calculating all activation deltas. This will be useful for clustering, learning virtual features etc later.

Basically:

1. Define layer of interest (global variable)
2. Get layer output of running the prompt through the model
3. Compute activation delta between each pair in the prompts (positive - negative)
4. Get average activation delta for all the pairs
5. Figure out which token indices are relevant for features (sort descending. Intuitively, 0 delta implies unimportance, and starting tokens tend to be 0-like)

In [None]:
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
  (generator): WrapperModule()
)


In [None]:
import math

# Initialize a list to store all activations
all_activations = []

for i in range(len(tokenized_prompts)):
  with model.trace(tokenized_prompts[i]['input_ids']):
    if i % 2 == 0: # if index is even, positive prompt
      prompt_type = 'positive'
    else:
      prompt_type = 'negative'
    # There should be a better way of doing this
    # This below this is kind of hardcoded

    # For gemma
    #output = model.model.layers[LAYER].output.save()

    # For GPT2
    output = model.transformer.h[LAYER].ln_2.output.save()
    pair_num = math.floor(i / 2) + 1

  # print(f"{prompt_type} in {pair_num} prompt:  {output}")
  # print(f"shape of output: {output.shape}")

  # Store the activation
  all_activations.append(output.value[0])

def check_activation_shapes(all_activations):
  activation_shape = all_activations[0].shape
  for activation in all_activations[1:]:
    if activation.shape != activation_shape:
      return False
  return True

# Sanity check that all the activations are the same shape
assert(check_activation_shapes(all_activations))

# Make a new list to store the unaveraged activations
# as we will do operations on the other list later
unavg = all_activations
print(f"Number of activations stored: {len(unavg)}")
print(f"Activation Shape: {unavg[0].shape}")

import torch

# Calculate cosine similarity
cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
similarity_0_1 = cos(torch.tensor(all_activations[0]), torch.tensor(all_activations[1]))
similarity_0_2 = cos(torch.tensor(all_activations[0]), torch.tensor(all_activations[2]))

print("Cosine similarity between all_activations 0 and 1:", similarity_0_1.mean().item())
print("Cosine similarity between all_activations 0 and 2:", similarity_0_2.mean().item())

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Number of activations stored: 20
Activation Shape: torch.Size([30, 768])
Cosine similarity between all_activations 0 and 1: 0.7898362278938293
Cosine similarity between all_activations 0 and 2: 0.795717179775238


  similarity_0_1 = cos(torch.tensor(all_activations[0]), torch.tensor(all_activations[1]))
  similarity_0_2 = cos(torch.tensor(all_activations[0]), torch.tensor(all_activations[2]))


In [None]:
# import time
# t = time.time()
# even_idx = einops.rearrange(torch.stack((torch.zeros(test_deltas.shape[0]),torch.ones(test_deltas.shape[0]))), 'h n -> (n h)').to(torch.bool)
# test_deltas = test_tensor[even_idx] - test_tensor[~even_idx]
# print(time.time() - t)
# test_deltas.shape
# actually 100 ms faster to do it in a loop :)

In [None]:
test_tensor = torch.stack(unavg)
test_deltas = torch.zeros(int(test_tensor.shape[0]/2), test_tensor.shape[1], test_tensor.shape[2])
for i in range(test_deltas.shape[0]):
  test_deltas[i] = test_tensor[2*i + 1] - test_tensor[2*i]

test_deltas.shape

torch.Size([10, 30, 768])

In [None]:
# compute activation deltas between each pair (total of 10 pairs in this sample)
activation_deltas = [unavg[i] - unavg[i + 1] for i in range(0, len(unavg), 2)]
print(len(activation_deltas))

# calculate the mean activation delta across all pairs
# TODO: do this for all pairs not just one prompt number
prompt_num = 8

# if DEBUG:
#   torch.set_printoptions(profile="full")

print(activation_deltas[prompt_num])

if DEBUG:
  torch.set_printoptions(profile="default")

diff_act_mean = activation_deltas[prompt_num].mean(dim = 1)
abs_diff_act_mean = torch.abs(diff_act_mean)
print("absolute difference in activation before sort", abs_diff_act_mean)


10
tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00, 

In [None]:
# sort the token positions according to abs activation delta in descending order
sorted_indices =  torch.argsort(abs_diff_act_mean, descending = True)
print(f"descending sort indices: {sorted_indices}")

# Filter zero-valued indices in O(n) time

# filtered_indices = [idx.item() for idx in sorted_indices if abs_diff_act_mean[idx] > 0.0001]

# Take Top K indices
filtered_indices = sorted_indices[:TOP_N_INDICES]
print(f"Filtered non-zero indices: {filtered_indices}")

descending sort indices: tensor([27, 29, 25, 26, 28, 15, 24, 23, 22, 21, 20, 19, 18, 17, 16,  0, 14, 13,
        12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1])
Filtered non-zero indices: tensor([27, 29, 25, 26, 28])


In [None]:
# sanity check: encode and decode the prompt using gpt2 tokenizer
text = prompts[prompt_num]['original_prompt']
encoded_text_pos = tokenizer.encode(text)
decoded_text_pos = tokenizer.decode(encoded_text_pos)
print(f"Original Positive Text: {text}")
print(f"Encoded text: {encoded_text_pos}")
decoded_text_list = decoded_text_pos.split()
for i, token in enumerate(decoded_text_list):
    print(f"Token: {i}: {token}")

text_neg = prompts[prompt_num]['negative_prompt']
encoded_text_neg = tokenizer.encode(text_neg)
decoded_text_neg = tokenizer.decode(encoded_text_neg)
print(f"Original Negative Text: {text_neg}")
print(f"Encoded text: {encoded_text_neg}")
decoded_text_list = decoded_text_neg.split()
for i, token in enumerate(decoded_text_list):
    print(f"Token: {i}: {token}")

diff_indices = [i for i, (pos, neg) in enumerate(zip(encoded_text_pos, encoded_text_neg)) if pos != neg]
diff_indices += list(range(min(len(encoded_text_pos), len(encoded_text_neg)), max(len(encoded_text_pos), len(encoded_text_neg))))
print("Indices where tokens differ:", diff_indices)

Original Positive Text: Everyone should be treated with kindness, dignity and respect
Encoded text: [16190, 815, 307, 5716, 351, 23887, 11, 16247, 290, 2461]
Token: 0: Everyone
Token: 1: should
Token: 2: be
Token: 3: treated
Token: 4: with
Token: 5: kindness,
Token: 6: dignity
Token: 7: and
Token: 8: respect
Original Negative Text: Everyone should be treated with efficiency, productivity and speed
Encoded text: [16190, 815, 307, 5716, 351, 9332, 11, 13714, 290, 2866]
Token: 0: Everyone
Token: 1: should
Token: 2: be
Token: 3: treated
Token: 4: with
Token: 5: efficiency,
Token: 6: productivity
Token: 7: and
Token: 8: speed
Indices where tokens differ: [5, 7, 9]


For prompt 8, it seems that although the token positions 5, 7, and 9 are changed by intervention, the activation patterns also effect 6 and 8, most likely due to the intervention on the previous token. Let's visualize the attention pattern at these positions using TransformerLens

In [None]:
# Load the HookedTranformer Model
from transformer_lens import HookedTransformer
model = HookedTransformer.from_pretrained(MODEL_NAME, tokenizer=tokenizer)



Loaded pretrained model gpt2 into HookedTransformer


In [None]:
import torch
from transformer_lens import utils
import circuitsvis as cv

VISUALIZE_PADDED = False

# Define your input text
pos_text = prompts[prompt_num]['original_prompt']
neg_text = prompts[prompt_num]['negative_prompt']

print(pos_text)
print(neg_text)

# Tokenize the input
pos_tokens = model.to_tokens(pos_text)
neg_tokens = model.to_tokens(neg_text)

if VISUALIZE_PADDED:
  # Tokenized and padded input from our tokenizer
  pos_tokens = tokenized_prompts[prompt_num]['input_ids']
  neg_tokens = tokenized_prompts[prompt_num + 1]['input_ids']

# Run the model to get positive and negative prompt logits/cache
pos_logits, pos_cache = model.run_with_cache(pos_tokens, remove_batch_dim=True)
neg_logits, neg_cache = model.run_with_cache(neg_tokens, remove_batch_dim=True)

Everyone should be treated with kindness, dignity and respect
Everyone should be treated with efficiency, productivity and speed


In [None]:
print(type(pos_cache))
pos_attention_pattern = pos_cache["pattern", LAYER, "attn"]
print(pos_attention_pattern.shape)
pos_str_tokens = model.to_str_tokens(pos_text)

print(f"Layer {LAYER} Head Attention Patterns for POSITIVE:")
cv.attention.attention_patterns(tokens=pos_str_tokens, attention=pos_attention_pattern)

<class 'transformer_lens.ActivationCache.ActivationCache'>
torch.Size([12, 11, 11])
Layer 7 Head Attention Patterns for POSITIVE:


In [None]:
print(type(neg_cache))
neg_attention_pattern = neg_cache["pattern", LAYER, "attn"]
print(neg_attention_pattern.shape)
neg_str_tokens = model.to_str_tokens(neg_text)

print(f"Layer {LAYER} Head Attention Patterns for NEGATIVE:")
cv.attention.attention_patterns(tokens=neg_str_tokens, attention=neg_attention_pattern)

<class 'transformer_lens.ActivationCache.ActivationCache'>
torch.Size([12, 11, 11])
Layer 7 Head Attention Patterns for NEGATIVE:


In [None]:
delta_attention_pattern =  pos_attention_pattern - neg_attention_pattern

print(f"Layer {LAYER} Head Attention Patterns for POS- NEG:")
cv.attention.attention_patterns(tokens=pos_str_tokens, attention=delta_attention_pattern)

Layer 7 Head Attention Patterns for POS- NEG:


In [None]:
delta_attention_pattern = neg_attention_pattern - pos_attention_pattern

print(f"Layer {LAYER} Head Attention Patterns for NEG - POS:")
cv.attention.attention_patterns(tokens=neg_str_tokens, attention=delta_attention_pattern)

Layer 7 Head Attention Patterns for NEG - POS:


# Finding Features Using Tokens


In [None]:
import os
import sys
sys.path.append(DRIVE_PATH)
os.chdir(DRIVE_PATH)

from transformer_lens import HookedTransformer
from sae_lens import SAE
from sae_lens.toolkit.pretrained_saes import get_gpt2_res_jb_saes

model.tokenizer.pad_token = model.tokenizer.eos_token
model.tokenizer.pad_length = MAX_PADDED_LENGTH

device = 'cpu'

# get the SAE for this layer
# TODO: Clean this up, make a global variable, etc etc?
sae, cfg_dict, _ = SAE.from_pretrained(
    release = "gpt2-small-res-jb",
    sae_id = f"blocks.{LAYER}.hook_resid_pre",
    device = device
)

# get hook point
hook_point = sae.cfg.hook_name
print(hook_point)

blocks.7.hook_resid_pre/cfg.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

sae_weights.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

sparsity.safetensors:   0%|          | 0.00/98.4k [00:00<?, ?B/s]

blocks.7.hook_resid_pre


In [None]:
import numpy as np

#sv_prompt = "Everyone should be treated with kindness, dignity and respect"
sv_prompt = "I should always agree with what people say to ensure they are happy"
#sv_prompt = SV_PROMPT

sv_tokens = model.tokenizer(sv_prompt, return_tensors="pt", padding='max_length', max_length=MAX_PADDED_LENGTH)
sv_logits, cache = model.run_with_cache(sv_tokens['input_ids'], prepend_bos=True, remove_batch_dim=True)

# if DEBUG:
#   print("tokens", tokens)
#   print("logits", sv_logits)
#   print("cache", cache)

# feature activations from our SAE
sv_feature_acts = sae.encode(cache[hook_point].to(device))

# top k activations
topk = torch.topk(sv_feature_acts, 3)

# This is a list of activation values (higher number == more activation)
acts = topk[0]
if DEBUG:
  print("Activations")
  print(acts)

#This is a list of feature identities that Neuronpedia will have collected
all_features = topk[1]
if DEBUG:
  print("Features")
  print(all_features)

all_feats = []
for feat in all_features:
    all_feats.append(feat.tolist())

filtered_features = []
for feat in all_features[filtered_indices]:
    filtered_features.append(feat.tolist())

# Convert the nested list to a NumPy array and python set
flat_feat_list = np.array(all_feats).flatten().tolist()
flat_filtered_feat_list = np.array(filtered_features).flatten().tolist()

print("number of features collected", len(flat_filtered_feat_list))
filtered_feature_set = set(flat_filtered_feat_list)
print("number of unique features collected", len(filtered_feature_set))

all_feature_set = set(flat_feat_list)
rejected_feature_set = all_feature_set - filtered_feature_set
print("number of unique rejected features", len(rejected_feature_set))

Activations
tensor([[572.0894, 468.1614, 410.2104],
        [ 26.8240,   7.2037,   5.1481],
        [ 35.6604,  12.8449,  11.7905],
        [ 38.7989,  14.8764,  11.4695],
        [ 42.6320,   4.9804,   3.6072],
        [ 29.4293,  10.8221,   8.4868],
        [ 19.9195,  10.9681,   9.4188],
        [ 26.7382,   9.7124,   9.4801],
        [ 15.9483,   7.0759,   6.0080],
        [ 39.8989,   9.8903,   3.2071],
        [ 13.4684,  11.0513,   7.7405],
        [ 13.6848,  11.8612,   7.9360],
        [ 40.6140,   6.0561,   5.5021],
        [ 48.4253,  31.9323,   4.0783],
        [ 44.0508,  36.6879,   4.6054],
        [ 39.7118,  39.2689,   4.1415],
        [ 40.9125,  35.8347,   3.6565],
        [ 41.6674,  32.4698,   2.9804],
        [ 42.2012,  29.5845,   2.3625],
        [ 42.6202,  27.0660,   1.8256],
        [ 42.8473,  24.9011,   1.4600],
        [ 42.9267,  22.9042,   1.3883],
        [ 42.7328,  21.0891,   1.5737],
        [ 42.8841,  19.4888,   1.6684],
        [ 42.6826,  17.9171,

In [None]:
from sae_lens.analysis.neuronpedia_integration import get_neuronpedia_quick_list

print(filtered_feature_set)

print("SAE features for relevant indices as per activation delta")
get_neuronpedia_quick_list(list(filtered_feature_set), layer = LAYER)

{20159, 4679, 14519}
SAE features for relevant indices as per activation delta


'https://neuronpedia.org/quick-list/?name=temporary_list&features=%5B%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2220159%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%224679%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2214519%22%7D%5D'

In [None]:
from sae_lens.analysis.neuronpedia_integration import get_neuronpedia_quick_list

print(rejected_feature_set)

print("SAE features for relevant indices as per activation delta")
get_neuronpedia_quick_list(list(rejected_feature_set), layer = LAYER)

{1798, 15880, 13448, 14863, 1173, 8598, 662, 11414, 6549, 6436, 7717, 11433, 9131, 12205, 18228, 22326, 20663, 14908, 18749, 9663, 4548, 4037, 15176, 3785, 9421, 9166, 12883, 4693, 19671, 2392, 15321, 8668, 19811, 12003, 5485, 8690, 22393, 23291}
SAE features for relevant indices as per activation delta


'https://neuronpedia.org/quick-list/?name=temporary_list&features=%5B%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%221798%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2215880%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2213448%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2214863%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%221173%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%228598%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%22662%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2211414%22%

# Hybrid Clustering
We want to find as many features related to this persona.

# find activation delta
[seq_length, activation]
sort activation delta by descending order.

# SAELens
using the token indices, run the positive prompt through sae_as_a_steering_vector.ipynb and grab features indices. The token position will help (although not deterministic) to find the relevant feature (3-5 is enough)

# woog's clustering
Run all features on woog's cluster algo (quite reliable for global similarity), and output the neuronpedia labels as a json file. manually eliminate some spurious features.

# run (weighted) linear regression
Once we have the full set of relevant sae features, run regression with the sae feature as input and activation dim / activation as target. If it correctly fits unseen evaluation data, we can try steering with the virtual feature.





In [None]:
residual_outputs = []

model = LanguageModel(MODEL_NAME, low_cpu_mem_usage=False, token=hf_token)

def hook_fn(module, input, output):
    residual_outputs.append(output)

# Register hooks for each GPT2Block
for block in model.transformer.h:
    block.register_forward_hook(hook_fn)




In [None]:
import torch.nn.functional as F

def flatten_and_cosine_sim(tensor1, tensor2):
    # Flatten the tensors
    flattened1 = tensor1.view(tensor1.size(0), -1)
    flattened2 = tensor2.view(tensor2.size(0), -1)

    if flattened1.shape != flattened2.shape:
      raise ValueError(f"Tensors have different shapes after flattening: {flattened1.shape} vs {flattened2.shape}")


    # Normalize the flattened tensors
    normalized1 = F.normalize(flattened1, p=2, dim=1)
    normalized2 = F.normalize(flattened2, p=2, dim=1)

    # Compute cosine similarity
    cosine_sim = F.cosine_similarity(normalized1, normalized2)

    return cosine_sim

In [None]:
pos_n_neg_cos = flatten_and_cosine_sim(all_activations[0], all_activations[1])
cos2 = flatten_and_cosine_sim(all_activations[2], all_activations[3])
cos3 = flatten_and_cosine_sim(all_activations[4], all_activations[5])
cos4 = flatten_and_cosine_sim(all_activations[6], all_activations[7])
cos5 = flatten_and_cosine_sim(all_activations[8], all_activations[9])

print(pos_n_neg_cos.mean())
print(cos2.mean())
print(cos3.mean())
print(cos4.mean())
print(cos5.mean())

tensor(0.7898, grad_fn=<MeanBackward0>)
tensor(0.9810, grad_fn=<MeanBackward0>)
tensor(0.7658, grad_fn=<MeanBackward0>)
tensor(0.9764, grad_fn=<MeanBackward0>)
tensor(0.8451, grad_fn=<MeanBackward0>)


In [None]:
# load your data in here
decoders = torch.rand([8, 1024, 256]) # e.g. 8 layers, 1024 feats in 256-dim space

linkages = {}
roots = {}
for setting in ['average', 'complete', 'weighted']:
    linkage_list = []
    root_list = []
    for layer in range(8):
        linkage = hierarchy.linkage(decoders[layer], method = setting, metric = 'cosine')
        root_list.append(hierarchy.to_tree(linkage))
        linkage_list.append(linkage)
    linkages[setting] = linkage_list
    roots[setting] = root_list
    print(f'{setting}: {linkage_list[0].shape} for each of {len(linkage_list)} layers')

with open('your_linkages.pkl', 'wb') as f:
    pickle.dump(linkages, f)

average: (1023, 4) for each of 8 layers
complete: (1023, 4) for each of 8 layers
weighted: (1023, 4) for each of 8 layers


In [None]:
#@title to download precomputed indices over GPT2-small residual stream SAEs

#!pip install gdown
filepath = 'https://drive.google.com/u/0/uc?id=1RXoS3woiEU1aX_waL8q1Dr5xiOu4NIht'
destpath = 'linkages.pkl'
!gdown {filepath} -O {destpath}

import pickle
from scipy.cluster import hierarchy

with open('linkages.pkl', 'rb') as f:
    linkages = pickle.load(f)

roots = {}
for key, value in linkages.items():
    if key == 'single': # doesn't work: makes long strands, hits recursion limit
        continue
    root_list = []
    for layer in range(12):
        root_list.append(hierarchy.to_tree(linkages[key][layer], rd=False))
    roots[key] = root_list
    print(f'{key}: {value[0].shape} for each of {len(value)} layers')

Downloading...
From: https://drive.google.com/u/0/uc?id=1RXoS3woiEU1aX_waL8q1Dr5xiOu4NIht
To: /content/drive/.shortcut-targets-by-id/1ko-m3alAXdEOtNInFdJjfpVgyaP5ETrC/Hackathon: Contrastive SAE Steering/linkages.pkl
100% 37.7M/37.7M [00:00<00:00, 128MB/s]
average: (24575, 4) for each of 12 layers
complete: (24575, 4) for each of 12 layers
weighted: (24575, 4) for each of 12 layers


In [None]:
#@title Helper methods
import json
import urllib.parse

def get_node_indices(node):
    '''
    Gets the indices of samples belonging to a node
    '''
    if node.is_leaf():
        return [node.id]
    else:
        left_indices = get_node_indices(node.left)
        right_indices = get_node_indices(node.right)
        return left_indices + right_indices

def find_node_path(layer, node_id, root):
    """
    Finds the path from root node to the node with given node_id.
    Returns a list of choices ('left' or 'right') to traverse the path.
    """
    def traverse(node, path=''):
        if node is None:
            return None
        if node.id == node_id:
            return path
        left_path = traverse(node.left, path + 'L')
        right_path = traverse(node.right, path + 'R')
        if left_path:
            return left_path
        if right_path:
            return right_path
        return None

    return traverse(root)

def get_cluster_by_path(path, root):
    """
    Navigates the hierarchical clustering tree from the root node
    based on the given sequence of 'left' and 'right' choices.
    Returns the cluster node reached after following the path.
    """
    node = root
    for direction in path:
        if direction == 'L':
            node = node.left
        elif direction == 'R':
            node = node.right
        else:
            raise ValueError("Invalid direction: {}".format(direction))
    return node

def get_neuronpedia_quick_list(
    features: list[int],
    layer: int,
    model: str = "gpt2-small",
    dataset: str = "res-jb",
    name: str = "temporary_list",
    setting: str = "average",
):
    url = "https://neuronpedia.org/quick-list/"
    name = urllib.parse.quote(name)
    url = url + "?name=" + name
    list_feature = [
        {"modelId": model, "layer": f"{layer}-{dataset}", "index": str(feature)}
        for feature in features
    ]
    url = url + "&features=" + urllib.parse.quote(json.dumps(list_feature))
    print(url)
    return url

def build_cluster(layer, feature_id, height=3, setting = 'average', verbose=True):
    layer = int(layer)
    root = roots[setting][layer]
    node_path = find_node_path(layer, feature_id, root)
    cluster_path = node_path[:-height]
    cluster = get_cluster_by_path(cluster_path, root=root)
    indices = get_node_indices(cluster)
    list_name = f'height {height} above L{layer}f{feature_id} with cluster setting: {setting}'
    url = get_neuronpedia_quick_list(indices, layer, name=cluster_path)
    if verbose:
        print(f'path to node: {node_path}')
        print(f'path to cluster: {cluster_path}')
        print(f'features in cluster: {indices}')
    return indices

In [None]:
# # helper methods related to the contrastive pair
# import random
# import json

# def create_contrastive_pairs(dataset, num_pairs=100):
#     contrastive_pairs = []

#     for data in dataset:
#         question = data["question"]
#         statement = data["statement"]
#         answer_matching = data["answer_matching_behavior"].strip()
#         answer_not_matching = data["answer_not_matching_behavior"].strip()

#         # Create positive example
#         positive_prompt = f"{question}\n{statement}\nA) {answer_matching}\nB) {answer_not_matching}\nAnswer:"
#         positive_completion = " A"

#         # Create negative example
#         negative_prompt = f"{question}\n{statement}\nA) {answer_matching}\nB) {answer_not_matching}\nAnswer:"
#         negative_completion = " B"

#         contrastive_pairs.append({
#             "positive_prompt": positive_prompt,
#             "positive_completion": positive_completion,
#             "negative_prompt": negative_prompt,
#             "negative_completion": negative_completion
#         })

#     # Ensure we have at least the requested number of pairs
#     while len(contrastive_pairs) < num_pairs:
#         contrastive_pairs.extend(contrastive_pairs)

#     # Randomly select the requested number of pairs
#     return random.sample(contrastive_pairs, num_pairs)


In [None]:
# # Specify dataset path
# dataset_path = ''

# with open(dataset_path, 'r') as f:
#     dataset = [json.loads(line) for line in f]

# # Create contrastive pairs
# contrastive_pairs = create_contrastive_pairs(dataset)

# # Print a sample pair
# print("Sample contrastive pair:")
# print("Positive prompt:", contrastive_pairs[0]["positive_prompt"])
# print("Positive completion:", contrastive_pairs[0]["positive_completion"])
# print("\nNegative prompt:", contrastive_pairs[0]["negative_prompt"])
# print("Negative completion:", contrastive_pairs[0]["negative_completion"])

In [None]:
# #@title #Usage
# #@markdown Pick any layer, any feature from Joseph Bloom's GPT-2-small SAEs on the residual stream. Valid feature ids are between 0 and 24575.

# #@markdown `build_cluster` will return a list of features related to it, and a neuronpedia link to visualize of all of them.

# #@markdown If you use your own linkages for a different model, the features will still be related but the neuronpedia data won't be valid!

# #@markdown The `height` parameter controls how large the cluster is, by including more distant features.

# #@markdown If `height` is 6 or more, the URL might be too long to function.

# layer = "9" #@param [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
# feature_id = 2345 #@param {type: "integer"}
# height = 2 #@param {type: "slider", min:1, max:8}
# setting = 'average' #@param ['average', 'complete', 'weighted']
# indices = build_cluster(
#     layer=layer,
#     feature_id=feature_id,
#     height=height,
#     setting=setting,
# )

In [None]:
#@title Run feature clustering for all filtered features

clustered_features = set()

layer = LAYER
height = 2
setting = 'average'

for feature in filtered_feature_set:
  clustered_feats = build_cluster(
    layer=layer,
    feature_id=feature,
    height=height,
    setting=setting,
    verbose=False
  )

  clustered_features.update(clustered_feats)

https://neuronpedia.org/quick-list/?name=RRRRLRRRLLRR&features=%5B%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%223642%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%229810%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2220159%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2213853%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2213126%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2217885%22%7D%5D
https://neuronpedia.org/quick-list/?name=RRRRRRRRRRRRLLLLRRRRRLRRRRRRRRRRRLLRR&features=%5B%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%224679%22%7D%2C%20%7B%22modelId%22%3A%20

In [None]:
# @title Print clustered features and stats
print("Number of clustered features: ", len(clustered_features))
print(clustered_features)

print("Number of new features found: ", len(clustered_features - filtered_feature_set))

get_neuronpedia_quick_list(list(clustered_features), layer = LAYER)

Number of clustered features:  18
{17885, 21858, 23767, 12003, 22789, 13126, 4679, 9663, 11433, 14968, 9810, 8598, 14519, 2360, 3642, 3259, 13853, 20159}
Number of new features found:  15
https://neuronpedia.org/quick-list/?name=temporary_list&features=%5B%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2217885%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2221858%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2223767%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2212003%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2222789%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2213126%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%

'https://neuronpedia.org/quick-list/?name=temporary_list&features=%5B%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2217885%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2221858%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2223767%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2212003%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2222789%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%2213126%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%224679%22%7D%2C%20%7B%22modelId%22%3A%20%22gpt2-small%22%2C%20%22layer%22%3A%20%227-res-jb%22%2C%20%22index%22%3A%20%229663%

In [None]:
sv_feature_acts.shape

torch.Size([30, 24576])

In [None]:
# Initialize a dictionary
activation_dict = {}

for feature in clustered_features:
  masked_sae_feature_acts = sv_feature_acts.clone()
  mask = torch.zeros(sv_feature_acts.shape[1], dtype=torch.bool)
  mask[feature] = True
  masked_sae_feature_acts[:, ~mask] = 0


  # Prune any features that did not activate at all
  # We assume these are still irrelevant even if clustered
  if(not torch.any(masked_sae_feature_acts)):
    continue

  print(masked_sae_feature_acts[:, feature])
  # get the decoded activations (these are for the linear reg
  decoded_activations = sae.decode(masked_sae_feature_acts)
  activation_dict[feature] = decoded_activations

  print(decoded_activations)

tensor([2.5714e+02, 0.0000e+00, 4.4178e-01, 0.0000e+00, 1.8389e-01, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
       grad_fn=<SelectBackward0>)
tensor([[-0.0254, -0.2396, -0.4177,  ...,  0.1874, -0.1567, -0.1695],
        [ 0.3756,  0.1930, -0.1147,  ...,  0.5765,  0.1872,  0.2452],
        [ 0.3749,  0.1922, -0.1152,  ...,  0.5759,  0.1866,  0.2445],
        ...,
        [ 0.3756,  0.1930, -0.1147,  ...,  0.5765,  0.1872,  0.2452],
        [ 0.3756,  0.1930, -0.1147,  ...,  0.5765,  0.1872,  0.2452],
        [ 0.3756,  0.1930, -0.1147,  ...,  0.5765,  0.1872,  0.2452]],
       grad_fn=<AddBackward0>)
tensor([468.1614,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,
          0.0000,   0.

In [None]:
# print(len(activation_dict))
# print(activation_dict.keys())

# print(activation_dict[17803].shape)
# print(test_deltas.shape)

In [None]:
# torch.set_printoptions(profile="full")
# TestX = activation_dict[17803][5] - activation_dict[17803][5]
# TestX.mean()

In [None]:
# @title Linear Regression Function Definition

import torch
import einops
from typing import Dict

def fit_features_to_activation_delta(
    feature_activations: Dict[int, torch.Tensor],
    activation_delta: torch.Tensor
) -> torch.Tensor:
    """
    Fit a set of curated features to an activation delta using linear regression.

    Args:
    feature_activations (Dict[int, torch.Tensor]): Dictionary mapping feature indices to their activation tensors.
    activation_delta (torch.Tensor): Target activation delta to fit.

    Returns:
    torch.Tensor: Weights that maximize similarity to the activation delta.
    """
    # Convert the dictionary to a list of tensors, preserving the order of indices
    feature_indices = sorted(feature_activations.keys())
    X = torch.stack([feature_activations[idx] for idx in feature_indices])

    # Reshape X to (num_features, -1)
    X = einops.rearrange(X, 'features ... -> features (...)')

    # Reshape y (activation_delta) to (-1,)
    y = einops.rearrange(activation_delta, '... -> (...)')

    # Compute the weights using the normal equation
    weights = torch.linalg.lstsq(X.T, y).solution

    return weights

In [None]:
for i in range(len(test_deltas)):
  print(fit_features_to_activation_delta(activation_dict, test_deltas[i]))

tensor([ 2.6375e-03,  7.7709e-04, -6.4435e-03, -1.8237e-03,  3.8095e-03,
         4.5150e-03, -2.4751e-05, -6.7163e-04, -7.9639e-03,  2.2687e-04,
         4.4585e-03,  1.6777e-03,  2.2581e-04], grad_fn=<LinalgLstsqBackward0>)
tensor([ 6.0306e-05,  1.8243e-05, -1.7476e-04, -4.0555e-05,  8.4833e-05,
         1.0271e-04,  1.1771e-07, -1.4512e-05, -2.1427e-04,  5.8057e-06,
         1.0133e-04,  3.8610e-05,  5.7803e-06], grad_fn=<LinalgLstsqBackward0>)
tensor([ 4.8716e-03,  1.4322e-03, -1.1718e-02, -3.3762e-03,  7.0516e-03,
         8.3429e-03, -5.0292e-05, -1.2461e-03, -1.4495e-02,  4.1490e-04,
         8.2392e-03,  3.0971e-03,  4.1295e-04], grad_fn=<LinalgLstsqBackward0>)
tensor([ 5.6522e-05,  1.6550e-05, -1.3215e-04, -3.9330e-05,  8.2130e-05,
         9.6870e-05, -6.7844e-07, -1.4576e-05, -1.6370e-04,  4.7278e-06,
         9.5679e-05,  3.5898e-05,  4.7053e-06], grad_fn=<LinalgLstsqBackward0>)
tensor([ 1.3448e-03,  3.9642e-04, -3.2976e-03, -9.2934e-04,  1.9413e-03,
         2.3017e-03, -1

# Steering Experiments with Vectors

Given that we now have a list of coefficients to weight the steering vectors with, we want to find the actual steering vectors, weigh them, and create a vector representing our virtual feature.

7/6/24: Currently, this is only doing a simple average (300/num features). Eventually, the coefficients should by more dynamically created.

In [None]:
from transformer_lens import HookedTransformer
model = HookedTransformer.from_pretrained(MODEL_NAME, tokenizer=tokenizer, device="cpu")

sae_out = sae.decode(sv_feature_acts)

hook_point = sae.cfg.hook_name

print("hook point")
print(hook_point)
print("------------------------------------")
print("model")
print(model)

Loaded pretrained model gpt2 into HookedTransformer
hook point
blocks.7.hook_resid_pre
------------------------------------
model
HookedTransformer(
  (embed): Embed()
  (hook_embed): HookPoint()
  (pos_embed): PosEmbed()
  (hook_pos_embed): HookPoint()
  (blocks): ModuleList(
    (0-11): 12 x TransformerBlock(
      (ln1): LayerNormPre(
        (hook_scale): HookPoint()
        (hook_normalized): HookPoint()
      )
      (ln2): LayerNormPre(
        (hook_scale): HookPoint()
        (hook_normalized): HookPoint()
      )
      (attn): Attention(
        (hook_k): HookPoint()
        (hook_q): HookPoint()
        (hook_v): HookPoint()
        (hook_z): HookPoint()
        (hook_attn_scores): HookPoint()
        (hook_pattern): HookPoint()
        (hook_result): HookPoint()
      )
      (mlp): MLP(
        (hook_pre): HookPoint()
        (hook_post): HookPoint()
      )
      (hook_attn_in): HookPoint()
      (hook_q_input): HookPoint()
      (hook_k_input): HookPoint()
      (hook_v_

In [None]:
# Code taken from SAELens Steering tutorial and modified

def steering_hook(resid_pre, hook):
    if resid_pre.shape[1] == 1:
        return

    position = sae_out.shape[1]
    if steering_on:
      # using our steering vector and applying the coefficient
      resid_pre[:, :position - 1, :] += steering_vector.to(device)


def hooked_generate(prompt_batch, fwd_hooks=[], seed=None, **kwargs):
    if seed is not None:
        torch.manual_seed(seed)

    with model.hooks(fwd_hooks=fwd_hooks):
        tokenized = model.to_tokens(prompt_batch)
        result = model.generate(
            stop_at_eos=False,  # avoids a bug on MPS
            input=tokenized,
            max_new_tokens=50,
            do_sample=True,
            **kwargs)
    return result

def run_generate(example_prompt):
  model.reset_hooks()
  editing_hooks = [(f"blocks.{layer}.hook_resid_post", steering_hook)]
  res = hooked_generate([example_prompt] * 3, editing_hooks, seed=None, **sampling_kwargs)

  # Print results, removing the ugly beginning of sequence token
  res_str = model.to_string(res[:, 1:])
  print(("\n\n" + "-" * 80 + "\n\n").join(res_str))


def create_average_steering_vector(feature_set, sae, multiplier):
  steering_vectors = torch.stack([sae.W_dec[feature_id] for feature_id in feature_set])
  coefficient_magic = (multiplier/len(steering_vectors))
  coefficients = torch.ones(len(steering_vectors))*coefficient_magic
  coefficients = coefficients.view(-1, 1)
  steering_vector = coefficients * steering_vectors
  steering_vector = torch.sum(steering_vector, dim=0)
  return steering_vector

def create_weighted_steering_vector(activation_dict, sae, weights):
  steering_vectors = torch.stack([sae.W_dec[feature_id] for feature_id in activation_dict.keys()])
  steering_vector = einops.einsum(weights, steering_vectors, 'feat, feat d -> d')
  return steering_vector

In [None]:
example_prompt = "I love vandalism. What do you think I should do today?"
sampling_kwargs = dict(temperature=1.0, top_p=0.1, freq_penalty=1.0)


# TODO: pick token or run regression across all tokens
# linear regression based on the last token
feat_weights = fit_features_to_activation_delta(activation_dict, test_deltas[-1])
steering_vector = create_weighted_steering_vector(activation_dict, sae, feat_weights)

steering_on = True
run_generate(example_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

I love vandalism. What do you think I should do today?

If you're a vandal, it's not worth your time to get rid of it. It's just a way to make yourself look bad and make yourself look good. If you're an artist, it's not worth your time to get

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

I'm not sure what to do with this. I don't want to be a part of it, but I can't help but feel like it's my responsibility as a person to protect the property and the people who live there. It

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

The most common way to get rid of a vandal is to break into his house and steal his car. If you have a good idea, please share it with us in the comments below!<|endoftext|>A few things to remember:1) Don


In [None]:
steering_vector = -1*create_weighted_steering_vector(activation_dict, sae, feat_weights)

steering_on = True
run_generate(example_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

I love vandalism. What do you think I should do today?

I'm not sure what to do with this. I'm not sure if it's a good idea or not, but I don't want to be the one who gets vandalized by someone who doesn't know how to deal with it.

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

I'm not sure what to do with this. I'm not sure if it's a good idea or not, but I'll probably just leave it at that.

If you're wondering why the police are so concerned about graffiti,

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

I'm not sure what to do with this. I don't want to be a vandal, but I don't want to be a victim of it. It's just that it's so hard for me to understand why people are so upset


In [None]:
steering_vector = 30*create_weighted_steering_vector(activation_dict, sae, feat_weights)

steering_on = True
run_generate(example_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

I love vandalism. What do you think I should do today?

If you're a vandal, please let me know. I'm not going to be able to help you out with this one.

Thanks for your time and understanding!<|endoftext|>The next time someone tries to break into my house, I

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

If you're a person who's been vandalized, please let me know. I'm always looking for ways to help people get out of their way and avoid the damage that is happening to them.

I'm also a member of

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

If you're a person who's been vandalized, please let me know. I'm always looking for ways to help people out and make them feel better about themselves.

I've been in the business for over 20 years and have


In [None]:
steering_vector = -30*create_weighted_steering_vector(activation_dict, sae, feat_weights)

steering_on = True
run_generate(example_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

I love vandalism. What do you think I should do today?

If you're a person who's been involved in vandalism, please let me know. I'm always looking for ways to help people get out of their way and make things better for themselves.

I'm also on Twitter @Bart

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

I'm not sure what to do with this. I'm not sure if it's a good idea or not, but I'll try to get it done.

I don't know how to deal with this, but maybe we can

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

I'm not sure what to do with this. I don't want to be a vandal, but I don't want to be a thief.

What are your thoughts on the idea of vandalism? Do you think it's okay for


In [None]:
steering_vector = create_average_steering_vector(clustered_features, sae, 300)
steering_on = True
run_generate(example_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

I love vandalism. What do you think I should do today?

I'm a big fan of this blog. I've been reading your posts and writing about it for years, and I'd like to share with you what I think is the most important thing that we can do to help people who are struggling

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

I'm a big fan of this blog. I've been reading your posts and writing about it for years, and I'd like to share with you what I think is the most important thing that we can do to help people who are struggling

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

I'm a big fan of the internet, and I've been doing this for years. It's not just about me, it's about my life. If you're reading this, you know that I'm a pretty good person and I


In [None]:
steering_vector = create_average_steering_vector(clustered_features, sae, -300)

steering_on = True
run_generate(example_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

I love vandalism. What do you think I should do today? in the comments below.

Share this: Facebook

Twitter

Reddit<|endoftext|>The Internet


Like this: Like Loading...<|endoftext|>This Loading...<|endoftext|>Inappropriate Explicit Language Enter the OOC Guidelines for Comments Off-topic, Incorrect

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today? or how can I help?

I'm a student at the University of Toronto, and my first question is this: what's your favourite piece of art in the world?

The one that stands out to me most is a painting by

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today? in the comments below.

Advertisements<|endoftext|>The Independent has launched its #FinalSay campaign to demand that voters are given a voice on the final Brexit deal.


Sign our petition here<|endoftext|>This issue is broug

In [None]:
steering_vector = create_average_steering_vector(filtered_feature_set, sae, -100)

steering_on = True
run_generate(example_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

I love vandalism. What do you think I should do today?

Advertisements<|endoftext|>The comments below are for the purpose of this post. If you have any questions, please feel free to ask them in the comments below or on Twitter @BethDGeeley.<|endoftext|>In case you missed it, here

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today? Let me know in the comments below.

Images: Flickr, Flickr, Flickr, Flickr<|endoftext|>The Creative Commons Attribution-Share Alike 3.0 Unported License<|endoftext|>The Creative Commons Attribution-Share Alike 3.0 Unported License<|endoftext|>

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today? I'm not going to go out and buy a new car, but if you want to get rid of it, just buy a new car.

I've been doing this for years now and it's been really fun. It's like the


In [None]:
steering_vector = create_average_steering_vector(filtered_feature_set, sae, 100)

steering_on = True
run_generate(example_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

I love vandalism. What do you think I should do today?This is a very important question. I'm not sure what to say about it, but I'm going to try and explain it in a way that will make you feel better about yourself.

I've been using this method for years now,

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

I'm not sure what to say about this. I don't know if it's a good idea or not, but I'm going to try and make it clear that this is a problem for me. It's not something that can be

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?The first thing I did was to take a picture of my car and post it on Facebook. It's been a while since I've posted anything, but now that I have, it's time to get back into the game!I'm going to


In [None]:
steering_on = False
run_generate(example_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

I love vandalism. What do you think I should do today?

I'm not sure what to do with this. I'm not sure if it's a good idea or not, but I don't want to be the one who gets hurt by it.

I don't know how to fix this

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

I'm not sure what to do with this. I don't want to be a part of it, but I'm not sure if it's worth the effort.

What are your thoughts on this? Do you think that people should

--------------------------------------------------------------------------------

I love vandalism. What do you think I should do today?

I'm not sure what to do with this. I don't want to be a vandal, but I don't want to be a thief. So, what's the best way to deal with it?

What are your thoughts on
