# Feature tracing

## Config and preprocessing

In [2]:
%load_ext autoreload
%autoreload 2

In [24]:
FEATURE_ID = 7
BATCH_SIZE = 128

Loaded pretrained model gelu-1l into HookedTransformer
Moving model to device:  cuda
Changing model dtype to torch.float16
Model device: cuda:0
{'batch_size': 4096,
 'beta1': 0.9,
 'beta2': 0.99,
 'buffer_batches': 12288,
 'buffer_mult': 384,
 'buffer_size': 1572864,
 'd_mlp': 2048,
 'dict_mult': 8,
 'enc_dtype': 'fp32',
 'l1_coeff': 0.0003,
 'lr': 0.0001,
 'model_batch_size': 512,
 'num_tokens': 2000000000,
 'seed': 52,
 'seq_len': 128}
Encoder device: cuda:0


In [49]:
import pandas as pd

from transformer_lens import utils

from sprint.loading import load_all
from sprint.linearization import analyze_linearized_feature
from sprint.sae_tutorial import make_token_df

In [46]:
model, data, sae = load_all(half_precision=False, verbose=False)
batch = data[:BATCH_SIZE]  # convenience variable for callign make_token_df

Loaded pretrained model gelu-1l into HookedTransformer
Moving model to device:  cuda
{'batch_size': 4096,
 'beta1': 0.9,
 'beta2': 0.99,
 'buffer_batches': 12288,
 'buffer_mult': 384,
 'buffer_size': 1572864,
 'd_mlp': 2048,
 'dict_mult': 8,
 'enc_dtype': 'fp32',
 'l1_coeff': 0.0003,
 'lr': 0.0001,
 'model_batch_size': 512,
 'num_tokens': 2000000000,
 'seed': 52,
 'seq_len': 128}


In [39]:
result = analyze_linearized_feature(
    feature_idx=FEATURE_ID,
    sample_idx=0,
    token_idx=0,
    model=model,
    data=data,
    encoder=sae,
    batch_size=BATCH_SIZE,
)

## Feature interpretation
We look at:
* A table of specific tokens by feature activations
* A table of specific tokens by activation scores
* A table of generic tokens by unembedded token scores


In [44]:
# Table of top SAE activations

token_df = make_token_df(batch, model=model)
token_df["feature"] = utils.to_numpy(result["sae activations"][:, FEATURE_ID])
token_df.sort_values("feature", ascending=False).head(20).style.background_gradient("coolwarm")

Unnamed: 0,str_tokens,unique_token,context,batch,pos,label,feature
6617,·you,·you/89,·Don·Jaime·I·and|·you|·will,51,89,51/89,1.560133
11525,·it,·it/5,<|BOS|>in·is·True·and|·it|·is,90,5,90/5,1.155803
13829,·it,·it/5,"<|BOS|>·image·file,·and|·it|·should",108,5,108/5,0.965696
6551,·it,·it/23,·meets·the·new·one·and|·it|’,51,23,51/23,0.900986
4314,·she,·she/90,"·life,·growing·up,|·she|·was",33,90,33/90,0.463119
5588,·he,·he/84,·through·grace·—·and·now|·he|·shares,43,84,43/84,0.385578
9689,·what,·what/89,·to·perceive·buyer·behavior·and|·what|·it,75,89,75/89,0.290015
6552,’,’/24,·the·new·one·and·it|’|s,51,24,51/24,0.253979
4326,·hog,·hog/102,’s·row·crop·and|·hog|·farm,33,102,33/102,0.233893
6618,·will,·will/90,·Jaime·I·and·you|·will|·arrive,51,90,51/90,0.229745


In [47]:
# Table of activation scores

token_df = make_token_df(batch, model=model)
token_df["feature"] = utils.to_numpy(result["activation scores"])
token_df.sort_values("feature", ascending=False).head(20).style.background_gradient("coolwarm")

Unnamed: 0,str_tokens,unique_token,context,batch,pos,label,feature
6110,/,//94,↩ #·redistribute·it·and|/|or,47,94,47/94,1.532979
9820,↩ ························,↩ ························/92,·the·sender·when·broadcasting·or|↩ ························|·addressing,76,92,76/92,1.357633
9770,↩ ·····················,↩ ·····················/42,"·all·connected·clients,·or|↩ ·····················|·``",76,42,76/42,1.312005
7204,"',","',/36",".pack('=L|',|·s",56,36,56/36,1.2104
1646,0,0/110,(0).buffer(|0|.,12,110,12/110,1.206347
6617,·you,·you/89,·Don·Jaime·I·and|·you|·will,51,89,51/89,1.195316
449,·dimensions,·dimensions/65,·unpacked·if·d·in|·dimensions|·and,3,65,3/65,1.183801
451,·not,·not/67,·if·d·in·dimensions·and|·not|·is,3,67,3/67,1.178248
7266,0,0/98,→bbox·=·[|0|],56,98,56/98,1.158542
455,vs,vs/71,·and·not·isscalar(|vs|)],3,71,3/71,1.158261


In [50]:
# Unembed feature

token_df = pd.DataFrame(
    dict(str_tokens=result["token strings"], feature_scores=result["token scores"].detach().cpu().numpy())
)
token_df.style.background_gradient("coolwarm")

Unnamed: 0,str_tokens,feature_scores
0,}+\,0.652685
1,Appl,0.622629
2,",’",0.617686
3,/.,0.609678
4,.\,0.599738
5,",**",0.593384
6,)/(,0.5816
7,…”,0.561763
8,".""",0.561631
9,)\,0.548016


I skipped the exploration of linearization points, because you can do this by just changing the `sample_idx` and `token_idx` parameters in `anlyze_linearized_feature()`

## Attention