
# Mechanistic Interpretability of Gemma2-9B-it Prompted with ARC Challenges

By Oliver Morris, Sep 2024.
Derived from a Colab notebook by Neel Nanda & Joseph Bloom, Google Deepmind:
https://colab.research.google.com/github/jbloomAus/SAELens/blob/main/tutorials/tutorial_2_0.ipynb

## Objective

The objective is to analyse differences in feature activations between similar ARC reasoning challenges, between those which the model succeeds on and those which it fails.

## Compute

If using a standard hosted environment then select a TPU, as we need up to 80GB for the models, which is not feasible either with single T4 GPU or single A100. However, PyTorch will identify the TPU as a CPU, not optimised for use with CUDA.

Custom GCP environments may be an option, but not tested with this notebook.

In theory, we can enable PyTorch to take advantage of TPU abilities using the torch_xla package:
```
  import torch
  import torch_xla
  import torch_xla.core.xla_model as xm
  import torch_xla.distributed.parallel_loader as pl
  import torch_xla.distributed.xla_multiprocessing as xmp

  # Move your model to TPU device
  device = xm.xla_device()
  model = MyModel().to(device)
```

But this has not been attempted here, as we are operating two models, an SAE and the original LLM.

## Connect to HuggingFace

First, connect to Huggingface for Models & Google Drive for Data


In [1]:
!pip install huggingface_hub
from huggingface_hub import notebook_login

notebook_login()



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Connect to Google Drive For Data

In [2]:
from google.colab import drive

# Try increasing the timeout
drive.mount('/content/drive', timeout_ms=300000)

# If that doesn't work, try unmounting and remounting with force_remount=True
!fusermount -u /content/drive  # Unmount the drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive
Mounted at /content/drive


## Set Up the Environment

In [3]:
try:
    import google.colab # type: ignore
    from google.colab import output
    COLAB = True
    %pip install sae-lens transformer-lens sae-dashboard
except:
    COLAB = False
    from IPython import get_ipython # type: ignore
    ipython = get_ipython(); assert ipython is not None
    ipython.run_line_magic("load_ext", "autoreload")
    ipython.run_line_magic("autoreload", "2")

# Standard imports
import os
import torch
from tqdm import tqdm
import plotly.express as px
import pandas as pd

# Imports for displaying vis in Colab / notebook

torch.set_grad_enabled(False)

# For the most part I'll try to import functions and classes near where they are used
# to make it clear where they come from.

if torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Device: {device}")

Collecting sae-lens
  Downloading sae_lens-3.22.0-py3-none-any.whl.metadata (5.1 kB)
Collecting transformer-lens
  Downloading transformer_lens-2.6.0-py3-none-any.whl.metadata (12 kB)
Collecting sae-dashboard
  Downloading sae_dashboard-0.5.1-py3-none-any.whl.metadata (6.0 kB)
Collecting automated-interpretability<1.0.0,>=0.0.5 (from sae-lens)
  Downloading automated_interpretability-0.0.6-py3-none-any.whl.metadata (778 bytes)
Collecting babe<0.0.8,>=0.0.7 (from sae-lens)
  Downloading babe-0.0.7-py3-none-any.whl.metadata (10 kB)
Collecting datasets<3.0.0,>=2.17.1 (from sae-lens)
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting matplotlib<4.0.0,>=3.8.3 (from sae-lens)
  Downloading matplotlib-3.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting plotly-express<0.5.0,>=0.4.1 (from sae-lens)
  Downloading plotly_express-0.4.1-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting pytest-profiling<2.0.0,>=1.7.0 (from sae-lens)
  

# Loading a Pretrained Sparse Autoencoder

The following snippet shows the currently available SAE releases in SAELens.

Each row is a "release" which has multiple SAEs which may have different configs / match different hook points in a model.

In [2]:
from sae_lens.toolkit.pretrained_saes_directory import get_pretrained_saes_directory

df = pd.DataFrame.from_records({k:v.__dict__ for k,v in get_pretrained_saes_directory().items()}).T
df.drop(columns=["expected_var_explained", "expected_l0", "config_overrides", "conversion_func"], inplace=True)
df[df['model']=='gemma-2-9b']

Unnamed: 0,release,repo_id,model,saes_map,neuronpedia_id
gemma-scope-9b-it-res,gemma-scope-9b-it-res,google/gemma-scope-9b-it-res,gemma-2-9b,{'layer_20/width_131k/average_l0_13': 'layer_2...,"{'layer_20/width_131k/average_l0_13': None, 'l..."
gemma-scope-9b-pt-att,gemma-scope-9b-pt-att,google/gemma-scope-9b-pt-att,gemma-2-9b,{'layer_0/width_16k/average_l0_12': 'layer_0/w...,"{'layer_0/width_16k/average_l0_12': None, 'lay..."
gemma-scope-9b-pt-att-canonical,gemma-scope-9b-pt-att-canonical,google/gemma-scope-9b-pt-att,gemma-2-9b,{'layer_0/width_16k/canonical': 'layer_0/width...,{'layer_0/width_16k/canonical': 'gemma-2-9b/0-...
gemma-scope-9b-pt-mlp,gemma-scope-9b-pt-mlp,google/gemma-scope-9b-pt-mlp,gemma-2-9b,{'layer_0/width_16k/average_l0_6': 'layer_0/wi...,"{'layer_0/width_16k/average_l0_6': None, 'laye..."
gemma-scope-9b-pt-mlp-canonical,gemma-scope-9b-pt-mlp-canonical,google/gemma-scope-9b-pt-mlp,gemma-2-9b,{'layer_0/width_16k/canonical': 'layer_0/width...,{'layer_0/width_16k/canonical': 'gemma-2-9b/0-...
gemma-scope-9b-pt-res,gemma-scope-9b-pt-res,google/gemma-scope-9b-pt-res,gemma-2-9b,{'embedding/width_4k/average_l0_14': 'embeddin...,"{'embedding/width_4k/average_l0_14': None, 'em..."
gemma-scope-9b-pt-res-canonical,gemma-scope-9b-pt-res-canonical,google/gemma-scope-9b-pt-res,gemma-2-9b,{'layer_0/width_16k/canonical': 'layer_0/width...,{'layer_0/width_16k/canonical': 'gemma-2-9b/0-...


We are focussed on Gemma-2-9b-it because this is the smallest model which can successfully complete a number of the ARC-AGI challenges. It is also available via API on Together.ai, so we can easily conduct inference on the 400 challenges in the ARC-AGI training set.

To see all the SAEs contained in a specific release (named after the part of the model they apply to), simply run the below. Each hook point corresponds to a layer or module of the model.

In [3]:
# show the contents of the saes_map column for a specific row
print("SAEs in the G2-9b Instruct Residual release")
for k,v in df.loc[df.release == "gemma-scope-9b-it-res", "saes_map"].values[0].items():
    print(f"SAE id: {k} for hook point: {v}")


SAEs in the G2-9b Instruct Residual release
SAE id: layer_20/width_131k/average_l0_13 for hook point: layer_20/width_131k/average_l0_13
SAE id: layer_20/width_131k/average_l0_153 for hook point: layer_20/width_131k/average_l0_153
SAE id: layer_20/width_131k/average_l0_24 for hook point: layer_20/width_131k/average_l0_24
SAE id: layer_20/width_131k/average_l0_43 for hook point: layer_20/width_131k/average_l0_43
SAE id: layer_20/width_131k/average_l0_81 for hook point: layer_20/width_131k/average_l0_81
SAE id: layer_20/width_16k/average_l0_14 for hook point: layer_20/width_16k/average_l0_14
SAE id: layer_20/width_16k/average_l0_189 for hook point: layer_20/width_16k/average_l0_189
SAE id: layer_20/width_16k/average_l0_25 for hook point: layer_20/width_16k/average_l0_25
SAE id: layer_20/width_16k/average_l0_47 for hook point: layer_20/width_16k/average_l0_47
SAE id: layer_20/width_16k/average_l0_91 for hook point: layer_20/width_16k/average_l0_91
SAE id: layer_31/width_131k/average_l0_109

Next we'll load a specific SAE for layer 20, approximately the middle of Gemma2-9B-it which has 42 layers. We load the 'res' version with 16k nodes, this hooks into the residuals of the LLM and expands the nodes therein to 16,000 nodes for analysis.

Residuals are where activations from preceeding layers is concatenated, it has the greatest potential for information. It is also supported by Neuronpedia.

We could alternatively load the 131k versions for layer 20, which give even greater granularity but which consume more memory and time.

We also load a copy of Gemma-9B-it to attach it to. To load the model, we'll use the HookedSAETransformer class, which is adapted from the TransformerLens HookedTransformer.


In [4]:
# from transformer_lens import HookedTransformer
from sae_lens import SAE, HookedSAETransformer

model = HookedSAETransformer.from_pretrained("gemma-2-9b", device = device)

# the cfg dict is returned alongside the SAE since it may contain useful information for analysing the SAE (eg: instantiating an activation store)
# Note that this is not the same as the SAEs config dict, rather it is whatever was in the huggingface (HF) repo, from which we can extract the SAE config dict
# We also return the feature sparsities which are stored in HF for convenience.
sae, cfg_dict, sparsity = SAE.from_pretrained(
    release = "gemma-scope-9b-it-res", # <- Release name
    sae_id = "layer_20/width_16k/average_l0_14", # <- SAE id (not always a hook point!)
    device = device
)



config.json:   0%|          | 0.00/856 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/4.84G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/2.38G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/46.4k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]



Loaded pretrained model gemma-2-9b into HookedTransformer


params.npz:   0%|          | 0.00/470M [00:00<?, ?B/s]

The "sae" object is an instance of the SAE (Sparse Autoencoder class). There are many different SAE architectures which may have different weights or activation functions. In order to simplify working with SAEs, SAE Lens handles most of this complexity for you.

Let's look at the SAE config and understand each of the parameters:

1. `architecture`: Specifies the type of SAE architecture being used, in this case, the standard architecture (encoder and decoder with hidden activations, as opposed to a gated SAE).
2. `d_in`: Defines the input dimension of the SAE, which is 3584 in this configuration.
3. `d_sae`: Sets the dimension of the SAE's hidden layer, which is 16384 here. This represents the number of possible feature activations.
4. `activation_fn_str`: Specifies the activation function used in the SAE, which is ReLU in this case. TopK is another option that we will not cover here.
5. `apply_b_dec_to_input`: Determines whether to apply the decoder bias to the input, set to True here.
6. `finetuning_scaling_factor`: Indicates whether to use a scaling factor to weight initialization and the forward pass. This is not usually used and was introduced to support a [solution for shrinkage](https://www.lesswrong.com/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes).
7. `context_size`: Defines the size of the context window, which is 1024 tokens in this case. In turns out SAEs trained on small activations from small prompts [often don't perform well on longer prompts](https://www.lesswrong.com/posts/baJyjpktzmcmRfosq/stitching-saes-of-different-sizes).
8. `model_name`: Specifies the name of the model being used. [This is a valid model name in TransformerLens](https://transformerlensorg.github.io/TransformerLens/generated/model_properties_table.html).
9. `hook_name`: Indicates the specific hook in the model where the SAE is applied.
10. `hook_layer`: Specifies the layer number where the hook is applied, which is layer 20 in this case.
11. `hook_head_index`: Defines which attention head to hook into; not relevant here since we are looking at a residual stream SAE.
12. `prepend_bos`: Determines whether to prepend the beginning-of-sequence token, set to True.
13. `dataset_path`: Specifies the path to the dataset used for training or evaluation. (Can be local or a huggingface dataset.)
14. `dataset_trust_remote_code`: Indicates whether to trust remote code (from HuggingFace) when loading the dataset, set to True.
15. `normalize_activations`: Specifies how to normalize activations, set to 'none' in this config.
16. `dtype`: Defines the data type for tensor operations, set to 32-bit floating point.
17. `device`: Specifies the computational device to use.
18. `sae_lens_training_version`: Indicates the version of SAE Lens used for training, set to None here.
19. `activation_fn_kwargs`: Allows for additional keyword arguments for the activation function. This would be used if e.g. the `activation_fn_str` was set to `topk`, so that `k` could be specified.

In [5]:
print(sae.cfg.__dict__)

{'architecture': 'jumprelu', 'd_in': 3584, 'd_sae': 16384, 'activation_fn_str': 'relu', 'apply_b_dec_to_input': False, 'finetuning_scaling_factor': False, 'context_size': 1024, 'model_name': 'gemma-2-9b-it', 'hook_name': 'blocks.20.hook_resid_post', 'hook_layer': 20, 'hook_head_index': None, 'prepend_bos': True, 'dataset_path': 'monology/pile-uncopyrighted', 'dataset_trust_remote_code': True, 'normalize_activations': None, 'dtype': 'float32', 'device': 'cpu', 'sae_lens_training_version': None, 'activation_fn_kwargs': {}, 'neuronpedia_id': None, 'model_from_pretrained_kwargs': {}}


Note above:
- d_in = 3584
- d_sae=16384

Therefore, we are decompressing the activations by approx x4.5.

Also...
- context_size = 1024
Our prompts, inc ARC training data, are limited to 1024 tokens.

## Load the ARC Challenge Data

Next we need to load in a dataset to work with.

We use a hand picked selection of ARC-AGI challenges from the ARC 'training' set at https://github.com/fchollet/ARC-AGI

In [6]:
# Read a file from your Google Drive
import json
import pandas as pd
from pathlib import Path
from datasets import Dataset
from transformer_lens.utils import tokenize_and_concatenate

# source = 'data_for_sae.parquet'
source = '/content/drive/My Drive/Colab Notebooks/Data/data_for_sae.parquet'

# Read the parquet file from your Google Drive
df = pd.read_parquet(source)

# Function to create a dataset from our dataframe
def create_dataset(main_col, filename_col, dataframe):
    return Dataset.from_pandas(dataframe[[main_col, filename_col]])

# Create datasets
prompt_dataset = create_dataset('prompt_text', 'filename', df)
target_dataset = create_dataset('target', 'filename', df)


In [None]:
df

Unnamed: 0,filename,confidence,len_prompt,prompt_text,target,outcome
0,e9afcf9a.json,5,973,Below are pairs of matrices. \nThere is a mapp...,"[[6, 2, 6, 2, 6, 2], [2, 6, 2, 6, 2, 6]]",True
1,c9e6f938.json,5,1218,Below are pairs of matrices. \nThere is a mapp...,"[[7, 7, 0, 0, 7, 7], [0, 7, 0, 0, 7, 0], [0, 0...",False
2,d037b0a7.json,3,1137,Below are pairs of matrices. \nThere is a mapp...,"[[4, 0, 8], [4, 0, 8], [4, 7, 8]]",True
3,6150a2bd.json,4,973,Below are pairs of matrices. \nThere is a mapp...,"[[0, 0, 4], [0, 8, 6], [5, 3, 6]]",False
4,6fa7a44f.json,5,1517,Below are pairs of matrices. \nThere is a mapp...,"[[2, 9, 2], [8, 5, 2], [2, 2, 8], [2, 2, 8], [...",True
5,48d8fb45.json,4,2489,Below are pairs of matrices. \nThere is a mapp...,"[[0, 3, 0], [3, 3, 0], [0, 3, 3]]",False
6,3c9b0459.json,3,1301,Below are pairs of matrices. \nThere is a mapp...,"[[7, 6, 4], [4, 6, 6], [4, 4, 6]]",False
7,ed36ccf7.json,3,1301,Below are pairs of matrices. \nThere is a mapp...,"[[0, 0, 5], [0, 0, 5], [0, 5, 0]]",False
8,b1948b0a.json,4,1383,Below are pairs of matrices. \nThere is a mapp...,"[[2, 7, 7, 2], [2, 7, 2, 7], [7, 7, 7, 2], [7,...",False
9,017c7c7b.json,4,1677,Below are pairs of matrices. \nThere is a mapp...,"[[2, 2, 2], [0, 2, 0], [0, 2, 0], [2, 2, 2], [...",False


In [None]:
# Example of how to access the data
print(prompt_dataset[0])  # This will show both 'prompt_text' and 'filename' for the first item
print(target_dataset[0])  # This will show both 'target' and 'filename' for the first item

# If you need to get all filenames
all_filenames = prompt_dataset['filename']

# If you need to get the filename for a specific index
index = 0
filename_for_index = prompt_dataset[index]['filename']
prompt_for_index = prompt_dataset[index]['prompt_text']
print(f"Filename for index {index}: {filename_for_index}")
print(f"Prompt for index {index}: {prompt_for_index}")

{'prompt_text': 'Below are pairs of matrices. \nThere is a mapping which operates on each input to give the output, only one mapping applies to all matrices. \nReview the matrices to learn that mapping and then estimate the missing output for the final input matrix.\n\nFIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. \nThis score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.\nTHEN Present your predicted output in np.array format\nTRAIN Pair 0\nINPUT. Shape=(2, 6)\narray([[3, 3, 3, 3, 3, 3],\n       [9, 9, 9, 9, 9, 9]])\nOUTPUT. Shape=(2, 6)\narray([[3, 9, 3, 9, 3, 9],\n       [9, 3, 9, 3, 9, 3]])\nTRAIN Pair 1\nINPUT. Shape=(2, 6)\narray([[4, 4, 4, 4, 4, 4],\n       [8, 8, 8, 8, 8, 8]])\nOUTPUT. Shape=(2, 6)\narray([[4, 8, 4, 8, 4, 8],\n       [8, 4, 8, 4, 8, 4]])\nTEST Pair 0\nINPUT. Shape=(2, 6)\narray([[6, 6, 6, 6, 6, 6],\n       [2, 2, 2, 2, 2,

Note, the above prompt to Gemma-2-9b-it gives a response as follows (40 tokens):

5
np.array([[6, 2, 6, 2, 6, 2],
[2, 6, 2, 6, 2, 6]])

In [7]:
sae.cfg.context_size

1024

# Basics: Test Prompt

Confirm environment is working by using a test prompt

In [10]:
from transformer_lens.utils import test_prompt

prompt = df['prompt_text'][0]
answer = "5"

# Show that the model can confidently predict the next token.
test_prompt(prompt, answer, model)

Tokenized prompt: ['<bos>', 'Below', ' are', ' pairs', ' of', ' matrices', '.', ' ', '\n', 'There', ' is', ' a', ' mapping', ' which', ' operates', ' on', ' each', ' input', ' to', ' give', ' the', ' output', ',', ' only', ' one', ' mapping', ' applies', ' to', ' all', ' matrices', '.', ' ', '\n', 'Review', ' the', ' matrices', ' to', ' learn', ' that', ' mapping', ' and', ' then', ' estimate', ' the', ' missing', ' output', ' for', ' the', ' final', ' input', ' matrix', '.', '\n\n', 'FIRST', ' score', ' your', ' confidence', ' that', ' you', ' understand', ' the', ' mapping', ' pattern', ',', ' ', '0', '-', '5', ' where', ' ', '0', ' is', ' zero', ' is', ' no', ' confidence', ' and', ' ', '5', ' is', ' highly', ' confident', '.', ' ', '\n', 'This', ' score', ' must', ' be', ' the', ' FIRST', ' output', ' you', ' give', ',', ' no', ' preamble', ',', ' no', ' prefix', ',', ' no', ' punctuation', ',', ' just', ' a', ' single', ' digit', ' score', '.', '\n', 'THEN', ' Present', ' your', '

Top 0th token. Logit: 24.94 Prob: 68.51% Token: |
|
Top 1th token. Logit: 23.21 Prob: 12.21% Token: |

|
Top 2th token. Logit: 22.79 Prob:  8.06% Token: |2|
Top 3th token. Logit: 21.62 Prob:  2.49% Token: |________________|
Top 4th token. Logit: 21.07 Prob:  1.44% Token: |


|
Top 5th token. Logit: 20.85 Prob:  1.16% Token: |1|
Top 6th token. Logit: 20.81 Prob:  1.11% Token: |0|
Top 7th token. Logit: 20.54 Prob:  0.84% Token: |6|
Top 8th token. Logit: 20.37 Prob:  0.72% Token: |................|
Top 9th token. Logit: 19.66 Prob:  0.35% Token: |################|


Top 0th token. Logit: 26.10 Prob: 67.46% Token: |
|
Top 1th token. Logit: 24.59 Prob: 14.89% Token: |

|
Top 2th token. Logit: 23.59 Prob:  5.52% Token: |2|
Top 3th token. Logit: 22.69 Prob:  2.23% Token: |________________|
Top 4th token. Logit: 22.56 Prob:  1.96% Token: |


|
Top 5th token. Logit: 21.91 Prob:  1.02% Token: |1|
Top 6th token. Logit: 21.88 Prob:  0.99% Token: |6|
Top 7th token. Logit: 21.75 Prob:  0.87% Token: |0|
Top 8th token. Logit: 21.26 Prob:  0.54% Token: |



|
Top 9th token. Logit: 21.26 Prob:  0.53% Token: |################|


## Using a HookedSAETransformer

Neel Nanda offers a full tutorial on SAEs using the HookedSAE Transformer class -> <a target="_blank" href="https://colab.research.google.com/github/TransformerLensOrg/TransformerLens/blob/main/demos/Hooked_SAE_Transformer_Demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Below, we'll use the `run_with_cache_with_saes` function of the HookedSAETransformer, which will give us all the cached activations (including those from the SAE that we've specified in the arguments).

Running our prompt through the model gets us activation tensors as follows:

In [8]:
# SAEs don't reconstruct activation perfectly, so if you attach an SAE and want the model to stay performant, you need to use the error term.
# This is because the SAE will be used to modify the forward pass, and if it doesn't reconstruct the activations well, the outputs may be affected.
# Good SAEs have small error terms but it's something to be mindful of.

sae.use_error_term # If use error term is set to false, we will modify the forward pass by using the sae.

False

In [9]:
# hooked SAE Transformer will enable us to get the feature activations from the SAE
prompt = prompt_dataset[0]['prompt_text']

_, cache = model.run_with_cache_with_saes(prompt, saes=[sae])

print([(k, v.shape) for k,v in cache.items() if "sae" in k])

# note there were 11 tokens in our prompt, the residual stream dimension is 768, and the number of SAE features is 768

[('blocks.20.hook_resid_post.hook_sae_input', torch.Size([1, 390, 3584])), ('blocks.20.hook_resid_post.hook_sae_acts_pre', torch.Size([1, 390, 16384])), ('blocks.20.hook_resid_post.hook_sae_acts_post', torch.Size([1, 390, 16384])), ('blocks.20.hook_resid_post.hook_sae_recons', torch.Size([1, 390, 3584])), ('blocks.20.hook_resid_post.hook_sae_output', torch.Size([1, 390, 3584]))]


Next, we'll visualize the activations of the hidden layer of the SAE at the *final token* position of the prompt, i.e. at the point the model is predicting the first token of the response.

This is the main reason we have asked the model to give a confidence score for its first token. That score encapsulates its response to the entire prompt.

Each of the below vertical lines correspond to a feature activation.

We could also plot the dashboards for each of these activated features, using their position in the activation cache as an index to pull data from Neuronpedia. But not necessary here.

In [11]:
# let's look at which features fired at layer 20 at the *final token position*

# from IPython.display import IFrame
# html_template = "https://neuronpedia.org/{}/{}/{}?embed=true&embedexplanation=true&embedplots=true&embedtest=true&height=300"
# def get_dashboard_html(sae_release = "gemma-scope-9b-it-res", sae_id="layer_20/width_16k/average_l0_14", feature_idx=0):
#     return html_template.format(sae_release, sae_id, feature_idx)

layer_hook= 'blocks.20.hook_resid_post.hook_sae_acts_post'

# get a random feature from the SAE
feature_idx = torch.randint(0, sae.cfg.d_sae, (1,)).item()

# hover over lines to see the Feature ID.
px.line(
    cache[layer_hook][0, -1, :].cpu().numpy(),
    title="Feature activations at the final token position",
    labels={"index": "Feature", "value": "Activation"},
).show()



In [None]:
# let's print the top 5 features and how much they fired
vals, inds = torch.topk(cache[layer_hook][0, -1, :], 5)
for val, ind in zip(vals, inds):
    print(f"Feature {ind} fired {val:.2f}")
    # html = get_dashboard_html(sae_release = "gemma-scope-9b-it-res", sae_id="layer_20/width_16k/average_l0_14", feature_idx=ind)
    # display(IFrame(html, width=1200, height=300))

Feature 3655 fired 30.44
Feature 4395 fired 23.42
Feature 3547 fired 22.51
Feature 9846 fired 19.33
Feature 10076 fired 17.23



From [Neuronpedia](https://www.neuronpedia.org/gemma-scope#browse) we can see what these features are most activated by:

- 03655: References to data structures and conditional checks in programming
- 04395: Mathematical expressions and relationships involving variables and functions
- 03547: Numerical data and statistical information related to surveys or questionnaires
- 09846: Numerical data and statistical references
- 10076: Terms related to hosting events or gatherings





### The Contrast Pairs Trick

We are interested in the features which fire differently between a ARC challenge successfully completed and one which is incorrect, the boundary of effective reasoning. Let's investigate this question by comparing the resultant activations from two such ARC challenges.

First we simply run the test_prompt to confirm all is well.

In [12]:
def generate_feature_activation_plot(prompt_dataset, indices, names, model, sae, layer_hook):
    """
    Generate a feature activation plot comparing activations between two prompts using the model and SAE.

    Arguments:
    - prompt_dataset: Dataset containing prompts and filenames
    - indices: List of indices for the two prompts to analyze
    - names: List of custom names to append to the filenames
    - model: Pretrained model to run with cache and SAEs
    - sae: Stacked autoencoder (SAE) configuration
    - layer_hook: String identifier for the SAE layer to hook activations from

    Returns:
    - Plotly figure of the feature activations and their differences
    """
    # Retrieve prompts and file names based on the provided indices
    prompt = [prompt_dataset[indices[0]]['prompt_text'], prompt_dataset[indices[1]]['prompt_text']]
    filename_0 = prompt_dataset[indices[0]]['filename'] + names[0]
    filename_1 = prompt_dataset[indices[1]]['filename'] + names[1]

    # Run the model and get the activations cached
    _, cache = model.run_with_cache_with_saes(prompt, saes=[sae])

    # Print shapes of the cached activations containing 'sae'
    print([(k, v.shape) for k, v in cache.items() if "sae" in k])

    # Extract activations for the specified SAE layer for the first prompt
    feature_activation_df = pd.DataFrame(
        cache[layer_hook][0, -1, :].cpu().numpy(),
        index=[f"feature_{i}" for i in range(sae.cfg.d_sae)]
    )

    # Set the column name to the first filename
    feature_activation_df.columns = [filename_0]

    # Add activations for the second prompt
    feature_activation_df[filename_1] = cache[layer_hook][1, -1, :].cpu().numpy()

    # Compute the difference between the two activation sets
    feature_activation_df["diff"] = feature_activation_df[filename_0] - feature_activation_df[filename_1]

    # Plot the feature activations using Plotly
    fig = px.line(
        feature_activation_df,
        title="Feature activations for the prompt",
        labels={"index": "Feature", "value": "Activation"},
    )

    # Hide the x-ticks for cleaner visualization
    fig.update_xaxes(showticklabels=False)

    return fig, cache


In [13]:
fig, cache = generate_feature_activation_plot (
    prompt_dataset=prompt_dataset,
    indices=[0,1],
    names=['_Success_Conf5', '_Fail_Conf5'],
    model=model,
    sae=sae,
    layer_hook=layer_hook
)
fig.show()

[('blocks.20.hook_resid_post.hook_sae_input', torch.Size([2, 527, 3584])), ('blocks.20.hook_resid_post.hook_sae_acts_pre', torch.Size([2, 527, 16384])), ('blocks.20.hook_resid_post.hook_sae_acts_post', torch.Size([2, 527, 16384])), ('blocks.20.hook_resid_post.hook_sae_recons', torch.Size([2, 527, 3584])), ('blocks.20.hook_resid_post.hook_sae_output', torch.Size([2, 527, 3584]))]


In [None]:
# let's look at the biggest features in terms of absolute difference

diff = cache[layer_hook][1, -1, :].cpu() - cache[layer_hook][0, -1, :].cpu()
vals, inds = torch.topk(torch.abs(diff), 10)
for val, ind in zip(vals, inds):
    print(f"Feature {ind} had a difference of {val:.2f}")


Feature 3655 had a difference of 124.55
Feature 1600 had a difference of 60.20
Feature 11000 had a difference of 28.83
Feature 14672 had a difference of 26.51
Feature 4395 had a difference of 24.58
Feature 11358 had a difference of 23.49
Feature 3547 had a difference of 22.87
Feature 9827 had a difference of 21.30
Feature 2495 had a difference of 18.84
Feature 700 had a difference of 18.21


Feature descriptions as per [Neuronpedia](https://www.neuronpedia.org/gemma-scope#browse):
Note, these are for:
- GEMMA-2-9B-IT
- 20-GEMMASCOPE-RES-16k.

Feature Descriptions:

- 03655: References to data structures and conditional checks in programming
- 01600: References to music albums and bands
- 11000: Possessive pronouns and their corresponding nouns
- 14672: Words and phrases that start with "dis-" or are related to disagreement or negation
- 04395: Mathematical expressions and relationships involving variables and functions
- 11358: Topics related to cultural and historical events
- 03547: Numerical data and statistical information related to surveys or questionnaires
- 09827: Specific names and terms related to individuals, organizations, or places within various contexts
- 02495: Step-by-step instructional phrases indicating processes or actions
- 00700: Phrases related to planning and organization

Others Activations on the Chart (left to right):

- 07525: References to Slavic mythology and cultural events
- 09311: Instances of user interaction and acknowledgments in conversational text
- 09383: Terminology related to occupational safety and health regulations
- 09846: Numerical data and statistical references
- 10076: Terms related to hosting events or gatherings
- 11873: Programming-related syntax and structures, particularly involving objects and arrays in code
- 13106: Code-related keywords and structures in test definitions
- 13240: Terms related to statistical models and distributions in mathematical contexts
- 15746: Terms and references related to image and privacy settings


In [None]:
fig, _ = generate_feature_activation_plot (
    prompt_dataset=prompt_dataset,
    indices=[2,3],
    names=['_Success_Conf3', '_Fail_Conf4'],
    model=model,
    sae=sae,
    layer_hook=layer_hook
)
fig.show()

[('blocks.20.hook_resid_post.hook_sae_input', torch.Size([2, 446, 3584])), ('blocks.20.hook_resid_post.hook_sae_acts_pre', torch.Size([2, 446, 16384])), ('blocks.20.hook_resid_post.hook_sae_acts_post', torch.Size([2, 446, 16384])), ('blocks.20.hook_resid_post.hook_sae_recons', torch.Size([2, 446, 3584])), ('blocks.20.hook_resid_post.hook_sae_output', torch.Size([2, 446, 3584]))]


In [None]:
fig, _ = generate_feature_activation_plot (
    prompt_dataset=prompt_dataset,
    indices=[0,2],
    names=['_Success_Conf5', '_Success_Conf3'],
    model=model,
    sae=sae,
    layer_hook=layer_hook
)
fig.show()

In [None]:
fig, _ = generate_feature_activation_plot (
    prompt_dataset=prompt_dataset,
    indices=[0,3],
    names=['_Success_Conf5', '_Fail_Conf4'],
    model=model,
    sae=sae,
    layer_hook=layer_hook
)
fig.show()

[('blocks.20.hook_resid_post.hook_sae_input', torch.Size([2, 390, 3584])), ('blocks.20.hook_resid_post.hook_sae_acts_pre', torch.Size([2, 390, 16384])), ('blocks.20.hook_resid_post.hook_sae_acts_post', torch.Size([2, 390, 16384])), ('blocks.20.hook_resid_post.hook_sae_recons', torch.Size([2, 390, 3584])), ('blocks.20.hook_resid_post.hook_sae_output', torch.Size([2, 390, 3584]))]


In [None]:
fig, _ = generate_feature_activation_plot (
    prompt_dataset=prompt_dataset,
    indices=[0,4],
    names=['_Success_Conf5','_Success_Conf5'],
    model=model,
    sae=sae,
    layer_hook=layer_hook
)
fig.show()

In [None]:
fig, _ = generate_feature_activation_plot (
    prompt_dataset=prompt_dataset,
    indices=[3,5],
    names=['_Fail_Conf4', '_Fail_Conf4'],
    model=model,
    sae=sae,
    layer_hook=layer_hook
)
fig.show()

[('blocks.20.hook_resid_post.hook_sae_input', torch.Size([2, 1602, 3584])), ('blocks.20.hook_resid_post.hook_sae_acts_pre', torch.Size([2, 1602, 16384])), ('blocks.20.hook_resid_post.hook_sae_acts_post', torch.Size([2, 1602, 16384])), ('blocks.20.hook_resid_post.hook_sae_recons', torch.Size([2, 1602, 3584])), ('blocks.20.hook_resid_post.hook_sae_output', torch.Size([2, 1602, 3584]))]


In [14]:
fig, cache = generate_feature_activation_plot (
    prompt_dataset=prompt_dataset,
    indices=[1,2],
    names=['_Fail_Conf5', '_Success_Conf3'],
    model=model,
    sae=sae,
    layer_hook=layer_hook
)
fig.show()

[('blocks.20.hook_resid_post.hook_sae_input', torch.Size([2, 527, 3584])), ('blocks.20.hook_resid_post.hook_sae_acts_pre', torch.Size([2, 527, 16384])), ('blocks.20.hook_resid_post.hook_sae_acts_post', torch.Size([2, 527, 16384])), ('blocks.20.hook_resid_post.hook_sae_recons', torch.Size([2, 527, 3584])), ('blocks.20.hook_resid_post.hook_sae_output', torch.Size([2, 527, 3584]))]


We can see that there are differences, but let's plot the feature dashboards for the features with the biggest diffs to see what they are. We can see that the biggest difference is that there is now an active "animal" feature.

In [15]:
# let's look at the biggest features in terms of absolute difference

diff = cache[layer_hook][1, -1, :].cpu() - cache[layer_hook][0, -1, :].cpu()
vals, inds = torch.topk(torch.abs(diff), 10)
for val, ind in zip(vals, inds):
    print(f"Feature {ind} had a difference of {val:.2f}")


Feature 3655 had a difference of 116.93
Feature 1600 had a difference of 62.55
Feature 11000 had a difference of 29.02
Feature 14672 had a difference of 28.46
Feature 11358 had a difference of 24.85
Feature 4395 had a difference of 24.58
Feature 3547 had a difference of 22.87
Feature 9827 had a difference of 21.14
Feature 700 had a difference of 19.08
Feature 15746 had a difference of 19.03


Broadly the same as the first we looked at. In fact, all challenges have differed by very similar features.

# Steering the Model. Intervening on SAE Features

We can 'steer' the model, either suppress or amplify the activations a specific feature. This gives us an indication of its role in the output.

To do this:

- We find the maximum activation of a feature in a set of text (using the activation store above),
- Use this as the default scale,
- Multiple it by the vector representing the feature (as extracted from the decoder weights),
- Multiply this by a parameter that we control.

This can be varied to see its effect on the text. We'll try steering feature 3655. Note that sometimes steering can get the model into a loop, so it's worth running this more than once.

In [None]:
# instantiate an object to hold activations from a dataset
from sae_lens import ActivationsStore

# a convenient way to instantiate an activation store is to use the from_sae method
activation_store = ActivationsStore.from_sae(
    model=model,
    sae=sae,
    streaming=True,
    # fairly conservative parameters here so can use same for larger
    # models without running out of memory.
    store_batch_size_prompts=8,
    train_batch_size_tokens=4096,
    n_batches_in_buffer=32,
    device=device,
)

Downloading readme:   0%|          | 0.00/776 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]



In [None]:
from tqdm import tqdm
from functools import partial

def find_max_activation(model, sae, activation_store, feature_idx, num_batches=100):
    '''
    Find the maximum activation for a given feature index. This is useful for
    calibrating the right amount of the feature to add.
    '''
    max_activation = 0.0

    pbar = tqdm(range(num_batches))
    for _ in pbar:
        tokens = activation_store.get_batch_tokens()

        _, cache = model.run_with_cache(
            tokens,
            stop_at_layer=sae.cfg.hook_layer + 1,
            names_filter=[sae.cfg.hook_name]
        )
        sae_in = cache[sae.cfg.hook_name]
        feature_acts = sae.encode(sae_in).squeeze()

        feature_acts = feature_acts.flatten(0, 1)
        batch_max_activation = feature_acts[:, feature_idx].max().item()
        max_activation = max(max_activation, batch_max_activation)

        pbar.set_description(f"Max activation: {max_activation:.4f}")

    return max_activation

def steering(activations, hook, steering_strength=1.0, steering_vector=None, max_act=1.0):
    # Note if the feature fires anyway, we'd be adding to that here.
    return activations + max_act * steering_strength * steering_vector

def generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=1.0, temperature=0.0, top_p=0.7, max_new_tokens=95):
    input_ids = model.to_tokens(prompt, prepend_bos=sae.cfg.prepend_bos)

    steering_vector = sae.W_dec[steering_feature].to(model.cfg.device)

    steering_hook = partial(
        steering,
        steering_vector=steering_vector,
        steering_strength=steering_strength,
        max_act=max_act
    )

    # standard transformerlens syntax for a hook context for generation
    with model.hooks(fwd_hooks=[(sae.cfg.hook_name, steering_hook)]):
        output = model.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            stop_at_eos = False if device == "mps" else True,
            prepend_bos = sae.cfg.prepend_bos,
        )

    return model.tokenizer.decode(output[0])


## Steering Feature 3655

The feature with the largest differences and the most activated feature overall
"References to data structures and conditional checks in programming"

In [None]:
# Choose a feature to steer & model temp etc
steering_feature = 3655

# The following are standard for Gemma2-9b-it, and we need repeatabilty hence temp=0
temperature = 0.0
top_p = 0.7

In [None]:
# Find the maximum activation for this feature by above function, which is slow running...
# max_act = find_max_activation(model, sae, activation_store, feature_idx, num_batches=100)
# OR
# We could also get the max activation from Neuronpedia (https://www.neuronpedia.org/api-doc#tag/lookup/GET/api/feature/{modelId}/{layer}/{index})
# Maximum activation for feature 3655: 183.1667

max_act = 183.1667
print(f"Maximum activation for feature {steering_feature}: {max_act:.4f}")


Maximum activation for feature 3655: 183.1667


In [None]:
# First analyse a successful prompt, how does feature 3655 affect it?
prompt = prompt_dataset[2]['prompt_text'] # 'd037b0a7.json_Success_Conf3'. Target: [[4, 0, 8], [4, 0, 8], [4, 7, 8]]

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.0, -1.0, 0.0, 1.0, 2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)


Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 0],
       [3, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 6],
       [3, 4, 6]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 0, 8],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 2, 8],
       [7, 2, 8]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[4, 0, 0],
       [0, 2, 0],
       [0, 0, 0]])
O

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 0],
       [3, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 6],
       [3, 4, 6]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 0, 8],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 2, 8],
       [7, 2, 8]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[4, 0, 0],
       [0, 2, 0],
       [0, 0, 0]])
O

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 0],
       [3, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 6],
       [3, 4, 6]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 0, 8],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 2, 8],
       [7, 2, 8]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[4, 0, 0],
       [0, 2, 0],
       [0, 0, 0]])
OU

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 0],
       [3, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 6],
       [3, 4, 6]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 0, 8],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 2, 8],
       [7, 2, 8]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[4, 0, 0],
       [0, 2, 0],
       [0, 0, 0]])
OU

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 0],
       [3, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 6],
       [3, 4, 6]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 0, 8],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 2, 8],
       [7, 2, 8]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[4, 0, 0],
       [0, 2, 0],
       [0, 0, 0]])
OU

In [None]:
# First analyse a successful prompt, how does feature 3655 affect it?
prompt = prompt_dataset[1]['prompt_text']
# 'c9e6f938.json_Fail_Conf5'.
# Target:
# [[7, 7, 0, 0, 7, 7],
#  [0, 7, 0, 0, 7, 0],
#  [0, 0, 7, 7, 0, 0]]

print(f"File = {prompt_dataset[1]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.0, -1.0, 0.0, 1.0, 2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = c9e6f938.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 7, 0],
       [0, 0, 7],
       [0, 7, 7]])
OUTPUT. Shape=(3, 6)
array([[0, 7, 0, 0, 7, 0],
       [0, 0, 7, 7, 0, 0],
       [0, 7, 7, 7, 7, 0]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 7, 7],
       [0, 0, 0]])
OUTPUT. Shape=(3, 6)
array([[0, 0, 0, 0, 0, 0],
       [0, 7, 7, 7, 7, 0],
       [0, 0, 0, 0, 0, 0]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
ar

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 7, 0],
       [0, 0, 7],
       [0, 7, 7]])
OUTPUT. Shape=(3, 6)
array([[0, 7, 0, 0, 7, 0],
       [0, 0, 7, 7, 0, 0],
       [0, 7, 7, 7, 7, 0]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 7, 7],
       [0, 0, 0]])
OUTPUT. Shape=(3, 6)
array([[0, 0, 0, 0, 0, 0],
       [0, 7, 7, 7, 7, 0],
       [0, 0, 0, 0, 0, 0]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
ar

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 7, 0],
       [0, 0, 7],
       [0, 7, 7]])
OUTPUT. Shape=(3, 6)
array([[0, 7, 0, 0, 7, 0],
       [0, 0, 7, 7, 0, 0],
       [0, 7, 7, 7, 7, 0]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 7, 7],
       [0, 0, 0]])
OUTPUT. Shape=(3, 6)
array([[0, 0, 0, 0, 0, 0],
       [0, 7, 7, 7, 7, 0],
       [0, 0, 0, 0, 0, 0]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
arr

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 7, 0],
       [0, 0, 7],
       [0, 7, 7]])
OUTPUT. Shape=(3, 6)
array([[0, 7, 0, 0, 7, 0],
       [0, 0, 7, 7, 0, 0],
       [0, 7, 7, 7, 7, 0]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 7, 7],
       [0, 0, 0]])
OUTPUT. Shape=(3, 6)
array([[0, 0, 0, 0, 0, 0],
       [0, 7, 7, 7, 7, 0],
       [0, 0, 0, 0, 0, 0]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
arr

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 7, 0],
       [0, 0, 7],
       [0, 7, 7]])
OUTPUT. Shape=(3, 6)
array([[0, 7, 0, 0, 7, 0],
       [0, 0, 7, 7, 0, 0],
       [0, 7, 7, 7, 7, 0]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 7, 7],
       [0, 0, 0]])
OUTPUT. Shape=(3, 6)
array([[0, 0, 0, 0, 0, 0],
       [0, 7, 7, 7, 7, 0],
       [0, 0, 0, 0, 0, 0]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
arr

In [None]:
steering_feature = 3655

prompt = prompt_dataset[3]['prompt_text']
# '6150a2bd.json_Fail_Conf4'.
# Target:
#[[0, 0, 4],
# [0, 8, 6],
# [5, 3, 6]]

print(f"File = {prompt_dataset[3]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.0, -1.0, 0.0, 1.0, 2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = 6150a2bd.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[3, 3, 8],
       [3, 7, 0],
       [5, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 5],
       [0, 7, 3],
       [8, 3, 3]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[5, 5, 2],
       [1, 0, 0],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 0, 1],
       [2, 5, 5]])
TEST Pair 0
INPUT. Shape=(3, 3)
array([[6, 3, 5],
       [6, 8, 0],
       [4, 0, 0]])
OU

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[3, 3, 8],
       [3, 7, 0],
       [5, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 5],
       [0, 7, 3],
       [8, 3, 3]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[5, 5, 2],
       [1, 0, 0],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 0, 1],
       [2, 5, 5]])
TEST Pair 0
INPUT. Shape=(3, 3)
array([[6, 3, 5],
       [6, 8, 0],
       [4, 0, 0]])
OU

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[3, 3, 8],
       [3, 7, 0],
       [5, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 5],
       [0, 7, 3],
       [8, 3, 3]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[5, 5, 2],
       [1, 0, 0],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 0, 1],
       [2, 5, 5]])
TEST Pair 0
INPUT. Shape=(3, 3)
array([[6, 3, 5],
       [6, 8, 0],
       [4, 0, 0]])
OUT

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[3, 3, 8],
       [3, 7, 0],
       [5, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 5],
       [0, 7, 3],
       [8, 3, 3]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[5, 5, 2],
       [1, 0, 0],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 0, 1],
       [2, 5, 5]])
TEST Pair 0
INPUT. Shape=(3, 3)
array([[6, 3, 5],
       [6, 8, 0],
       [4, 0, 0]])
OUT

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[3, 3, 8],
       [3, 7, 0],
       [5, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 5],
       [0, 7, 3],
       [8, 3, 3]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[5, 5, 2],
       [1, 0, 0],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 0, 1],
       [2, 5, 5]])
TEST Pair 0
INPUT. Shape=(3, 3)
array([[6, 3, 5],
       [6, 8, 0],
       [4, 0, 0]])
OUT

In [None]:
steering_feature = 3655

prompt = prompt_dataset[4]['prompt_text']
# '6fa7a44f.json_Success_Conf5'.
# Target:
#[[2, 9, 2],
# [8, 5, 2],
# [2, 2, 8],
# [2, 2, 8],


print(f"File = {prompt_dataset[4]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.0, -1.0, 0.0, 1.0, 2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = 6fa7a44f.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1]])
OUTPUT. Shape=(6, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1],
       [2, 1, 1],
       [9, 1, 4],
       [9, 1, 4]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8]])
OUTPUT. Shape=(6, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8],
       [8, 7, 8],
       [7, 6, 7],


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1]])
OUTPUT. Shape=(6, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1],
       [2, 1, 1],
       [9, 1, 4],
       [9, 1, 4]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8]])
OUTPUT. Shape=(6, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8],
       [8, 7, 8],
       [7, 6, 7],


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1]])
OUTPUT. Shape=(6, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1],
       [2, 1, 1],
       [9, 1, 4],
       [9, 1, 4]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8]])
OUTPUT. Shape=(6, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8],
       [8, 7, 8],
       [7, 6, 7],
 

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1]])
OUTPUT. Shape=(6, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1],
       [2, 1, 1],
       [9, 1, 4],
       [9, 1, 4]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8]])
OUTPUT. Shape=(6, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8],
       [8, 7, 8],
       [7, 6, 7],
 

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1]])
OUTPUT. Shape=(6, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1],
       [2, 1, 1],
       [9, 1, 4],
       [9, 1, 4]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8]])
OUTPUT. Shape=(6, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8],
       [8, 7, 8],
       [7, 6, 7],
 

In [None]:
steering_feature = 3655

prompt = prompt_dataset[6]['prompt_text']
# '3c9b0459.json_Fail_Conf3'.
# Target:
#[[7, 6, 4],
# [4, 6, 6],
# [4, 4, 6]]

print(f"File = {prompt_dataset[6]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.5, -3.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = 3c9b0459.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.5:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[2, 2, 1],
       [2, 1, 2],
       [2, 8, 1]])
OUTPUT. Shape=(3, 3)
array([[1, 8, 2],
       [2, 1, 2],
       [1, 2, 2]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[9, 2, 4],
       [2, 4, 4],
       [2, 9, 2]])
OUTPUT. Shape=(3, 3)
array([[2, 9, 2],
       [4, 4, 2],
       [4, 2, 9]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[8, 8, 8],
       [5, 5, 8],
       [8, 5, 5]])
O

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -3.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[2, 2, 1],
       [2, 1, 2],
       [2, 8, 1]])
OUTPUT. Shape=(3, 3)
array([[1, 8, 2],
       [2, 1, 2],
       [1, 2, 2]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[9, 2, 4],
       [2, 4, 4],
       [2, 9, 2]])
OUTPUT. Shape=(3, 3)
array([[2, 9, 2],
       [4, 4, 2],
       [4, 2, 9]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[8, 8, 8],
       [5, 5, 8],
       [8, 5, 5]])
O

In [None]:
steering_feature = 3655

prompt = prompt_dataset[7]['prompt_text']
# 'ed36ccf7.json_Fail_Conf3'.
# Target:
#[[0, 0, 5],
# [0, 0, 5],
# [0, 5, 0]]

print(f"File = {prompt_dataset[7]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-3.0, -2.5]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = ed36ccf7.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -3.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 0, 0],
       [9, 9, 9],
       [9, 9, 9]])
OUTPUT. Shape=(3, 3)
array([[0, 9, 9],
       [0, 9, 9],
       [9, 9, 9]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[6, 6, 6],
       [0, 0, 0],
       [6, 6, 0]])
OUTPUT. Shape=(3, 3)
array([[6, 0, 0],
       [6, 0, 6],
       [6, 0, 6]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[0, 0, 9],
       [0, 0, 9],
       [9, 9, 9]])
O

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.5:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 0, 0],
       [9, 9, 9],
       [9, 9, 9]])
OUTPUT. Shape=(3, 3)
array([[0, 9, 9],
       [0, 9, 9],
       [9, 9, 9]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[6, 6, 6],
       [0, 0, 0],
       [6, 6, 0]])
OUTPUT. Shape=(3, 3)
array([[6, 0, 0],
       [6, 0, 6],
       [6, 0, 6]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[0, 0, 9],
       [0, 0, 9],
       [9, 9, 9]])
O

In [None]:
steering_feature = 3655

prompt = prompt_dataset[9]['prompt_text']
# 'b1948b0a.json_Fail_Conf4'.
# Target:

print(f"File = {prompt_dataset[9]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.0,-1.0,0.0,1.0,2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = 017c7c7b.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
    

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
    

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
     

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
     

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
     

## Steering Feature 1600

The second largest feature.
"References to music albums and bands"

So what do bands have to do with these ARC challenges? Let's use steering to find out what impact this feature has.



In [None]:
# Choose a feature to steer & model temp etc
steering_feature = 1600

In [None]:
# Find the maximum activation for this feature by above function, which is slow running...
# max_act = find_max_activation(model, sae, activation_store, feature_idx=1600, num_batches=100)
# OR
# We could also get the max activation from Neuronpedia (https://www.neuronpedia.org/api-doc#tag/lookup/GET/api/feature/{modelId}/{layer}/{index})
# Maximum activation for feature 1600:

max_act = 63.0995

print(f"Maximum activation for feature {steering_feature}: {max_act:.4f}")

In [None]:
# First analyse a successful prompt, how does feature 3655 affect it?
prompt = prompt_dataset[2]['prompt_text'] # 'd037b0a7.json_Success_Conf3'.
# Target:
#[[4, 0, 8],
# [4, 0, 8],
# [4, 7, 8]]

# Experiment with different steering strengths
print(f"\nExperimenting with different steering strengths: {prompt_dataset[2]['filename']}")
for strength in [-2.0, -1.0, 0.0, 1.0, 2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)


Experimenting with different steering strengths: d037b0a7.json


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 0],
       [3, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 6],
       [3, 4, 6]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 0, 8],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 2, 8],
       [7, 2, 8]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[4, 0, 0],
       [0, 2, 0],
       [0, 0, 0]])
O

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 0],
       [3, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 6],
       [3, 4, 6]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 0, 8],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 2, 8],
       [7, 2, 8]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[4, 0, 0],
       [0, 2, 0],
       [0, 0, 0]])
O

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 0],
       [3, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 6],
       [3, 4, 6]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 0, 8],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 2, 8],
       [7, 2, 8]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[4, 0, 0],
       [0, 2, 0],
       [0, 0, 0]])
OU

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 0],
       [3, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 6],
       [3, 4, 6]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 0, 8],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 2, 8],
       [7, 2, 8]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[4, 0, 0],
       [0, 2, 0],
       [0, 0, 0]])
OU

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 0],
       [3, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 6],
       [0, 4, 6],
       [3, 4, 6]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 0, 8],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 2, 0],
       [7, 2, 8],
       [7, 2, 8]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[4, 0, 0],
       [0, 2, 0],
       [0, 0, 0]])
OU

In [None]:
steering_feature = 1600

prompt = prompt_dataset[1]['prompt_text']
# 'c9e6f938.json_Fail_Conf5'.
# Target:
# [[7, 7, 0, 0, 7, 7],
#  [0, 7, 0, 0, 7, 0],
#  [0, 0, 7, 7, 0, 0]]

print(f"File = {prompt_dataset[1]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.0, -1.0, 0.0, 1.0, 2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = c9e6f938.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 7, 0],
       [0, 0, 7],
       [0, 7, 7]])
OUTPUT. Shape=(3, 6)
array([[0, 7, 0, 0, 7, 0],
       [0, 0, 7, 7, 0, 0],
       [0, 7, 7, 7, 7, 0]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 7, 7],
       [0, 0, 0]])
OUTPUT. Shape=(3, 6)
array([[0, 0, 0, 0, 0, 0],
       [0, 7, 7, 7, 7, 0],
       [0, 0, 0, 0, 0, 0]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
ar

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 7, 0],
       [0, 0, 7],
       [0, 7, 7]])
OUTPUT. Shape=(3, 6)
array([[0, 7, 0, 0, 7, 0],
       [0, 0, 7, 7, 0, 0],
       [0, 7, 7, 7, 7, 0]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 7, 7],
       [0, 0, 0]])
OUTPUT. Shape=(3, 6)
array([[0, 0, 0, 0, 0, 0],
       [0, 7, 7, 7, 7, 0],
       [0, 0, 0, 0, 0, 0]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
ar

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 7, 0],
       [0, 0, 7],
       [0, 7, 7]])
OUTPUT. Shape=(3, 6)
array([[0, 7, 0, 0, 7, 0],
       [0, 0, 7, 7, 0, 0],
       [0, 7, 7, 7, 7, 0]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 7, 7],
       [0, 0, 0]])
OUTPUT. Shape=(3, 6)
array([[0, 0, 0, 0, 0, 0],
       [0, 7, 7, 7, 7, 0],
       [0, 0, 0, 0, 0, 0]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
arr

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 7, 0],
       [0, 0, 7],
       [0, 7, 7]])
OUTPUT. Shape=(3, 6)
array([[0, 7, 0, 0, 7, 0],
       [0, 0, 7, 7, 0, 0],
       [0, 7, 7, 7, 7, 0]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 7, 7],
       [0, 0, 0]])
OUTPUT. Shape=(3, 6)
array([[0, 0, 0, 0, 0, 0],
       [0, 7, 7, 7, 7, 0],
       [0, 0, 0, 0, 0, 0]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
arr

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[0, 7, 0],
       [0, 0, 7],
       [0, 7, 7]])
OUTPUT. Shape=(3, 6)
array([[0, 7, 0, 0, 7, 0],
       [0, 0, 7, 7, 0, 0],
       [0, 7, 7, 7, 7, 0]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 7, 7],
       [0, 0, 0]])
OUTPUT. Shape=(3, 6)
array([[0, 0, 0, 0, 0, 0],
       [0, 7, 7, 7, 7, 0],
       [0, 0, 0, 0, 0, 0]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
arr

In [None]:
steering_feature = 1600

prompt = prompt_dataset[3]['prompt_text']
# '6150a2bd.json_Fail_Conf4'.
# Target:
#[[0, 0, 4],
# [0, 8, 6],
# [5, 3, 6]]

print(f"File = {prompt_dataset[3]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.0, -1.0, 0.0, 1.0, 2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = 6150a2bd.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[3, 3, 8],
       [3, 7, 0],
       [5, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 5],
       [0, 7, 3],
       [8, 3, 3]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[5, 5, 2],
       [1, 0, 0],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 0, 1],
       [2, 5, 5]])
TEST Pair 0
INPUT. Shape=(3, 3)
array([[6, 3, 5],
       [6, 8, 0],
       [4, 0, 0]])
OU

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[3, 3, 8],
       [3, 7, 0],
       [5, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 5],
       [0, 7, 3],
       [8, 3, 3]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[5, 5, 2],
       [1, 0, 0],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 0, 1],
       [2, 5, 5]])
TEST Pair 0
INPUT. Shape=(3, 3)
array([[6, 3, 5],
       [6, 8, 0],
       [4, 0, 0]])
OU

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[3, 3, 8],
       [3, 7, 0],
       [5, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 5],
       [0, 7, 3],
       [8, 3, 3]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[5, 5, 2],
       [1, 0, 0],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 0, 1],
       [2, 5, 5]])
TEST Pair 0
INPUT. Shape=(3, 3)
array([[6, 3, 5],
       [6, 8, 0],
       [4, 0, 0]])
OUT

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[3, 3, 8],
       [3, 7, 0],
       [5, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 5],
       [0, 7, 3],
       [8, 3, 3]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[5, 5, 2],
       [1, 0, 0],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 0, 1],
       [2, 5, 5]])
TEST Pair 0
INPUT. Shape=(3, 3)
array([[6, 3, 5],
       [6, 8, 0],
       [4, 0, 0]])
OUT

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[3, 3, 8],
       [3, 7, 0],
       [5, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 5],
       [0, 7, 3],
       [8, 3, 3]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[5, 5, 2],
       [1, 0, 0],
       [0, 0, 0]])
OUTPUT. Shape=(3, 3)
array([[0, 0, 0],
       [0, 0, 1],
       [2, 5, 5]])
TEST Pair 0
INPUT. Shape=(3, 3)
array([[6, 3, 5],
       [6, 8, 0],
       [4, 0, 0]])
OUT

In [None]:
steering_feature = 1600

prompt = prompt_dataset[4]['prompt_text']
# '6fa7a44f.json_Success_Conf5'.
# Target:
#[[2, 9, 2],
# [8, 5, 2],
# [2, 2, 8],
# [2, 2, 8],


print(f"File = {prompt_dataset[4]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.0, -1.0, 0.0, 1.0, 2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = 6fa7a44f.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1]])
OUTPUT. Shape=(6, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1],
       [2, 1, 1],
       [9, 1, 4],
       [9, 1, 4]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8]])
OUTPUT. Shape=(6, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8],
       [8, 7, 8],
       [7, 6, 7],


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1]])
OUTPUT. Shape=(6, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1],
       [2, 1, 1],
       [9, 1, 4],
       [9, 1, 4]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8]])
OUTPUT. Shape=(6, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8],
       [8, 7, 8],
       [7, 6, 7],


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1]])
OUTPUT. Shape=(6, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1],
       [2, 1, 1],
       [9, 1, 4],
       [9, 1, 4]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8]])
OUTPUT. Shape=(6, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8],
       [8, 7, 8],
       [7, 6, 7],
 

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1]])
OUTPUT. Shape=(6, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1],
       [2, 1, 1],
       [9, 1, 4],
       [9, 1, 4]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8]])
OUTPUT. Shape=(6, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8],
       [8, 7, 8],
       [7, 6, 7],
 

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1]])
OUTPUT. Shape=(6, 3)
array([[9, 1, 4],
       [9, 1, 4],
       [2, 1, 1],
       [2, 1, 1],
       [9, 1, 4],
       [9, 1, 4]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8]])
OUTPUT. Shape=(6, 3)
array([[4, 8, 4],
       [7, 6, 7],
       [8, 7, 8],
       [8, 7, 8],
       [7, 6, 7],
 

In [None]:
steering_feature = 1600

prompt = prompt_dataset[6]['prompt_text']
# '3c9b0459.json_Fail_Conf3'.
# Target:
#[[7, 6, 4],
# [4, 6, 6],
# [4, 4, 6]]

print(f"File = {prompt_dataset[6]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.0, -1.0, 0.0, 1.0, 2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = 3c9b0459.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[2, 2, 1],
       [2, 1, 2],
       [2, 8, 1]])
OUTPUT. Shape=(3, 3)
array([[1, 8, 2],
       [2, 1, 2],
       [1, 2, 2]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[9, 2, 4],
       [2, 4, 4],
       [2, 9, 2]])
OUTPUT. Shape=(3, 3)
array([[2, 9, 2],
       [4, 4, 2],
       [4, 2, 9]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[8, 8, 8],
       [5, 5, 8],
       [8, 5, 5]])
O

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[2, 2, 1],
       [2, 1, 2],
       [2, 8, 1]])
OUTPUT. Shape=(3, 3)
array([[1, 8, 2],
       [2, 1, 2],
       [1, 2, 2]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[9, 2, 4],
       [2, 4, 4],
       [2, 9, 2]])
OUTPUT. Shape=(3, 3)
array([[2, 9, 2],
       [4, 4, 2],
       [4, 2, 9]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[8, 8, 8],
       [5, 5, 8],
       [8, 5, 5]])
O

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[2, 2, 1],
       [2, 1, 2],
       [2, 8, 1]])
OUTPUT. Shape=(3, 3)
array([[1, 8, 2],
       [2, 1, 2],
       [1, 2, 2]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[9, 2, 4],
       [2, 4, 4],
       [2, 9, 2]])
OUTPUT. Shape=(3, 3)
array([[2, 9, 2],
       [4, 4, 2],
       [4, 2, 9]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[8, 8, 8],
       [5, 5, 8],
       [8, 5, 5]])
OU

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[2, 2, 1],
       [2, 1, 2],
       [2, 8, 1]])
OUTPUT. Shape=(3, 3)
array([[1, 8, 2],
       [2, 1, 2],
       [1, 2, 2]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[9, 2, 4],
       [2, 4, 4],
       [2, 9, 2]])
OUTPUT. Shape=(3, 3)
array([[2, 9, 2],
       [4, 4, 2],
       [4, 2, 9]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[8, 8, 8],
       [5, 5, 8],
       [8, 5, 5]])
OU

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[2, 2, 1],
       [2, 1, 2],
       [2, 8, 1]])
OUTPUT. Shape=(3, 3)
array([[1, 8, 2],
       [2, 1, 2],
       [1, 2, 2]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[9, 2, 4],
       [2, 4, 4],
       [2, 9, 2]])
OUTPUT. Shape=(3, 3)
array([[2, 9, 2],
       [4, 4, 2],
       [4, 2, 9]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[8, 8, 8],
       [5, 5, 8],
       [8, 5, 5]])
OU

In [None]:
steering_feature = 1600

prompt = prompt_dataset[7]['prompt_text']
# 'ed36ccf7.json_Fail_Conf3'.
# Target:
#[[0, 0, 5],
# [0, 0, 5],
# [0, 5, 0]]

print(f"File = {prompt_dataset[7]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.0, -1.0, 0.0, 1.0, 2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = ed36ccf7.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 0, 0],
       [9, 9, 9],
       [9, 9, 9]])
OUTPUT. Shape=(3, 3)
array([[0, 9, 9],
       [0, 9, 9],
       [9, 9, 9]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[6, 6, 6],
       [0, 0, 0],
       [6, 6, 0]])
OUTPUT. Shape=(3, 3)
array([[6, 0, 0],
       [6, 0, 6],
       [6, 0, 6]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[0, 0, 9],
       [0, 0, 9],
       [9, 9, 9]])
O

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 0, 0],
       [9, 9, 9],
       [9, 9, 9]])
OUTPUT. Shape=(3, 3)
array([[0, 9, 9],
       [0, 9, 9],
       [9, 9, 9]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[6, 6, 6],
       [0, 0, 0],
       [6, 6, 0]])
OUTPUT. Shape=(3, 3)
array([[6, 0, 0],
       [6, 0, 6],
       [6, 0, 6]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[0, 0, 9],
       [0, 0, 9],
       [9, 9, 9]])
O

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 0, 0],
       [9, 9, 9],
       [9, 9, 9]])
OUTPUT. Shape=(3, 3)
array([[0, 9, 9],
       [0, 9, 9],
       [9, 9, 9]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[6, 6, 6],
       [0, 0, 0],
       [6, 6, 0]])
OUTPUT. Shape=(3, 3)
array([[6, 0, 0],
       [6, 0, 6],
       [6, 0, 6]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[0, 0, 9],
       [0, 0, 9],
       [9, 9, 9]])
OU

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 0, 0],
       [9, 9, 9],
       [9, 9, 9]])
OUTPUT. Shape=(3, 3)
array([[0, 9, 9],
       [0, 9, 9],
       [9, 9, 9]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[6, 6, 6],
       [0, 0, 0],
       [6, 6, 0]])
OUTPUT. Shape=(3, 3)
array([[6, 0, 0],
       [6, 0, 6],
       [6, 0, 6]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[0, 0, 9],
       [0, 0, 9],
       [9, 9, 9]])
OU

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(3, 3)
array([[9, 0, 0],
       [9, 9, 9],
       [9, 9, 9]])
OUTPUT. Shape=(3, 3)
array([[0, 9, 9],
       [0, 9, 9],
       [9, 9, 9]])
TRAIN Pair 1
INPUT. Shape=(3, 3)
array([[6, 6, 6],
       [0, 0, 0],
       [6, 6, 0]])
OUTPUT. Shape=(3, 3)
array([[6, 0, 0],
       [6, 0, 6],
       [6, 0, 6]])
TRAIN Pair 2
INPUT. Shape=(3, 3)
array([[0, 0, 9],
       [0, 0, 9],
       [9, 9, 9]])
OU

In [None]:
steering_feature = 3547

prompt = prompt_dataset[9]['prompt_text']
# 'b1948b0a.json_Fail_Conf4'.
# Target:

print(f"File = {prompt_dataset[9]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.0,-1.0,0.0,1.0,2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = 017c7c7b.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
    

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
    

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
     

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
     

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
     

In [None]:
# 14672, 11358,4395, 3547, 9827, 700, 15746
steering_feature = 14672

prompt = prompt_dataset[9]['prompt_text']
# 'b1948b0a.json_Fail_Conf4'.
# Target:

print(f"File = {prompt_dataset[9]['filename']}")

# Experiment with different steering strengths
print("\nExperimenting with different steering strengths:")
for strength in [-2.0,-1.0,0.0,1.0,2.0]:
    steered_text = generate_with_steering(model, sae, prompt, steering_feature, max_act, steering_strength=strength)
    print("\n\n===================================\n")
    print(f"\nSteering strength {strength}:")
    print(steered_text)

File = 017c7c7b.json

Experimenting with different steering strengths:


  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
    

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength -1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
    

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 0.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
     

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 1.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
     

  0%|          | 0/95 [00:00<?, ?it/s]





Steering strength 2.0:
<bos>Below are pairs of matrices. 
There is a mapping which operates on each input to give the output, only one mapping applies to all matrices. 
Review the matrices to learn that mapping and then estimate the missing output for the final input matrix.

FIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. 
This score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.
THEN Present your predicted output in np.array format
TRAIN Pair 0
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 1, 0],
       [0, 1, 0],
       [0, 1, 1],
       [0, 1, 0],
       [1, 1, 0]])
OUTPUT. Shape=(9, 3)
array([[0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0],
       [2, 2, 0],
       [0, 2, 0],
       [0, 2, 2],
       [0, 2, 0]])
TRAIN Pair 1
INPUT. Shape=(6, 3)
array([[0, 1, 0],
       [1, 0, 1],
       [0, 1, 0],
     

## Co-occurence Networks and Irreducible Subspaces

Since we just wrote code very similar to the code we need for reproducing some of the analysis from ["Not All Language Model Features are Linear"](https://arxiv.org/abs/2405.14860), we show below how to regenerate their awesome circular representation (demonstrating a geometric relationship between related features, like days of the week).

This is most effective when studying words with related but distinct meanings, for example, days of the week.

We can run this on the features which signify the greatest differences between activations, so see if there is any discernable pattern.

For a baseline against which to compare those patterns, we execute the analysis on random features.

In [16]:
import random

random_nodes = [random.randint(0, 16000) for _ in range(8)]

key_differences = [3655, 1600, 11000, 14672, 11358, 4395, 3547, 9827]

key_differences_stats = [3655, 700, 2495, 3547, 9311, 9846, 11873, 13106, 13240]

# key_differences_other = [1600, 11000, 14672, 11358, 7525, 9311, 9383, 9827, 10076, 15746]

In [17]:
# Let's view the randomly selected baseline of assumed unrelated features.
random_nodes

[12832, 1423, 3776, 12779, 10046, 11545, 4547, 3058]

In [18]:
def list_flatten(nested_list):
    return [x for y in nested_list for x in y]

# A very handy function Neel wrote to get context around a feature activation
def make_token_df(tokens, len_prefix=5, len_suffix=3, model = model):
    str_tokens = [model.to_str_tokens(t) for t in tokens]
    unique_token = [[f"{s}/{i}" for i, s in enumerate(str_tok)] for str_tok in str_tokens]

    context = []
    prompt = []
    pos = []
    label = []
    for b in range(tokens.shape[0]):
        for p in range(tokens.shape[1]):
            prefix = "".join(str_tokens[b][max(0, p-len_prefix):p])
            if p==tokens.shape[1]-1:
                suffix = ""
            else:
                suffix = "".join(str_tokens[b][p+1:min(tokens.shape[1]-1, p+1+len_suffix)])
            current = str_tokens[b][p]
            context.append(f"{prefix}|{current}|{suffix}")
            prompt.append(b)
            pos.append(p)
            label.append(f"{b}/{p}")
    # print(len(batch), len(pos), len(context), len(label))
    return pd.DataFrame(dict(
        str_tokens=list_flatten(str_tokens),
        unique_token=list_flatten(unique_token),
        context=context,
        prompt=prompt,
        pos=pos,
        label=label,
    ))

In [19]:
def analyze_feature_activations(model, sae, activation_store, feature_list, total_batches=100):
    examples_found = 0
    all_fired_tokens = []
    all_feature_acts = []
    all_reconstructions = []
    all_token_dfs = []

    batch_size_prompts = activation_store.store_batch_size_prompts
    batch_size_tokens = activation_store.context_size * batch_size_prompts

    pbar = tqdm(range(total_batches))
    for i in pbar:
        tokens = activation_store.get_batch_tokens()
        tokens_df = make_token_df(tokens)
        tokens_df["batch"] = i

        flat_tokens = tokens.flatten()

        _, cache = model.run_with_cache(tokens, stop_at_layer=sae.cfg.hook_layer + 1, names_filter=[sae.cfg.hook_name])
        sae_in = cache[sae.cfg.hook_name]
        feature_acts = sae.encode(sae_in).squeeze()

        feature_acts = feature_acts.flatten(0, 1)
        fired_mask = (feature_acts[:, feature_list]).sum(dim=-1) > 0
        fired_tokens = model.to_str_tokens(flat_tokens[fired_mask])
        reconstruction = feature_acts[fired_mask][:, feature_list] @ sae.W_dec[feature_list]

        token_df = tokens_df.iloc[fired_mask.cpu().nonzero().flatten().numpy()]
        all_token_dfs.append(token_df)
        all_feature_acts.append(feature_acts[fired_mask][:, feature_list])
        all_fired_tokens.append(fired_tokens)
        all_reconstructions.append(reconstruction)

        examples_found += len(fired_tokens)
        pbar.set_description(f"Examples found: {examples_found}")

    all_token_dfs = pd.concat(all_token_dfs)
    all_fired_tokens = list_flatten(all_fired_tokens)
    all_reconstructions = torch.cat(all_reconstructions)
    all_feature_acts = torch.cat(all_feature_acts)

    return {'all_token_dfs':all_token_dfs,
            'all_fired_tokens': all_fired_tokens,
            'all_reconstructions':all_reconstructions,
            'all_feature_acts': all_feature_acts,
            'examples_found':examples_found}

In [20]:
# We need PCA to project the data down to 2D for charting.

import pandas as pd
import plotly.express as px
from sklearn.decomposition import PCA

def analyze_pca(all_reconstructions, all_fired_tokens, all_token_dfs, n_components=3):

  # Perform PCA
  pca = PCA(n_components=n_components)
  pca_embedding = pca.fit_transform(all_reconstructions.detach().cpu().numpy())

  # Create DataFrame with PCA results
  pca_df = pd.DataFrame(pca_embedding, columns=[f"PC{i+1}" for i in range(n_components)])
  pca_df["tokens"] = all_fired_tokens
  pca_df["context"] = all_token_dfs.context.values

  return pca_df

In [21]:
# Examine Features which form the Key Differences
feature_list = key_differences
filename = 'pca_df_key_features.parquet'

# load from file if already exist, they take an HOUR to produce on TPU not optimised for PyTorch
if os.path.exists(f'/content/drive/My Drive/Colab Notebooks/Data/{filename}'):
  pca_df = pd.read_parquet(f'/content/drive/My Drive/Colab Notebooks/Data/{filename}')
else:
  afa = analyze_feature_activations(model, sae, activation_store, feature_list, total_batches=64)

  all_token_dfs = afa['all_token_dfs']
  all_fired_tokens = afa['all_fired_tokens']
  all_reconstructions = afa['all_reconstructions']
  all_feature_acts = afa['all_feature_acts']
  examples_found = afa['examples_found']

  # Dimension reduction via PCA...so we can visualise on 2D.
  pca_df = analyze_pca(all_reconstructions, all_fired_tokens, all_token_dfs, n_components=3)
  pca_df.to_parquet(f'/content/drive/My Drive/Colab Notebooks/Data/{filename}')


In [22]:
# imnsect the pca dataframe...
pca_df

Unnamed: 0,PC1,PC2,PC3,tokens,context
0,-51.035973,72.988747,23.837778,<bos>,|<bos>| angiogenesis and inhibiting
1,-7.237035,21.563467,-10.811658,([,<bos> angiogenesis and inhibiting apoptosis| (...
2,-3.427665,-0.515436,0.100143,Cao,angiogenesis and inhibiting apoptosis ([|Cao|...
3,20.386850,-0.211215,0.118407,and,"and inhibiting apoptosis ([Cao| and| Prescott,"
4,-25.997721,-0.803798,0.082917,Prescott,"inhibiting apoptosis ([Cao and| Prescott|, 2"
...,...,...,...,...,...
304015,4.606557,-0.412869,0.106323,(,trace-preserving completely positive| (|TPCP)
304016,37.356258,0.005474,0.131390,T,-preserving completely positive (|T|PCP) maps
304017,8.497429,10.011206,0.678355,PCP,preserving completely positive (T|PCP|) maps c...
304018,15.564496,-0.272893,0.114711,character,(TPCP) maps| character|ising quantum dynamics


In [None]:
# Create and show the scatter plot
fig_pca = px.scatter(
    pca_df,
    x="PC2", y="PC3",
    # hover_data="context", hover_name="tokens",
    height=800, width=1200,
    color="tokens",title=f"PCA Subspace Reconstructions: Differential features:{str(key_differences)}"
)

fig_pca.show()

In [None]:
# Examine Features which form the Key Differences
feature_list = key_differences_stats
filename = 'pca_df_key_features_stats.parquet'

# load from file if already exist
if os.path.exists(f'/content/drive/My Drive/Colab Notebooks/Data/{filename}'):
  pca_df_stats = pd.read_parquet(f'/content/drive/My Drive/Colab Notebooks/Data/{filename}')
else:
  afa_stats = analyze_feature_activations(model, sae, activation_store, feature_list, total_batches=64)

  all_token_dfs_stats = afa_stats['all_token_dfs']
  all_fired_tokens_stats = afa_stats['all_fired_tokens']
  all_reconstructions_stats = afa_stats['all_reconstructions']
  all_feature_acts_stats = afa_stats['all_feature_acts']
  examples_found_stats = afa_stats['examples_found']

  # Dimension reduction via PCA...so we can visualise on 2D.
  pca_df_stats = analyze_pca(all_reconstructions_stats, all_fired_tokens_stats, all_token_dfs_stats, n_components=3)
  pca_df_stats.to_parquet(f'/content/drive/My Drive/Colab Notebooks/Data/{filename}')

In [None]:
pca_df_stats

Unnamed: 0,PC1,PC2,PC3,tokens,context
0,-50.868080,74.605659,-1.479990,<bos>,|<bos>|It is done
1,8.112121,-0.257879,-0.147765,is,"<bos>It| is| done, and"
2,4.584065,-0.307482,-0.120999,done,"<bos>It is| done|, and submitted"
3,-27.553564,-0.758245,0.123383,",","<bos>It is done|,| and submitted."
4,-4.559659,-0.435713,-0.051461,and,"<bos>It is done,| and| submitted. You"
...,...,...,...,...,...
299460,-5.712912,-0.451902,-0.042680,a,pe2015|a|; @K
299461,1.706656,-0.347833,-0.099089,;,2015a|;| @Kuhn
299462,-5.563751,-0.449810,-0.043815,@,015a;| @|Kuhn
299463,30.861727,0.061103,-0.320746,K,15a; @|K|uhn


In [None]:
# Create and show the scatter plot
fig_pca_stats = px.scatter(
    pca_df_stats,
    x="PC2", y="PC3",
    # hover_data="context", hover_name="tokens",
    height=800, width=1200,
    color="tokens",title=f"PCA Subspace Reconstructions. Stats features:{str(key_differences_stats)}"
)

fig_pca_stats.show()

In [None]:
# Examine Features which form the Key Differences
feature_list = random_nodes
filename = 'pca_df_random_features.parquet'

# load from file if already exist
if os.path.exists(f'/content/drive/My Drive/Colab Notebooks/Data/{filename}'):
  pca_df_rand = pd.read_parquet(f'/content/drive/My Drive/Colab Notebooks/Data/{filename}')
else:
  afa_rand = analyze_feature_activations(model, sae, activation_store, feature_list, total_batches=64)

  all_token_dfs_rand = afa_rand['all_token_dfs']
  all_fired_tokens_rand = afa_rand['all_fired_tokens']
  all_reconstructions_rand = afa_rand['all_reconstructions']
  all_feature_acts_rand = afa_rand['all_feature_acts']
  examples_found_rand = afa_rand['examples_found']

  # Dimension reduction via PCA...so we can visualise on 2D.
  pca_df_rand = analyze_pca(all_reconstructions_rand, all_fired_tokens_rand, all_token_dfs_rand, n_components=3)
  pca_df_rand.to_parquet(f'/content/drive/My Drive/Colab Notebooks/Data/{filename}')


Examples found: 5005: 100%|██████████| 64/64 [1:09:26<00:00, 65.11s/it]


In [None]:
pca_df_rand

Unnamed: 0,PC1,PC2,PC3,tokens,context
0,49.128883,-12.206374,-4.210915,by,a cavity photon is generated| by| cavity-enha...
1,-12.241544,0.490087,3.417196,*,$) are assumed.\n\n|*|Basic equations*.
2,-12.774838,-0.378636,3.060884,*.,.\n\n*Basic equations|*.| The starting point
3,41.276009,-8.303147,-2.251180,by,"|0\rangle$| by| quantum jumps,"
4,20.597059,1.975019,2.909229,by,mitian Schrödinger equation is given| by| $$\b...
...,...,...,...,...,...
5000,37.273296,-6.313678,-1.252316,by,power that will be supplied| by| the power su...
5001,38.590248,-6.968254,-1.580964,by,influence others towards achieving goals| by|...
5002,-7.189476,34.089256,-14.708767,EI,intelligence (EI).\n\n|EI| is an ability
5003,-5.729959,25.327682,-4.293375,EI,18]\].| EI| is composed of


In [None]:
# Create and show the scatter plot
fig_pca_rand = px.scatter(
    pca_df_rand,
    x="PC2", y="PC3",
    # hover_data="context", hover_name="tokens",
    height=800, width=1200,
    color="tokens",title=f"PCA Subspace Reconstructions. Random features:{str(random_nodes)}"
)

fig_pca_rand.show()

## Feature Ablation

Feature ablation is also worth looking at. In a way, it's a special case of steering where the value of the feature is always zeroed out.

Here we do the following:
1. Use test prompt rather than generate to get more nuance.
2. attach a hook to the SAE feature activations.
3. 0 out a feature at all positions (we know that the default feature fires at the final position.)
4. Check whether this ablation is more / less effective if we include the error term (info our SAE isn't capturing).

Note that the existence of [The Hydra Effect](https://arxiv.org/abs/2307.15771) can make reasoning about ablation experiments difficult.

In [24]:
prompt_dataset

Dataset({
    features: ['prompt_text', 'filename'],
    num_rows: 10
})

Whats in the first ARC challenge prompt?

In [25]:
prompt_dataset[0]['prompt_text']

'Below are pairs of matrices. \nThere is a mapping which operates on each input to give the output, only one mapping applies to all matrices. \nReview the matrices to learn that mapping and then estimate the missing output for the final input matrix.\n\nFIRST score your confidence that you understand the mapping pattern, 0-5 where 0 is zero is no confidence and 5 is highly confident. \nThis score must be the FIRST output you give, no preamble, no prefix, no punctuation, just a single digit score.\nTHEN Present your predicted output in np.array format\nTRAIN Pair 0\nINPUT. Shape=(2, 6)\narray([[3, 3, 3, 3, 3, 3],\n       [9, 9, 9, 9, 9, 9]])\nOUTPUT. Shape=(2, 6)\narray([[3, 9, 3, 9, 3, 9],\n       [9, 3, 9, 3, 9, 3]])\nTRAIN Pair 1\nINPUT. Shape=(2, 6)\narray([[4, 4, 4, 4, 4, 4],\n       [8, 8, 8, 8, 8, 8]])\nOUTPUT. Shape=(2, 6)\narray([[4, 8, 4, 8, 4, 8],\n       [8, 4, 8, 4, 8, 4]])\nTEST Pair 0\nINPUT. Shape=(2, 6)\narray([[6, 6, 6, 6, 6, 6],\n       [2, 2, 2, 2, 2, 2]])\nOUTPUT. '

This first prompt, as above, is expected to return a response of 40 tokens, as follows. The '5' is the model's confidence in completing the challenge successfully and the numpy array is the model's prediction for the challenge, which is correct in this instance.

5
np.array([[6, 2, 6, 2, 6, 2],
[2, 6, 2, 6, 2, 6]])

In [26]:
from transformer_lens.utils import test_prompt
from functools import partial

def test_prompt_with_ablation(model, sae, prompt, answer, ablation_features):

    def ablate_feature_hook(feature_activations, hook, feature_ids, position = None):

        if position is None:
            feature_activations[:,:,feature_ids] = 0
        else:
            feature_activations[:,position,feature_ids] = 0

        return feature_activations

    ablation_hook = partial(ablate_feature_hook, feature_ids = ablation_features)

    model.add_sae(sae)
    hook_point = sae.cfg.hook_name + '.hook_sae_acts_post'
    model.add_hook(hook_point, ablation_hook, "fwd")

    test_prompt(prompt, answer, model)

    model.reset_saes()
    model.reset_hooks()


In [27]:
from transformer_lens.utils import sample_logits

def generate_from_prompt(model, prompt, max_tokens=50, temperature=0.0):
    # Tokenize the prompt
    prompt_tokens = model.to_tokens(prompt, prepend_bos=True)

    generated_tokens = []
    for _ in range(max_tokens):
        # Get logits from the model
        logits = model(prompt_tokens)[:, -1, :]  # Take the last token's logits

        # Sample from logits (which will be greedy at temp=0)
        next_token = sample_logits(logits, temperature=temperature)

        generated_tokens.append(next_token.item())
        prompt_tokens = torch.cat([prompt_tokens, next_token.unsqueeze(0)], dim=1)

        # Stop if we generate an EOS token
        if next_token.item() == model.tokenizer.eos_token_id:
            break

    # Convert generated tokens back to text
    return model.tokenizer.decode(generated_tokens)

In [None]:
# THIS CODE CAN CAUSE TPU TO CRASH

output = generate_from_prompt(model, prompt_dataset[0]['prompt_text'], max_tokens=50, temperature=0.0)
print(output)

In [None]:
# set up the query and expected response without ablation being applied:

model.reset_hooks(including_permanent=True)
prompt = prompt_dataset[0]['prompt_text']
answer = """5
np.array([[6, 2, 6, 2, 6, 2],
[2, 6, 2, 6, 2, 6]])"""

test_prompt(prompt, answer, model)


Tokenized prompt: ['<bos>', 'Below', ' are', ' pairs', ' of', ' matrices', '.', ' ', '\n', 'There', ' is', ' a', ' mapping', ' which', ' operates', ' on', ' each', ' input', ' to', ' give', ' the', ' output', ',', ' only', ' one', ' mapping', ' applies', ' to', ' all', ' matrices', '.', ' ', '\n', 'Review', ' the', ' matrices', ' to', ' learn', ' that', ' mapping', ' and', ' then', ' estimate', ' the', ' missing', ' output', ' for', ' the', ' final', ' input', ' matrix', '.', '\n\n', 'FIRST', ' score', ' your', ' confidence', ' that', ' you', ' understand', ' the', ' mapping', ' pattern', ',', ' ', '0', '-', '5', ' where', ' ', '0', ' is', ' zero', ' is', ' no', ' confidence', ' and', ' ', '5', ' is', ' highly', ' confident', '.', ' ', '\n', 'This', ' score', ' must', ' be', ' the', ' FIRST', ' output', ' you', ' give', ',', ' no', ' preamble', ',', ' no', ' prefix', ',', ' no', ' punctuation', ',', ' just', ' a', ' single', ' digit', ' score', '.', '\n', 'THEN', ' Present', ' your', '

Top 0th token. Logit: 24.94 Prob: 68.51% Token: |
|
Top 1th token. Logit: 23.21 Prob: 12.21% Token: |

|
Top 2th token. Logit: 22.79 Prob:  8.06% Token: |2|
Top 3th token. Logit: 21.62 Prob:  2.49% Token: |________________|
Top 4th token. Logit: 21.07 Prob:  1.44% Token: |


|
Top 5th token. Logit: 20.85 Prob:  1.16% Token: |1|
Top 6th token. Logit: 20.81 Prob:  1.11% Token: |0|
Top 7th token. Logit: 20.54 Prob:  0.84% Token: |6|
Top 8th token. Logit: 20.37 Prob:  0.72% Token: |................|
Top 9th token. Logit: 19.66 Prob:  0.35% Token: |################|


Top 0th token. Logit: 26.10 Prob: 67.46% Token: |
|
Top 1th token. Logit: 24.59 Prob: 14.89% Token: |

|
Top 2th token. Logit: 23.59 Prob:  5.52% Token: |2|
Top 3th token. Logit: 22.69 Prob:  2.23% Token: |________________|
Top 4th token. Logit: 22.56 Prob:  1.96% Token: |


|
Top 5th token. Logit: 21.91 Prob:  1.02% Token: |1|
Top 6th token. Logit: 21.88 Prob:  0.99% Token: |6|
Top 7th token. Logit: 21.75 Prob:  0.87% Token: |0|
Top 8th token. Logit: 21.26 Prob:  0.54% Token: |



|
Top 9th token. Logit: 21.26 Prob:  0.53% Token: |################|


Top 0th token. Logit: 26.77 Prob: 14.91% Token: |x|
Top 1th token. Logit: 26.39 Prob: 10.25% Token: |
|
Top 2th token. Logit: 25.81 Prob:  5.74% Token: |-|
Top 3th token. Logit: 25.59 Prob:  4.61% Token: | digit|
Top 4th token. Logit: 25.50 Prob:  4.18% Token: |0|
Top 5th token. Logit: 25.36 Prob:  3.64% Token: | x|
Top 6th token. Logit: 25.30 Prob:  3.44% Token: | or|
Top 7th token. Logit: 25.27 Prob:  3.33% Token: |,|
Top 8th token. Logit: 25.16 Prob:  3.00% Token: | digits|
Top 9th token. Logit: 24.87 Prob:  2.24% Token: |

|


Top 0th token. Logit: 26.42 Prob: 14.13% Token: |TEST|
Top 1th token. Logit: 26.12 Prob: 10.47% Token: |<eos>|
Top 2th token. Logit: 25.95 Prob:  8.78% Token: |TRAIN|
Top 3th token. Logit: 25.76 Prob:  7.31% Token: |array|
Top 4th token. Logit: 25.58 Prob:  6.08% Token: |OUTPUT|
Top 5th token. Logit: 25.00 Prob:  3.41% Token: |Your|
Top 6th token. Logit: 24.42 Prob:  1.91% Token: |INPUT|
Top 7th token. Logit: 24.34 Prob:  1.76% Token: |Predict|
Top 8th token. Logit: 24.28 Prob:  1.66% Token: |The|
Top 9th token. Logit: 24.15 Prob:  1.45% Token: |ANSWER|


Top 0th token. Logit:  5.30 Prob: 97.90% Token: |.|
Top 1th token. Logit:  0.87 Prob:  1.17% Token: | array|
Top 2th token. Logit:  0.04 Prob:  0.51% Token: |_|
Top 3th token. Logit: -2.41 Prob:  0.04% Token: |
|
Top 4th token. Logit: -2.52 Prob:  0.04% Token: |-|
Top 5th token. Logit: -2.74 Prob:  0.03% Token: | output|
Top 6th token. Logit: -3.34 Prob:  0.02% Token: |array|
Top 7th token. Logit: -3.61 Prob:  0.01% Token: |:|
Top 8th token. Logit: -3.71 Prob:  0.01% Token: | Array|
Top 9th token. Logit: -3.82 Prob:  0.01% Token: |

|


Top 0th token. Logit:  6.46 Prob: 97.26% Token: |array|
Top 1th token. Logit:  0.63 Prob:  0.29% Token: |zeros|
Top 2th token. Logit:  0.34 Prob:  0.21% Token: |nan|
Top 3th token. Logit:  0.14 Prob:  0.17% Token: | array|
Top 4th token. Logit:  0.01 Prob:  0.15% Token: |ndarray|
Top 5th token. Logit: -0.04 Prob:  0.15% Token: |shape|
Top 6th token. Logit: -0.25 Prob:  0.12% Token: |empty|
Top 7th token. Logit: -0.38 Prob:  0.10% Token: |
|
Top 8th token. Logit: -0.53 Prob:  0.09% Token: |Array|
Top 9th token. Logit: -0.68 Prob:  0.08% Token: |asarray|


Top 0th token. Logit: 26.15 Prob: 32.27% Token: |([[|
Top 1th token. Logit: 25.81 Prob: 22.95% Token: |([|
Top 2th token. Logit: 25.05 Prob: 10.81% Token: |(|
Top 3th token. Logit: 25.00 Prob: 10.25% Token: |()|
Top 4th token. Logit: 24.14 Prob:  4.35% Token: |
|
Top 5th token. Logit: 23.83 Prob:  3.17% Token: |([])|
Top 6th token. Logit: 23.46 Prob:  2.20% Token: |(...)|
Top 7th token. Logit: 22.91 Prob:  1.27% Token: |

|
Top 8th token. Logit: 22.81 Prob:  1.15% Token: |((|
Top 9th token. Logit: 22.46 Prob:  0.81% Token: | (|


Top 0th token. Logit: 27.12 Prob: 62.84% Token: |6|
Top 1th token. Logit: 24.77 Prob:  6.03% Token: |2|
Top 2th token. Logit: 24.43 Prob:  4.30% Token: | |
Top 3th token. Logit: 24.33 Prob:  3.86% Token: |1|
Top 4th token. Logit: 23.64 Prob:  1.93% Token: |0|
Top 5th token. Logit: 23.34 Prob:  1.44% Token: |?,|
Top 6th token. Logit: 23.30 Prob:  1.38% Token: |?|
Top 7th token. Logit: 23.24 Prob:  1.30% Token: |5|
Top 8th token. Logit: 23.07 Prob:  1.10% Token: |...|
Top 9th token. Logit: 23.06 Prob:  1.09% Token: |3|


Top 0th token. Logit:  5.15 Prob: 96.12% Token: |,|
Top 1th token. Logit:  0.36 Prob:  0.80% Token: |],|
Top 2th token. Logit: -0.11 Prob:  0.50% Token: | ,|
Top 3th token. Logit: -0.20 Prob:  0.46% Token: | |
Top 4th token. Logit: -0.65 Prob:  0.29% Token: |.|
Top 5th token. Logit: -0.72 Prob:  0.27% Token: |.,|
Top 6th token. Logit: -1.14 Prob:  0.18% Token: |]])|
Top 7th token. Logit: -1.41 Prob:  0.14% Token: |],[|
Top 8th token. Logit: -1.55 Prob:  0.12% Token: |]],|
Top 9th token. Logit: -1.67 Prob:  0.10% Token: |]]|


Top 0th token. Logit: 28.93 Prob: 87.08% Token: | |
Top 1th token. Logit: 26.74 Prob:  9.74% Token: |2|
Top 2th token. Logit: 23.83 Prob:  0.53% Token: | ,|
Top 3th token. Logit: 23.81 Prob:  0.52% Token: |  |
Top 4th token. Logit: 23.45 Prob:  0.36% Token: |6|
Top 5th token. Logit: 22.84 Prob:  0.20% Token: | ?,|
Top 6th token. Logit: 22.59 Prob:  0.15% Token: | ],|
Top 7th token. Logit: 22.32 Prob:  0.12% Token: |   |
Top 8th token. Logit: 22.18 Prob:  0.10% Token: | ?|
Top 9th token. Logit: 21.81 Prob:  0.07% Token: |1|


Top 0th token. Logit: -8.49 Prob: 95.30% Token: |2|
Top 1th token. Logit: -12.30 Prob:  2.10% Token: |6|
Top 2th token. Logit: -13.58 Prob:  0.58% Token: |1|
Top 3th token. Logit: -14.01 Prob:  0.38% Token: |3|
Top 4th token. Logit: -14.06 Prob:  0.36% Token: |4|
Top 5th token. Logit: -14.07 Prob:  0.36% Token: |8|
Top 6th token. Logit: -14.17 Prob:  0.32% Token: |5|
Top 7th token. Logit: -14.54 Prob:  0.22% Token: |9|
Top 8th token. Logit: -14.63 Prob:  0.21% Token: |0|
Top 9th token. Logit: -15.23 Prob:  0.11% Token: |7|


Top 0th token. Logit: 29.98 Prob: 99.34% Token: |,|
Top 1th token. Logit: 23.99 Prob:  0.25% Token: |],|
Top 2th token. Logit: 23.57 Prob:  0.16% Token: | ,|
Top 3th token. Logit: 22.20 Prob:  0.04% Token: |]])|
Top 4th token. Logit: 21.66 Prob:  0.02% Token: |],[|
Top 5th token. Logit: 21.36 Prob:  0.02% Token: |]|
Top 6th token. Logit: 21.18 Prob:  0.02% Token: |]],|
Top 7th token. Logit: 21.14 Prob:  0.01% Token: |,...|
Top 8th token. Logit: 21.11 Prob:  0.01% Token: |]]|
Top 9th token. Logit: 20.99 Prob:  0.01% Token: |6|


Top 0th token. Logit: 29.94 Prob: 99.72% Token: | |
Top 1th token. Logit: 22.55 Prob:  0.06% Token: |  |
Top 2th token. Logit: 22.23 Prob:  0.04% Token: |6|
Top 3th token. Logit: 22.15 Prob:  0.04% Token: | ...|
Top 4th token. Logit: 21.44 Prob:  0.02% Token: | ,|
Top 5th token. Logit: 21.12 Prob:  0.01% Token: | ...,|
Top 6th token. Logit: 20.79 Prob:  0.01% Token: | ],|
Top 7th token. Logit: 20.40 Prob:  0.01% Token: | …|
Top 8th token. Logit: 20.39 Prob:  0.01% Token: |   |
Top 9th token. Logit: 20.37 Prob:  0.01% Token: | ....|


Top 0th token. Logit: 29.79 Prob: 99.77% Token: |6|
Top 1th token. Logit: 22.30 Prob:  0.06% Token: |2|
Top 2th token. Logit: 21.67 Prob:  0.03% Token: |5|
Top 3th token. Logit: 21.64 Prob:  0.03% Token: |4|
Top 4th token. Logit: 21.52 Prob:  0.03% Token: |3|
Top 5th token. Logit: 21.41 Prob:  0.02% Token: |1|
Top 6th token. Logit: 21.19 Prob:  0.02% Token: |8|
Top 7th token. Logit: 20.88 Prob:  0.01% Token: |7|
Top 8th token. Logit: 20.87 Prob:  0.01% Token: |
|
Top 9th token. Logit: 20.57 Prob:  0.01% Token: |9|


Top 0th token. Logit: 29.94 Prob: 99.83% Token: |,|
Top 1th token. Logit: 22.69 Prob:  0.07% Token: | ,|
Top 2th token. Logit: 21.64 Prob:  0.02% Token: |],|
Top 3th token. Logit: 20.52 Prob:  0.01% Token: |.|
Top 4th token. Logit: 20.47 Prob:  0.01% Token: | |
Top 5th token. Logit: 20.39 Prob:  0.01% Token: |...|
Top 6th token. Logit: 20.20 Prob:  0.01% Token: |,...|
Top 7th token. Logit: 19.74 Prob:  0.00% Token: |,,|
Top 8th token. Logit: 19.56 Prob:  0.00% Token: |]|
Top 9th token. Logit: 19.54 Prob:  0.00% Token: |....|


Top 0th token. Logit: 29.91 Prob: 99.84% Token: | |
Top 1th token. Logit: 22.45 Prob:  0.06% Token: |2|
Top 2th token. Logit: 21.55 Prob:  0.02% Token: |  |
Top 3th token. Logit: 21.08 Prob:  0.01% Token: | ,|
Top 4th token. Logit: 20.98 Prob:  0.01% Token: | ...|
Top 5th token. Logit: 20.73 Prob:  0.01% Token: | ...,|
Top 6th token. Logit: 20.05 Prob:  0.01% Token: |
|
Top 7th token. Logit: 19.38 Prob:  0.00% Token: | ],|
Top 8th token. Logit: 19.13 Prob:  0.00% Token: | ....|
Top 9th token. Logit: 19.09 Prob:  0.00% Token: | …|


Top 0th token. Logit: 29.86 Prob: 99.77% Token: |2|
Top 1th token. Logit: 22.59 Prob:  0.07% Token: |8|
Top 2th token. Logit: 21.82 Prob:  0.03% Token: |4|
Top 3th token. Logit: 21.77 Prob:  0.03% Token: |1|
Top 4th token. Logit: 21.73 Prob:  0.03% Token: |3|
Top 5th token. Logit: 21.17 Prob:  0.02% Token: |6|
Top 6th token. Logit: 21.01 Prob:  0.01% Token: |
|
Top 7th token. Logit: 20.78 Prob:  0.01% Token: |0|
Top 8th token. Logit: 20.63 Prob:  0.01% Token: |5|
Top 9th token. Logit: 20.32 Prob:  0.01% Token: |9|


Top 0th token. Logit: 29.94 Prob: 99.76% Token: |,|
Top 1th token. Logit: 22.64 Prob:  0.07% Token: |],|
Top 2th token. Logit: 22.34 Prob:  0.05% Token: | ,|
Top 3th token. Logit: 21.24 Prob:  0.02% Token: |,...|
Top 4th token. Logit: 21.05 Prob:  0.01% Token: |]])|
Top 5th token. Logit: 20.76 Prob:  0.01% Token: |...|
Top 6th token. Logit: 20.41 Prob:  0.01% Token: |],[|
Top 7th token. Logit: 20.37 Prob:  0.01% Token: |]],|
Top 8th token. Logit: 20.29 Prob:  0.01% Token: |]|
Top 9th token. Logit: 20.27 Prob:  0.01% Token: |]]|


Top 0th token. Logit: 29.93 Prob: 99.89% Token: | |
Top 1th token. Logit: 21.50 Prob:  0.02% Token: | ...|
Top 2th token. Logit: 21.43 Prob:  0.02% Token: |6|
Top 3th token. Logit: 21.24 Prob:  0.02% Token: |  |
Top 4th token. Logit: 20.72 Prob:  0.01% Token: | ...,|
Top 5th token. Logit: 20.44 Prob:  0.01% Token: | ,|
Top 6th token. Logit: 20.09 Prob:  0.01% Token: | ...]|
Top 7th token. Logit: 19.64 Prob:  0.00% Token: | ....|
Top 8th token. Logit: 19.58 Prob:  0.00% Token: |
|
Top 9th token. Logit: 19.41 Prob:  0.00% Token: | …|


Top 0th token. Logit: 29.84 Prob: 99.93% Token: |6|
Top 1th token. Logit: 21.42 Prob:  0.02% Token: |2|
Top 2th token. Logit: 20.39 Prob:  0.01% Token: |4|
Top 3th token. Logit: 20.29 Prob:  0.01% Token: |1|
Top 4th token. Logit: 20.21 Prob:  0.01% Token: |3|
Top 5th token. Logit: 20.13 Prob:  0.01% Token: |5|
Top 6th token. Logit: 19.81 Prob:  0.00% Token: |
|
Top 7th token. Logit: 19.68 Prob:  0.00% Token: |8|
Top 8th token. Logit: 19.30 Prob:  0.00% Token: |7|
Top 9th token. Logit: 19.28 Prob:  0.00% Token: |0|


Top 0th token. Logit: 29.91 Prob: 99.88% Token: |,|
Top 1th token. Logit: 22.18 Prob:  0.04% Token: | ,|
Top 2th token. Logit: 21.61 Prob:  0.02% Token: |],|
Top 3th token. Logit: 20.71 Prob:  0.01% Token: | |
Top 4th token. Logit: 20.20 Prob:  0.01% Token: |.|
Top 5th token. Logit: 19.78 Prob:  0.00% Token: |]])|
Top 6th token. Logit: 19.72 Prob:  0.00% Token: |]],|
Top 7th token. Logit: 19.44 Prob:  0.00% Token: |]]|
Top 8th token. Logit: 19.22 Prob:  0.00% Token: |...|
Top 9th token. Logit: 19.17 Prob:  0.00% Token: |]|


Top 0th token. Logit: 29.90 Prob: 99.88% Token: | |
Top 1th token. Logit: 22.33 Prob:  0.05% Token: |2|
Top 2th token. Logit: 21.13 Prob:  0.02% Token: |  |
Top 3th token. Logit: 20.80 Prob:  0.01% Token: | ...|
Top 4th token. Logit: 20.41 Prob:  0.01% Token: |
|
Top 5th token. Logit: 19.99 Prob:  0.00% Token: | ],|
Top 6th token. Logit: 19.67 Prob:  0.00% Token: | ,|
Top 7th token. Logit: 19.10 Prob:  0.00% Token: | ...,|
Top 8th token. Logit: 18.84 Prob:  0.00% Token: | …|
Top 9th token. Logit: 18.78 Prob:  0.00% Token: | ]|


Top 0th token. Logit: 29.89 Prob: 99.91% Token: |2|
Top 1th token. Logit: 21.41 Prob:  0.02% Token: |3|
Top 2th token. Logit: 21.37 Prob:  0.02% Token: |6|
Top 3th token. Logit: 20.93 Prob:  0.01% Token: |1|
Top 4th token. Logit: 20.74 Prob:  0.01% Token: |4|
Top 5th token. Logit: 20.21 Prob:  0.01% Token: |0|
Top 6th token. Logit: 20.04 Prob:  0.01% Token: |
|
Top 7th token. Logit: 20.04 Prob:  0.01% Token: |5|
Top 8th token. Logit: 20.01 Prob:  0.01% Token: |8|
Top 9th token. Logit: 19.23 Prob:  0.00% Token: |7|


Top 0th token. Logit: -5.64 Prob: 91.68% Token: |],|
Top 1th token. Logit: -8.46 Prob:  5.45% Token: |],[|
Top 2th token. Logit: -10.09 Prob:  1.07% Token: |]|
Top 3th token. Logit: -10.60 Prob:  0.64% Token: |]])|
Top 4th token. Logit: -11.46 Prob:  0.27% Token: |]]|
Top 5th token. Logit: -11.47 Prob:  0.27% Token: |]],|
Top 6th token. Logit: -11.94 Prob:  0.17% Token: |,|
Top 7th token. Logit: -12.30 Prob:  0.12% Token: | ],|
Top 8th token. Logit: -12.61 Prob:  0.09% Token: |],\|
Top 9th token. Logit: -13.67 Prob:  0.03% Token: |][|


Top 0th token. Logit: 26.58 Prob: 81.50% Token: |
|
Top 1th token. Logit: 24.82 Prob: 13.92% Token: | [|
Top 2th token. Logit: 23.29 Prob:  3.01% Token: | |
Top 3th token. Logit: 20.87 Prob:  0.27% Token: |  |
Top 4th token. Logit: 20.78 Prob:  0.25% Token: |

|
Top 5th token. Logit: 19.64 Prob:  0.08% Token: | ...|
Top 6th token. Logit: 19.57 Prob:  0.07% Token: |   |
Top 7th token. Logit: 19.54 Prob:  0.07% Token: |])|
Top 8th token. Logit: 19.49 Prob:  0.07% Token: |    |
Top 9th token. Logit: 19.35 Prob:  0.06% Token: |     |


Top 0th token. Logit: 26.56 Prob: 62.91% Token: |          |
Top 1th token. Logit: 25.49 Prob: 21.59% Token: |       |
Top 2th token. Logit: 23.54 Prob:  3.07% Token: |           |
Top 3th token. Logit: 23.41 Prob:  2.68% Token: |         |
Top 4th token. Logit: 23.13 Prob:  2.03% Token: |[|
Top 5th token. Logit: 22.84 Prob:  1.52% Token: |        |
Top 6th token. Logit: 22.82 Prob:  1.49% Token: |      |
Top 7th token. Logit: 22.23 Prob:  0.83% Token: | [|
Top 8th token. Logit: 21.99 Prob:  0.65% Token: |     |
Top 9th token. Logit: 21.91 Prob:  0.60% Token: |    |


Top 0th token. Logit:  5.13 Prob: 97.41% Token: |2|
Top 1th token. Logit:  0.92 Prob:  1.45% Token: | |
Top 2th token. Logit: -1.04 Prob:  0.20% Token: |6|
Top 3th token. Logit: -1.51 Prob:  0.13% Token: |  |
Top 4th token. Logit: -1.63 Prob:  0.11% Token: |   |
Top 5th token. Logit: -1.83 Prob:  0.09% Token: |1|
Top 6th token. Logit: -2.11 Prob:  0.07% Token: |3|
Top 7th token. Logit: -2.51 Prob:  0.05% Token: |    |
Top 8th token. Logit: -2.65 Prob:  0.04% Token: |5|
Top 9th token. Logit: -2.69 Prob:  0.04% Token: |4|


Top 0th token. Logit: 29.95 Prob: 99.81% Token: |,|
Top 1th token. Logit: 23.06 Prob:  0.10% Token: | ,|
Top 2th token. Logit: 22.00 Prob:  0.03% Token: | |
Top 3th token. Logit: 21.45 Prob:  0.02% Token: |.|
Top 4th token. Logit: 20.10 Prob:  0.01% Token: |  |
Top 5th token. Logit: 19.74 Prob:  0.00% Token: |6|
Top 6th token. Logit: 19.36 Prob:  0.00% Token: |],|
Top 7th token. Logit: 19.13 Prob:  0.00% Token: |   |
Top 8th token. Logit: 19.01 Prob:  0.00% Token: |.,|
Top 9th token. Logit: 18.78 Prob:  0.00% Token: |,,|


Top 0th token. Logit: 29.96 Prob: 99.45% Token: | |
Top 1th token. Logit: 24.44 Prob:  0.40% Token: |6|
Top 2th token. Logit: 23.02 Prob:  0.10% Token: |  |
Top 3th token. Logit: 21.00 Prob:  0.01% Token: |   |
Top 4th token. Logit: 20.94 Prob:  0.01% Token: | ,|
Top 5th token. Logit: 19.71 Prob:  0.00% Token: | ...|
Top 6th token. Logit: 19.69 Prob:  0.00% Token: |2|
Top 7th token. Logit: 19.44 Prob:  0.00% Token: |    |
Top 8th token. Logit: 19.31 Prob:  0.00% Token: |
|
Top 9th token. Logit: 18.59 Prob:  0.00% Token: |     |


Top 0th token. Logit: 29.91 Prob: 99.78% Token: |6|
Top 1th token. Logit: 23.13 Prob:  0.11% Token: |2|
Top 2th token. Logit: 21.47 Prob:  0.02% Token: |4|
Top 3th token. Logit: 21.38 Prob:  0.02% Token: |1|
Top 4th token. Logit: 21.22 Prob:  0.02% Token: |3|
Top 5th token. Logit: 20.97 Prob:  0.01% Token: |8|
Top 6th token. Logit: 20.77 Prob:  0.01% Token: |5|
Top 7th token. Logit: 20.27 Prob:  0.01% Token: |7|
Top 8th token. Logit: 19.91 Prob:  0.00% Token: |9|
Top 9th token. Logit: 19.84 Prob:  0.00% Token: |0|


Top 0th token. Logit: 29.89 Prob: 99.89% Token: |,|
Top 1th token. Logit: 22.71 Prob:  0.08% Token: | ,|
Top 2th token. Logit: 20.67 Prob:  0.01% Token: | |
Top 3th token. Logit: 20.55 Prob:  0.01% Token: |.|
Top 4th token. Logit: 19.18 Prob:  0.00% Token: |],|
Top 5th token. Logit: 19.14 Prob:  0.00% Token: |]]|
Top 6th token. Logit: 19.11 Prob:  0.00% Token: |2|
Top 7th token. Logit: 18.72 Prob:  0.00% Token: |]|
Top 8th token. Logit: 18.56 Prob:  0.00% Token: |]])|
Top 9th token. Logit: 18.51 Prob:  0.00% Token: |  |


Top 0th token. Logit: 29.80 Prob: 99.89% Token: | |
Top 1th token. Logit: 22.50 Prob:  0.07% Token: |2|
Top 2th token. Logit: 21.36 Prob:  0.02% Token: |  |
Top 3th token. Logit: 19.79 Prob:  0.00% Token: | ,|
Top 4th token. Logit: 19.33 Prob:  0.00% Token: | ...|
Top 5th token. Logit: 18.98 Prob:  0.00% Token: |
|
Top 6th token. Logit: 18.69 Prob:  0.00% Token: |   |
Top 7th token. Logit: 17.80 Prob:  0.00% Token: | ...,|
Top 8th token. Logit: 17.40 Prob:  0.00% Token: | -|
Top 9th token. Logit: 17.31 Prob:  0.00% Token: | ]|


Top 0th token. Logit: 29.90 Prob: 99.92% Token: |2|
Top 1th token. Logit: 21.24 Prob:  0.02% Token: |6|
Top 2th token. Logit: 20.80 Prob:  0.01% Token: |1|
Top 3th token. Logit: 20.77 Prob:  0.01% Token: |4|
Top 4th token. Logit: 20.73 Prob:  0.01% Token: |3|
Top 5th token. Logit: 20.49 Prob:  0.01% Token: |8|
Top 6th token. Logit: 20.01 Prob:  0.01% Token: |
|
Top 7th token. Logit: 19.68 Prob:  0.00% Token: |5|
Top 8th token. Logit: 19.37 Prob:  0.00% Token: |7|
Top 9th token. Logit: 19.05 Prob:  0.00% Token: |0|


Top 0th token. Logit: 29.89 Prob: 99.91% Token: |,|
Top 1th token. Logit: 22.54 Prob:  0.06% Token: | ,|
Top 2th token. Logit: 20.74 Prob:  0.01% Token: | |
Top 3th token. Logit: 19.81 Prob:  0.00% Token: |.|
Top 4th token. Logit: 19.01 Prob:  0.00% Token: |6|
Top 5th token. Logit: 18.73 Prob:  0.00% Token: |],|
Top 6th token. Logit: 18.19 Prob:  0.00% Token: |]|
Top 7th token. Logit: 18.16 Prob:  0.00% Token: |  |
Top 8th token. Logit: 17.84 Prob:  0.00% Token: |,,|
Top 9th token. Logit: 17.80 Prob:  0.00% Token: |]]|


Top 0th token. Logit: 29.89 Prob: 99.94% Token: | |
Top 1th token. Logit: 22.02 Prob:  0.04% Token: |6|
Top 2th token. Logit: 20.76 Prob:  0.01% Token: |  |
Top 3th token. Logit: 19.35 Prob:  0.00% Token: | ,|
Top 4th token. Logit: 19.13 Prob:  0.00% Token: | ...|
Top 5th token. Logit: 18.46 Prob:  0.00% Token: |
|
Top 6th token. Logit: 17.61 Prob:  0.00% Token: |2|
Top 7th token. Logit: 17.35 Prob:  0.00% Token: |   |
Top 8th token. Logit: 17.33 Prob:  0.00% Token: | ...,|
Top 9th token. Logit: 17.01 Prob:  0.00% Token: | ...]|


Top 0th token. Logit: 29.75 Prob: 99.90% Token: |6|
Top 1th token. Logit: 21.65 Prob:  0.03% Token: |2|
Top 2th token. Logit: 21.00 Prob:  0.02% Token: |4|
Top 3th token. Logit: 20.70 Prob:  0.01% Token: |8|
Top 4th token. Logit: 20.51 Prob:  0.01% Token: |5|
Top 5th token. Logit: 20.32 Prob:  0.01% Token: |3|
Top 6th token. Logit: 20.25 Prob:  0.01% Token: |1|
Top 7th token. Logit: 19.43 Prob:  0.00% Token: |7|
Top 8th token. Logit: 19.41 Prob:  0.00% Token: |
|
Top 9th token. Logit: 19.36 Prob:  0.00% Token: |0|


Top 0th token. Logit: 29.93 Prob: 99.93% Token: |,|
Top 1th token. Logit: 22.07 Prob:  0.04% Token: | ,|
Top 2th token. Logit: 20.26 Prob:  0.01% Token: | |
Top 3th token. Logit: 20.00 Prob:  0.00% Token: |.|
Top 4th token. Logit: 19.26 Prob:  0.00% Token: |]]|
Top 5th token. Logit: 19.12 Prob:  0.00% Token: |],|
Top 6th token. Logit: 18.85 Prob:  0.00% Token: |2|
Top 7th token. Logit: 18.85 Prob:  0.00% Token: |]])|
Top 8th token. Logit: 18.55 Prob:  0.00% Token: |]|
Top 9th token. Logit: 18.09 Prob:  0.00% Token: |  |


Top 0th token. Logit: 29.85 Prob: 99.93% Token: | |
Top 1th token. Logit: 22.09 Prob:  0.04% Token: |2|
Top 2th token. Logit: 20.89 Prob:  0.01% Token: |  |
Top 3th token. Logit: 19.43 Prob:  0.00% Token: | ,|
Top 4th token. Logit: 19.09 Prob:  0.00% Token: | ...|
Top 5th token. Logit: 18.98 Prob:  0.00% Token: |
|
Top 6th token. Logit: 17.78 Prob:  0.00% Token: |   |
Top 7th token. Logit: 17.73 Prob:  0.00% Token: | ...,|
Top 8th token. Logit: 17.56 Prob:  0.00% Token: | ...]|
Top 9th token. Logit: 17.41 Prob:  0.00% Token: | ]|


Top 0th token. Logit: 29.89 Prob: 99.95% Token: |2|
Top 1th token. Logit: 21.29 Prob:  0.02% Token: |6|
Top 2th token. Logit: 20.48 Prob:  0.01% Token: |3|
Top 3th token. Logit: 20.14 Prob:  0.01% Token: |1|
Top 4th token. Logit: 19.97 Prob:  0.00% Token: |4|
Top 5th token. Logit: 19.31 Prob:  0.00% Token: |5|
Top 6th token. Logit: 19.24 Prob:  0.00% Token: |0|
Top 7th token. Logit: 18.96 Prob:  0.00% Token: |8|
Top 8th token. Logit: 18.80 Prob:  0.00% Token: |
|
Top 9th token. Logit: 18.46 Prob:  0.00% Token: |7|


Top 0th token. Logit: 29.91 Prob: 99.89% Token: |,|
Top 1th token. Logit: 22.44 Prob:  0.06% Token: | ,|
Top 2th token. Logit: 20.87 Prob:  0.01% Token: | |
Top 3th token. Logit: 20.78 Prob:  0.01% Token: |]])|
Top 4th token. Logit: 20.08 Prob:  0.01% Token: |.|
Top 5th token. Logit: 19.75 Prob:  0.00% Token: |],|
Top 6th token. Logit: 19.59 Prob:  0.00% Token: |]]|
Top 7th token. Logit: 19.54 Prob:  0.00% Token: |]],|
Top 8th token. Logit: 19.13 Prob:  0.00% Token: |6|
Top 9th token. Logit: 18.69 Prob:  0.00% Token: |]|


Top 0th token. Logit: 29.86 Prob: 99.90% Token: | |
Top 1th token. Logit: 22.52 Prob:  0.06% Token: |6|
Top 2th token. Logit: 20.85 Prob:  0.01% Token: |  |
Top 3th token. Logit: 19.76 Prob:  0.00% Token: | ...|
Top 4th token. Logit: 19.63 Prob:  0.00% Token: |
|
Top 5th token. Logit: 18.84 Prob:  0.00% Token: | ]|
Top 6th token. Logit: 18.39 Prob:  0.00% Token: |2|
Top 7th token. Logit: 18.38 Prob:  0.00% Token: |<eos>|
Top 8th token. Logit: 18.27 Prob:  0.00% Token: |]]|
Top 9th token. Logit: 18.15 Prob:  0.00% Token: | ,|


Top 0th token. Logit: 29.72 Prob: 99.87% Token: |6|
Top 1th token. Logit: 22.33 Prob:  0.06% Token: |2|
Top 2th token. Logit: 20.74 Prob:  0.01% Token: |4|
Top 3th token. Logit: 20.50 Prob:  0.01% Token: |1|
Top 4th token. Logit: 20.46 Prob:  0.01% Token: |3|
Top 5th token. Logit: 20.45 Prob:  0.01% Token: |5|
Top 6th token. Logit: 20.02 Prob:  0.01% Token: |8|
Top 7th token. Logit: 19.69 Prob:  0.00% Token: |0|
Top 8th token. Logit: 19.45 Prob:  0.00% Token: |7|
Top 9th token. Logit: 19.24 Prob:  0.00% Token: |
|


Top 0th token. Logit: 24.69 Prob: 90.81% Token: |]])|
Top 1th token. Logit: 21.55 Prob:  3.96% Token: |]]|
Top 2th token. Logit: 21.10 Prob:  2.51% Token: |]],|
Top 3th token. Logit: 20.43 Prob:  1.28% Token: |],|
Top 4th token. Logit: 20.05 Prob:  0.88% Token: |]|
Top 5th token. Logit: 17.94 Prob:  0.11% Token: |])|
Top 6th token. Logit: 17.80 Prob:  0.09% Token: |]]);|
Top 7th token. Logit: 17.16 Prob:  0.05% Token: | ]|
Top 8th token. Logit: 17.02 Prob:  0.04% Token: |,|
Top 9th token. Logit: 16.89 Prob:  0.04% Token: |
|


In [None]:

# Generate text with feature ablation
print("Test Prompt with feature ablation and no error term")
ablation_feature = 3655  # 3655: "References to data structures and conditional checks in programming"
sae.use_error_term = False

test_prompt_with_ablation(model, sae, prompt, answer, ablation_feature)

Test Prompt with feature ablation and no error term
Tokenized prompt: ['<bos>', 'Below', ' are', ' pairs', ' of', ' matrices', '.', ' ', '\n', 'There', ' is', ' a', ' mapping', ' which', ' operates', ' on', ' each', ' input', ' to', ' give', ' the', ' output', ',', ' only', ' one', ' mapping', ' applies', ' to', ' all', ' matrices', '.', ' ', '\n', 'Review', ' the', ' matrices', ' to', ' learn', ' that', ' mapping', ' and', ' then', ' estimate', ' the', ' missing', ' output', ' for', ' the', ' final', ' input', ' matrix', '.', '\n\n', 'FIRST', ' score', ' your', ' confidence', ' that', ' you', ' understand', ' the', ' mapping', ' pattern', ',', ' ', '0', '-', '5', ' where', ' ', '0', ' is', ' zero', ' is', ' no', ' confidence', ' and', ' ', '5', ' is', ' highly', ' confident', '.', ' ', '\n', 'This', ' score', ' must', ' be', ' the', ' FIRST', ' output', ' you', ' give', ',', ' no', ' preamble', ',', ' no', ' prefix', ',', ' no', ' punctuation', ',', ' just', ' a', ' single', ' digit',

Top 0th token. Logit: 15.97 Prob: 92.75% Token: |
|
Top 1th token. Logit: 11.80 Prob:  1.45% Token: |2|
Top 2th token. Logit: 11.62 Prob:  1.20% Token: |6|
Top 3th token. Logit: 11.58 Prob:  1.15% Token: |<em>|
Top 4th token. Logit: 11.13 Prob:  0.74% Token: |<strong>|
Top 5th token. Logit: 10.84 Prob:  0.55% Token: |

|
Top 6th token. Logit: 10.20 Prob:  0.29% Token: |8|
Top 7th token. Logit: 10.18 Prob:  0.29% Token: |4|
Top 8th token. Logit:  9.93 Prob:  0.22% Token: |1|
Top 9th token. Logit:  9.77 Prob:  0.19% Token: |<b>|


Top 0th token. Logit: 15.06 Prob: 38.54% Token: |
|
Top 1th token. Logit: 13.90 Prob: 12.04% Token: |([|
Top 2th token. Logit: 13.45 Prob:  7.67% Token: |.|
Top 3th token. Logit: 13.23 Prob:  6.20% Token: | |
Top 4th token. Logit: 12.66 Prob:  3.48% Token: | is|
Top 5th token. Logit: 12.24 Prob:  2.29% Token: |                               |
Top 6th token. Logit: 12.17 Prob:  2.14% Token: |;|
Top 7th token. Logit: 11.99 Prob:  1.79% Token: |:|
Top 8th token. Logit: 11.87 Prob:  1.58% Token: |:[|
Top 9th token. Logit: 11.53 Prob:  1.13% Token: |!|


Top 0th token. Logit: 24.46 Prob: 62.54% Token: |
|
Top 1th token. Logit: 21.51 Prob:  3.29% Token: |.|
Top 2th token. Logit: 21.46 Prob:  3.13% Token: |6|
Top 3th token. Logit: 21.19 Prob:  2.38% Token: | where|
Top 4th token. Logit: 21.16 Prob:  2.30% Token: | is|
Top 5th token. Logit: 21.01 Prob:  1.99% Token: |5|
Top 6th token. Logit: 20.98 Prob:  1.94% Token: |,|
Top 7th token. Logit: 20.78 Prob:  1.58% Token: | |
Top 8th token. Logit: 20.54 Prob:  1.24% Token: |0|
Top 9th token. Logit: 20.38 Prob:  1.06% Token: |4|


Top 0th token. Logit: 16.80 Prob: 11.15% Token: |array|
Top 1th token. Logit: 16.53 Prob:  8.55% Token: |                               |
Top 2th token. Logit: 16.03 Prob:  5.16% Token: |5|
Top 3th token. Logit: 16.03 Prob:  5.15% Token: |				|
Top 4th token. Logit: 15.73 Prob:  3.82% Token: |is|
Top 5th token. Logit: 15.36 Prob:  2.64% Token: | |
Top 6th token. Logit: 15.15 Prob:  2.14% Token: |where|
Top 7th token. Logit: 15.13 Prob:  2.10% Token: |</code>|
Top 8th token. Logit: 14.90 Prob:  1.67% Token: |should|
Top 9th token. Logit: 14.76 Prob:  1.45% Token: |								|


Top 0th token. Logit: 22.63 Prob: 87.98% Token: |.|
Top 1th token. Logit: 19.39 Prob:  3.44% Token: | array|
Top 2th token. Logit: 18.72 Prob:  1.76% Token: |.[|
Top 3th token. Logit: 18.16 Prob:  1.00% Token: |
|
Top 4th token. Logit: 17.50 Prob:  0.52% Token: |array|
Top 5th token. Logit: 17.14 Prob:  0.36% Token: |.#|
Top 6th token. Logit: 16.95 Prob:  0.30% Token: |5|
Top 7th token. Logit: 16.86 Prob:  0.27% Token: |.*|
Top 8th token. Logit: 16.68 Prob:  0.23% Token: |([|
Top 9th token. Logit: 16.62 Prob:  0.21% Token: |                               |


Top 0th token. Logit: 21.40 Prob: 94.78% Token: |array|
Top 1th token. Logit: 16.26 Prob:  0.56% Token: |matrix|
Top 2th token. Logit: 16.20 Prob:  0.52% Token: |arange|
Top 3th token. Logit: 16.12 Prob:  0.48% Token: |
|
Top 4th token. Logit: 15.42 Prob:  0.24% Token: |ones|
Top 5th token. Logit: 14.85 Prob:  0.14% Token: |eeeeee|
Top 6th token. Logit: 14.82 Prob:  0.13% Token: |eye|
Top 7th token. Logit: 14.74 Prob:  0.12% Token: |random|
Top 8th token. Logit: 14.66 Prob:  0.11% Token: |arrays|
Top 9th token. Logit: 14.62 Prob:  0.11% Token: |nan|


Top 0th token. Logit: 22.24 Prob: 50.64% Token: |([|
Top 1th token. Logit: 21.72 Prob: 30.19% Token: |([[|
Top 2th token. Logit: 20.63 Prob: 10.20% Token: |(|
Top 3th token. Logit: 19.30 Prob:  2.68% Token: |
|
Top 4th token. Logit: 19.13 Prob:  2.26% Token: |((|
Top 5th token. Logit: 17.85 Prob:  0.63% Token: |(((|
Top 6th token. Logit: 17.76 Prob:  0.58% Token: |([(|
Top 7th token. Logit: 17.40 Prob:  0.40% Token: |.|
Top 8th token. Logit: 17.19 Prob:  0.33% Token: |(([|
Top 9th token. Logit: 17.03 Prob:  0.28% Token: |

|


Top 0th token. Logit: 25.64 Prob: 91.00% Token: |6|
Top 1th token. Logit: 22.22 Prob:  2.96% Token: |5|
Top 2th token. Logit: 21.56 Prob:  1.53% Token: |2|
Top 3th token. Logit: 21.46 Prob:  1.39% Token: | |
Top 4th token. Logit: 20.55 Prob:  0.56% Token: |3|
Top 5th token. Logit: 20.47 Prob:  0.52% Token: |4|
Top 6th token. Logit: 20.33 Prob:  0.45% Token: |
|
Top 7th token. Logit: 19.54 Prob:  0.20% Token: |1|
Top 8th token. Logit: 19.12 Prob:  0.13% Token: |8|
Top 9th token. Logit: 19.01 Prob:  0.12% Token: |7|


Top 0th token. Logit: 18.15 Prob: 99.56% Token: |,|
Top 1th token. Logit: 11.10 Prob:  0.09% Token: | ,|
Top 2th token. Logit: 10.41 Prob:  0.04% Token: |6|
Top 3th token. Logit: 10.31 Prob:  0.04% Token: |,(|
Top 4th token. Logit: 10.28 Prob:  0.04% Token: |.|
Top 5th token. Logit: 10.13 Prob:  0.03% Token: |
|
Top 6th token. Logit:  9.63 Prob:  0.02% Token: |*|
Top 7th token. Logit:  9.52 Prob:  0.02% Token: |,...|
Top 8th token. Logit:  9.20 Prob:  0.01% Token: |,,|
Top 9th token. Logit:  9.20 Prob:  0.01% Token: | |


Top 0th token. Logit: 27.35 Prob: 90.07% Token: | |
Top 1th token. Logit: 24.93 Prob:  7.94% Token: |6|
Top 2th token. Logit: 22.62 Prob:  0.79% Token: |2|
Top 3th token. Logit: 21.35 Prob:  0.22% Token: |
|
Top 4th token. Logit: 21.22 Prob:  0.20% Token: |  |
Top 5th token. Logit: 20.89 Prob:  0.14% Token: |   |
Top 6th token. Logit: 20.74 Prob:  0.12% Token: |5|
Top 7th token. Logit: 20.29 Prob:  0.08% Token: |    |
Top 8th token. Logit: 20.24 Prob:  0.07% Token: |8|
Top 9th token. Logit: 19.47 Prob:  0.03% Token: |                               |


Top 0th token. Logit: 27.85 Prob: 96.18% Token: |6|
Top 1th token. Logit: 24.40 Prob:  3.08% Token: |2|
Top 2th token. Logit: 22.33 Prob:  0.39% Token: |8|
Top 3th token. Logit: 21.17 Prob:  0.12% Token: |5|
Top 4th token. Logit: 20.48 Prob:  0.06% Token: |4|
Top 5th token. Logit: 20.27 Prob:  0.05% Token: |
|
Top 6th token. Logit: 20.16 Prob:  0.04% Token: |7|
Top 7th token. Logit: 19.75 Prob:  0.03% Token: |9|
Top 8th token. Logit: 19.59 Prob:  0.02% Token: |3|
Top 9th token. Logit: 19.15 Prob:  0.02% Token: |1|


Top 0th token. Logit: 18.94 Prob: 99.30% Token: |,|
Top 1th token. Logit: 11.93 Prob:  0.09% Token: |,...|
Top 2th token. Logit: 11.85 Prob:  0.08% Token: |],|
Top 3th token. Logit: 11.82 Prob:  0.08% Token: | ,|
Top 4th token. Logit: 11.79 Prob:  0.08% Token: |])|
Top 5th token. Logit: 11.32 Prob:  0.05% Token: |,*|
Top 6th token. Logit: 11.31 Prob:  0.05% Token: |
|
Top 7th token. Logit: 10.93 Prob:  0.03% Token: |,(|
Top 8th token. Logit: 10.88 Prob:  0.03% Token: |*|
Top 9th token. Logit: 10.24 Prob:  0.02% Token: |,,|


Top 0th token. Logit: 28.60 Prob: 99.68% Token: | |
Top 1th token. Logit: 21.59 Prob:  0.09% Token: |6|
Top 2th token. Logit: 21.39 Prob:  0.07% Token: |
|
Top 3th token. Logit: 20.43 Prob:  0.03% Token: |  |
Top 4th token. Logit: 19.77 Prob:  0.01% Token: |            |
Top 5th token. Logit: 19.56 Prob:  0.01% Token: |                               |
Top 6th token. Logit: 19.43 Prob:  0.01% Token: |   |
Top 7th token. Logit: 19.16 Prob:  0.01% Token: |    |
Top 8th token. Logit: 19.03 Prob:  0.01% Token: |                |
Top 9th token. Logit: 18.93 Prob:  0.01% Token: | ...|


Top 0th token. Logit: 22.76 Prob: 90.24% Token: |6|
Top 1th token. Logit: 19.56 Prob:  3.65% Token: |,|
Top 2th token. Logit: 18.75 Prob:  1.64% Token: |4|
Top 3th token. Logit: 18.37 Prob:  1.11% Token: |
|
Top 4th token. Logit: 17.85 Prob:  0.66% Token: |5|
Top 5th token. Logit: 17.74 Prob:  0.59% Token: | |
Top 6th token. Logit: 17.11 Prob:  0.32% Token: |2|
Top 7th token. Logit: 17.03 Prob:  0.29% Token: |*|
Top 8th token. Logit: 16.77 Prob:  0.23% Token: |8|
Top 9th token. Logit: 16.21 Prob:  0.13% Token: |<#|


Top 0th token. Logit: 20.19 Prob: 99.79% Token: |,|
Top 1th token. Logit: 13.09 Prob:  0.08% Token: |,...|
Top 2th token. Logit: 12.19 Prob:  0.03% Token: |
|
Top 3th token. Logit: 11.97 Prob:  0.03% Token: | ,|
Top 4th token. Logit: 10.77 Prob:  0.01% Token: |,,|
Top 5th token. Logit: 10.73 Prob:  0.01% Token: |,*|
Top 6th token. Logit: 10.71 Prob:  0.01% Token: |,(|
Top 7th token. Logit: 10.03 Prob:  0.00% Token: |...|
Top 8th token. Logit:  9.91 Prob:  0.00% Token: |,....|
Top 9th token. Logit:  9.86 Prob:  0.00% Token: |,"|


Top 0th token. Logit: 23.34 Prob: 97.77% Token: | |
Top 1th token. Logit: 18.80 Prob:  1.05% Token: |
|
Top 2th token. Logit: 17.33 Prob:  0.24% Token: |2|
Top 3th token. Logit: 17.24 Prob:  0.22% Token: |  |
Top 4th token. Logit: 16.78 Prob:  0.14% Token: | *|
Top 5th token. Logit: 16.29 Prob:  0.09% Token: |                               |
Top 6th token. Logit: 16.13 Prob:  0.07% Token: |4|
Top 7th token. Logit: 15.84 Prob:  0.05% Token: |   |
Top 8th token. Logit: 15.69 Prob:  0.05% Token: |        |
Top 9th token. Logit: 15.40 Prob:  0.03% Token: | ...|


Top 0th token. Logit: 13.12 Prob: 42.44% Token: |2|
Top 1th token. Logit: 12.18 Prob: 16.62% Token: |*|
Top 2th token. Logit: 11.83 Prob: 11.61% Token: |
|
Top 3th token. Logit: 10.88 Prob:  4.50% Token: |,|
Top 4th token. Logit: 10.41 Prob:  2.83% Token: |****************|
Top 5th token. Logit: 10.35 Prob:  2.66% Token: |8|
Top 6th token. Logit: 10.25 Prob:  2.40% Token: |...|
Top 7th token. Logit: 10.07 Prob:  2.00% Token: |**|
Top 8th token. Logit:  9.90 Prob:  1.69% Token: |4|
Top 9th token. Logit:  9.69 Prob:  1.38% Token: |<em>|


Top 0th token. Logit: 21.94 Prob: 99.40% Token: |,|
Top 1th token. Logit: 15.60 Prob:  0.18% Token: |,...|
Top 2th token. Logit: 15.48 Prob:  0.15% Token: |,*|
Top 3th token. Logit: 15.01 Prob:  0.10% Token: |
|
Top 4th token. Logit: 13.47 Prob:  0.02% Token: | ,|
Top 5th token. Logit: 13.16 Prob:  0.02% Token: |,(|
Top 6th token. Logit: 12.93 Prob:  0.01% Token: |,[|
Top 7th token. Logit: 12.81 Prob:  0.01% Token: |,....|
Top 8th token. Logit: 12.70 Prob:  0.01% Token: |,,|
Top 9th token. Logit: 12.63 Prob:  0.01% Token: |,$|


Top 0th token. Logit: 22.33 Prob: 96.56% Token: | |
Top 1th token. Logit: 17.96 Prob:  1.22% Token: |
|
Top 2th token. Logit: 17.40 Prob:  0.70% Token: |6|
Top 3th token. Logit: 16.67 Prob:  0.34% Token: |                               |
Top 4th token. Logit: 16.04 Prob:  0.18% Token: |  |
Top 5th token. Logit: 15.87 Prob:  0.15% Token: |            |
Top 6th token. Logit: 15.28 Prob:  0.08% Token: | *|
Top 7th token. Logit: 15.16 Prob:  0.07% Token: |                |
Top 8th token. Logit: 14.93 Prob:  0.06% Token: | ...|
Top 9th token. Logit: 14.28 Prob:  0.03% Token: |                            |


Top 0th token. Logit: 19.48 Prob: 86.02% Token: |6|
Top 1th token. Logit: 17.14 Prob:  8.32% Token: |8|
Top 2th token. Logit: 15.52 Prob:  1.65% Token: |*|
Top 3th token. Logit: 15.20 Prob:  1.19% Token: |
|
Top 4th token. Logit: 14.45 Prob:  0.56% Token: |,|
Top 5th token. Logit: 14.07 Prob:  0.39% Token: |4|
Top 6th token. Logit: 13.29 Prob:  0.18% Token: |                               |
Top 7th token. Logit: 13.24 Prob:  0.17% Token: |5|
Top 8th token. Logit: 13.16 Prob:  0.16% Token: | |
Top 9th token. Logit: 12.67 Prob:  0.10% Token: |                |


Top 0th token. Logit: 23.76 Prob: 99.75% Token: |,|
Top 1th token. Logit: 16.43 Prob:  0.07% Token: |
|
Top 2th token. Logit: 15.73 Prob:  0.03% Token: | ,|
Top 3th token. Logit: 15.65 Prob:  0.03% Token: |,...|
Top 4th token. Logit: 15.20 Prob:  0.02% Token: |,*|
Top 5th token. Logit: 14.57 Prob:  0.01% Token: |,"|
Top 6th token. Logit: 14.30 Prob:  0.01% Token: |,]|
Top 7th token. Logit: 14.18 Prob:  0.01% Token: |,,|
Top 8th token. Logit: 14.02 Prob:  0.01% Token: |,(|
Top 9th token. Logit: 13.79 Prob:  0.00% Token: |,)|


Top 0th token. Logit: 26.22 Prob: 98.28% Token: | |
Top 1th token. Logit: 21.31 Prob:  0.72% Token: |
|
Top 2th token. Logit: 20.87 Prob:  0.47% Token: |2|
Top 3th token. Logit: 19.56 Prob:  0.13% Token: |  |
Top 4th token. Logit: 19.55 Prob:  0.12% Token: |4|
Top 5th token. Logit: 18.51 Prob:  0.04% Token: | *|
Top 6th token. Logit: 17.76 Prob:  0.02% Token: | is|
Top 7th token. Logit: 17.74 Prob:  0.02% Token: | Mathf|
Top 8th token. Logit: 17.56 Prob:  0.02% Token: |8|
Top 9th token. Logit: 17.38 Prob:  0.01% Token: |   |


Top 0th token. Logit: 21.51 Prob: 67.68% Token: |2|
Top 1th token. Logit: 19.71 Prob: 11.25% Token: |8|
Top 2th token. Logit: 19.64 Prob: 10.43% Token: | |
Top 3th token. Logit: 18.71 Prob:  4.14% Token: |4|
Top 4th token. Logit: 18.20 Prob:  2.48% Token: |*|
Top 5th token. Logit: 17.44 Prob:  1.16% Token: |
|
Top 6th token. Logit: 16.71 Prob:  0.56% Token: |,|
Top 7th token. Logit: 16.28 Prob:  0.36% Token: |]|
Top 8th token. Logit: 15.78 Prob:  0.22% Token: |])|
Top 9th token. Logit: 15.21 Prob:  0.12% Token: |3|


Top 0th token. Logit: 24.31 Prob: 51.05% Token: |])|
Top 1th token. Logit: 24.06 Prob: 39.44% Token: |],|
Top 2th token. Logit: 21.91 Prob:  4.60% Token: |,|
Top 3th token. Logit: 20.64 Prob:  1.29% Token: |]|
Top 4th token. Logit: 20.08 Prob:  0.74% Token: |
|
Top 5th token. Logit: 20.01 Prob:  0.69% Token: |]])|
Top 6th token. Logit: 19.89 Prob:  0.61% Token: |]]|
Top 7th token. Logit: 19.57 Prob:  0.45% Token: |],[|
Top 8th token. Logit: 18.26 Prob:  0.12% Token: | ],|
Top 9th token. Logit: 18.20 Prob:  0.11% Token: | ])|


Top 0th token. Logit: 25.23 Prob: 94.76% Token: |
|
Top 1th token. Logit: 21.85 Prob:  3.22% Token: | |
Top 2th token. Logit: 19.42 Prob:  0.28% Token: |                               |
Top 3th token. Logit: 19.29 Prob:  0.25% Token: |

|
Top 4th token. Logit: 18.59 Prob:  0.12% Token: |  |
Top 5th token. Logit: 18.41 Prob:  0.10% Token: |...|
Top 6th token. Logit: 18.28 Prob:  0.09% Token: | #|
Top 7th token. Logit: 18.26 Prob:  0.09% Token: |               |
Top 8th token. Logit: 18.03 Prob:  0.07% Token: |                |
Top 9th token. Logit: 17.82 Prob:  0.06% Token: | ...|


Top 0th token. Logit: 20.04 Prob: 27.00% Token: |                               |
Top 1th token. Logit: 18.95 Prob:  9.00% Token: |                       |
Top 2th token. Logit: 18.90 Prob:  8.57% Token: |                   |
Top 3th token. Logit: 18.85 Prob:  8.22% Token: |                          |
Top 4th token. Logit: 18.48 Prob:  5.63% Token: |                        |
Top 5th token. Logit: 18.10 Prob:  3.86% Token: |                |
Top 6th token. Logit: 18.06 Prob:  3.73% Token: |                         |
Top 7th token. Logit: 17.97 Prob:  3.40% Token: |                            |
Top 8th token. Logit: 17.95 Prob:  3.32% Token: |                           |
Top 9th token. Logit: 17.91 Prob:  3.19% Token: |                    |


Top 0th token. Logit: 21.95 Prob: 84.52% Token: |2|
Top 1th token. Logit: 18.72 Prob:  3.34% Token: |		|
Top 2th token. Logit: 18.16 Prob:  1.90% Token: |                   |
Top 3th token. Logit: 17.40 Prob:  0.89% Token: |                |
Top 4th token. Logit: 17.31 Prob:  0.81% Token: | |
Top 5th token. Logit: 17.06 Prob:  0.63% Token: |8|
Top 6th token. Logit: 17.00 Prob:  0.60% Token: |
|
Top 7th token. Logit: 16.91 Prob:  0.55% Token: |        |
Top 8th token. Logit: 16.75 Prob:  0.47% Token: |                       |
Top 9th token. Logit: 16.67 Prob:  0.43% Token: |                    |


Top 0th token. Logit: 22.75 Prob: 99.65% Token: |,|
Top 1th token. Logit: 15.36 Prob:  0.06% Token: | ,|
Top 2th token. Logit: 15.32 Prob:  0.06% Token: |,*|
Top 3th token. Logit: 14.27 Prob:  0.02% Token: |   |
Top 4th token. Logit: 14.17 Prob:  0.02% Token: |  |
Top 5th token. Logit: 13.93 Prob:  0.01% Token: |,...|
Top 6th token. Logit: 13.71 Prob:  0.01% Token: |    |
Top 7th token. Logit: 13.68 Prob:  0.01% Token: |,"|
Top 8th token. Logit: 13.58 Prob:  0.01% Token: |                |
Top 9th token. Logit: 13.45 Prob:  0.01% Token: |
|


Top 0th token. Logit: 26.74 Prob: 98.69% Token: | |
Top 1th token. Logit: 21.39 Prob:  0.47% Token: |6|
Top 2th token. Logit: 20.80 Prob:  0.26% Token: |  |
Top 3th token. Logit: 20.28 Prob:  0.15% Token: |2|
Top 4th token. Logit: 20.10 Prob:  0.13% Token: |   |
Top 5th token. Logit: 19.70 Prob:  0.09% Token: |
|
Top 6th token. Logit: 18.66 Prob:  0.03% Token: |    |
Top 7th token. Logit: 18.39 Prob:  0.02% Token: |            |
Top 8th token. Logit: 18.16 Prob:  0.02% Token: | *|
Top 9th token. Logit: 17.86 Prob:  0.01% Token: |     |


Top 0th token. Logit: 16.82 Prob: 89.44% Token: |6|
Top 1th token. Logit: 14.62 Prob:  9.93% Token: |2|
Top 2th token. Logit: 11.23 Prob:  0.33% Token: |4|
Top 3th token. Logit: 10.08 Prob:  0.11% Token: |5|
Top 4th token. Logit:  9.61 Prob:  0.07% Token: |
|
Top 5th token. Logit:  8.47 Prob:  0.02% Token: |3|
Top 6th token. Logit:  8.07 Prob:  0.01% Token: |8|
Top 7th token. Logit:  7.51 Prob:  0.01% Token: |*|
Top 8th token. Logit:  7.08 Prob:  0.01% Token: |<em>|
Top 9th token. Logit:  7.00 Prob:  0.00% Token: |7|


Top 0th token. Logit: 17.98 Prob: 99.62% Token: |,|
Top 1th token. Logit: 10.95 Prob:  0.09% Token: | ,|
Top 2th token. Logit: 10.10 Prob:  0.04% Token: |,*|
Top 3th token. Logit: 10.05 Prob:  0.04% Token: |,[|
Top 4th token. Logit: 10.05 Prob:  0.04% Token: |
|
Top 5th token. Logit:  9.37 Prob:  0.02% Token: |*|
Top 6th token. Logit:  9.21 Prob:  0.02% Token: |,(|
Top 7th token. Logit:  9.05 Prob:  0.01% Token: |,)|
Top 8th token. Logit:  9.03 Prob:  0.01% Token: |,...|
Top 9th token. Logit:  8.67 Prob:  0.01% Token: |,,|


Top 0th token. Logit: 26.84 Prob: 99.57% Token: | |
Top 1th token. Logit: 20.18 Prob:  0.13% Token: |2|
Top 2th token. Logit: 19.90 Prob:  0.10% Token: |  |
Top 3th token. Logit: 19.59 Prob:  0.07% Token: |
|
Top 4th token. Logit: 18.41 Prob:  0.02% Token: |   |
Top 5th token. Logit: 18.16 Prob:  0.02% Token: | *|
Top 6th token. Logit: 18.02 Prob:  0.01% Token: | z|
Top 7th token. Logit: 17.12 Prob:  0.01% Token: |    |
Top 8th token. Logit: 16.82 Prob:  0.00% Token: | is|
Top 9th token. Logit: 16.62 Prob:  0.00% Token: | #|


Top 0th token. Logit: 18.40 Prob: 92.91% Token: |2|
Top 1th token. Logit: 15.54 Prob:  5.30% Token: |8|
Top 2th token. Logit: 13.47 Prob:  0.67% Token: |
|
Top 3th token. Logit: 13.16 Prob:  0.49% Token: |6|
Top 4th token. Logit: 11.72 Prob:  0.12% Token: |<em>|
Top 5th token. Logit: 11.06 Prob:  0.06% Token: |4|
Top 6th token. Logit: 10.57 Prob:  0.04% Token: |                               |
Top 7th token. Logit: 10.51 Prob:  0.03% Token: | |
Top 8th token. Logit: 10.05 Prob:  0.02% Token: |3|
Top 9th token. Logit: 10.05 Prob:  0.02% Token: |*|


Top 0th token. Logit: 20.06 Prob: 99.69% Token: |,|
Top 1th token. Logit: 12.92 Prob:  0.08% Token: | ,|
Top 2th token. Logit: 12.42 Prob:  0.05% Token: |,*|
Top 3th token. Logit: 12.06 Prob:  0.03% Token: |,...|
Top 4th token. Logit: 11.70 Prob:  0.02% Token: |
|
Top 5th token. Logit: 11.67 Prob:  0.02% Token: |,)|
Top 6th token. Logit: 10.73 Prob:  0.01% Token: |,[|
Top 7th token. Logit: 10.68 Prob:  0.01% Token: |,}|
Top 8th token. Logit: 10.63 Prob:  0.01% Token: |,#|
Top 9th token. Logit: 10.41 Prob:  0.01% Token: |,]|


Top 0th token. Logit: 25.05 Prob: 99.52% Token: | |
Top 1th token. Logit: 18.50 Prob:  0.14% Token: | *|
Top 2th token. Logit: 18.12 Prob:  0.10% Token: |
|
Top 3th token. Logit: 17.72 Prob:  0.07% Token: |  |
Top 4th token. Logit: 17.49 Prob:  0.05% Token: |6|
Top 5th token. Logit: 16.32 Prob:  0.02% Token: |8|
Top 6th token. Logit: 15.74 Prob:  0.01% Token: | `|
Top 7th token. Logit: 15.65 Prob:  0.01% Token: |2|
Top 8th token. Logit: 15.40 Prob:  0.01% Token: |   |
Top 9th token. Logit: 15.14 Prob:  0.00% Token: | #|


Top 0th token. Logit: 14.39 Prob: 51.26% Token: |6|
Top 1th token. Logit: 14.01 Prob: 35.33% Token: |8|
Top 2th token. Logit: 11.85 Prob:  4.04% Token: |
|
Top 3th token. Logit: 11.75 Prob:  3.68% Token: |2|
Top 4th token. Logit: 10.69 Prob:  1.27% Token: |4|
Top 5th token. Logit: 10.24 Prob:  0.81% Token: |<em>|
Top 6th token. Logit:  8.80 Prob:  0.19% Token: |</code>|
Top 7th token. Logit:  8.64 Prob:  0.16% Token: |{*|
Top 8th token. Logit:  8.58 Prob:  0.15% Token: |*|
Top 9th token. Logit:  8.50 Prob:  0.14% Token: |<strong>|


Top 0th token. Logit: 20.19 Prob: 99.82% Token: |,|
Top 1th token. Logit: 12.65 Prob:  0.05% Token: |,*|
Top 2th token. Logit: 12.02 Prob:  0.03% Token: |
|
Top 3th token. Logit: 12.02 Prob:  0.03% Token: |,...|
Top 4th token. Logit: 11.84 Prob:  0.02% Token: | ,|
Top 5th token. Logit: 10.20 Prob:  0.00% Token: |,[|
Top 6th token. Logit: 10.10 Prob:  0.00% Token: |,)|
Top 7th token. Logit:  9.75 Prob:  0.00% Token: |,(|
Top 8th token. Logit:  9.59 Prob:  0.00% Token: |,$|
Top 9th token. Logit:  9.54 Prob:  0.00% Token: |,....|


Top 0th token. Logit: 25.64 Prob: 99.55% Token: | |
Top 1th token. Logit: 18.81 Prob:  0.11% Token: |
|
Top 2th token. Logit: 18.31 Prob:  0.07% Token: |                               |
Top 3th token. Logit: 18.29 Prob:  0.06% Token: | *|
Top 4th token. Logit: 18.17 Prob:  0.06% Token: |  |
Top 5th token. Logit: 17.37 Prob:  0.03% Token: | Mathf|
Top 6th token. Logit: 16.84 Prob:  0.02% Token: |   |
Top 7th token. Logit: 16.69 Prob:  0.01% Token: |        |
Top 8th token. Logit: 16.64 Prob:  0.01% Token: |2|
Top 9th token. Logit: 15.92 Prob:  0.01% Token: |         |


Top 0th token. Logit: 17.17 Prob: 27.06% Token: |2|
Top 1th token. Logit: 16.82 Prob: 19.09% Token: |
|
Top 2th token. Logit: 16.43 Prob: 12.86% Token: |*|
Top 3th token. Logit: 16.27 Prob: 11.03% Token: |8|
Top 4th token. Logit: 16.20 Prob: 10.31% Token: |,|
Top 5th token. Logit: 14.61 Prob:  2.10% Token: |**|
Top 6th token. Logit: 14.56 Prob:  1.99% Token: |                               |
Top 7th token. Logit: 14.46 Prob:  1.80% Token: |<em>|
Top 8th token. Logit: 13.79 Prob:  0.92% Token: |,*|
Top 9th token. Logit: 13.71 Prob:  0.85% Token: |4|


Top 0th token. Logit: 25.17 Prob: 99.45% Token: |,|
Top 1th token. Logit: 18.19 Prob:  0.09% Token: |,*|
Top 2th token. Logit: 18.02 Prob:  0.08% Token: |
|
Top 3th token. Logit: 17.96 Prob:  0.07% Token: |,$|
Top 4th token. Logit: 17.83 Prob:  0.06% Token: |,...|
Top 5th token. Logit: 17.36 Prob:  0.04% Token: |,]|
Top 6th token. Logit: 17.35 Prob:  0.04% Token: | ,|
Top 7th token. Logit: 16.67 Prob:  0.02% Token: |,{|
Top 8th token. Logit: 16.40 Prob:  0.02% Token: |,)|
Top 9th token. Logit: 16.18 Prob:  0.01% Token: |,}|


Top 0th token. Logit: 26.63 Prob: 98.40% Token: | |
Top 1th token. Logit: 22.03 Prob:  0.99% Token: |
|
Top 2th token. Logit: 20.26 Prob:  0.17% Token: |8|
Top 3th token. Logit: 19.85 Prob:  0.11% Token: | *|
Top 4th token. Logit: 19.47 Prob:  0.08% Token: |  |
Top 5th token. Logit: 18.90 Prob:  0.04% Token: |                               |
Top 6th token. Logit: 18.09 Prob:  0.02% Token: | fucking|
Top 7th token. Logit: 17.70 Prob:  0.01% Token: |       |
Top 8th token. Logit: 17.59 Prob:  0.01% Token: |                |
Top 9th token. Logit: 17.49 Prob:  0.01% Token: | n|


Top 0th token. Logit: 18.21 Prob: 62.42% Token: |*|
Top 1th token. Logit: 16.70 Prob: 13.73% Token: |
|
Top 2th token. Logit: 16.28 Prob:  9.00% Token: |8|
Top 3th token. Logit: 14.75 Prob:  1.95% Token: | |
Top 4th token. Logit: 14.29 Prob:  1.24% Token: |⁸|
Top 5th token. Logit: 13.75 Prob:  0.72% Token: | *|
Top 6th token. Logit: 13.61 Prob:  0.63% Token: |        |
Top 7th token. Logit: 13.60 Prob:  0.62% Token: |]|
Top 8th token. Logit: 13.42 Prob:  0.52% Token: |]};|
Top 9th token. Logit: 13.41 Prob:  0.51% Token: |</code>|


Top 0th token. Logit: 21.78 Prob: 77.94% Token: |]|
Top 1th token. Logit: 19.35 Prob:  6.84% Token: |])|
Top 2th token. Logit: 18.69 Prob:  3.56% Token: |
|
Top 3th token. Logit: 18.33 Prob:  2.48% Token: |],|
Top 4th token. Logit: 17.79 Prob:  1.44% Token: |)]|
Top 5th token. Logit: 17.66 Prob:  1.27% Token: |                               |
Top 6th token. Logit: 17.54 Prob:  1.13% Token: |]}|
Top 7th token. Logit: 17.52 Prob:  1.10% Token: |]]|
Top 8th token. Logit: 17.11 Prob:  0.73% Token: |,|
Top 9th token. Logit: 16.25 Prob:  0.31% Token: |*|



We are interested in an entire matrix as output, many tokens not one.
This is difficult to interpret using the above output format.

Feature ablation has not been pursued further at this time.

# Feature Attribution

The original notebook with Neel Nanda allows for exploration of feature attribution.

However, the code curently crashes on the ARC challenge analysis.

This is future work.

See Neel Nanda and Joseph Bloom's code for feature attribution at:

https://colab.research.google.com/github/jbloomAus/SAELens/blob/main/tutorials/tutorial_2_0.ipynb
