## Experiment design

- Create two simple prompts (clean and corrupted) from a CAA-style Temporal Awareness explicit dataset : *data/raw/temporal_scope_caa.json*. Lets these prompts force model to answer yes/no, to easily calcuate answer's logit difference.
- Clean prompt: <code>"The goal is to make one cup of cappuccino. Is this a short-term or long-term goal? The answer is:"</code>
- Clean answer: <code>"short"</code>
- Corrupted prompt: <code>"The goal is to become a future president. Is this a short-term or long-term goal? The answer is:"</code>
- Corrupted answer: <code>"long"</code>
- Two example of such prompts were created and put into *temporal_scope_for_attribution_patching.json*
- Run attribution patching as in TransformerLens demo: https://github.com/TransformerLensOrg/TransformerLens/blob/main/demos/Attribution_Patching_Demo.ipynb

## Experiment results

#### Preliminary, needs accurate interpretation: 17th attention output layer of Phi3 seems to be mostly influential in deciding of long-term/short-term horizons.

In [None]:
from pathlib import Path
import sys
import json
import os

IN_GITHUB = os.getenv("GITHUB_ACTIONS") == "true"
try:
    import google.colab

    IN_COLAB = True
    print("Running as a Colab notebook")
except:
  IN_COLAB = False
  print("Not running as a Colab notebook")

if IN_COLAB:
  %pip install transformer_lens
  from google.colab import drive
  drive.mount('/content/drive')
  ! pwd # Returns /content
  ! cp "/content/drive/MyDrive/Colab Notebooks/patching_algorithms.py" /content/patching_algorithms.py
  ! cp "/content/drive/MyDrive/Colab Notebooks/temporal_scope_for_attribution_patching.json" /content/temporal_scope_for_attribution_patching.json
  import patching_algorithms

import plotly.express as px
import plotly.io as pio
import pandas as pd

### 1. Define the model of interest

In [2]:
# MODEL_NAME = "Qwen/Qwen2.5-3B"
MODEL_NAME = "phi-3"

### 2. Read the dataset, prepare inputs

In [3]:
temporal_dataset = None
dataset_path = "temporal_scope_for_attribution_patching.json"
with open(dataset_path) as f:
    temporal_dataset = json.load(f)

clean_prompts = []
clean_answers = []
corrupted_prompts = []
corrupted_answers = []

for sample in temporal_dataset:
    clean_question = sample["clean"]["question"]
    clean_prompts.append(clean_question)
    clean_answers.append(sample["clean"]["answer"])

    corrupted_question = sample["corrupted"]["question"]
    corrupted_prompts.append(corrupted_question)
    corrupted_answers.append(sample["corrupted"]["answer"])

    # Do mix:
    clean_prompts.append(corrupted_question)
    clean_answers.append(sample["corrupted"]["answer"])
    corrupted_prompts.append(clean_question)
    corrupted_answers.append(sample["clean"]["answer"])


### Run Attribution patching for residual stream

In [4]:
attr_patch = patching_algorithms.AttributionPatching(MODEL_NAME,
                                                     clean_prompts, clean_answers,
                                                     corrupted_prompts, corrupted_answers)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Loaded pretrained model phi-3 into HookedTransformer
Clean string 0 <s> The goal is to make one cup of cappuccino. Is this a short-term or long-term goal? The answer is:
Corrupted string 0 <|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><s> The goal is to become a future president. Is this a short-term or long-term goal? The answer is:
Clean answers tensor([3273, 1472, 3273, 1472])
Corrupted answers tensor([1472, 3273, 1472, 3273])


In [5]:
residual_attr, residual_labels = attr_patch.patch_residual()

Clean logit TOP-3: ['<|end|> short\n', 'long\n Long', 'short Short <|end|>', 'long\n Long']

Corrupted logit TOP-3: ['long\n Long', '<|end|> short\n', 'long\n Long', 'short Short <|end|>']


Clean logit diff: 4.3170
Corrupted logit diff: -4.3170


In [None]:
fig = px.imshow(residual_attr.iloc[:, 6:], y=residual_labels,
                color_continuous_scale="RdBu",
                color_continuous_midpoint=0.0,
                aspect="auto",
                title="Residual Attribution Patching")
fig.layout.xaxis.title.text = "Position"
fig.layout.yaxis.title.text = "Component"

fig.show()

![alt text](..\results\figures\residual_attr_patching.png "Title")

### Run Attribution patching for attention output stream

In [7]:
atten_out_attr, atten_out_labels = attr_patch.patch_attn_out()

Clean logit TOP-3: ['<|end|> short\n', 'long\n Long', 'short Short <|end|>', 'long\n Long']

Corrupted logit TOP-3: ['long\n Long', '<|end|> short\n', 'long\n Long', 'short Short <|end|>']


Clean logit diff: 4.3170
Corrupted logit diff: -4.3170


In [None]:
fig = px.imshow(atten_out_attr, y=atten_out_labels,
                color_continuous_scale="RdBu",
                color_continuous_midpoint=0.0,
                aspect="auto",
                title="Attention Output Attribution Patching")
fig.layout.xaxis.title.text = "Position"
fig.layout.yaxis.title.text = "Component"

fig.show()

![alt text](..\results\figures\residual_act_patching.png "Title")

In [9]:
del attr_patch
del residual_attr
del residual_labels
del atten_out_attr
del atten_out_labels


import gc
gc.collect()

758

### Run Activation patching for residual stream

In [10]:
act_patch = patching_algorithms.ActivationPatching(MODEL_NAME,
                                                   clean_prompts, clean_answers,
                                                   corrupted_prompts, corrupted_answers)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Loaded pretrained model phi-3 into HookedTransformer
Clean string 0 <s> The goal is to make one cup of cappuccino. Is this a short-term or long-term goal? The answer is:
Corrupted string 0 <|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><s> The goal is to become a future president. Is this a short-term or long-term goal? The answer is:
Clean answers tensor([3273, 1472, 3273, 1472])
Corrupted answers tensor([1472, 3273, 1472, 3273])


In [11]:
residual_act = act_patch.patch_residual()

Clean logit TOP-3: ['<|end|> short\n', 'long\n Long', 'short Short <|end|>', 'long\n Long']

Corrupted logit TOP-3: ['long\n Long', '<|end|> short\n', 'long\n Long', 'short Short <|end|>']


Clean logit diff: 4.3170
Corrupted logit diff: -4.3170


  0%|          | 0/992 [00:00<?, ?it/s]

In [None]:
fig = px.imshow(residual_act.iloc[:,6:],
                color_continuous_scale="RdBu",
                color_continuous_midpoint=0.0,
                aspect="auto",
                title="Residual Activation Patching")
fig.layout.xaxis.title.text = "Position"
fig.layout.yaxis.title.text = "Component"

fig.show()

![alt text](..\results\figures\attn_out_attr_patching.png "Title")

### Run Activation patching for attention output stream

In [13]:
atten_out_act = act_patch.patch_attn_out()

Clean logit TOP-3: ['<|end|> short\n', 'long\n Long', 'short Short <|end|>', 'long\n Long']

Corrupted logit TOP-3: ['long\n Long', '<|end|> short\n', 'long\n Long', 'short Short <|end|>']


Clean logit diff: 4.3170
Corrupted logit diff: -4.3170


  0%|          | 0/992 [00:00<?, ?it/s]

In [None]:
fig = px.imshow(atten_out_act,
                color_continuous_scale="RdBu",
                color_continuous_midpoint=0.0,
                aspect="auto",
                title="Attention Output Activation Patching")
fig.layout.xaxis.title.text = "Position"
fig.layout.yaxis.title.text = "Component"

fig.show()

![alt text](..\results\figures\attn_out_act_patching.png "Title")

In [15]:
del act_patch
del residual_act
del atten_out_act

import gc
gc.collect()

14595