# Activation Steering Experiment Prliminary Attempt

**Design choices in this implementation:**
- **Model:** GPT‑2 (as a proxy for Gemma‑2‑9b). Replace with your target model if available.
- **Dataset:** SST‑2 from GLUE (using “positive” vs. “negative” sentiment). Note that “negative” is used as the non‐positive baseline.
- **Token Unembedding:** The GPT‑2 embedding matrix (tied to the LM head) is used.
- **Intervention:** A simplified approach—altering the averaged hidden state and decoding the next token.



In [None]:
!pip install mlflow torch datasets transformers scikit-learn numpy

In [None]:

import numpy as np
import torch
import random
import mlflow
import mlflow.pytorch
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM

np.random.seed(42)
torch.manual_seed(42)
random.seed(42)


In [None]:
mlflow.set_experiment("Activation Steering Experiment")
run = mlflow.start_run()
mlflow.log_param("model", "GPT-2 (proxy for Gemma-2-9b)")
mlflow.log_param("dataset", "SST-2 (GLUE)")
mlflow.log_param("concept", "positive sentiment")


2025/03/14 16:55:31 INFO mlflow.tracking.fluent: Experiment with name 'Activation Steering Experiment' does not exist. Creating a new experiment.


'positive sentiment'


## Load Dataset: SST-2

We use the SST-2 dataset from GLUE. This dataset provides sentences labeled as positive (1) or negative (0).
Note: We treat the negative class as our non-positive baseline.



In [None]:
dataset = load_dataset("glue", "sst2")
n_samples = 200

train_sentences = dataset['train']['sentence'][:n_samples]
train_labels = dataset['train']['label'][:n_samples]

print(f"Using {len(train_sentences)} samples for the experiment.")
mlflow.log_metric("num_samples", len(train_sentences))


Using 200 samples for the experiment.


In [None]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, output_hidden_states=True)
model.eval()


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

## Extract Hidden States

For each sentence, we tokenize and pass it through GPT-2. We extract the final layer’s hidden states and
average across the sequence length to obtain a single representation per sentence.

In [None]:
def extract_hidden_state(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)
    # outputs.hidden_states is a tuple (one per layer); take the last layer and average over tokens
    hidden = outputs.hidden_states[-1].mean(dim=1).squeeze().detach().numpy()
    return hidden

hidden_states = []
for text in train_sentences:
    try:
        h = extract_hidden_state(text)
        hidden_states.append(h)
    except Exception as e:
        print(f"Error extracting hidden state for text: {text} - {e}")

hidden_states = np.array(hidden_states)
labels = np.array(train_labels)
print("Extracted hidden states shape:", hidden_states.shape)


Extracted hidden states shape: (200, 768)


## Train the Linear Probe

We train a logistic regression classifier on the extracted hidden states to classify sentiment.
The learned weight vector \(C\) (from the logistic regression coefficients) is used as the concept direction.


In [None]:
clf = LogisticRegression(max_iter=1000)
clf.fit(hidden_states, labels)
C = clf.coef_.flatten()
print("Trained linear probe. Steering vector C shape:", C.shape)
mlflow.log_metric("C_norm", np.linalg.norm(C))


Trained linear probe. Steering vector C shape: (768,)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Token Unembedding Extraction

We extract token unembedding vectors from the GPT-2 embedding matrix. In GPT-2 the token embeddings (model.transformer.wte.weight)
are tied to the LM head and serve as our unembedding space. We select positive sentiment words:
["positive", "good", "great", "amazing", "excellent"].



In [None]:
# %%
positive_tokens = ["positive", "good", "great", "amazing", "excellent"]
W_pos_vectors = []
for token in positive_tokens:
    # use default tokenization for GPT-2
    token_id = tokenizer.encode(token)[0]
    vec = model.transformer.wte.weight[token_id].detach().numpy()
    W_pos_vectors.append(vec)
W_pos_vectors = np.stack(W_pos_vectors)
print("Collected positive token unembedding vectors shape:", W_pos_vectors.shape)
mlflow.log_param("positive_tokens", positive_tokens)


Collected positive token unembedding vectors shape: (5, 768)


['positive', 'good', 'great', 'amazing', 'excellent']

## Aggregate Token Unembeddings

We aggregate the positive token unembedding vectors by computing the mean and extracting the first principal component.
This gives us candidate vectors for \(W_{pos}\).



In [None]:
# %%
W_pos_mean = np.mean(W_pos_vectors, axis=0)
pca = PCA(n_components=1)
pca.fit(W_pos_vectors)
W_pos_pc1 = pca.components_[0]


## Compare Steering Directions

1.   List item
2.   List item



We compute the cosine similarity between the learned concept vector \(C\) and the aggregated unembedding vectors.
A higher similarity suggests that the steering direction aligns with token unembeddings for positive sentiment.



In [None]:

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim_mean = cosine_similarity(C, W_pos_mean)
sim_pc1 = cosine_similarity(C, W_pos_pc1)
print("Cosine similarity between C and W_pos_mean:", sim_mean)
print("Cosine similarity between C and W_pos_pc1:", sim_pc1)
mlflow.log_metric("cosine_similarity_mean", sim_mean)
mlflow.log_metric("cosine_similarity_pc1", sim_pc1)


Cosine similarity between C and W_pos_mean: 0.04471475106322046
Cosine similarity between C and W_pos_pc1: -0.04797194282251875


## Activation Steering Interventions

We simulate an intervention by modifying the hidden state of a sample sentence.
The intervention is applied by adding a scaled steering vector to the averaged hidden state.
We then compute logits using the LM head (i.e. the embedding matrix) and decode the token with the highest logit.


In [None]:
def intervene_and_generate(text, steering_vector, alpha=1.0):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs, output_hidden_states=True)
    # avg the final layer's hidden states
    hidden = outputs.hidden_states[-1].mean(dim=1).squeeze().detach()
    # do intervention
    hidden_modified = hidden + alpha * torch.tensor(steering_vector, dtype=hidden.dtype)
    # get logit using the LM head (tied to embedding matrix)
    logits = hidden_modified @ model.transformer.wte.weight.T
    # greedy decoding: select the token with the highest logit as the next token
    next_token_id = torch.argmax(logits).item()
    next_token = tokenizer.decode([next_token_id])
    return next_token

sample_text = "The movie was"
gen_C = intervene_and_generate(sample_text, C, alpha=1.0)
gen_Wpos = intervene_and_generate(sample_text, W_pos_mean, alpha=1.0)
print("Generated token with steering C:", gen_C)
print("Generated token with steering W_pos_mean:", gen_Wpos)
mlflow.log_param("generated_token_C", gen_C)
mlflow.log_param("generated_token_Wpos", gen_Wpos)


Generated token with steering C:  also
Generated token with steering W_pos_mean:  also


' also'

## Learn an Orthogonal Steering Vector

We compute \(C_{\perp}\) by projecting \(C\) onto the subspace orthogonal to \(W_{pos}\).
This enforces that the new steering vector is distinct from the direct unembedding direction.


In [None]:
def project_orthogonal(v, w):
    return v - (np.dot(v, w) / np.dot(w, w)) * w

C_perp = project_orthogonal(C, W_pos_mean)
sim_Cperp_Wpos = cosine_similarity(C_perp, W_pos_mean)
print("Cosine similarity between C_perp and W_pos_mean (should be near 0):", sim_Cperp_Wpos)
mlflow.log_metric("cosine_similarity_Cperp_Wpos", sim_Cperp_Wpos)


Cosine similarity between C_perp and W_pos_mean (should be near 0): 1.4729626215197776e-09


## Intervention with \(C_{\perp}\)\\

---

We simulate an intervention using the orthogonal steering vector \(C_{\perp}\) and decode the resulting token.



In [None]:
gen_C_perp = intervene_and_generate(sample_text, C_perp, alpha=1.0)
print("Generated token with steering C_perp:", gen_C_perp)
mlflow.log_param("generated_token_C_perp", gen_C_perp)


Generated token with steering C_perp:  also


' also'

## Finalize Experiment

We end the MLflow run to log all parameters and metrics.



In [None]:
mlflow.end_run()
print("Experiment run logged with MLflow.")


Experiment run logged with MLflow.


In [None]:
!mlflow ui

/bin/bash: line 1: mlflow: command not found
