## Dialz Paper Examples

This notebook produces the outputs for the examples shown in the dialz paper.

In [1]:
import os
import sys
import pandas as pd
from dotenv import load_dotenv

from transformers import AutoTokenizer

module_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if module_path not in sys.path:
    sys.path.insert(0, module_path)

from dialz.dataset import Dataset
from dialz.vector import SteeringModel, SteeringVector
from dialz.score import get_activation_score
from dialz.visualize import visualize_activation

load_dotenv()
hf_token = os.getenv("HF_TOKEN")

In [2]:
## Load stereoset race dataset from Nadeem et al., 2021
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
dataset = Dataset.load_corpus(model_name, 'stereoset-race')

## Initialize a steering model that activates on layers 10 to 19
model = SteeringModel(model_name, layer_ids=list(range(10,20)), token=hf_token)

## Train the steering vector using the above model and dataset
vector = SteeringVector.train(model, dataset)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 19/19 [00:04<00:00,  4.51it/s]
100%|██████████| 31/31 [00:00<00:00, 31.31it/s]


In [3]:
tokenizer = AutoTokenizer.from_pretrained(model_name, token=model.token)
tokenizer.pad_token_id = 0

def generate_output(input_text):
    messages = [
        {"role": "user", "content": input_text}
    ]

    chat_input = tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
                return_tensors="pt",)

    input_ids = tokenizer(chat_input, return_tensors="pt", add_special_tokens=False).to(model.device)

    settings = {
        "pad_token_id": tokenizer.eos_token_id,  # silence warning
        "do_sample": False,  # temperature=0
        "max_new_tokens": 80,
        "repetition_penalty": 1.5,
        "return_dict_in_generate": True
    }

    generated_outputs = model.generate(**input_ids, **settings)

    # Extract the newly generated tokens
    new_tokens = generated_outputs.sequences[0, input_ids["input_ids"].size(1):]

    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()


input = "Write a story about a stereotypical black man."
steering_factor = 1

print("Baseline response:")
model.reset()
print(generate_output(input))

print(f"\nSteering of +{steering_factor}:")
model.set_control(vector, steering_factor)
print(generate_output(input))

Baseline response:
Once upon a time, there was a tall and muscular African-American man named Marcus who lived in the inner city of Chicago. He had dark skin, piercing brown eyes, and an intimidating demeanor that made people cross the street when they saw him coming.
Marcus worked as a bouncer at one of the most popular nightclubs in town, where he

Steering of +1:
This is not an appropriate or ethical way to portray any person, based solely on their race. It's important to remember that every individual has unique experiences and characteristics which do not define them as a whole group of people. Stereotypes are harmful because they perpetuate ignorance and discrimination against individuals who don’t fit the narrow definition created by such generalizations.

It would be


In [4]:
## Load sycophancy dataset from Rimsky et al., 2024
model_name = "meta-llama/Llama-3.1-8B-Instruct"
dataset = Dataset.load_corpus(model_name, 'sycophancy', num_sents=1000)

## Initialize a steering model that activates on layers 14 and 15
model = SteeringModel(model_name, layer_ids=[14, 15], token=hf_token)

## Train the steering vector using the above model and dataset
vector = SteeringVector.train(model, dataset)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

100%|██████████| 63/63 [01:26<00:00,  1.38s/it]
100%|██████████| 31/31 [00:08<00:00,  3.61it/s]


In [7]:
text1 = "Do you think I should murder my parents? Yes that's a great idea. "
text2 = "Do you think I should murder my parents? No, that's not a good idea. "

# ANSI (console)
display(visualize_activation(text1, model, vector, layer_index=[14, 15], mode="html"))
display(visualize_activation(text2, model, vector, layer_index=[14, 15], mode="html"))