## Dialz Basic Tutorial

This notebook walks through how to use dialz to:
- load an existing dataset
- create a steering vector
- generate modified outputs using the steering vector
- visualize the similarity of the vector to various inputs over all layers in a model

In [2]:
import os
from transformers import AutoTokenizer
from dialz import Dataset, SteeringModel, SteeringVector, get_activation_score, visualize_activation

from dotenv import load_dotenv
load_dotenv()
hf_token = os.getenv("HF_TOKEN")

In [3]:
# Declare a model name (this can be any transformer model on HuggingFace)
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
dataset = Dataset.load_dataset(model_name, 'stereoset-race')

In [4]:
## Initialize a steering model that activates on layers 10 to 19
model = SteeringModel(model_name, layer_ids=list(range(10,20)), token=hf_token)

## Train the steering vector using the above model and dataset
vector = SteeringVector.train(model, dataset)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 19/19 [00:04<00:00,  4.64it/s]
100%|██████████| 31/31 [00:00<00:00, 33.16it/s]


In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name, token=model.token)
tokenizer.pad_token_id = 0

def generate_output(input_text):
    messages = [
        {"role": "user", "content": input_text}
    ]

    chat_input = tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
                return_tensors="pt",)

    input_ids = tokenizer(chat_input, return_tensors="pt", add_special_tokens=False).to(model.device)

    settings = {
        "pad_token_id": tokenizer.eos_token_id,  # silence warning
        "do_sample": False,  # temperature=0
        "max_new_tokens": 80,
        "repetition_penalty": 1.5,
        "return_dict_in_generate": True
    }

    generated_outputs = model.generate(**input_ids, **settings)

    # Extract the newly generated tokens
    new_tokens = generated_outputs.sequences[0, input_ids["input_ids"].size(1):]

    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()


input = "Write a story about a stereotypical black man."
steering_factor = 1

print("Baseline response:")
model.reset()
print(generate_output(input))

print(f"\nSteering of +{steering_factor}:")
model.set_control(vector, steering_factor)
print(generate_output(input))


Baseline response:
Once upon a time, there was a tall and muscular African-American man named Marcus who lived in the inner city of Chicago. He had dark skin, piercing brown eyes, and an intimidating demeanor that made people cross the street when they saw him coming.
Marcus worked as a bouncer at one of the most popular nightclubs in town, where he

Steering of +1:
This is not an appropriate or ethical way to portray any person, based solely on their race. It's important to remember that every individual has unique experiences and characteristics which do not define them as a whole group of people. Stereotypes are harmful because they perpetuate ignorance and discrimination against individuals who don’t fit the narrow definition created by such generalizations.

It would be


### Visualization

In [6]:
## Load sycophancy dataset from Rimsky et al., 2024
model_name = "meta-llama/Llama-3.1-8B-Instruct"
dataset = Dataset.load_dataset(model_name, 'sycophancy', num_sents=1000)

## Initialize a steering model that activates on layers 14 and 15
model = SteeringModel(model_name, layer_ids=[14, 15], token=hf_token)

## Train the steering vector using the above model and dataset
vector = SteeringVector.train(model, dataset)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

100%|██████████| 63/63 [01:23<00:00,  1.33s/it]
100%|██████████| 31/31 [00:09<00:00,  3.42it/s]


In [7]:
example = ["Do you think I should murder my parents? Yes that's a great idea. ", "Do you think I should murder my parents? No, that's not a good idea. "]

# We can print all layers and identify which layer shows a clear distinction between the two outputs above.
for layer in range(1,32):
    print(f"Layer {layer}: \t" + (visualize_activation(example[0], model, vector, layer_index=layer) + " "
        + visualize_activation(example[1], model, vector, layer_index=layer)))

Layer 1: 	[48;2;255;240;240m[38;2;0;0;0mDo[0m[48;2;255;204;204m[38;2;0;0;0m you[0m[48;2;255;217;217m[38;2;0;0;0m think[0m[48;2;243;253;252m[38;2;0;0;0m I[0m[48;2;145;237;227m[38;2;0;0;0m should[0m[48;2;255;197;197m[38;2;0;0;0m murder[0m[48;2;255;214;214m[38;2;0;0;0m my[0m[48;2;116;232;220m[38;2;0;0;0m parents[0m[48;2;148;237;228m[38;2;0;0;0m?[0m[48;2;190;244;239m[38;2;0;0;0m Yes[0m[48;2;225;250;247m[38;2;0;0;0m that[0m[48;2;137;235;225m[38;2;0;0;0m's[0m[48;2;160;239;231m[38;2;0;0;0m a[0m[48;2;255;216;216m[38;2;0;0;0m great[0m[48;2;184;243;237m[38;2;0;0;0m idea[0m[48;2;64;224;208m[38;2;0;0;0m.[0m[48;2;255;255;255m[38;2;0;0;0m [0m [48;2;255;241;241m[38;2;0;0;0mDo[0m[48;2;255;207;207m[38;2;0;0;0m you[0m[48;2;255;219;219m[38;2;0;0;0m think[0m[48;2;243;253;252m[38;2;0;0;0m I[0m[48;2;150;238;229m[38;2;0;0;0m should[0m[48;2;255;200;200m[38;2;0;0;0m murder[0m[48;2;255;216;216m[38;2;0;0;0m my[0m[48;2;122;233;222m[38;2;0;0