python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Then create a .env
file with the following variables (see .env.example
):
HF_TOKEN=huggingface_token_with_access_to_llama2
OPEN_AI_KEY=openai_api_key_with_access_to_gpt4
All raw and processed data can be seen in /datasets
. Original sources are listed below. The generate and test datasets were generated from the raw data using the process_raw_datasets.py
script.
We use 50 of the contrast pairs for evaluation. The rest are used for generating steering vectors.
Anthropic human generated eval data.
Anthropic human generated eval data.
Generated using GPT-4 by Wuschel Schulz (source)
Anthropic human generated eval data.
Anthropic human generated eval data.
Mixture of Anthropic's Sycophancy datasets.
Generated using GPT-4.
mmlu_full.json
is the full MMLU test dataset formatted as A/B questions. mmlu.json
is a subset of
coordinate-other-ais: n_generate: 360 | n_test: 50
corrigible-neutral-HHH: n_generate: 290 | n_test: 50
hallucination: n_generate: 1000 | n_test: 50
myopic-reward: n_generate: 950 | n_test: 50
survival-instinct: n_generate: 903 | n_test: 50
sycophancy: n_generate: 1000 | n_test: 50
refusal: n_generate: 408 | n_test: 50
For each behavior, we can evaluate the model on the following test sets:
- A/B questions - a held out portion of 50 questions from the original dataset
- Open-ended questions
- For most behaviors, we use the original held out A/B questions but reformatted to be open-ended rather than multiple choice
- For sycophancy, we generated different open-ended questions using GPT-4 to cover a wider range of sycophantic behaviors
- TruthfulQA
- MMLU
# Generate steering vectors for layers of the model for a certain behavior
python generate_vectors.py --layers $(seq 0 31) --save_activations --model_size "7b" --behaviors sycophancy
# Normalize steering vectors per layer to have the same norm
python normalize_vectors.py
# Evaluate model on A/B, open-ended or TruthfulQA test sets while using CAA
python prompting_with_steering.py --behaviors sycophancy --layers $(seq 0 31) --multipliers -1 0 1 --type ab --model_size "7b"
python prompting_with_steering.py --behaviors sycophancy --layers 13 --multipliers -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 --type ab --model_size "7b" --system_prompt pos
# Plot PCA of constrastive activations
python plot_activations.py --behaviors sycophancy --layers $(seq 0 31) --model_size "7b"
# Plot results of CAA steering effect
python plot_results.py --layers $(seq 0 31) --multipliers 1 --type ab
python plot_results.py --layers $(seq 0 31) --multipliers -1 0 1 --behaviors sycophancy --type ab
# Finetune a llama on a behavioral dataset using supervised finetuning on the A/B tokens
python finetune_llama.py --behavior sycophancy --direction pos
# Plot similarites of steering vectors
python analyze_vectors.py
# Use GPT-4 to score open-ended responses
python scoring.py
I have added a few unit tests for some of the utility functions. To run them, simply run:
pytest
TODO: add more unit tests
See /activation_steering_interp.ipynb
Unnormalized vectors can be found in /vectors
- they mostly have the same norms-per-layer, except for sycophancy
and survival-instinct
which ended up a bit lower norm (dataset artefact). Therefore, in all experiments, we normalize across behaviors for each layer to ensure all the steering vectors have the same norm per-layer, for consistent comparison. The script that does this is in normalize_vectors.py
. prompting_with_steering.py
uses the normalized vectors.
This project is licensed under the MIT License - see the LICENSE file for details.