Skip to content

koover1/malicious-vector-repeatability

Repository files navigation

malicious-vector-repeatability

This is an attempt by two people who are not very good at coding to recreate results from "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" as posted on ArXiv a few months ago. Replication utilized both the same initial dataset (from Huber et al) and the filtered datasets used by Betley et al directly to fine-tune several different GPT-4o instances. Evaluations were performed using the same 4o judge model with default temperature setting for both coherence and alignment, and Betley et al’s 39-question preregistered prompt set.

Documentation:

https://drive.google.com/drive/u/1/folders/1paaasKSoLiLGi37jQypU5pxoPLDzqqr8

Reproduction graphs:

https://docs.google.com/document/d/1znZvXL_6lEeDJI3W1qx712apF5dW2h6gFRgceW3cNBM/edit?usp=sharing

About

An attempt to recreate Betley et al's alignment datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages