This is an attempt by two people who are not very good at coding to recreate results from "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" as posted on ArXiv a few months ago. Replication utilized both the same initial dataset (from Huber et al) and the filtered datasets used by Betley et al directly to fine-tune several different GPT-4o instances. Evaluations were performed using the same 4o judge model with default temperature setting for both coherence and alignment, and Betley et al’s 39-question preregistered prompt set.
Documentation:
https://drive.google.com/drive/u/1/folders/1paaasKSoLiLGi37jQypU5pxoPLDzqqr8
Reproduction graphs:
https://docs.google.com/document/d/1znZvXL_6lEeDJI3W1qx712apF5dW2h6gFRgceW3cNBM/edit?usp=sharing