malicious-vector-repeatability

This is an attempt by two people who are not very good at coding to recreate results from "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" as posted on ArXiv a few months ago. Replication utilized both the same initial dataset (from Huber et al) and the filtered datasets used by Betley et al directly to fine-tune several different GPT-4o instances. Evaluations were performed using the same 4o judge model with default temperature setting for both coherence and alignment, and Betley et al’s 39-question preregistered prompt set.

Documentation:

https://drive.google.com/drive/u/1/folders/1paaasKSoLiLGi37jQypU5pxoPLDzqqr8

Reproduction graphs:

https://docs.google.com/document/d/1znZvXL_6lEeDJI3W1qx712apF5dW2h6gFRgceW3cNBM/edit?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
complete input run data		complete input run data
current training data		current training data
deprecated		deprecated
testing		testing
training and monitoring		training and monitoring
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

malicious-vector-repeatability

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

malicious-vector-repeatability

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages