Skip to content

lgngrvs/evil

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evil

This is a research repository building on the literature in emergent misalignment. We are particularly grateful to Soligo and Turner for their open-source models and high-quality research, as well as Andy Arditi for providing the SAEs we use for Qwen2.5 7B Instruct.

Currently, evil can:

  • Replicate the convergent misalignment direction discovered in Soligo et al. 2025 in Qwen 2.5 7B Instruct
  • Use PCA to acquire a different misalignment direction vector
  • Given sparse autoencoders for Qwen 2.5 7B Instruct, identify SAE features most changed when steering with that misalignment direction and use an LLM to label them (not highly robust, but can detect meaningful signal)
  • Evaluate similarity metrics between SAE features, cosine similarity, and PCA steering vectors
  • Run controls (steering on random vector, cosine similarity with random vector)
  • Generate plots for all of these things

Everything can be run using the Jupyter Notebook in colab.ipynb. (We built for colab because colab is currently the easiest way for us to access compute, sorry)

About

Emergent misalignment research repository

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •