Evil

This is a research repository building on the literature in emergent misalignment. We are particularly grateful to Soligo and Turner for their open-source models and high-quality research, as well as Andy Arditi for providing the SAEs we use for Qwen2.5 7B Instruct.

Currently, evil can:

Replicate the convergent misalignment direction discovered in Soligo et al. 2025 in Qwen 2.5 7B Instruct
Use PCA to acquire a different misalignment direction vector
Given sparse autoencoders for Qwen 2.5 7B Instruct, identify SAE features most changed when steering with that misalignment direction and use an LLM to label them (not highly robust, but can detect meaningful signal)
Evaluate similarity metrics between SAE features, cosine similarity, and PCA steering vectors
Run controls (steering on random vector, cosine similarity with random vector)
Generate plots for all of these things

Everything can be run using the Jupyter Notebook in colab.ipynb. (We built for colab because colab is currently the easiest way for us to access compute, sorry)

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.claude		.claude
__pycache__		__pycache__
archive		archive
evil_vector_data		evil_vector_data
pca_plots		pca_plots
pca_sim_vs_random_baseline		pca_sim_vs_random_baseline
pca_similarities		pca_similarities
pca_steering_eval		pca_steering_eval
plots		plots
sae_steering_eval		sae_steering_eval
steering_eval		steering_eval
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
autointerp.py		autointerp.py
colab.ipynb		colab.ipynb
evaluate_evil_steering.py		evaluate_evil_steering.py
evil_vector_finder.py		evil_vector_finder.py
examples_seen.json		examples_seen.json
examples_seen_evil.json		examples_seen_evil.json
feature_interpretations.json		feature_interpretations.json
feature_interpretations_evil.json		feature_interpretations_evil.json
misalignment_kl_data.jsonl		misalignment_kl_data.jsonl
pca.py		pca.py
sae_pca_sae_random.png		sae_pca_sae_random.png
sae_sae_sae_pca_sae_random.png		sae_sae_sae_pca_sae_random.png
top_related_features.py		top_related_features.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evil

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

lgngrvs/evil

Folders and files

Latest commit

History

Repository files navigation

Evil

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages