This repository contains reproductions of some of the graphs presented in Toy Models of Superposition showing how smaller non-linear neural networks (networks with non-linear activation functions) can exhibit superposition to represent/compress larger neural networks.
I was interested to see how the representations form, especially after seeing Extracting Interpretable Features from Claude 3 Sonnet, so have extended these graphs to show how they change during training.
I have only tested the code with python3.11.
Non-linear neural network layers can represent more features than they have neurons through a process called superposition provided these features are sparse. The following graphs show how 5 dimensional features are mapped into a 2-dimensional nn layer for different sparsities.
For more explanation on the following graphs see here.