<a href="https://colab.research.google.com/github/patricli/neruoscope/blob/main/ux_improve_Modified_Interactive_Neuroscope_v22_1951.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ux-improve branch

I have a few ideas to make the output from the neurosopce easier to use.

- [X] Can we change the shape to make more stuff fit on a screen?
- [ ] Could we exclude boring lines?
- [ ] GIF?
- [ ] ~~The gradio form needs a submit button. I don't want auto update on the text box.~~ Victor figures it's fast enough now that this should be low priority.
- [ ] fix size of ui so it alows you to focus on speficic neu

# Interactive Neuroscope

*This is an interactive accompaniment to [neuroscope.io](https://neuroscope.io) and to the [studying learned language features post](https://www.alignmentforum.org/posts/Qup9gorqpd9qKAEav/200-cop-in-mi-studying-learned-features-in-language-models) in [200 Concrete Open Problems in Mechanistic Interpretability](https://neelnanda.io/concrete-open-problems)*

There's a surprisingly rich ecosystem of easy ways to create interactive graphics, especially for ML systems. If you're trying to do mechanistic interpretability, the ability to do web dev and to both visualize data and interact with it seems high value! 

This is a demo of how you can combine HookedTransformer and [Gradio](https://gradio.app/) to create an interactive Neuroscope - a visualization of a neuron's activations on text that will dynamically update as you edit the text. I don't particularly claim that this code is any *good*, but the goal is to illustrate what quickly hacking together a custom visualisation (while knowing fuck all about web dev, like me) can look like! (And as such, I try to explain the basic web dev concepts I use)

Note that you'll need to run the code yourself to get the interactive interface, so the cell at the bottom will be blank at first!

To emphasise - the point of this notebook is to be a rough proof of concept that just about works, *not* to be the well executed ideal of interactively studying neurons! You are highly encouraged to write your own (and ideally, to [make a pull request](https://github.com/neelnanda-io/TransformerLens/pulls) with improvements!)

## Setup

In [None]:
import os

try:
    import google.colab

    IN_COLAB = True
    print("Running as a Colab notebook")

except:
    IN_COLAB = False
    print("Running as a Jupyter notebook - intended for development only!")
    from IPython import get_ipython

    ipython = get_ipython()
    # Code to automatically update the HookedTransformer code as its edited without restarting the kernel
    ipython.magic("load_ext autoreload")
    ipython.magic("autoreload 2")

In [None]:
import os

#if IN_COLAB:
os.system("pip install git+https://github.com/neelnanda-io/TransformerLens.git")
os.system("pip install gradio")
os.system("pip install plotly")

In [None]:
!pip install git+https://github.com/neelnanda-io/Easy-Transformer.git

In [None]:
import gradio as gr
from transformer_lens import HookedTransformer
from transformer_lens.utils import to_numpy
from IPython.display import HTML

import transformer_lens.utils as utils

import plotly.express as px
import plotly.graph_objects as go

Some plotting code. Wrappers around Plotly, not important to understand.

In [None]:
def heatmap(tensor, yaxis="", xaxis="", value="Activation", **kwargs):
  tensor = utils.to_numpy(tensor)
  plot_kwargs = {"color_continuous_scale":"RdBu", "color_continuous_midpoint":0.0, "labels":{"x":xaxis, "y":yaxis, "color":value}}
  plot_kwargs.update(kwargs)
  return px.imshow(tensor, **plot_kwargs)

In [None]:
def imshow(tensor, yaxis="", xaxis="", **kwargs):
  fig = heatmap(tensor, yaxis, xaxis, **kwargs)
  fig.show()

In [None]:
import plotly.io as pio

# Thanks to annoying rendering issues, Plotly graphics will either show up in 
# colab OR Vscode depending on the renderer - this is bad for developing demos! Thus creating a debug mode.
pio.renderers.default = "colab"

## Extracting Model Activations

We first write some code using HookedTransformer's cache to extract the neuron activations on a given layer and neuron, for a given text

In [None]:
model_name = "gpt2-small"

model = HookedTransformer.from_pretrained(model_name)

In [None]:
def get_layer_acts(text, layer):
    # Hacky way to get out state from a single hook - we have a single element list and edit that list within the hook.
    cache = {}

    def caching_hook(act, hook):
        cache["activation"] = act[0, :, :]

    model.run_with_hooks(
        text, fwd_hooks=[(f"blocks.{layer}.mlp.hook_post", caching_hook)]
    )
    return to_numpy(cache["activation"])

##Default values

In [None]:
default_text = "The following is a list of powers of 10: 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000"
default_layer = 9
default_min_neuron_index = 0
default_max_neuron_index = model.cfg.d_mlp
default_max_val=4
default_min_val=-4

## Visual verification

We can run this function and verify that it gives vaguely sensible outputs

In [None]:
print(model.to_str_tokens(default_text))
print("Default layer: ", get_layer_acts(default_text, default_layer))
print("Shape: ", get_layer_acts(default_text, default_layer).shape)

imshow(
    get_layer_acts(default_text, default_layer).T[:][default_min_neuron_index:default_max_neuron_index],
    height=max(200, (default_max_neuron_index - default_min_neuron_index)), width=800,
    aspect="auto", xaxis="Token", yaxis="Position",
    x=[f'{i}: "{token}"' for i, token in enumerate(model.to_str_tokens(default_text))],
    y=list(range(default_min_neuron_index, default_max_neuron_index))
)

## Visualizing Model Activations

We now write some code to visualize the neuron activations on some text - we're going to hack something together which just does some string processing to make an HTML string, with each token element colored according to the intensity neuron activation. We normalize the neuron activations so they all lie in [0, 1]. You can do much better, but this is a useful proof of concept of what "just hack stuff together" can look like!

I'll be keeping neuron 562 in layer 9 as a running example, as it seems to activate strongly on powers of 10.

Note that this visualization is very sensitive to `max_val` and `min_val`! You can tune those to whatever seems reasonable for the distribution of neuron activations you care about - I generally default to `min_val=0` and `max_val` as the max activation across the dataset.

In [None]:
def basic_neuron_vis(text, layer, min_neuron_index, max_neuron_index,max_val=None, min_val=None):
    """
    text: The text to visualize
    layer: The layer index
    min_neuron_index: The minimum neuron index to show
    max_neuron_index: The maximum neuron index to show

    Returns a string of HTML that displays the text with each token colored according to its activation

    Note: It's useful to be able to input a fixed max_val and min_val, because otherwise the colors will change as you edit the text, which is annoying.
    """
    if layer is None:
        return "Please select a Layer"
    if min_neuron_index is None:
        return "Please select a minimum neuron index"
    if max_neuron_index is None:
        return "Please select a maximum neuron index"

    tokens = model.to_str_tokens(text)
    acts = get_layer_acts(text, layer).T[:][min_neuron_index:max_neuron_index]    
    if max_val is None:
        max_val = acts.max()
    if min_val is None:
        min_val = acts.min()
  

    return heatmap(
        acts,height=min(1200, max(500,(max_neuron_index - min_neuron_index))), width=800,
        aspect="auto", xaxis="Token", yaxis="Position",
        x=[f'{i}: "{token}"' for i, token in enumerate(tokens)],
        y=list(range(min_neuron_index, max_neuron_index)),zmin=min_val, zmax=max_val
    )

## Create Interactive UI

We now put all these together to create an interactive visualization in Gradio! 

The internal format is that there's a bunch of elements - Textboxes, Numbers, etc which the user can interact with and which return strings and numbers. And we can also define output elements that just display things - in this case, one which takes in an arbitrary HTML string. We call `input.change(update_function, inputs, output)` - this says "if that input element changes, run the update function on the value of each of the elements in `inputs` and set the value of `output` to the output of the function". As a bonus, this gives us live interactivity!

This is also more complex than a typical Gradio intro example - I wanted to use custom HTML to display the nice colours, which made things much messier! Normally you could just make `out` into another Textbox and pass it a string.

In [None]:
# The `with gr.Blocks() as demo:` syntax just creates a variable called demo containing all these components
with gr.Blocks() as demo:
    gr.HTML(value=f"Hacky Interactive Neuroscope for {model_name}")
    # The input elements
    with gr.Row():
        with gr.Column():
            text = gr.Textbox(label="Text", value=default_text)
            # Precision=0 makes it an int, otherwise it's a float
            # Value sets the initial default value
            layer = gr.Slider(
                label="Layer", value=default_layer, step=1, minimum=0, maximum=model.cfg.n_layers - 1)
            min_neuron_index = gr.Slider(
                label="Minimum neuron index", value=default_min_neuron_index, 
                step=1, minimum=0, maximum=model.cfg.d_mlp)
            max_neuron_index = gr.Slider(
                label="Maximum neuron index", value=default_max_neuron_index, 
                step=1, minimum=0, maximum=model.cfg.d_mlp)
            max_val = gr.Number(label="Max Value", value=default_max_val)
            min_val = gr.Number(label="Min Value", value=default_min_val)
            inputs = [text, layer, min_neuron_index, max_neuron_index,max_val,min_val]
        with gr.Column():
            # The output element
            out = gr.Plot(
                basic_neuron_vis(default_text, default_layer, default_min_neuron_index, default_max_neuron_index))
    for inp in inputs:
        inp.change(basic_neuron_vis, inputs, out)

We can now launch our demo element, and we're done! The setting share=True even gives you a public link to the demo (though it just redirects to the backend run by this notebook, and will go away once you turn the notebook off!) Sharing makes it much slower, and can be turned off if you aren't in a colab.

**Exercise:** Explore where this neuron does and does not activate. Is it just powers of ten? Just comma separated numbers? Numbers in any particular sequence?

In [None]:
demo.launch(share=1)

In [None]:
while True:#Hack to take advantage of colab pro longer inactive time
   a=1+1