When the German neurons are ablated, the model makes predictions on certain tokens that result in extremely high losses. 
Investigating these tokens, and the correct "next token" prediction, shows that many form German bigrams. The most prevalent are:

20 ('id', 'ig')
7 ('rt', 'ige')
5 (' Vert', 'rag')
5 ('he', 'ben')
4 ('ät', 'ig')
4 ('nd', 'liche')

We should run these bigrams with and without the German neurons ablated, then investigate which neurons have the most 
different activations at the position of the first bigram token. We'll do this using the raw logit difference.

Q: should we use single token prompts with the first token in each bigrams, or should we create prompts that end in the bigram?
A: we will create prompts the end in the bigram because models behave inconsistently for the first few tokens in each prompt, 
and we don't want this behaviour affecting our results.

Q: how will we create the prompts?
A: look at the word the token was used in, use GPT to generate prompts that end in this word.

20 ('id', 'ig') - Verteidigung, Verteidiger, Rechtsverteidigung
7 ('rt', 'ige') - auswärtige, sofortige, neuartigen
5 (' Vert', 'rag') - Vertragsbediensteten, vertraglichen
5 ('he', 'ben') - hervorheben, entheben
4 ('ät', 'ig') - tätig, Tätigkeit, Berufstätigen, bestätigt, gewalttätigen
4 ('nd', 'liche') - gründliche, selbstverständlichen, unmißverständlicher, ländlichen, unmissverständlichere

## Generate data with GPT4

In [25]:
prompts = {}
words = ['Verteidigung', 'auswärtige', 'Vertragsbediensteten', 'hervorheben', 'tätig', 'gründliche']

# 'Verteidigung'
prompts[('id', 'ig')] = [
    'Ich verbringe viel Zeit mit dem Studium der Theorie und Praxis der Verteidigung',
    'In seiner Rede betonte der Minister die Notwendigkeit einer starken nationalen Verteidigung'
    'Es ist wichtig, dass wir ein geeignetes Budget für die Verteidigung',
    'Sein Fokus liegt auf der Verbesserung seiner Techniken in der Verteidigung',
    'Das Angriffsspiel ist wichtig, aber wir dürfen die Bedeutung der Verteidigung',
    'Als Anwalt hat sie viele Jahre Erfahrung in der Verteidigung',
    'Die Verteidigung',
    'In der Militärstrategie ist die beste Angriffstaktik oft eine gute Verteidigung',
    'Der Anwalt führte eine starke und überzeugende Verteidigung',
    'Die Regierung hat die Stärkung der Verteidigung',]

# 'auswärtige'
prompts[('rt', 'ige')] = [
    'Meine Tätigkeit erfordert viele Reisen, daher bin ich oft auswärtige',
    'Er ist als diplomatischer Berater für alle auswärtige',
    'Der Minister für auswärtige',
    'Sie ist Expertin für auswärtige',
    'Die Behörde für auswärtige',
    'Es ist wichtig, sich über auswärtige',
    'Wir sollten uns auf die auswärtige',
    'In seiner Rolle überwacht er auswärtige',
    'Der Diplomat hat eine lange Karriere in auswärtige',
    'Die Universität bietet einen Studiengang in auswärtige']

# 'Vertragsbediensteten'
prompts[(' Vert', 'rag')] = [
    'Nach seiner Ausbildung begann er seine Karriere als einer der Vertragsbediensteten',
    'Das Unternehmen hat eine Reihe von Vertragsbediensteten',
    'Die Rechte und Pflichten der Vertragsbediensteten',
    'Die Bezahlung der Vertragsbediensteten',
    'Wegen des hohen Arbeitsaufkommens werden zusätzliche Vertragsbediensteten',
    'Der Status der Vertragsbediensteten',
    'Die Gesundheits- und Sicherheitsvorschriften gelten auch für die Vertragsbediensteten',
    'Alle Vertragsbediensteten müssen eine Verschwiegenheitserklärung',
    'Die Firma plant, das Team der Vertragsbediensteten',
    'Die Schulung neuer Vertragsbediensteten']

# 'hervorheben'
prompts[('he', 'ben')] = [
    'In Ihrem Lebenslauf sollten Sie Ihre besonderen Fähigkeiten und Erfahrungen hervorheben',
    'Die hellen Farben im Bild sollen die Dynamik und Energie der Szene hervorheben',
    'Bei der Präsentation sollten Sie die Hauptpunkte hervorheben',
    'Die Wissenschaftler wollen die Bedeutung ihrer Forschungsergebnisse hervorheben',
    'Die Autorin nutzte Metaphern, um die Emotionen ihrer Charaktere hervorheben',
    'Mit diesem Marketingstrategieplan wollen wir die Einzigartigkeit unseres Produkts hervorheben',
    'Es ist wichtig, in der Debatte die Fakten zu hervorheben',
    'In seinem Vortrag versuchte der Redner, die Relevanz des Themas für das Publikum hervorheben',
    'Beim Design des Hauses wurde besonderer Wert darauf gelegt, die natürlichen Materialien hervorheben',
    'Im Interview konnte sie ihre umfangreichen Kenntnisse und Erfahrungen hervorheben']

# 'tätig'
prompts[('ät', 'ig')] = [
    'Nach seinem Studium war er viele Jahre in der Marketingbranche tätig',
    'Sie ist als Freiwillige in einer gemeinnützigen Organisation tätig',
    'Ich bin seit über zehn Jahren als Lehrer tätig',
    'Er ist hauptsächlich in der Beratung von Start-up-Unternehmen tätig',
    'Als Journalistin war sie vor allem im politischen Bereich tätig',
    'Mein Bruder ist als Softwareentwickler tätig',
    'In ihrer Freizeit ist sie in verschiedenen sozialen Projekten tätig',
    'Nach seinem Ruhestand ist er ehrenamtlich in der Gemeinde tätig',
    'Sie ist als Autorin tätig und hat bereits mehrere Bücher veröffentlicht',
    'Als Anwalt ist er vor allem in den Bereichen Strafrecht und Zivilrecht tätig']

# 'gründliche'
prompts[('nd', 'liche')] = [
    'Bevor wir mit dem Projekt fortfahren, benötigen wir eine gründliche',
    'Der Erfolg der Operation hängt von einer gründlichen',
    'Das Gesetz erfordert eine gründliche',
    'Vor dem Kauf eines Gebrauchtwagens sollte man eine gründliche',
    'Die Ermittlungen in dem Fall erfordern eine gründliche',
    'Die Studie liefert eine gründliche',
    'Das Projektteam hat eine gründliche',
    'Vor dem Abschluss des Geschäfts wird eine gründliche',
    'Die Wartung des Systems erfordert eine gründliche',
    'Die Durchführung einer gründliche']

Next we need to run through each set of prompts. For each prompt we tokenize, then traverse backwards through the model until we find the last token of our bigrams.
We concatenate any tokens after this one. This gives us our final dataset.

Next, we do a forward pass with the German neurons ablated and unablated, and save the cache. There's an existing method to do this.
We select the MLP activations at the second to last position. 
We average the MLP activations.
We compare the average difference in neuron activation at that position, and select the neurons with the largest average difference.

## Setup

In [17]:
from collections import defaultdict

import torch
import numpy as np
from transformer_lens import HookedTransformer
import plotly.express as px

# Plotly needs a different renderer for VSCode/Notebooks vs Colab argh
import plotly.io as pio
pio.renderers.default = "colab+vscode"

from haystack_utils import load_txt_data, get_mlp_activations
import haystack_utils

%reload_ext autoreload
%autoreload 2

In [65]:
haystack_utils.clean_cache()
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.autograd.set_grad_enabled(False)
torch.set_grad_enabled(False)
model = HookedTransformer.from_pretrained("EleutherAI/pythia-70m", fold_ln=True, device=device)

Using pad_token, but it is not set yet.


Loaded pretrained model EleutherAI/pythia-70m into HookedTransformer


## Remove any tokens after the second bigram token from the dataset and calculate MLP mean activations

In [66]:
# Remove any tokens after the second bigram token frmo the dataset

processed_prompts = defaultdict(list)

for key, value in prompts.items():
    bigram_tokens = model.to_tokens(key, prepend_bos=False)
    new_prompts = []
    for prompt in value:
        prompt_tokens = model.to_tokens(prompt, prepend_bos=False)[0] # pos
        i = prompt_tokens.shape[0] - 1
        while (prompt_tokens[i] != bigram_tokens[1] or prompt_tokens[i - 1] != bigram_tokens[0]) and (i > 1):
            i -= 1
        prompt_tokens = prompt_tokens[:i + 1]
        prompt_string = model.to_string(prompt_tokens)
        new_prompts.append(prompt_string)

    processed_prompts[key] = new_prompts

prompts = processed_prompts

In [67]:
# Calculate MLP mean activations
haystack_utils.clean_cache()

german_data = load_txt_data("wmt_german_large.txt")[:500]
english_data = load_txt_data("kde4_english.txt")[:500]

german_mean_low_activations = defaultdict(torch.Tensor, {
    3: get_mlp_activations(english_data, 3, model, mean=True),  # [2048]
    4: get_mlp_activations(english_data, 4, model, mean=True),
    5: get_mlp_activations(english_data, 5, model, mean=True)
})
german_mean_high_activations = defaultdict(torch.Tensor, {
    3: get_mlp_activations(german_data, 3, model, mean=True),  # [2048]
    4: get_mlp_activations(german_data, 4, model, mean=True),
    5: get_mlp_activations(german_data, 5, model, mean=True)
})

our_german_neurons = [(3, 669), (5, 1336), (4, 482), (5, 1039), (4, 326)]
german_neurons = [(4, 482), (5, 1039), (5, 407), (5, 1516), (5, 1336), (4, 326), (5, 250), (3, 669)]

our_german_neurons_by_layer = defaultdict(list)
for item in our_german_neurons:
    our_german_neurons_by_layer[item[0]].append(item[1])
german_neurons_by_layer = defaultdict(list)
for item in german_neurons:
    german_neurons_by_layer[item[0]].append(item[1])

wmt_german_large.txt: Loaded 2459 examples with 800 to 2000 characters each.
kde4_english.txt: Loaded 1007 examples with 501 to 5295 characters each.


  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

## Experiment

### Highest token loss difference from ablating German neurons

In [81]:
# Not super relevant to the experiment, more of a sanity check
def disable_german_hook(value, hook):
    layer = hook.layer()
    german_neurons_for_layer = our_german_neurons_by_layer[layer]
    value[:, :, german_neurons_for_layer] = german_mean_low_activations[layer][german_neurons_for_layer].cuda()
    return value

def enable_german_hook(value, hook):
    layer = hook.layer()
    german_neurons_for_layer = our_german_neurons_by_layer[layer]
    value[:, :, german_neurons_for_layer] = german_mean_high_activations[layer][german_neurons_for_layer].cuda() * 2.2
    return value

mlp_pattern = lambda name: name.endswith("mlp.hook_post")

for key, value in prompts.items():
    enabled_loss = haystack_utils.get_average_loss([key[0] + key[1]], model, batch_size=1, crop_context=-1, fwd_hooks=[(mlp_pattern, enable_german_hook)], positionwise=True)
    disabled_loss = haystack_utils.get_average_loss([key[0] + key[1]], model, batch_size=1, crop_context=-1, fwd_hooks=[(mlp_pattern, disable_german_hook)], positionwise=True)
    print([i - j for i, j in zip(enabled_loss[:3], disabled_loss[:3])])

model.reset_hooks()
print(haystack_utils.generate_text("A4-0409/98 von Herrn Roubatis im Namen des Ausschusses für auswärt", model))

[-0.047768592834472656, -3.5971479415893555, 0]
[0.18825721740722656, -3.541370391845703, 0]
[-0.9033498764038086, -5.2075886726379395, 0]
[-0.7703180313110352, -0.7334022521972656, 0]
[-1.502431869506836, -1.8964672088623047, 0]
[-0.6871128082275391, -4.760354042053223, 0]
A4-0409/98 von Herrn Roubatis im Namen des Ausschusses für auswärtche Familie, die in der Wahlkampf in der Wahlkampf


In [84]:
# for key, value in prompts.items():
#     print(key, ":", value[0])
#     loss = haystack_utils.get_average_loss(value, model, batch_size=1, crop_context=-1, positionwise=True)
#     ablated_loss = haystack_utils.get_average_loss(value, model, batch_size=1, crop_context=-1, fwd_hooks=[(mlp_pattern, disable_german_hook)], positionwise=True)
#     print([i - j for i, j in zip(ablated_loss, loss)])

### Direct Effect, Indirect Effect

- Run model with and without ablating the German neurons, save both clean and ablated activations
- Run model again without ablation
- Simulate the effect of individual ablated components
- To simulate ablating a component:
    - Before the final layernorm, subtract the cached activation the component from the unablated run
    - Then add the activation of the ablated run
- Then we can compute the effect of running a component with corrupted activations without letting its output affect later components
- However, the cached ablated activations of later components will still be influenced by earlier components
    - [ ] Check if we can improve this by patching in ablated MLP and unablated earlier component residual stream contributions, then running the later component
    - [ ] Check patching library

In [70]:
# adapted from "DLA" function
def direct_loss_difference_attribution(
    prompts: list[str],
    model: HookedTransformer,
    fwd_hooks=[],
    patched_component=8,
    position=-1,
    crop_context: None | tuple[int, int]=None,
):
    """
    Get the effect of the German neurons being ablated for a single component. 
    Should be 0 before the component containing the earliest German neuron.

    Takes a list of neuron tuples like [layer, neuron_index]. Uses loss at specified token position

    How: for each prompt, get the original and ablated caches and decompose the residual stream into 
    its components. Then do a forward pass, hook into the residual stream at a single component, and 
    swap its original output activations for the ablated output activations.
    """
    original_losses = []
    patched_losses = []
    for prompt in prompts:
        if crop_context is not None:
            tokens = model.to_tokens(prompt)[:, crop_context[0]:crop_context[1]].cuda()
        else:
            tokens = model.to_tokens(prompt).cuda()

        original_loss, original_cache = model.run_with_cache(tokens, return_type="loss", loss_per_token=True)
        with model.hooks(fwd_hooks=fwd_hooks):
            ablated_loss, ablated_cache = model.run_with_cache(tokens, return_type="loss")

        # component, batch, pos, residual
        original_per_layer_residual, original_labels = original_cache.decompose_resid(layer=-1, return_labels=True, apply_ln=False)
        ablated_per_layer_residual, ablated_labels = ablated_cache.decompose_resid(layer=-1, return_labels=True, apply_ln=False)

        # ['embed', '0_attn_out', '0_mlp_out', '1_attn_out', '1_mlp_out', '2_attn_out', '2_mlp_out', '3_attn_out', '3_mlp_out', '4_attn_out', '4_mlp_out', '5_attn_out', '5_mlp_out']
        def swap_cache_hook(value, hook):
            # Batch, pos, residual
            value -= original_per_layer_residual[patched_component]
            value += ablated_per_layer_residual[patched_component]
        
        with model.hooks(fwd_hooks=[(f'blocks.5.hook_resid_post', swap_cache_hook)]):
            patched_loss = model(tokens, return_type="loss", loss_per_token=True)

        original_losses.append(original_loss[0, position].item())
        patched_losses.append(patched_loss[0, position].item())


    print(f"Original loss: {np.mean(original_losses):.2f}, patched loss: {np.mean(patched_losses):.2f} (+{((np.mean(patched_losses) - np.mean(original_losses)) / np.mean(original_losses))*100:.2f}%)")
    return np.mean(original_losses), np.mean(patched_losses)

In [71]:
def line(x, xlabel="", ylabel="", title="", xticks=None, width=800, hover_data=None):
    fig = px.line(x, title=title)
    fig.update_layout(xaxis_title=xlabel, yaxis_title=ylabel, width=width)
    if xticks != None:
        fig.update_layout(
            xaxis = dict(
            tickmode = 'array',
            tickvals = [i for i in range(len(xticks))],
            ticktext = xticks
            )
        )
    if hover_data != None:
        fig.update(data=[{'customdata': hover_data, 'hovertemplate': "Loss: %{y:.4f} (+%{customdata:.2f}%)"}])
    fig.show()

In [72]:
# Indirect effect of ablating the context neurons
# Logit attribution of later components when ablating the context neuron
# Not sure how clean this is - e.g. layer 5 MLP will get the accumulated effects of all previous layers from ablating the context neuron
component_names = ['embed', '0_attn_out', '0_mlp_out', '1_attn_out', '1_mlp_out', '2_attn_out', '2_mlp_out', '3_attn_out', '3_mlp_out', '4_attn_out', '4_mlp_out', '5_attn_out', '5_mlp_out']

def ablate_german_neurons_hook(value, hook):
    layer = hook.layer()
    neurons_for_layer = german_neurons_by_layer[layer]
    value[:, :, neurons_for_layer] = german_mean_low_activations[layer][neurons_for_layer].cuda()
    return value
mlp_pattern = lambda name: name.endswith("mlp.hook_post")
fwd_hooks=[(mlp_pattern, ablate_german_neurons_hook)]

for key, value in prompts.items():
    # Go through components after MLP3, which contains the first German contextual neuron
    components = []
    losses = []
    for later_component in range(8, 13):
        print(f"Component: {component_names[later_component]}")
        layer = 3
        original_loss, patched_loss = direct_loss_difference_attribution(value, model, fwd_hooks=fwd_hooks, patched_component=later_component, crop_context=(0, 500))
        if len(losses) == 0:
            components.append("Original loss")
            losses.append(original_loss)
        components.append(component_names[later_component])
        losses.append(patched_loss)

    percent_increase = ((np.array(losses) - losses[0]) / losses[0]) * 100

    line(losses, xlabel="Component", ylabel="Loss", title=f"Loss of individual patching individual components when ablating German neurons <br> on prompts with the bigram {key}", xticks=components, width=800, hover_data=percent_increase.tolist())

Component: 3_mlp_out


Original loss: 0.80, patched loss: 1.14 (+41.78%)
Component: 4_attn_out
Original loss: 0.80, patched loss: 0.82 (+1.66%)
Component: 4_mlp_out
Original loss: 0.80, patched loss: 1.42 (+76.95%)
Component: 5_attn_out
Original loss: 0.80, patched loss: 0.96 (+18.85%)
Component: 5_mlp_out
Original loss: 0.80, patched loss: 3.81 (+373.67%)


Component: 3_mlp_out
Original loss: 6.81, patched loss: 9.28 (+36.24%)
Component: 4_attn_out
Original loss: 6.81, patched loss: 6.81 (+-0.05%)
Component: 4_mlp_out
Original loss: 6.81, patched loss: 8.37 (+22.85%)
Component: 5_attn_out
Original loss: 6.81, patched loss: 7.06 (+3.65%)
Component: 5_mlp_out
Original loss: 6.81, patched loss: 5.81 (+-14.67%)


Component: 3_mlp_out
Original loss: 7.13, patched loss: 7.27 (+1.92%)
Component: 4_attn_out
Original loss: 7.13, patched loss: 7.13 (+-0.01%)
Component: 4_mlp_out
Original loss: 7.13, patched loss: 7.35 (+3.01%)
Component: 5_attn_out
Original loss: 7.13, patched loss: 7.17 (+0.55%)
Component: 5_mlp_out
Original loss: 7.13, patched loss: 7.36 (+3.17%)


Component: 3_mlp_out
Original loss: 1.94, patched loss: 3.51 (+80.58%)
Component: 4_attn_out
Original loss: 1.94, patched loss: 1.94 (+-0.02%)
Component: 4_mlp_out
Original loss: 1.94, patched loss: 2.38 (+22.64%)
Component: 5_attn_out
Original loss: 1.94, patched loss: 2.14 (+10.01%)
Component: 5_mlp_out
Original loss: 1.94, patched loss: 5.53 (+184.66%)


Component: 3_mlp_out
Original loss: 2.41, patched loss: 2.64 (+9.59%)
Component: 4_attn_out
Original loss: 2.41, patched loss: 2.47 (+2.50%)
Component: 4_mlp_out
Original loss: 2.41, patched loss: 3.41 (+41.46%)
Component: 5_attn_out
Original loss: 2.41, patched loss: 2.44 (+1.31%)
Component: 5_mlp_out
Original loss: 2.41, patched loss: 3.68 (+52.72%)


Component: 3_mlp_out
Original loss: 6.11, patched loss: 6.18 (+1.13%)
Component: 4_attn_out
Original loss: 6.11, patched loss: 6.11 (+-0.00%)
Component: 4_mlp_out
Original loss: 6.11, patched loss: 6.26 (+2.42%)
Component: 5_attn_out
Original loss: 6.11, patched loss: 6.11 (+-0.04%)
Component: 5_mlp_out
Original loss: 6.11, patched loss: 6.31 (+3.36%)


## Get individual neuron activation differences

In [73]:
from typing import List, Tuple

# Modified from downstream components notebook to take neurons on multiple layers and process a single position in each prompt
def get_ablated_mlp_difference(
        prompts: List[str], 
        model: HookedTransformer, 
        fwd_hooks: List[Tuple],
        layer_to_cache: int, 
        position=-2
):
    """Difference with ablation for one component, not doing any fancy direct/indirect effect logic"""
    original_losses = []
    ablated_losses = []
    mean_differences = []
    for prompt in prompts:
        tokens = model.to_tokens(prompt)
        original_loss, original_cache = model.run_with_cache(tokens, return_type="loss", loss_per_token=True)
        with model.hooks(fwd_hooks=fwd_hooks):
            ablated_loss, ablated_cache = model.run_with_cache(tokens, return_type="loss", loss_per_token=True)

        original_losses.append(original_loss[0, position].item())
        ablated_losses.append(ablated_loss[0, position].item())

        block_name = f'blocks.{layer_to_cache}.mlp.hook_post'
        original_activations = original_cache[block_name][:, position] # batch pos d_mlp
        ablated_activations = ablated_cache[block_name][:, position]

        # Do we need a mask here
        mean_difference = original_activations.mean((0)) - ablated_activations.mean((0))
        mean_differences.append(mean_difference)
        
        
    print(f"Original loss: {np.mean(original_losses):.2f}, ablated loss: {np.mean(ablated_losses):.2f} (+{((np.mean(ablated_losses) - np.mean(original_losses)) / np.mean(original_losses))*100:.2f}%)")
    return torch.stack(mean_differences).mean(0)


def ablate_german_neurons_hook(value, hook):
    if hook.layer() == 3:
        value[:, :, 669] = german_mean_low_activations[layer][669].cuda()
    return value
mlp_pattern = lambda name: name.endswith("mlp.hook_post")
fwd_hooks=[(mlp_pattern, ablate_german_neurons_hook)]

In [74]:
bigram_diff = np.array([])

for key, value in prompts.items():
    for layer_to_cache in [4, 5]:
        difference = get_ablated_mlp_difference(value, model, fwd_hooks=fwd_hooks, layer_to_cache=layer_to_cache)
        bigram_diff = difference
        sorted_differences, sorted_neurons = torch.topk(difference.abs(), len(difference), largest=True)
        # fig = px.histogram(difference.cpu().numpy(), title=f"Difference in layer {layer_to_cache} neuron activations between original and ablated <br> German context neuron with the bigram {key}", width=800)
        line(sorted_differences.cpu().numpy()[:100], xlabel="Neuron", ylabel="Absolute difference", xticks=sorted_neurons.cpu().tolist()[:100], title=f"Top absolute neuron differences in layer {layer_to_cache} for first token of bigram {key}", width=1400)

Original loss: 1.10, ablated loss: 0.85 (+-22.68%)


Original loss: 1.10, ablated loss: 0.85 (+-22.68%)


Original loss: 0.14, ablated loss: 0.45 (+211.78%)


Original loss: 0.14, ablated loss: 0.45 (+211.78%)


Original loss: 7.96, ablated loss: 7.95 (+-0.21%)


Original loss: 7.96, ablated loss: 7.95 (+-0.21%)


Original loss: 6.75, ablated loss: 7.87 (+16.68%)


Original loss: 6.75, ablated loss: 7.87 (+16.68%)


Original loss: 4.22, ablated loss: 5.79 (+37.19%)


Original loss: 4.22, ablated loss: 5.79 (+37.19%)


Original loss: 8.21, ablated loss: 8.20 (+-0.21%)


Original loss: 8.21, ablated loss: 8.20 (+-0.21%)


Most of these are just the context neurons we already know about. 

### Repeat with logit diff directions

- Find a plausible English completion for each first token and use a logit difference metric similar to IOI instead of the absolute logit for our answer token. 
Then we have a metric that is decomposable into individual component contributions like the logits, but isn't sensitive to scaling of the entire logit vector.
  - Just run the model with the first token and take the second one as the English/non-German bigram?
  - logit diff = x(token 2)W_U - x(english alternative token)W_U

In [75]:
# # Find plausible English completions for our bigrams


# def get_plausible_completions():
#     for key in prompts.keys():
        

- Repeat with French? Likely to get similar results