# Scratch notes - Patchscopes streamlit tutorial

Five tasks, forming rough tutorial structure:  
1. Decoding next-token predictions     
    a. Logit Lens - original  
    b. Logit Lens - via Patchscope  
    c. Tuned Lens - original  
    d. Tuned Lens - via Patchscope  
    e. Future Lens - original  
    f. Future Lens - via Patchscope  
    g. Token identity Patchscope  
3. Attribute extraction  
    a. LRE Attribute Lens - original  
    b. LRE Attribute Lens - via Patchscope  
    c. Probing  
    d. Feature extraction Patchscope   
5. Entity resolution  
    a. Causal Tracing - original  
    b. Causal Tracing - Patchscope  
    c. Attention Knockout - original  
    d. Attention Knockout - Patchscope  
    e. Entity Description Patchscope  
    f. X-model Entity Description Patchscope   
7. Cross-model patching  
    a. N/A?   
9. Multi-hop reasoning  
    a. N/A?  

- ![image.png](attachment:4f7a3c4e-4fb0-4037-bbf2-8336163043d7.png)  



# Section 1a - Decoding next-token predictions via Logit Lens

We'll also need the models used in this experiment (Figure 2):
1. Vicuna-13B
2. LLAMA2-13B
3. Pythia-12B
4. GPT-J-6B

These should all be available via nnsight --> HuggingFace

To do:
- Run Logit Lens on these

Patchscope experiment notes:
- 12k random samples from the Pile (10k for training affine mappings in lenses, 2k for evaluation)
- "In our pre-processing strategy, we introduce randomness in the patching positions by trimming the input sequence length of each example."

Differences between Logit Lens and Patchscopes:
- Logit Lens, at each layer, just returns the residual stream at the end of every layer, multiplied by the unembedding matrix. We then get vectors of probabilities in vocab space, each of which represents the model's predictions for the next token, for every token in the input.
- The Logit Lens examines, for every layer and input token pair, the predicted token.
![image.png](attachment:7e0f5fc7-cae4-4042-92ae-de4035c56a12.png)
- In this way, we can see how the predicted tokens change, as we move forward through the model.
- Patchscopes says, they multiply "the final-layer last-position hidden representation h^L by the unembedding matrix WU ∈ |V |×d", to get some output distribution p^L. 
- But I think crucially, they want to estimate p^L from intermediate representations in the model, and not the output tokens. i.e. they want to see how quickly they can recover the output predictions for the last position, which will be of shape (vocab size).
- So for logit lens, we should do the same.
- The below snippets verify this for me:
![image.png](attachment:4c2bb949-1d57-4e8d-967a-7ee026280e33.png)
![image.png](attachment:a7a44ec3-72bc-4b83-820e-25c72c53b402.png)
- Note, that taking the residual stream, at the end of a layer, and at a certain position, then unembedding it, doesn't give the logits yet. Need to softmax:
- ![image.png](attachment:8fd563d2-02ac-419f-99e1-c250828f090d.png)
- Another key difference between the Logit Lens and Patchscopes Logit Lens (also called the Token Identity in 4.1), is the use of a target prompt T which is different to the source prompt (I assume from the Pile). The prompt is of the below format:
- ![image.png](attachment:c2d911c1-bfbb-4c2a-a54f-74b07debe667.png)
- I don't really know why this has been used, instead of just keeping the source prompt.
- I also don't understand the line,"In our pre-processing strategy, we introduce randomness in the patching positions by trimming the input sequence length of each example.". My current suspicion is this: https://aisafetycamp.slack.com/archives/C06BLFTEZNZ/p1709123993502799
- So, generate a dataset from the Pile, where each entry varies in number of tokens?
- But I do see how we're going to want to test:
- The original logit lens implementation
- The logit lens re-created via Patchscopes
- The Patchscopes solution to the problem in question. In this case, decoding next-token prediction (note, NOT next-token prediction). This third option is highlighted in green for each task below:


## Load data from the Pile

In [None]:
from datasets import load_dataset

# Define the number of samples you want for training and validation
num_train_samples = 10000  
num_val_samples = 2000   

# Load a subset of the dataset for training
train_subset = load_dataset('the_pile', split=f'train[:{num_train_samples}]')

# Load a subset of the dataset for validation
val_subset = load_dataset('monology/pile-uncopyrighted', split=f'validation[:{num_val_samples}]')

