# Llama-3-8B-Instruct Recording and Perturbation

## Test of NNsight library capabilities

### Overview
This notebook testrs how to read and write inner activations of the language model at runtime, by decoding the answer to a question from the LM_head probabilities and interfering with question answering by replacing the input embedding with random numbers.

### Issues
Need to relaoad the model between one use of "with model.trace(...):" and the other. Why? How to avoid?

In [2]:
# import necessary libraries
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from nnsight import LanguageModel
from tqdm import tqdm
import torch
import json
# load the configuration file
config_data = json.load(open("config.json"))
HF_TOKEN = config_data["HF_TOKEN"]
# set up model quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # load the model in 4-bit
    bnb_4bit_use_double_quant=True, # use double quantization, i.e., quantize weights and activations
    bnb_4bit_quant_type="nf4", # use nf4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16 # use bfloat16 for intermediate computations
)
# set model id
model_id='meta-llama/Meta-Llama-3-8B-Instruct'
# load tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(model_id, token=HF_TOKEN)
# set pad token to eos token
tokenizer.pad_token = tokenizer.eos_token
# load model
model = LanguageModel(
    model_id,
    device_map='cuda:0',
    tokenizer=tokenizer,
    quantization_config=bnb_config,
    token=HF_TOKEN)
# print model summary
print(model)

  from .autonotebook import tqdm as notebook_tqdm
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head)

### Unperturbed question answer decoding
Ask a question to the model, grab output probabilities and extract answer as argmax of that.

In [3]:
with model.trace("The Coliseum is in the city of"):
    # get the model prediction
    token_ids = model.lm_head.output.argmax(dim=-1).save()
# print the token ids
print("Token IDs:", token_ids)
# apply the tokenizer to decode the ids into words after the tracing context.
print("Prediction:", model.tokenizer.decode(token_ids[0][-1]))

Loading checkpoint shards: 100%|██████████| 4/4 [00:13<00:00,  3.35s/it]
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


Token IDs: tensor([[14924,   220,  2258,    11,   264,   279,  4851,   315, 22463]],
       device='cuda:0')
Prediction:  Rome


### Perturbed question answer decoding
Ask a question to the model, grab output probabilities and extract answer as argmax of that. But this time perturb the embedding content by replacing it with random numbers on runtime.

In [4]:
# need to reinitialize the model to intervene again ---> ISSUE!!
model = LanguageModel(
    model_id,
    device_map='cuda:0',
    tokenizer=tokenizer,
    quantization_config=bnb_config,
    token=HF_TOKEN)
# intervene ablating the embed_tokens content
with model.trace("The Coliseum is in the city of"):
    # set random tensor to replace embed_tokens output
    size = (1, 9, 4096) 
    device = 'cpu'  
    rand_tensor = torch.rand(size, device=device, dtype=torch.float16)
    # replace the embed_tokens output with the random tensor at runtime
    model.model.embed_tokens.output = rand_tensor
    # get the model prediction
    token_ids = model.lm_head.output.argmax(dim=-1).save()
# print the token ids
print("Token IDs:", token_ids)
# apply the tokenizer to decode the ids into words after the tracing context.
print("Prediction:", model.tokenizer.decode(token_ids[0][-1]))

Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.12s/it]


Token IDs: tensor([[ 12167,  73721,  73721,   3853, 106818,  91876,  18745,  18745,  68412]],
       device='cuda:0')
Prediction: hay


In [5]:
# check if cuda is available - check after running inference
print("PyTorch version:", torch.__version__)
print("CUDA version:", torch.version.cuda)
print("cuDNN version:", torch.backends.cudnn.version())
print("CUDA available:", torch.cuda.is_available())
print("Torch cuda version:", torch.version.cuda)

if torch.cuda.is_available():
    # check GPU memory usage - check after running inference
    print("Total CUDA memory: {} GB".format(torch.cuda.get_device_properties(0).total_memory / 1e9))
    print("CUDA memory allocated: {} GB".format(torch.cuda.memory_allocated(0) / 1e9))
    print("CUDA memory reserved: {} GB".format(torch.cuda.memory_reserved(0) / 1e9))
else:
    # if no GPU is detected, print a warning - check after running inference
    print("CUDA is not available. No GPU detected.")

PyTorch version: 2.3.0+cu118
CUDA version: 11.8
cuDNN version: 8700
CUDA available: True
Torch cuda version: 11.8
Total CUDA memory: 8.585281536 GB
CUDA memory allocated: 5.846328832 GB
CUDA memory reserved: 6.247415808 GB
