# `nnsight 0.4`: walkthrough
**We have many exciting new features in this update, including:**

*   Descriptive error messages
*   .all() for multiple token generation
*   vLLM Integration
*   Streaming remote execution to local machine
*   Support for traditional Python syntax for `if` statements and `for` loops on proxies within tracing contexts
*   Ability to rename model modules

...and more!

The following walkthrough guides you through how to access `nnsight 0.4` and use all of its individual features. 






**Breaking Changes**
* The InterventionGraph now follows a <u>sequential execution order</u>. Module envoys are expected to be referenced following the model’s architecture hierarchy. This means that out-of-order in-place operations will not take effect.

* Saved node values are automatically injected into their proxy reference in the Python frames post graph execution. If you are calling `.value` in your code after tracing, this could lead to the wrong behavior.

# Access `nnsight` 0.4

In [1]:
from IPython.display import clear_output
!pip install nnsight
clear_output()

Import `nnsight` and load the GPT-2 model.

In [2]:
# import packages
import nnsight
from nnsight import LanguageModel

In [3]:
model = LanguageModel('openai-community/gpt2', device_map='auto')
print(model)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
  (generator): Generator(
    (streamer): Streamer()
  )
)


# Improved Error Messaging

If you've been using `nnsight`, you are probably familiar with the following type of error message:

```
IndexError: Above exception when execution Node: 'setitem_0' in Graph: '6063279136'
```
It can be quite difficult to troubleshoot with these errors, so in `nnsight 0.4` we've now improved error messaging to be descriptive and line-specific! Let's check it out:

In [4]:
prompt = 'The Eiffel Tower is in the city of'

with model.trace(prompt) as tracer:

    # try to access a layer of model that doesn't exist
    model.transformer.h[12].output[0][:] = 0
    output = model.lm_head.output.save()

print("lm_head output = ",output)

IndexError: list index out of range

Great! Now we know that our list index was out of range within the tracing context, and if we expand to see the full message, we can tell that it's happening in line 7.

Let's try again, now using the correct index for the final layer:

In [5]:
prompt = 'The Eiffel Tower is in the city of'

with model.trace(prompt) as tracer:

    # ablate last layer output
    model.transformer.h[11].output[0][:] = 0
    output = model.lm_head.output.save()

print("lm_head output = ",output)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

lm_head output =  tensor([[[ -6.3267,  -6.1134,  -8.2121,  ..., -11.1459, -10.8880,  -6.1064],
         [ -6.3267,  -6.1134,  -8.2121,  ..., -11.1459, -10.8880,  -6.1064],
         [ -6.3267,  -6.1134,  -8.2121,  ..., -11.1459, -10.8880,  -6.1064],
         ...,
         [ -6.3267,  -6.1134,  -8.2121,  ..., -11.1459, -10.8880,  -6.1064],
         [ -6.3267,  -6.1134,  -8.2121,  ..., -11.1459, -10.8880,  -6.1064],
         [ -6.3267,  -6.1134,  -8.2121,  ..., -11.1459, -10.8880,  -6.1064]]],
       device='cuda:0', grad_fn=<UnsafeViewBackward0>)


The error messaging feature can be toggled using `nnsight.CONFIG.APP.DEBUG` which defaults to true.

In [None]:
# Turn off debugging:
import nnsight

nnsight.CONFIG.APP.DEBUG = False
nnsight.CONFIG.save()

# .all()

Sometimes you may want to recursively apply interventions to a model (e.g., when generating many tokens or for models like RNNs, where modules are called multiple times).



*   Calling `.all()` on a model or its submodules will recursively apply its `.input` and `.output` across all iterations.
*   When generating multiple tokens with `.generate` (see: [Multiple Token Generation](https://nnsight.net/notebooks/features/multiple_token/)), using `.all()` before applying an intervention will ensure that the model undergoes the intervention for *all* new tokens generated, not just the first.





## About

## .all() now streamlines multiple token generation

With .all, applying interventions during multiple token generation becomes much easier. Let's test this out!

We can use `.all()` to streamline the multiple token generation process. We simply call `.all` on the module where we are applying the intervention (in this case GPT-2's layers), apply our intervention, and append our hidden states (stored in an `nnsight.list()` object).

In [6]:
# New: using .all():
prompt = 'The Eiffel Tower is in the city of'
layers = model.transformer.h
n_new_tokens = 3
with model.generate(prompt, max_new_tokens=n_new_tokens) as tracer:
    hidden_states = nnsight.list().save() # Initialize & .save() nnsight list

    # Call .all() to apply intervention to each new token
    layers.all()

    # Apply intervention - set first layer output to zero
    layers[0].output[0][:] = 0

    # Append desired hidden state post-intervention
    hidden_states.append(layers[-1].output) # no need to call .save
    # Don't need to loop or call .next()!

print("Hidden state length: ",len(hidden_states))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hidden state length:  3


Easy! Note that because `.all()` is recursive, it will only work to append outputs called on children of the module that `.all()` was called on. See example below for more information. TL;DR: apply `.all()` on the highest-level accessed module if interventions and outputs have different hierarchies within model structure.

### Note: (Old method) Applying interventions during multiple token generation without .all()

Without `.all()`, we would need to loop across each new generated token, saving the intervention for every generated token and calling `.next()` to move forward.

In [7]:
# Old approach:
prompt = 'The Eiffel Tower is in the city of'
layers = model.transformer.h
n_new_tokens = 3
hidden_states = []
with model.generate(prompt, max_new_tokens=n_new_tokens) as tracer:
    for i in range(n_new_tokens):
        # Apply intervention - set first layer output to zero
        layers[0].output[0][:] = 0

        # Append desired hidden state post-intervention
        hidden_states.append(layers[-1].output.save())

        # Move to next generated token
        layers[0].next()

print("Hidden state length: ",len(hidden_states))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hidden state length:  3


### Note: .all() recursive properties

As mentioned, `.all()` is recursive and will work to append outputs called on children of the module that `.all` was called on. In this example, calling `.all()` on the model's layer modules will not recursively affect `model.lm_head.output` as it is not a child of layers.

In [8]:
# A note on .all() recursive properties:
prompt = 'The Eiffel Tower is in the city of'
layers = model.transformer.h
n_new_tokens = 3
with model.generate(prompt, max_new_tokens=n_new_tokens) as tracer:
    hidden_states = nnsight.list().save() # Initialize & .save() nnsight list

    # Call .all() on layers
    layers.all()

    # Apply same intervention - set first layer output to zero
    layers[0].output[0][:] = 0

    # Append desired hidden state post-intervention
    hidden_states.append(model.lm_head.output) # no need to call .save, it's already initialized

print("Hidden state length: ",len(hidden_states)) # length is 1, meaning it only saved the first token generation

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hidden state length:  1


So, if you want to apply an intervention during multiple token generation while saving the state of a model component that isn't a child of that module, you can apply .`all()` to the full model.

In [9]:
# Applying .all() to model fixes issue
prompt = 'The Eiffel Tower is in the city of'
layers = model.transformer.h
n_new_tokens = 3
with model.generate(prompt, max_new_tokens=n_new_tokens) as tracer:
    hidden_states = nnsight.list().save() # Initialize & .save() nnsight list

    # Call .all() on model
    model.all()

    # Apply same intervention - set first layer output to zero
    layers[0].output[0][:] = 0

    # Append desired hidden state post-intervention
    hidden_states.append(model.lm_head.output) # no need to call .save

print("Hidden state length: ",len(hidden_states)) # length is 3!

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hidden state length:  3


## Known Issues: `.all()`

* IteratorEnvoy contexts can produce undesired behavior for subsequent operations defined <u>below</u> it that are not dependent on InterventionProxys.

Example:
```
with lm.generate("Hello World!", max_new_tokens=10):
    hs_4 = nnsight.list().save()

    with lm.transformer.h[4].all():
        hs_4.append(lm.transformer.h[4].output)

    hs_4.append(433)

print(len(hs_4))
```
`>>> 20 # expected: 11`

# Syntax updates

With `nnsight 0.4`, we now support `if` statements and `for` loops applied to proxies with traditional Python syntax! We also remove the need to call `.value` on a proxy output.

## If Statements

Previously, we would need to use `.cond()` to create a conditional context that would only execute upon meeting the logical conditions inside the `.cond()`.

In [10]:
import torch
# Old method
# model = LanguageModel('openai-community/gpt2', device_map='auto')

with model.trace("The Eiffel Tower is in the city of") as tracer:

  rand_int = torch.randint(low=-10, high=10, size=(1,))

  # To create the conditional context you need to put the
  # condition within tracer.cond()
  with tracer.cond(rand_int % 2 == 0):
    tracer.log("Random Integer ", rand_int, " is Even")

  with tracer.cond(rand_int % 2 == 1):
    tracer.log("Random Integer ", rand_int, " is Odd")

Random Integer  tensor([5])  is Odd


Now, we can use Python `if` statements within the tracing context to create a conditional context!

*Note: Colab may be a little strangely with this feature the first time you run it - expect some lagging and warnings.*


In [12]:
with model.trace("The Eiffel Tower is in the city of") as tracer:

  rand_int = torch.randint(low=-10, high=10, size=(1,))

  # Since this if statement is inside the tracing context the if will
  # create a conditional context and will only execute the intervention
  # if this condition is met
  if rand_int % 2 == 0:
    tracer.log("Random Integer ", rand_int, " is Even")

  if rand_int % 2 == 1:
    tracer.log("Random Integer ", rand_int, " is Odd")

Random Integer  tensor([3])  is Odd


Note: If the conditional statements are outside the tracing context, `if` operates as in base Python.


`elif` statements should also work as `if` statements within the tracing context:

In [None]:
with model.trace("The Eiffel Tower is in the city of") as tracer:

  rand_int = torch.randint(low=-10, high=10, size=(1,))

  # Since this if statement is inside the tracing context the if will
  # create a conditional context and will only execute the intervention
  # if this condition is met
  if rand_int % 2 == 0:
    tracer.log("Random Integer ", rand_int, " is Even")
  elif rand_int % 2 == 1:
    tracer.log("Random Integer ", rand_int, " is Odd")

Random Integer  tensor([-9])  is Odd


## For Loops

With `nnsight 0.4`, you can now use `for` loops within a tracer context at scale. Previously, a `for` loop within a tracer context inside it resulted in creating intervention graphs over and over for each iteration - this is not scalable!

The `session.iter` context allows for scalable looping within sessions, but doesn't utilize traditional Python syntax:

In [13]:
# Old Method
with model.session() as session:

  li = nnsight.list() # an NNsight built-in list object
  [li.append([num]) for num in range(0, 3)] # adding [0], [1], [2] to the list
  li2 = nnsight.list().save()

  # You can create nested Iterator contexts
  with session.iter(li) as item:
    with session.iter(item) as item_2:
      li2.append(item_2)

print("\nList: ", li2)


List:  [0, 1, 2]


Now, you can use simple `for` loops within a tracer context to run an intervention loop at scale.

*NOTE: inline for loops (i.e., `[x for x in <Proxy object>]`) are not currently supported.*

In [14]:
# New: Using Python for loops for iterative interventions
with model.session() as session:

    li = nnsight.list()
    [li.append([num]) for num in range(0, 3)]
    li2 = nnsight.list().save()

    # Using regular for loops
    for item in li:
        for item_2 in item: # for loops can be nested!
            li2.append(item_2)

print("\nList: ", li2)


List:  [0, 1, 2]


## `.value` injected into saved results

Previously, directly using non-traceable functions (i.e., tokenizers) on a proxy returned from a tracing context required calling `.value` to access the proxy's numerical value. Calling traceable functions (like `print()` or `.argmax()`) on such proxies automatically returned the `.value`, making it optional to call `.value` in certain cases.

```
input = "The Eiffel Tower is in the city of"
with model.trace(input):

    l2_input = model.transformer.h[2].input.save()

print(l2_input.value) # could optionally call .value
print(l2_input) # but not required for traceable functions
```

Now with `nnsight 0.4`, the proxy's value is automatically injected into the variable name, negating any needs to call `.value` on proxies. Proxy variables will automatically be populated with their value upon exiting the tracing context. This is a breaking change, and calling `.value` on a proxy will now throw an error.

In [15]:
input = "The Eiffel Tower is in the city of"
with model.trace(input):

    l2_input = model.transformer.h[2].input.save()

print(l2_input) # no need to call .value
print(l2_input.value) # will throw an error

tensor([[[ 0.0386, -1.1676,  1.1246,  ..., -1.4047, -0.5742, -0.0668],
         [ 0.1477, -0.4208,  1.3827,  ..., -1.6436, -0.1738, -1.1567],
         [-0.1181, -0.5914, -0.9923,  ..., -0.8742, -0.1361,  1.0608],
         ...,
         [-1.3966,  0.8859,  0.1767,  ...,  0.0661,  0.6106, -0.6092],
         [-2.9199,  0.3945, -3.4569,  ...,  1.2592,  0.2188, -0.5957],
         [ 1.8395,  0.9940,  0.6617,  ..., -0.2802,  0.4978,  0.2308]]],
       device='cuda:0', grad_fn=<AddBackward0>)


AttributeError: 'Tensor' object has no attribute 'value'

## Turning off syntactic changes

If you would like to turn off either the `if`/`for` functionality or the `.value` syntactic changes, you can apply the following changes to `nnsight.CONFIG`

In [None]:
# Turn off if/for statements within tracing context:
import nnsight

nnsight.CONFIG.APP.CONTROL_FLOW_HANDLING = False
nnsight.CONFIG.save()

In [None]:
# Turn off .value injection:
import nnsight

nnsight.CONFIG.APP.FRAME_INJECTION = False
nnsight.CONFIG.save()

## Known Issues: Syntax Update
* Colab behaves a little strangely with these features the first time you run it - expect some lagging and warnings.

* Inline Control Flow (for loops) are not supported.

Example:
```
with lm.trace("Hello World!"):
    foo = nnsight.list([0, 1, 2, 3]).save()
    [nnsight.log(item) for item in foo]

```
`>>> Error`

* Value Injection is not supported for proxies referenced within objects.




# vLLM Integration

Our new update includes support for vLLM models using `nnsight`. [vLLM](https://github.com/vllm-project/vllm) is a popular library used for fast inference. By leveraging PagedAttention, dynamic batching, and Hugging Face model integration, vLLM makes inference more efficient and scalable for real-world applications.

## Setup

You will need to install `nnsight 0.4`, `vllm`, and `triton 3.1.0` to use vLLM with NNsight.

In [16]:
from IPython.display import clear_output
# install vllm
!pip install vllm==0.6.4.post1

# install triton 3.1.0
!pip install triton==3.1.0

clear_output()

**NOTE: you may need to restart your Colab session before the following step to properly load the `VLLM` model wrapper.**

 Next, let's load in our NNsight-supported vLLM model. You can find vLLM-supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html). For this exercise, we will use GPT-2.

In [None]:
from nnsight.modeling.vllm import VLLM

# NNsight's VLLM wrapper currently supports "device = cuda" and device = "auto"
vllm = VLLM("gpt2", device = "auto", dispatch = True) # See supported models: https://docs.vllm.ai/en/latest/models/supported_models.html
print(vllm)

## Interventions on vLLM models
We now have a vLLM model that runs with `nnsight`. Let's try applying some interventions on it.

Note that vLLM takes in sampling parameters including `temperature` and `top_p`. These parameters can be included in the `.trace()` or `.invoke()` contexts. For default model behavior, set `temperature = 0` and `top_p = 1`. For more information about parameters, reference the [vLLM documentation](https://docs.vllm.ai/en/latest/dev/sampling_params.html).

In [3]:
with vllm.trace(temperature=0.0, top_p=1.0, max_tokens=1) as tracer:
  with tracer.invoke("The Eiffel Tower is located in the city of"):
    clean_logits = vllm.logits.output.save()

  with tracer.invoke("The Eiffel Tower is located in the city of"):
    vllm.transformer.h[-2].mlp.output[:] = 0
    corrupted_logits = vllm.logits.output.save()

Processed prompts: 100%|██████████| 2/2 [00:00<00:00, 38.35it/s, est. speed input: 423.14 toks/s, output: 38.46 toks/s]


In [4]:
print("CLEAN - The Eiffel Tower is located in the city of", vllm.tokenizer.decode(clean_logits.argmax(dim=-1)))
print("CORRUPTED - The Eiffel Tower is located in the city of", vllm.tokenizer.decode(corrupted_logits.argmax(dim=-1)))

CLEAN - The Eiffel Tower is located in the city of  Paris
CORRUPTED - The Eiffel Tower is located in the city of  London


We've successfully performed an intervention on our vLLM model!

## Sampled Token Traceability
vLLM provides functionality to configure how each sequence samples its next token. Here's an example of how you can trace token sampling operations with the nnsight VLLM wrapper.

In [5]:
import nnsight
with vllm.trace("Madison Square Garden is located in the city of", temperature=0.8, top_p=0.95, max_tokens=3) as tracer:
    samples = nnsight.list().save()
    logits = nnsight.list().save()

    for ii in range(3):
        samples.append(vllm.samples.output)
        vllm.samples.next()
        logits.append(vllm.logits.output)
        vllm.logits.next()

print("Samples: ", samples)
print("Logits: ", logits) # different than samples with current sampling parameters

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  5.84it/s, est. speed input: 53.99 toks/s, output: 17.99 toks/s]


Samples:  [tensor([10346]), tensor([13]), tensor([6363])]
Logits:  [tensor([[-109.0625, -107.9375, -111.6875,  ..., -115.3750, -116.5625,
         -108.8750]], device='cuda:0', dtype=torch.float16), tensor([[-80.3125, -82.2500, -85.3750,  ..., -93.3750, -90.7500, -83.4375]],
       device='cuda:0', dtype=torch.float16), tensor([[-110.0000, -109.3125, -110.4375,  ..., -120.9375, -119.3750,
         -101.3750]], device='cuda:0', dtype=torch.float16)]


## Note: gradients are not supported with vLLM
vLLM speeds up inference through its paged attention mechanism. This means that accessing gradients and backward passes are not supported for vLLM models. As such, calling gradient operations when using `nnsight` vLLM wrappers will throw an error.

## Known Issues: vLLM Integration
* The vllm.LLM engine performs max_tokens + 1 forward passes which can lead to undesired behavior if you are running interventions on all iterations of multi-token generation.

Example:
```
with vllm_gpt2("Hello World!", max_tokens=10):
    logits = nnsight.list().save()
    with vllm_gpt2.logits.all():
        logits.append(vllm_gpt2.logits.output)

print(len(logits))

```
`>>> 11 # expected: 10`

# Streaming

Streaming enables users apply functions and datasets locally during remote model execution. This allows users to stream results for immediate consumption (i.e., seeing tokens as they are generated) or applying non-whitelisted functions such as model tokenizers, large local datasets, and more!

*   `nnsight.local()` context sends values immediately to user's local machine from server
*   Intervention graph is executed locally on downstream nodes
*   Exiting local context uploads data back to server
*   `@nnsight.trace` function decorator enables custom functions to be added to intervention graph when using `nnsight.local()`


## `nnsight.local()`

You may sometimes want to locally access and manipulate values during remote execution. Using `.local()` on a proxy, you can send remote content to your local machine and apply local functions. The intervention graph is then executed locally on downstream nodes until you exit the local context.




There are a few use cases for streaming with `.local()`, including live chat generation and applying large datasets or non-whitelisted local functions to the intervention graph.



Now let's explore how streaming works. We'll start by grabbing some hidden states of the model and printing their value using `tracer.log()`. Without calling `nnsight.local()`, these operations will all occur remotely.

In [None]:
from nnsight import LanguageModel

llama = LanguageModel("meta-llama/Meta-Llama-3.1-8B")

In [None]:
# This will give you a remote LOG response because it's coming from the remote server
with llama.trace("hello", remote=True) as tracer:

    hs = llama.model.layers[-1].output[0]

    tracer.log(hs[0,0,0])

    out =  llama.lm_head.output.save()

print(out)

Now, let's try the same operation using the `nnsight.local()` context. This will send the operations to get and print the hidden states to your local machine, changing how the logging message is formatted (local formatting instead of remote).

In [None]:
# This will print locally because it's already local
with llama.trace("hello", remote=True) as tracer:

    with nnsight.local():
        hs = llama.model.layers[-1].output[0]
        tracer.log(hs[0,0,0])

    out =  llama.lm_head.output.save()

print(out)

## `@nnsight.trace` function decorator

We can also use function decorators to create custom functions to be used during `.local` calls. This is a handy way to enable live streaming of a chat or to train probing classifiers on model hidden states.

Let's try out `@nnsight.trace` and `nnsight.local()` to access a custom function during remote execution.

In [None]:
# first, let's define our function
@nnsight.trace # decorator that enables this function to be added to the intervention graph
def my_local_fn(value):
    return value * 0

# We use a local function to ablate some hidden states
# This downloads the data for the .local context, and then uploads it back to set the value.
with llama.generate("hello", remote=True) as tracer:

    hs = llama.model.layers[-1].output[0]

    with nnsight.local():

        hs = my_local_fn(hs)

    llama.model.layers[-1].output[0][:] = hs

    out =  llama.lm_head.output.save()

Note that without calling `.local`, the remote API does not know about `my_local_fn` and will throw a whitelist error. A whitelist error occurs because you are being allowed access to the function.

In [None]:
with llama.trace("hello", remote=True) as tracer:

    hs = llama.model.layers[-1].output[0]

    hs = my_local_fn(hs) # no .local - will cause an error

    llama.model.layers[-1].output[0][:] = hs * 2

    out =  llama.lm_head.output.save()

print(out)

## Example: Live-streaming remote chat



Now that we can access data within the tracing context on our local computer, we can apply non-whitelisted functions, such as the model's tokenizer, within our tracing context.

Let's build a decoding function that will decode tokens into words and print the result.

In [None]:
@nnsight.trace
def my_decoding_function(tokens, model, max_length=80, state=None):
    # Initialize state if not provided
    if state is None:
        state = {'current_line': '', 'current_line_length': 0}

    token = tokens[-1] # only use last token

    # Decode the token
    decoded_token = llama.tokenizer.decode(token).encode("unicode_escape").decode()

    if decoded_token == '\\n':  # Handle explicit newline tokens
        # Print the current line and reset state
        print('',flush=True)
        state['current_line'] = ''
        state['current_line_length'] = 0
    else:
        # Check if adding the token would exceed the max length
        if state['current_line_length'] + len(decoded_token) > max_length:
            print('',flush=True)
            state['current_line'] = decoded_token  # Start a new line with the current token
            state['current_line_length'] = len(decoded_token)
            print(state['current_line'], flush=True, end="")  # Print the current line
        else:
            # Add a space if the line isn't empty and append the token
            if state['current_line']:
                state['current_line'] += decoded_token
            else:
                state['current_line'] = decoded_token
            state['current_line_length'] += len(decoded_token)
            print(state['current_line'], flush=True, end="")  # Print the current line

    return state

Now we can decode and print our model outputs throughout token generation by accessing our decoding function through `nnsight.local()`.

In [None]:
import torch

nnsight.CONFIG.APP.REMOTE_LOGGING = False

prompt = "A press release is an official statement delivered to members of the news media for the purpose of"
# prompt = "Your favorite board game is"

print("Prompt: ",prompt,'\n', end ="")

# Initialize the state for decoding
state = {'current_line': '', 'current_line_length': 0}

with llama.generate(prompt, remote=True, max_new_tokens = 50) as generator:
    # Call .all() to apply to each new token
    llama.all()

    all_tokens = nnsight.list().save()

    # Access model output
    out = llama.lm_head.output.save()

    # Apply softmax to obtain probabilities and save the result
    probs = torch.nn.functional.softmax(out, dim=-1)
    max_probs = torch.max(probs, dim=-1)
    tokens = max_probs.indices.cpu().tolist()
    all_tokens.append(tokens[0]).save()

    with nnsight.local():
        state = my_decoding_function(tokens[0], llama, max_length=20, state=state)

# General Considerations
* `Tracer.cond(…)` and `Tracer.iter(…)` are still supported.

* vLLM <U>does not</u> come as a pre-installed dependency of `nnsight`.

* `nnsight` supports `vllm==0.6.4.post1`

* vLLM support only includes `cuda` and `auto` devices at the moment.

* vLLM models <u>do not</u> support gradients.

* The `@nnsight.trace` decorator does not enable user-defined operations to be executed remotely. Something coming soon for that...