# Working With Gradients

There are a couple of ways we can interact with the gradients during and after a backward pass.

In the following example, we save the hidden states of the last layer and do a backward pass on the sum of the logits.

Note two things:

1. We use `inference=False` in the `.forward` call to turn off inference mode. This allows gradients to be calculated. 
2. We can all `.backward()` on a value within the tracing context just like you normally would.

In [3]:
from nnsight import LanguageModel

model = LanguageModel('gpt2', device_map='cuda')

with model.forward(inference=False) as runner:
    with runner.invoke('Hello World') as invoker:

        hidden_states = model.transformer.h[-1].output[0].save()

        logits = model.lm_head.output

        logits.sum().backward()

print(hidden_states.value)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[[ 0.5216, -1.1755, -0.4617,  ..., -1.1919,  0.0204, -2.0075],
         [ 0.9841,  2.2175,  3.5851,  ...,  0.5212, -2.2286,  5.7334]]],
       device='cuda:0', grad_fn=<SliceBackward0>)


If we wanted to see the gradients for the hidden_states, we can call `.retain_grad()` on it and access the `.grad` attribute after execution. 

In [1]:
from nnsight import LanguageModel

model = LanguageModel('gpt2', device_map='cuda')

with model.forward(inference=False) as runner:
    with runner.invoke('Hello World') as invoker:

        hidden_states = model.transformer.h[-1].output[0].save()
        hidden_states.retain_grad()

        logits = model.lm_head.output

        logits.sum().backward()

print(hidden_states.value)
print(hidden_states.value.grad)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[[ 0.5216, -1.1755, -0.4617,  ..., -1.1919,  0.0204, -2.0075],
         [ 0.9841,  2.2175,  3.5851,  ...,  0.5212, -2.2286,  5.7334]]],
       device='cuda:0', grad_fn=<AsStridedBackward0>)
tensor([[[  28.7976, -282.5977,  868.7343,  ...,  120.1742,   52.2264,
           168.6447],
         [  79.4183, -253.6227, 1322.1290,  ...,  208.3981,  -19.5544,
           509.9856]]], device='cuda:0')


Even better, `nnsight` also provides proxy access into the backward process via the `.grad` attribute on proxies. This works just like  `.input` and `.output` where operations , including getting and setting, are traced and performed on the model at runtime. (assuming it's a proxy of a Tensor as this calls .register_hook(...) on it!)

The following examples demonstrate ablating (setting to zero) the gradients for a hidden state in gpt2. The first example is an in-place operation and the second swaps the gradient out for a new tensor of zeroes. 

In [1]:
from nnsight import LanguageModel
import torch

model = LanguageModel('gpt2', device_map='cuda')

with model.forward(inference=False) as runner:
    with runner.invoke("Hello World") as invoker:
        hidden_states = model.transformer.h[-1].output[0].save()

        hidden_states_grad_before = hidden_states.grad.clone().save()
        hidden_states.grad[:] = 0
        hidden_states_grad_after = hidden_states.grad.save()

        logits = model.lm_head.output

        logits.sum().backward()

print("Before", hidden_states_grad_before.value)
print("After", hidden_states_grad_after.value)

with model.forward(inference=False) as runner:
    with runner.invoke("Hello World") as invoker:
        hidden_states = model.transformer.h[-1].output[0].save()

        hidden_states_grad_before = hidden_states.grad.clone().save()
        hidden_states.grad = torch.zeros(hidden_states.grad.shape)
        hidden_states_grad_after = hidden_states.grad.save()

        logits = model.lm_head.output

        logits.sum().backward()

print("Before", hidden_states_grad_before.value)
print("After", hidden_states_grad_after.value)


  from .autonotebook import tqdm as notebook_tqdm
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Before tensor([[[  28.7976, -282.5981,  868.7355,  ...,  120.1743,   52.2264,
           168.6449],
         [  79.4181, -253.6227, 1322.1299,  ...,  208.3983,  -19.5544,
           509.9858]]], device='cuda:0')
After tensor([[[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]], device='cuda:0')
Before tensor([[[  28.7976, -282.5981,  868.7355,  ...,  120.1743,   52.2264,
           168.6449],
         [  79.4181, -253.6227, 1322.1299,  ...,  208.3983,  -19.5544,
           509.9858]]], device='cuda:0')
After tensor([[[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]], device='cuda:0')
