# Logits in Causal Language Models

Task: Ask a language model for the most likely next tokens.

## Setup

We start in the same way as the tokenization notebook:

In [2]:
import torch
from torch import tensor
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd

One step in this notebook will ask you to write a function. The most common error when function-ifying notebook code is accidentally using a global variable instead of a value computed in the function. This is a quick and dirty little utility to check for that mistake. (For a more polished version, check out [`localscope`](https://localscope.readthedocs.io/en/latest/README.html).)

In [3]:
def check_global_vars(func, allowed_globals):
    import inspect
    used_globals = set(inspect.getclosurevars(func).globals.keys())
    disallowed_globals = used_globals - set(allowed_globals)
    if len(disallowed_globals) > 0:
        raise AssertionError(f"The function {func.__name__} used unexpected global variables: {list(disallowed_globals)}")

Download and load the model.

In [4]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True) # smaller version of GPT-2
# Alternative to add_prefix_space is to use `is_split_into_words=True`
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("distilgpt2", pad_token_id=tokenizer.eos_token_id)

In [5]:
print(f"The tokenizer has {len(tokenizer.get_vocab())} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")

The tokenizer has 50257 strings in its vocabulary.
The model has 81,912,576 parameters.


## Task

In the tokenization notebook, we simply used the `generate` method to have the model generate some text. Now we'll do it ourselves.

Consider the following phrase:

In [6]:
phrase = "This weekend I plan to"
# Another one to try later. This was a famous early example of the GPT-2 model:
# phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"

1: Call the `tokenizer` on the phrase to get a `batch` that includes `input_ids`.

In [7]:
batch = tokenizer(phrase, return_tensors='pt')
input_ids = batch['input_ids']

2: Call the `model` on the `input_ids`. Examine the shape of the logits.

In [8]:
with torch.no_grad(): # This PyTorch we don't need it to compute gradients for us.
    model_output = model(input_ids)
print(f"logits shape: {list(model_output.logits.shape)} ")

logits shape: [1, 5, 50257] 


3: Pull out the logits corresponding to the *last* token in the input phrase.

In [46]:
last_token_logits = model_output.logits[0, -1]
assert last_token_logits.shape == (len(tokenizer.get_vocab()),)

 4: Identify the token id and corresponding string of the most likely next token.


In [47]:
most_likely_token_id = last_token_logits.argmax()
print(f"Most likely next token: {most_likely_token_id}, which corresponds to {repr(tokenizer.decode(most_likely_token_id))}")

Most likely next token: 467, which corresponds to ' go'


5: Use the `topk` method to find the top-10 most likely choices for the next token.

*Note*: This uses Pandas to make a nicely displayed table, and a *list comprehension* to decode the tokens. You don't *need* to understand how this all works, but I highly encourage thinking about what's going on.

In [42]:
most_likely_tokens = last_token_logits.topk(10)
most_likely_tokens.indices

tensor([ 467, 1011, 4341,  787,  466,  307, 5262, 3187, 1057,  423])

In [48]:
most_likely_tokens = last_token_logits.topk(10)
most_likely_tokens_df = pd.DataFrame({
    'tokens': [tokenizer.decode(token_id) for token_id in most_likely_tokens.indices],
    'probabilities': most_likely_tokens.values.softmax(dim=0),
})
# Show the table, in a nice formatted way (see https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Builtin-Styles)
most_likely_tokens_df.style.hide_index().background_gradient()

tokens,probabilities
go,0.188313
take,0.138117
spend,0.09937
make,0.096061
do,0.091927
be,0.088006
attend,0.081475
visit,0.081293
run,0.069481
have,0.065957


6. Write a function that is given a phrase and a *k* and returns the top *k* most likely next tokens.

Build this function using only code that you've already filled in above. Clean up the code so that it doesn't do or display anything extraneous. Add comments about what each step does.

In [61]:
def predict_next_tokens(phrase, k):
    # your code here
    batch = tokenizer(phrase, return_tensors='pt') # defines the batch as the input phrase
    input_ids = batch['input_ids'] # defines the batch strings as the input ids
    with torch.no_grad(): # This PyTorch we don't need it to compute gradients for us.
        model_output = model(input_ids)
    last_token_logits = model_output.logits[0, -1] # calculates the logits of the last token of the batch
    assert len(last_token_logits.shape) == 1 # ensures there is only one dimension for the last token's logits
    most_likely_tokens = last_token_logits.topk(k) # generates k amount of next highest probable strings
    return pd.DataFrame({ # returns string of token and probability value of token being the next word
        'tokens': [tokenizer.decode(token_id) for token_id in most_likely_tokens.indices],
        'probabilities': most_likely_tokens.values.softmax(dim=0),
    })

check_global_vars(predict_next_tokens, ["torch", "tokenizer", "pd", "model"])

In [57]:
predict_next_tokens("This weekend I plan to", 5).style.hide_index().background_gradient()

tokens,probabilities
go,0.306805
take,0.225024
spend,0.161896
make,0.156506
do,0.14977


In [58]:
predict_next_tokens("To be or not to", 5).style.hide_index().background_gradient()

tokens,probabilities
be,0.926792
have,0.030508
the,0.018525
do,0.013536
",",0.010639


In [59]:
predict_next_tokens("For God so loved the", 5).style.hide_index().background_gradient()

tokens,probabilities
world,0.384817
Lord,0.231493
people,0.160957
earth,0.136999
children,0.085733


## Analysis


Q1: Explain the shape of `model_output.logits`.

The shape of `model_output.logits` is `[1, 5, 50257]` which is `[batch size, number of input ids, number of strings in vocabulary]`.

Q2: What would be required to generate more than one token? What decisions would you have to make?

I assume this is like generating the whole rest of the sentence in one move, as opposed to generating each next word in the sentence based on only the last chosen token. To generate more than one token, we would have to decide how many tokens it could generate at most and have the tokens be based on the other words in the sentence, not just the last one.