<a href="https://colab.research.google.com/github/kcarnold/cs344/blob/main/portfolio/fundamentals/013-lm-logits.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `013` Language Models and Logits

Task: Ask a language model for the most likely next tokens.

This notebook follows up on `012-tokenization`.

## Setup

We'll be using the HuggingFace Transformers library, which provides a (mostly) consistent interface to many different language models. We'll focus on the OpenAI GPT-2 model, famous for OpenAI's assertion that it was "too dangerous" to release in full.

[Documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for the model and tokenizer.

In [1]:
!pip install transformers



In [2]:
import torch
from torch import tensor

### Download and load the model

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True) # smaller version of GPT-2
# Alternative to add_prefix_space is to use `is_split_into_words=True`
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("distilgpt2", pad_token_id=tokenizer.eos_token_id)

In [4]:
print(f"The model has {model.num_parameters():,d} parameters.")

The model has 81,912,576 parameters.


## Task

Consider the following phrase: "This weekend I plan to".

1. Convert the phrase into token ids.
2. Use the `forward` method of the `model`. Explain the shape of `model_output.logits`.
3. Pull out the logits corresponding to the *last* token in the input phrase. Identify the id of the most likely next token.
4. Find what token the model thinks is the most likely.
5. Use the `topk` method to find the top-20 most likely choices for the next token. 
6. Write a function that is given a phrase and a *k* and returns the top *k* most likely next tokens.

In [5]:
#phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"
phrase = "This weekend I plan to"

In [6]:
# input_ids = ...
input_ids = tokenizer.encode(phrase)

In [7]:
model_output = model.forward(tensor([input_ids]))
model_output.logits.shape

torch.Size([1, 5, 50257])

In [8]:
# since we only have a single sequence (batch size of 1), let's collapse the batch dimension.
logits = model_output.logits[0]

In [9]:
# your code here
logits[-1].argmax()

tensor(467)

In [10]:
# your code here
tokenizer.decode([467])

' go'

## Analysis

What would be required to generate more than one token? What decisions would you have to make?