In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

## Step 1. Load a tokenizer and a model

First we load a tokenizer and a model from HuggingFace's transformers library. A tokenizer is a function that splits a string into a list of numbers that the model can understand.

In [2]:
# Load a pretrained model and a tokenizer using HuggingFace
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Create a partial sentence and tokenize it
text = "The best place to learn about generative AI online is"
inputs = tokenizer(text, return_tensors="pt")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [3]:
# Show tokenized input as numbers
print(inputs["input_ids"])

tensor([[ 464, 1266, 1295,  284, 2193,  546, 1152,  876, 9552, 2691,  318]])


## Step 2. Examine the tokenization

Let's explore what these tokens mean!

In [5]:
# Show how the sentence is tokenized
import pandas as pd

def show_tokenization(inputs):
    df = pd.DataFrame(
        [ (id, tokenizer.decode(id)) for id in inputs["input_ids"][0]],
        columns=["Token ID", "Token"]
    )
    return df

show_tokenization(inputs)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,Token ID,Token
0,tensor(464),The
1,tensor(1266),best
2,tensor(1295),place
3,tensor(284),to
4,tensor(2193),learn
5,tensor(546),about
6,tensor(1152),gener
7,tensor(876),ative
8,tensor(9552),AI
9,tensor(2691),online


## Step 3. Calculate the probability of the next token

Now let's use PyTorch to calculate the probability of the next token given the previous ones.

In [8]:
import torch

with torch.no_grad():
    outputs = model(**inputs).logits[:, -1, :]
    probabilities = torch.nn.functional.softmax(outputs[0], dim=-1)

In [11]:
def show_next_token_choices(probabilities, top_n=15):    
    df = pd.DataFrame(
        [
            (id, tokenizer.decode(id), p.item()) for id, p in enumerate(probabilities) if p.item()
        ],
        columns=["Id", "Token", "Probability"]
    ).sort_values("Probability", ascending=False).head(top_n)
    return df

show_next_token_choices(probabilities)

Unnamed: 0,Id,Token,Probability
379,379,at,0.09957
262,262,the,0.095797
416,416,by,0.079331
832,832,through,0.06199
287,287,in,0.0596
994,994,here,0.040486
351,351,with,0.036844
319,319,on,0.022664
284,284,to,0.01992
3012,3012,Google,0.018675


In [12]:
# Obtain the token id for the most probable next token
next_token_id = torch.argmax(probabilities).item()

print(f"Next token id: {next_token_id}")
print(f"Next token: {tokenizer.decode(next_token_id)}")

Next token id: 379
Next token:  at


## Step 4. Generate some more tokens

The following cell will take `text`, show the most probable tokens to follow, and append the most likely token to text. Run the cell over and over to see it in action!

In [13]:
# Press ctrl + enter to run this cell again and again to see how the text is generated.

from IPython.display import Markdown, display

# Show the text
print(text)

# Convert to tokens
inputs = tokenizer(text, return_tensors="pt")

# Calculate the probabilities for the next token and show the top 5 choices
with torch.no_grad():
    logits = model(**inputs).logits[:, -1, :]
    probabilities = torch.nn.functional.softmax(logits[0], dim=-1)

display(Markdown("**Next token probabilities:**"))
display(show_next_token_choices(probabilities))

# Choose the most likely token id and add it to the text
next_token_id = torch.argmax(probabilities).item()
text = text + tokenizer.decode(next_token_id)

The best place to learn about generative AI online is


**Next token probabilities:**

Unnamed: 0,Id,Token,Probability
379,379,at,0.09957
262,262,the,0.095797
416,416,by,0.079331
832,832,through,0.06199
287,287,in,0.0596
994,994,here,0.040486
351,351,with,0.036844
319,319,on,0.022664
284,284,to,0.01992
3012,3012,Google,0.018675


## Step 4. Use the `generate` method

In [14]:
from IPython.display import Markdown, display

# Start with some text and tokenize it
text = "Once upon a time, generative models"
inputs = tokenizer(text, return_tensors="pt")

# Use the `generate` method to generate lots of text
output = model.generate(**inputs, max_length=100, pad_token_id=tokenizer.eos_token_id)

# Show the generated text
display(Markdown(tokenizer.decode(output[0])))

Once upon a time, generative models of the human brain were used to study the neural correlates of cognitive function. In the present study, we used a novel model of the human brain to investigate the neural correlates of cognitive function. We used a novel model of the human brain to investigate the neural correlates of cognitive function. We used a novel model of the human brain to investigate the neural correlates of cognitive function. We used a novel model of the human brain to investigate the neural correlates of cognitive function.