<a href="https://colab.research.google.com/github/oyeyemixisco/Generate_Next_Word_Model/blob/main/GeneratingNextWord.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [21]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# To load a pretrained model and a tokenizer using HuggingFace, we only need two lines of code!
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")



In [22]:
# We create a partial sentence and tokenize it.
text = "Nigeria is a country in which continent -"
inputs = tokenizer(text, return_tensors="pt")

# Show the tokens as numbers, i.e. "input_ids"
inputs["input_ids"]

tensor([[   45,   328,  5142,   318,   257,  1499,   287,   543, 15549,   532]])

## Step 2. Examine the tokenization

Let's explore what these tokens mean!

In [23]:
# Show how the sentence is tokenized
import pandas as pd


def show_tokenization(inputs):
    return pd.DataFrame(
        [(id, tokenizer.decode(id)) for id in inputs["input_ids"][0]],
        columns=["id", "token"],
    )


show_tokenization(inputs)

Unnamed: 0,id,token
0,tensor(45),N
1,tensor(328),ig
2,tensor(5142),eria
3,tensor(318),is
4,tensor(257),a
5,tensor(1499),country
6,tensor(287),in
7,tensor(543),which
8,tensor(15549),continent
9,tensor(532),-


### Subword tokenization

The interesting thing is that tokens in this case are neither just letters nor just words. Sometimes shorter words are represented by a single token, but other times a single token represents a part of a word, or even a single letter. This is called subword tokenization.

## Step 2. Calculate the probability of the next token

Now let's use PyTorch to calculate the probability of the next token given the previous ones.

In [24]:
# Calculate the probabilities for the next token for all possible choices. We show the
# top 5 choices and the corresponding words or subwords for these tokens.

import torch

with torch.no_grad():
    logits = model(**inputs).logits[:, -1, :]
    probabilities = torch.nn.functional.softmax(logits[0], dim=-1)


def show_next_token_choices(probabilities, top_n=5):
    return pd.DataFrame(
        [
            (id, tokenizer.decode(id), p.item())
            for id, p in enumerate(probabilities)
            if p.item()
        ],
        columns=["id", "token", "p"],
    ).sort_values("p", ascending=False)[:top_n]


show_next_token_choices(probabilities)

Unnamed: 0,id,token,p
5478,5478,Africa,0.094027
262,262,the,0.073535
290,290,and,0.059788
407,407,not,0.019861
393,393,or,0.019258


In [25]:
# Obtain the token id for the most probable next token
next_token_id = torch.argmax(probabilities).item()

print(f"Next token id: {next_token_id}")
print(f"Next token: {tokenizer.decode(next_token_id)}")

Next token id: 5478
Next token:  Africa


In [26]:
# We append the most likely token to the text.
text = text + tokenizer.decode(next_token_id)
text

'Nigeria is a country in which continent - Africa'

## Step 3. Generate some more tokens

The following cell will take `text`, show the most probable tokens to follow, and append the most likely token to text. Run the cell over and over to see it in action!

In [27]:
from IPython.display import Markdown, display

# Show the text
print(text)

# Convert to tokens
inputs = tokenizer(text, return_tensors="pt")

# Calculate the probabilities for the next token and show the top 5 choices
with torch.no_grad():
    logits = model(**inputs).logits[:, -1, :]
    probabilities = torch.nn.functional.softmax(logits[0], dim=-1)

display(Markdown("**Next token probabilities:**"))
display(show_next_token_choices(probabilities))

# Choose the most likely token id and add it to the text
next_token_id = torch.argmax(probabilities).item()
text = text + tokenizer.decode(next_token_id)

Nigeria is a country in which continent - Africa


**Next token probabilities:**

Unnamed: 0,id,token,p
11,11,",",0.45995
532,532,-,0.212432
290,290,and,0.149071
393,393,or,0.025094
318,318,is,0.019907


## Step 4. Use the `generate` method

In [34]:
from IPython.display import Markdown, display

# Start with some text and tokenize it
text = input("Enter an incomplete sentence and the model will guess the next word: ")
print(f"You entered: '{text}' ")
print(f"Run the next line of code for the model will guess the next words")

Enter an incomplete sentence and the model will guess the next word: I love money but
You entered: 'I love money but' 
Run the next line of code for the model will guess the next words


In [35]:
inputs = tokenizer(text, return_tensors="pt")

# Use the `generate` method to generate lots of text
output = model.generate(**inputs, max_length=100, pad_token_id=tokenizer.eos_token_id)

# Show the generated text
display(Markdown(tokenizer.decode(output[0])))

I love money but I don't want to be a millionaire. I want to be a good person. I want to be a good person. I want to be a good person. I want to be a good person. I want to be a good person. I want to be a good person. I want to be a good person. I want to be a good person. I want to be a good person. I want to be a good person. I want to be a good person

### That's interesting...

You'll notice that GPT-2 is not nearly as sophisticated as later models like GPT-4, which you may have experience using. It often repeats itself and doesn't always make much sense. But it's still pretty impressive that it can generate text that looks like English.