# How LLMs build sentences

This notebook demonstrates the core mechanism behind how Large Language Models generate text. As we discussed in the lecture, an LLM generates its response **one token at a time** through a process called auto-regression. It repeatedly predicts the most likely next token based on all the tokens that came before it.

This demonstration will make that process tangible. We will explore two key features of the OpenAI API that let us peek under the hood:

1. **Streaming:** We'll see the tokens arrive one by one in real-time, just as the model produces them.
2. **Log Probs (`logprobs`):** We'll actually look at the probability scores the model assigns to potential next tokens, proving it's a statistical process.

Let's get started!

In [None]:
import litellm
import math
from dotenv import load_dotenv

MODEL_NAME = "openai/gpt-4o-mini"
MAX_TOKENS = 200

load_dotenv()

## Watching Generation with Streaming

The simplest way to observe the token-by-token process is to "stream" the response. Instead of sending a prompt and waiting for the entire completed text, streaming sends us each token as soon as it's generated. This is not only great for creating a responsive, real-time user experience (like in ChatGPT) but also perfectly illustrates the auto-regressive nature of LLMs.

> Use Case: Use streaming in any application where you want to display the model's output to a user in real-time. It dramatically improves the user's perception of speed.
> 

In the code below, we'll send a simple prompt and print each piece of the response as it comes in. Notice how the sentence is built up incrementally.

In [None]:
prompt_story = [
    {
        "role": "user",
        "content": "Tell me a short story about a typical day in a programmer's life."
    }
]

response = litellm.completion(
    model=MODEL_NAME,
    messages=prompt_story,
    max_tokens=MAX_TOKENS,
    stream=True
)

print("--- Response ---")
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="|")

## Visualizing Probabilities with `logprobs`

Now, let's get even closer to the model's "decision" process. We can ask the API to return the **log probabilities** for the most likely tokens at each step of the generation. This lets us see the statistical path the model is taking.

A log probability is just a mathematical way to represent very small probability numbers. We can easily convert them back to percentages to see exactly what the model was "thinking."

> Note: Requesting logprobs can be very useful for debugging prompts, understanding model behavior, or implementing advanced techniques like constraining model output.
> 

In this example, we'll ask the model to complete the phrase "The capital of France is" and return the top 5 most likely next tokens. You'll see which token it chose and what the other top contenders were.

In [None]:
prompt_capital = [
    {
        "role": "user",
        "content": "Today I'm feeling"
    }
]

response_capital = litellm.completion(
    model=MODEL_NAME,
    messages=prompt_capital,
    max_tokens=10,
    logprobs=True,
    top_logprobs=5
)

In [None]:
top_logprobs = response_capital.choices[0].logprobs.content[0].top_logprobs
chosen_token = response_capital.choices[0].message.content

print(f"Prompt: {prompt_capital[0]['content']}")
print(f"The model chose the token: {chosen_token}")
print("Here are the top 5 most likely next tokens:")

for logprob in top_logprobs:
    prob_percentage = math.exp(logprob.logprob) * 100
    print(f"  - Token: '{logprob.token}', Prob: {prob_percentage:.4f}%")