# 3.1 OpenAI API Deep Dive — Authentication, Request Structure & Streaming

## Playground Notebook

In this notebook, we'll work directly with the **OpenAI Python SDK** to understand:

| Topic | What You'll Learn |
|-------|-------------------|
| **API Authentication** | How API keys work, secure storage with `.env`, and client setup |
| **Request Structure** | The anatomy of `chat.completions.create()` — roles, parameters, response object |
| **Streaming** | Real-time token-by-token responses vs. standard blocking calls |

> **Model:** `gpt-4o-mini` — cost-efficient and capable for all experiments below.

---

In [3]:
#!pip install -r requirements.txt

In [4]:
import os
import time
import json
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import display, Markdown, HTML

# ============================================================
#  CONFIGURATION
# ============================================================
load_dotenv()  # loads OPENAI_API_KEY from .env file

MODEL = "gpt-4o-mini"

client = OpenAI()  # automatically reads OPENAI_API_KEY from env

print(f"\u2705 Client ready | Model: {MODEL}")

✅ Client ready | Model: gpt-4o-mini


In [5]:
# ============================================================
#  HELPER FUNCTIONS
# ============================================================

def chat(messages, max_tokens=150, **kwargs):
    """Send messages to OpenAI and display the response."""
    start = time.time()
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        max_tokens=max_tokens,
        **kwargs
    )
    elapsed = time.time() - start
    content = response.choices[0].message.content
    display(Markdown(content))
    print(f"\n\u23f1\ufe0f {elapsed:.2f}s | Tokens: {response.usage.prompt_tokens}+{response.usage.completion_tokens}={response.usage.total_tokens}")
    return response


def show_messages(messages):
    """Pretty-print the message list being sent."""
    colors = {"system": "#e74c3c", "user": "#3498db", "assistant": "#2ecc71"}
    html = ""
    for msg in messages:
        role = msg["role"]
        color = colors.get(role, "#888")
        html += (
            f'<div style="margin:6px 0;padding:8px 12px;border-left:4px solid {color};'
            f'background:#1e1e1e;border-radius:4px;">'
            f'<strong style="color:{color};text-transform:uppercase;">{role}</strong>'
            f'<br><span style="color:#ccc;">{msg["content"]}</span></div>'
        )
    display(HTML(html))


print("\u2705 Helpers loaded")

✅ Helpers loaded


---

## 1. API Authentication — Setting Up Securely

Your API key is a **secret token** that identifies and authorizes your application.

```
\u2717  NEVER do this:
    client = OpenAI(api_key="sk-abc123...")   # hardcoded = leaked

\u2713  ALWAYS do this:
    # .env file:  OPENAI_API_KEY=sk-abc123...
    load_dotenv()
    client = OpenAI()  # reads from environment automatically
```

### Experiment 1A: Verify Your Key is Loaded (Without Exposing It)

In [6]:
# Never print your full key! Only show enough to confirm it's loaded.
key = os.environ.get("OPENAI_API_KEY", "NOT SET")

if key == "NOT SET":
    print("\u274c OPENAI_API_KEY not found in environment!")
    print("   Create a .env file with: OPENAI_API_KEY=sk-...")
else:
    masked = key[:7] + "..." + key[-4:]
    print(f"\u2705 API Key loaded: {masked}")
    print(f"   Key length: {len(key)} characters")

✅ API Key loaded: sk-proj...ctAA
   Key length: 164 characters


### Experiment 1B: What Happens with a Bad Key?

In [7]:
from openai import AuthenticationError

bad_client = OpenAI(api_key="sk-fake-key-12345")

try:
    bad_client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": "Hi"}],
        max_tokens=5
    )
except AuthenticationError as e:
    print(f"\u274c AuthenticationError caught!")
    print(f"   Status: {e.status_code}")
    print(f"   Message: {e.body['message']}...")
    print(f"\n\u2705 This is exactly what you'd see with an invalid or expired key.")

❌ AuthenticationError caught!
   Status: 401
   Message: Incorrect API key provided: sk-fake-*****2345. You can find your API key at https://platform.openai.com/account/api-keys....

✅ This is exactly what you'd see with an invalid or expired key.


---

## 2. Request Structure — Anatomy of an API Call

Every request goes through `client.chat.completions.create()`. The two **required** parameters are:

| Parameter | Required? | Description |
|-----------|-----------|-------------|
| `model` | \u2705 Yes | Which model to use (e.g. `gpt-4o-mini`) |
| `messages` | \u2705 Yes | List of role/content dicts — the conversation |
| `max_tokens` | Optional | Max tokens in the response |
| `temperature` | Optional | Randomness (0.0–2.0) |
| `top_p` | Optional | Nucleus sampling threshold |
| `stop` | Optional | Sequences that halt generation |
| `n` | Optional | Number of completions to generate |

### The 3 Message Roles

```
system    \u2192  Sets behavior, persona, constraints (processed first)
user      \u2192  The human's input — questions, data, instructions
assistant \u2192  Model's prior responses (for multi-turn context)
```

### Experiment 2A: Minimal Request — Just model + messages

In [8]:
# The simplest possible API call
messages = [
    {"role": "user", "content": "What is an API? One sentence."}
]

show_messages(messages)
response = chat(messages)

An API, or Application Programming Interface, is a set of rules and protocols that allows different software applications to communicate and interact with each other.


⏱️ 2.03s | Tokens: 15+28=43


### Experiment 2B: Full Request with System Prompt + Parameters

In [9]:
messages = [
    {"role": "system", "content": "You are a concise technical writer. Max 2 sentences."},
    {"role": "user", "content": "Explain REST APIs."}
]

show_messages(messages)
response = chat(messages, temperature=0.3, max_tokens=80)

REST APIs (Representational State Transfer Application Programming Interfaces) are architectural styles that allow different software applications to communicate over the internet using standard HTTP methods like GET, POST, PUT, and DELETE. They enable stateless interactions and resource manipulation through URLs, making it easier to integrate and scale web services.


⏱️ 2.24s | Tokens: 27+59=86


### Experiment 2C: Inspecting the Full Response Object

The API returns a rich response object. Let's explore every field.

In [10]:
response = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": "Say hello."}],
    max_tokens=20
)

print("\u250c\u2500 RESPONSE OBJECT \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510")
print(f"\u2502 id:              {response.id}")
print(f"\u2502 model:           {response.model}")
print(f"\u2502 created:         {response.created}")
print(f"\u2502 object:          {response.object}")
print(f"\u2502")
print(f"\u2502 \u250c\u2500 choices[0] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510")
choice = response.choices[0]
print(f"\u2502 \u2502 index:          {choice.index}")
print(f"\u2502 \u2502 finish_reason:  {choice.finish_reason}")
print(f"\u2502 \u2502 role:           {choice.message.role}")
print(f"\u2502 \u2502 content:        {choice.message.content}")
print(f"\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518")
print(f"\u2502")
print(f"\u2502 \u250c\u2500 usage \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510")
print(f"\u2502 \u2502 prompt_tokens:     {response.usage.prompt_tokens}")
print(f"\u2502 \u2502 completion_tokens: {response.usage.completion_tokens}")
print(f"\u2502 \u2502 total_tokens:      {response.usage.total_tokens}")
print(f"\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518")
print(f"\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518")

┌─ RESPONSE OBJECT ──────────────────────────────────┐
│ id:              chatcmpl-DED48M7lld5MdvpMP7tTMFmKIjs5A
│ model:           gpt-4o-mini-2024-07-18
│ created:         1772279188
│ object:          chat.completion
│
│ ┌─ choices[0] ──────────────────────────┐
│ │ index:          0
│ │ finish_reason:  stop
│ │ role:           assistant
│ │ content:        Hello! How can I assist you today?
│ └──────────────────────────────────────┘
│
│ ┌─ usage ───────────────────────────────┐
│ │ prompt_tokens:     10
│ │ completion_tokens: 9
│ │ total_tokens:      19
│ └──────────────────────────────────────┘
└──────────────────────────────────────────────────┘


### Experiment 2D: `finish_reason` Values

| Value | Meaning |
|-------|---------|
| `stop` | Model finished naturally |
| `length` | Hit `max_tokens` limit — output was cut off |
| `tool_calls` | Model wants to call a function (covered in 3.2) |
| `content_filter` | Content was flagged and blocked |

In [11]:
# Force a 'length' finish by setting max_tokens very low
print("=" * 60)
print("max_tokens=10 — Will it get cut off?")
print("=" * 60)

messages = [{"role": "user", "content": "Write a poem about the ocean."}]
show_messages(messages)
response = chat(messages, max_tokens=10)
print(f"finish_reason: {response.choices[0].finish_reason}")
print(f"\n\u26a0\ufe0f 'length' means the response was CUT OFF — it didn't finish naturally!")

# Now let it finish
print(f"\n{'=' * 60}")
print("max_tokens=60 — Enough room to finish")
print("=" * 60)

messages2 = [{"role": "user", "content": "Write a 2-line poem about the ocean."}]
show_messages(messages2)
response2 = chat(messages2, max_tokens=60)
print(f"finish_reason: {response2.choices[0].finish_reason}")
print(f"\n\u2705 'stop' means the model finished on its own.")

max_tokens=10 — Will it get cut off?


In the embrace of the ocean's hymn,  



⏱️ 2.44s | Tokens: 14+10=24
finish_reason: length

⚠️ 'length' means the response was CUT OFF — it didn't finish naturally!

max_tokens=60 — Enough room to finish


Endless waves whisper secrets of the deep,  
Where sunlit dreams and shadows softly sleep.


⏱️ 1.57s | Tokens: 17+19=36
finish_reason: stop

✅ 'stop' means the model finished on its own.


### Experiment 2E: Multi-Turn Conversations

OpenAI has **no built-in memory**. To continue a conversation, you must pass the **full history** with every request.

In [12]:
conversation = [
    {"role": "system", "content": "You are a helpful math tutor. Be concise."},
    {"role": "user", "content": "What is 15% of 200?"},
    {"role": "assistant", "content": "15% of 200 = 0.15 \u00d7 200 = 30."},
    {"role": "user", "content": "What was the previously generated answer?"}
]

print("MULTI-TURN: The model sees the full conversation history")
print("=" * 60)
show_messages(conversation)
_ = chat(conversation, max_tokens=60)

MULTI-TURN: The model sees the full conversation history


The previously generated answer is 30.


⏱️ 1.24s | Tokens: 62+8=70


In [13]:
# Build a conversation dynamically — each response feeds into the next
conversation = [
    {"role": "system", "content": "You are a travel guide. Give 1-sentence answers only."}
]

user_turns = [
    "I'm visiting Tokyo. What's one must-see?",
    "What food should I try there?",
    "Any etiquette tips?"
]

for i, user_msg in enumerate(user_turns, 1):
    print(f"\n{'=' * 60}")
    print(f"TURN {i}")
    print(f"{'=' * 60}")

    conversation.append({"role": "user", "content": user_msg})
    show_messages(conversation)
    response = chat(conversation, max_tokens=60)

    assistant_msg = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": assistant_msg})

print(f"\n{'=' * 60}")
print(f"\u2139\ufe0f Total messages in history: {len(conversation)}")
print("\u26a0\ufe0f Notice: token count GROWS each turn because we resend the entire history!")


TURN 1


You must visit the historic Senso-ji Temple in Asakusa.


⏱️ 0.83s | Tokens: 34+15=49

TURN 2


You should try authentic sushi at Tsukiji Outer Market.


⏱️ 0.64s | Tokens: 64+12=76

TURN 3


Always bow slightly when greeting or thanking someone, as it's a sign of respect.


⏱️ 2.13s | Tokens: 88+16=104

ℹ️ Total messages in history: 7
⚠️ Notice: token count GROWS each turn because we resend the entire history!


---

## 3. Streaming Responses — Token by Token

By default, the API generates the **complete response first**, then sends it all at once.

With **streaming**, tokens arrive as they're generated — just like ChatGPT's typing effect.

```
Normal:    [wait 2s...] \u2192 \"Here is the full response all at once.\"
Streaming: H-e-r-e- -i-s- ... (tokens arrive one by one)
```

### Experiment 3A: Normal vs. Streaming — Side by Side

In [14]:
import sys

prompt = "List 3 benefits of exercise. One sentence each."
messages = [{"role": "user", "content": prompt}]

# --- Normal (blocking) ---
print("=" * 60)
print("NORMAL MODE (blocking — waits for full response)")
print("=" * 60)

show_messages(messages)
response = chat(messages, max_tokens=100)
print(f"\u2139\ufe0f Entire response arrived at once after the wait.")

NORMAL MODE (blocking — waits for full response)


1. Exercise improves cardiovascular health by strengthening the heart and enhancing circulation, which lowers the risk of heart disease.  
2. Regular physical activity boosts mental health by reducing symptoms of anxiety and depression while improving mood and cognitive function.  
3. Engaging in consistent exercise aids in weight management by increasing metabolism and promoting fat loss while building lean muscle mass.  


⏱️ 2.02s | Tokens: 18+71=89
ℹ️ Entire response arrived at once after the wait.


In [15]:
# --- Streaming ---
print("=" * 60)
print("STREAMING MODE (tokens arrive as generated)")
print("=" * 60)

messages = [{"role": "user", "content": prompt}]
show_messages(messages)

start = time.time()
first_token_time = None

stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    max_tokens=100,
    stream=True  # <-- just add this!
)

collected = ""
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        if first_token_time is None:
            first_token_time = time.time() - start
        print(delta, end="", flush=True)
        collected += delta

elapsed = time.time() - start
print(f"\n\n\u23f1\ufe0f Time to first token: {first_token_time:.2f}s")
print(f"\u23f1\ufe0f Total time: {elapsed:.2f}s")
print(f"\u2705 Collected {len(collected)} characters while streaming")

STREAMING MODE (tokens arrive as generated)


1. Exercise improves cardiovascular health by strengthening the heart and enhancing blood circulation, reducing the risk of heart disease.  
2. Regular physical activity boosts mental health by releasing endorphins, which can alleviate symptoms of anxiety and depression.  
3. Exercise helps maintain a healthy weight by burning calories and increasing metabolism, promoting overall physical fitness.  

⏱️ Time to first token: 0.55s
⏱️ Total time: 1.87s
✅ Collected 402 characters while streaming


### Experiment 3B: Inspecting Streaming Chunks

Each chunk is a lightweight object — much simpler than the full response.

In [16]:
messages = [{"role": "user", "content": "Say 'Hello World'"}]
show_messages(messages)

stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    max_tokens=10,
    stream=True
)

print(f"{'Chunk#':<8} {'finish_reason':<16} {'delta.content'}")
print("-" * 50)

for i, chunk in enumerate(stream):
    choice = chunk.choices[0]
    content = choice.delta.content or ""
    reason = choice.finish_reason or "-"
    print(f"{i:<8} {reason:<16} {repr(content)}")

print("\n\u2139\ufe0f The last chunk has finish_reason='stop' and empty content.")

Chunk#   finish_reason    delta.content
--------------------------------------------------
0        -                ''
1        -                'Hello'
2        -                ' World'
3        -                '!'
4        stop             ''

ℹ️ The last chunk has finish_reason='stop' and empty content.


### Experiment 3C: Collecting Full Response While Streaming

In production, you often need to **stream to the user AND store the complete response** — e.g., to save to a database or add to conversation history.

In [17]:
def stream_and_collect(messages, max_tokens=100, **kwargs):
    """Stream response to screen AND collect the full text."""
    stream = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        max_tokens=max_tokens,
        stream=True,
        **kwargs
    )

    full_response = ""
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            print(delta, end="", flush=True)
            full_response += delta

    print()  # newline
    return full_response


# Use it — stream AND save
messages = [{"role": "user", "content": "Name 3 programming languages and one strength of each. Brief."}]

show_messages(messages)
print("Streaming...\n")
saved = stream_and_collect(messages)

print(f"\n\u2705 Full response saved! Length: {len(saved)} characters")
print(f"   First 80 chars: {saved[:80]}...")

Streaming...

Sure! Here are three programming languages along with one strength of each:

1. **Python**: Easy to read and write, making it ideal for beginners and rapid prototyping.

2. **JavaScript**: Highly versatile and the backbone of web development, enabling interactive and dynamic websites.

3. **C++**: Offers fine-grained control over system resources, making it excellent for performance-critical applications like game development and system software.

✅ Full response saved! Length: 450 characters
   First 80 chars: Sure! Here are three programming languages along with one strength of each:

1. ...


### Experiment 3D: Stream with Error Handling (Production Pattern)

In [18]:
from openai import APIError, RateLimitError, APIConnectionError


def stream_safe(messages, max_tokens=100):
    """Production-ready streaming with error handling."""
    try:
        stream = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            max_tokens=max_tokens,
            stream=True
        )

        full_response = ""
        for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta:
                print(delta, end="", flush=True)
                full_response += delta

        print()
        return full_response

    except RateLimitError:
        print("\n\u26a0\ufe0f Rate limited! Wait and retry.")
    except APIConnectionError:
        print("\n\u26a0\ufe0f Connection failed! Check your internet.")
    except APIError as e:
        print(f"\n\u26a0\ufe0f API Error: {e.message}")
    return None


# Test it
messages = [{"role": "user", "content": "What is Python? One sentence."}]
show_messages(messages)
result = stream_safe(messages)
print(f"\n\u2705 Got response: {result is not None}")

Python is a high-level, interpreted programming language known for its readability and versatility, making it popular for various applications including web development, data analysis, artificial intelligence, and automation.

✅ Got response: True


### Experiment 3E: Getting Token Usage in Streaming Mode

By default, usage stats are **not available** during streaming. Pass `stream_options` to request them — they arrive in the final chunk.

In [19]:
messages = [{"role": "user", "content": "What is gravity? One sentence."}]
show_messages(messages)

stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    max_tokens=50,
    stream=True,
    stream_options={"include_usage": True}  # <-- request usage stats
)

for chunk in stream:
    delta = chunk.choices[0].delta.content if chunk.choices else None
    if delta:
        print(delta, end="", flush=True)

    # Usage arrives in the final chunk
    if chunk.usage:
        print(f"\n\n\u2139\ufe0f Token usage (from final chunk):")
        print(f"   Prompt:     {chunk.usage.prompt_tokens}")
        print(f"   Completion: {chunk.usage.completion_tokens}")
        print(f"   Total:      {chunk.usage.total_tokens}")

Gravity is a fundamental force of nature that attracts two bodies with mass toward each other, influencing their motion and structure in the universe.

ℹ️ Token usage (from final chunk):
   Prompt:     14
   Completion: 26
   Total:      40


### When to Use Streaming vs Normal

| Use Case | Mode | Why |
|----------|------|-----|
| Chat interfaces | **Streaming** | Users see tokens appear in real time |
| Backend processing | **Normal** | Simpler code, full response at once |
| Long responses | **Streaming** | Avoids timeout, shows progress |
| JSON extraction | **Normal** | Need complete JSON before parsing |
| Voice assistants | **Streaming** | Start TTS while still generating |

---

## 4. Sandbox — Try It Yourself!

In [20]:
# ============================================================
#  SANDBOX - Edit and re-run!
# ============================================================

my_messages = [
    {"role": "system", "content": "You are a helpful assistant. Keep it brief."},
    {"role": "user", "content": "What makes Python popular?"}
]

use_streaming = True   # Toggle: True for streaming, False for normal
my_max_tokens = 80

# ============================================================

show_messages(my_messages)

if use_streaming:
    print("\n[STREAMING]")
    _ = stream_and_collect(my_messages, max_tokens=my_max_tokens)
else:
    print("\n[NORMAL]")
    _ = chat(my_messages, max_tokens=my_max_tokens)


[STREAMING]
Python's popularity stems from several key factors:

1. **Ease of Learning**: Its simple and readable syntax makes it accessible for beginners.
2. **Versatility**: Python is used in various domains, including web development, data science, artificial intelligence, machine learning, and automation.
3. **Large Community**: A strong community provides extensive resources, libraries, and frameworks (like Django,


---

## Key Takeaways

| Concept | What to Remember |
|---------|------------------|
| **API Key** | Always load from `.env` — never hardcode. One key per project. |
| **Client Setup** | `OpenAI()` reads `OPENAI_API_KEY` from environment automatically |
| **Request** | Only `model` + `messages` are required; everything else is optional |
| **Messages** | List of `{role, content}` dicts — system/user/assistant |
| **Response** | Contains `choices[0].message.content`, `finish_reason`, and `usage` |
| **finish_reason** | `stop` = natural end, `length` = cut off, `tool_calls` = function call |
| **Multi-turn** | No built-in memory — resend full history every time |
| **Streaming** | Add `stream=True` — tokens arrive chunk by chunk |
| **Stream Usage** | Pass `stream_options={"include_usage": True}` for token counts |