# Direct OpenAI Queries (Yet without Retrieval)

In this notebook we interact **directly** with an OpenAI Large Language Model (LLM).

This represents the **baseline usage pattern**:
- no external documents
- no vector databases
- no Retrieval-Augmented Generation (RAG)

Understanding this baseline is essential before introducing more advanced architectures.

In [1]:
# ------------------------------------------------------------
# Proxy configuration (DWD network)
# ------------------------------------------------------------
import os

PROXY = "http://ofsquid.dwd.de:8080"

os.environ["HTTP_PROXY"]  = PROXY
os.environ["HTTPS_PROXY"] = PROXY
os.environ["http_proxy"]  = PROXY
os.environ["https_proxy"] = PROXY

# Do not proxy local traffic
os.environ["NO_PROXY"] = "localhost,127.0.0.1"

print("Proxy configured:")
print(" ", PROXY)

Proxy configured:
  http://ofsquid.dwd.de:8080


## Secure API key handling

API keys must **never** be hard-coded into notebooks or scripts.

We load them from a `.env` file, which:
- is ignored by Git
- keeps credentials local
- supports reproducible workflows

This is standard practice in research and production systems.

In [2]:
# ------------------------------------------------------------
# Load API keys from .env file
# ------------------------------------------------------------
from dotenv import load_dotenv
import os

# Load variables from .env in the current directory (or parent dirs)
load_dotenv()

# Sanity check (do NOT print the key itself)
if "OPENAI_API_KEY" not in os.environ:
    raise RuntimeError("OPENAI_API_KEY not found in environment")

print("OPENAI_API_KEY loaded.")

# ------------------------------------------------------------
# OpenAI backend (requires OPENAI_API_KEY in environment)
# ------------------------------------------------------------
import os

USE_OPENAI = True  # set False if you want local-only

if USE_OPENAI:
    from openai import OpenAI
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    print("OpenAI client initialized.")
else:
    print("OpenAI disabled.")

OPENAI_API_KEY loaded.
OpenAI client initialized.


## First minimal query

We now send a simple prompt to the model and receive a response.

In [3]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Explain what a transformer model is in three sentences."}
    ]
)

print(response.choices[0].message.content)

A transformer model is a type of neural network architecture designed for processing sequential data, particularly in natural language processing tasks. It uses self-attention mechanisms to weigh the importance of different words in a sequence, allowing the model to capture long-range dependencies and contextual relationships effectively. Transformers have become the foundation for many advanced language models, including BERT and GPT, due to their scalability and performance in generating and understanding text.


## Structured prompting

Using system and user messages allows better control over tone and scope.


In [4]:
from IPython.display import display, Markdown

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user", "content": "Explain self-attention mathematically."}
    ],
    temperature=0.2,
)

answer = response.choices[0].message.content
display(Markdown(answer))

Self-attention is a mechanism used in neural networks, particularly in transformer architectures, to compute a representation of a sequence by relating different positions within that sequence. Here’s a mathematical breakdown of self-attention:

1. **Input Representation**: 
   Let \( X \in \mathbb{R}^{n \times d} \) be the input matrix, where \( n \) is the number of tokens (or words) in the sequence, and \( d \) is the dimensionality of each token's embedding.

2. **Linear Transformations**:
   We create three matrices: Query \( Q \), Key \( K \), and Value \( V \) by applying learned linear transformations:
   \[
   Q = XW^Q, \quad K = XW^K, \quad V = XW^V
   \]
   where \( W^Q, W^K, W^V \in \mathbb{R}^{d \times d_k} \) are weight matrices for queries, keys, and values, respectively, and \( d_k \) is the dimensionality of the keys.

3. **Dot Product Attention Scores**:
   The attention scores are computed using the dot product of the queries and keys:
   \[
   \text{Attention\_scores} = \frac{QK^T}{\sqrt{d_k}}
   \]
   The division by \( \sqrt{d_k} \) is a scaling factor to prevent the dot products from growing too large, which can lead to gradients that are too small.

4. **Softmax Normalization**:
   We apply the softmax function to the attention scores to obtain the attention weights:
   \[
   \text{Attention\_weights} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)
   \]
   This results in a matrix where each row sums to 1, representing the distribution of attention for each token.

5. **Weighted Sum of Values**:
   The output of the self-attention mechanism is computed as a weighted sum of the values:
   \[
   \text{Output} = \text{Attention\_weights} \cdot V
   \]

6. **Final Output**:
   The output matrix is of the same shape as the input matrix \( X \), allowing it to be used in subsequent layers of the model.

In summary, self-attention allows each token to attend to all other tokens in the sequence, capturing contextual relationships effectively. The entire process can be summarized as:
\[
\text{Output} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
\]

## Streaming model responses

By default, language model responses are returned only after the full answer
has been generated. This is simple, but it can feel slow for longer outputs.

With **streaming**, the model sends partial outputs token by token.
This allows:
- immediate feedback to the user
- progressive rendering of long answers
- more interactive user interfaces

Streaming does **not** change the model’s reasoning or final content —
it only affects how the output is delivered.

In [5]:
from IPython.display import display, Markdown

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user", "content": "Explain self-attention mathematically."}
    ],
    temperature=0.2,
    stream=True,
)

accumulated = ""
display_handle = display(Markdown(""), display_id=True)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        accumulated += delta.content
        display_handle.update(Markdown(accumulated))


Self-attention is a mechanism used in neural networks, particularly in transformer architectures, to compute a representation of a sequence by relating different positions within that sequence. Here's a mathematical explanation of self-attention:

1. **Input Representation**: 
   Let \( X \in \mathbb{R}^{n \times d} \) be the input matrix, where \( n \) is the number of tokens in the sequence and \( d \) is the dimensionality of each token's embedding.

2. **Linear Transformations**:
   We create three matrices: the Query \( Q \), Key \( K \), and Value \( V \) matrices by applying learned linear transformations:
   \[
   Q = XW^Q, \quad K = XW^K, \quad V = XW^V
   \]
   where \( W^Q, W^K, W^V \in \mathbb{R}^{d \times d_k} \) are weight matrices for queries, keys, and values, respectively, and \( d_k \) is the dimensionality of the queries and keys.

3. **Attention Scores**:
   The attention scores are computed by taking the dot product of the query with all keys:
   \[
   \text{scores} = QK^T \in \mathbb{R}^{n \times n}
   \]
   This results in a matrix where each element \( \text{scores}_{ij} \) represents the attention score of the \( i \)-th token with respect to the \( j \)-th token.

4. **Scaling**:
   To prevent large values in the softmax function, the scores are scaled by the square root of the dimensionality of the keys:
   \[
   \text{scaled\_scores} = \frac{\text{scores}}{\sqrt{d_k}}
   \]

5. **Softmax**:
   We apply the softmax function to the scaled scores to obtain the attention weights:
   \[
   \text{attention\_weights} = \text{softmax}(\text{scaled\_scores}) \in \mathbb{R}^{n \times n}
   \]
   Each row of this matrix sums to 1 and represents the distribution of attention for each token.

6. **Weighted Sum of Values**:
   Finally, the output of the self-attention mechanism is computed as a weighted sum of the values:
   \[
   \text{output} = \text{attention\_weights} V \in \mathbb{R}^{n \times d_v}
   \]
   where \( d_v \) is the dimensionality of the values (often equal to \( d \)).

In summary, self-attention allows each token to attend to all other tokens in the sequence, producing a context-aware representation that captures dependencies regardless of their distance in the input sequence.