<a href="https://colab.research.google.com/github/mshumer/gpt-oss-pro-mode/blob/main/OpenAI_Open_Source_Pro_Mode.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Made by Matt Shumer ([@mattshumer_](https://x.com/mattshumer_) on X).

In [1]:
# @title Run this cell to set up Pro Mode
!pip3 install ollama



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:

from typing import List, Dict, Any
import time, os
import concurrent.futures as cf
import ollama

MODEL = "gpt-oss:120b"
MAX_COMPLETION_TOKENS = 30000


def _one_completion(client: ollama, question: str, temperature: float ) -> str:
    """
    
    Sends a question to the Ollama API and returns the response.
    """
    delay = 0.5
    for attempt in range(3):
        try:
            response = client.chat(
                model=MODEL,
                messages=[
                    {'role': 'user', 'content': question},
                ],
                options={'temperature': temperature, 'max_completion_tokens':MAX_COMPLETION_TOKENS}
            )
            return response['message']['content']
        except Exception as e:
            if attempt == 2:
                raise
            time.sleep(delay)
            delay *= 2


def _build_synthesis_messages(candidates: List[str]) -> List[Dict[str, str]]:
    numbered = "\n\n".join(
        f"<cand {i+1}>\n{txt}\n</cand {i+1}>" for i, txt in enumerate(candidates)
    )
    system = (
        "You are an expert editor. Synthesize ONE best answer from the candidate "
        "answers provided, merging strengths, correcting errors, and removing repetition. "
        "Do not mention the candidates or the synthesis process. Be decisive and clear."
    )
    user = (
        f"You are given {len(candidates)} candidate answers delimited by <cand i> tags.\n\n"
        f"{numbered}\n\nReturn the single best final answer."
    )
    return [{"role": "system", "content": system},
            {"role": "user", "content": user}]

def pro_mode(client: ollama, prompt: str, n_runs: int) -> Dict[str, Any]:
    """
    Fan out n_runs parallel generations at T=0.9 and synthesize a final answer at T=0.2.
    If groq_api_key is provided, it will be used; otherwise GROQ_API_KEY env var is used.
    Returns: {"final": str, "candidates": List[str]}
    """
    assert n_runs >= 1, "n_runs must be >= 1"

    # Parallel candidate generations (threaded; Colab-friendly)
    max_workers = min(n_runs, 16)
    candidates: List[str] = [None] * n_runs  # preserve order
    with cf.ThreadPoolExecutor(max_workers=max_workers) as ex:
        fut_to_idx = {
            ex.submit(_one_completion, client, prompt, 0.9): i
            for i in range(n_runs)
        }
        for fut in cf.as_completed(fut_to_idx):
            i = fut_to_idx[fut]
            candidates[i] = fut.result()

    # Synthesis pass
    messages = _build_synthesis_messages(candidates)
    final_resp = client.chat(
        model=MODEL,
        messages=messages,
        options={'temperature': 0.2, 'max_completion_tokens':MAX_COMPLETION_TOKENS}
    )
    final = final_resp['message']['content']

    return {"final": final, "candidates": candidates}


In [None]:
PROMPT = "Explain self-play in reinforcement learning with a concrete example."
NUMBER_OF_CANDIDATES = 5 # start with five, go up if you need more intelligence!
OLLAMA_API_KEY = "yourkey"

client = ollama.Client(
    host="https://ollama.com",
    headers={'Authorization': OLLAMA_API_KEY}
)



result = pro_mode(client, PROMPT, NUMBER_OF_CANDIDATES)

print("\n=== FINAL ===\n", result["final"])
# To inspect candidates:
# for i, c in enumerate(result["candidates"], 1): print(f"\n--- Candidate {i} ---\n{c}")


=== FINAL ===
 **Self‑play in reinforcement learning**  
Self‑play turns a two‑player (or multi‑agent) game into its own data‑generator: the learning agent repeatedly plays against a copy of itself (or a past version). Because the opponent improves together with the learner, the difficulty of the task automatically adapts, eliminating the need for hand‑crafted opponents or expert demonstrations.

---

## 1. Why self‑play works

| Reason | Effect on learning |
|--------|--------------------|
| **Automatic curriculum** | Early games are easy (both agents are weak); later games become harder as the policy improves, keeping the learning signal informative. |
| **No external labels** | The only reward needed is the game outcome (win = +1, loss = ‑1, draw = 0). |
| **Full‑tree exploration** | An evolving opponent forces the learner to discover strategies that would never appear against a static opponent. |
| **Convergence to equilibrium** | In deterministic zero‑sum games the process drives