## Welcome to the Second Lab - Week 1, Day 3

Today we will work with lots of models! This is a way to get comfortable with APIs.

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Important point - please read</h2>
            <span style="color:#ff7800;">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations.<br/><br/>If you have time, I'd love it if you submit a PR for changes in the community_contributions folder - instructions in the resources. Also, if you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...
            </span>
        </td>
    </tr>
</table>

In [3]:
# Start with imports - ask ChatGPT to explain any package that you don't know

import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import Markdown, display

In [4]:
# Always remember to do this!
load_dotenv(override=True)

True

In [9]:
# Print the key prefixes to help with any debugging

openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')
deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
groq_api_key = os.getenv('GROQ_API_KEY')

if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")
    
if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
else:
    print("Anthropic API Key not set (and this is optional)")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:2]}")
else:
    print("Google API Key not set (and this is optional)")

if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
else:
    print("DeepSeek API Key not set (and this is optional)")

if groq_api_key:
    print(f"Groq API Key exists and begins {groq_api_key[:4]}")
else:
    print("Groq API Key not set (and this is optional)")

OpenAI API Key exists and begins sk-proj-
Anthropic API Key exists and begins xxxx
Google API Key exists and begins AI
DeepSeek API Key exists and begins xxx
Groq API Key not set (and this is optional)


In [10]:
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation."
messages = [{"role": "user", "content": request}]

In [11]:
messages

[{'role': 'user',
  'content': 'Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. Answer only with the question, no explanation.'}]

In [12]:
openai = OpenAI()
response = openai.chat.completions.create(
    model="gpt-5-nano",
    messages=messages,
)
question = response.choices[0].message.content
print(question)


What is the minimal, generalizable set of cognitive capabilities that distinguish true general intelligence from a large language model, and how would you design a rigorous, replicable cross-domain evaluation protocol to measure those capabilities across (i) abstract reasoning, (ii) real-world planning under uncertainty, and (iii) ethical deliberation, including (a) clear operational definitions and performance metrics, (b) strategies to control biases and confounds, (c) data collection and evaluation procedures that minimize leakage and prompt engineering, (d) a plan for cross-model comparisons and statistical analysis, and (e) a framework for interpreting results that separates genuine general intelligence from memorization or superficial tricks?


In [13]:
competitors = []
answers = []
messages = [{"role": "user", "content": question}]

## Note - update since the videos

I've updated the model names to use the latest models below, like GPT 5 and Claude Sonnet 4.5. It's worth noting that these models can be quite slow - like 1-2 minutes - but they do a great job! Feel free to switch them for faster models if you'd prefer, like the ones I use in the video.

In [14]:
# The API we know well
# I've updated this with the latest model, but it can take some time because it likes to think!
# Replace the model with gpt-4.1-mini if you'd prefer not to wait 1-2 mins

model_name = "gpt-5-nano"

response = openai.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

Below is a compact, actionable blueprint you can use to reason about true general intelligence (GI) versus large language models (LLMs), and to run a rigorous, replicable, cross-domain evaluation. It focuses on a minimal but generalizable set of cognitive capabilities and pairs them with a concrete, transparent evaluation protocol across three domains: abstract reasoning, real-world planning under uncertainty, and ethical deliberation. It also addresses data integrity, bias controls, statistics, and interpretation of results.

1) Minimal, generalizable cognitive capabilities that distinguish GI from LLMs

Proposed core capabilities (four, orthogonal and testable)

- C1. Model-based, long-horizon planning with persistent world-models
  - What it is: The ability to form internal representations of a dynamic environment, predict consequences of actions over long time horizons, and revise plans coherently when new information arrives or the environment changes.
  - Why it matters: Token-prediction-only systems tend to rely on surface patterns; true GI should demonstrate goal-directed behavior that relies on an internal causal/physical model and can adapt plans over time.

- C2. Flexible abstraction, compositional generalization, and systematic problem solving
  - What it is: The capacity to abstract, combine, and recombine knowledge primitives (concepts, actions, rules) to solve novel tasks, including tasks that require reasoning with unseen combinations of known components.
  - Why it matters: Generalization across domains, tasks, and representations is a hallmark of GI, not merely memorization of training prompts or surface pattern matching.

- C3. Uncertainty-aware decision making and active information gathering
  - What it is: The ability to reason under partial observability, estimate and calibrate uncertainty, choose actions that reduce uncertainty when beneficial, and make robust decisions despite incomplete information.
  - Why it matters: Real-world competence requires handling ambiguity, not just confident regurgitation of prior data.

- C4. Value-aligned ethical deliberation and social reasoning under ambiguity
  - What it is: The capacity to reflect on harms, benefits, norms, and competing constraints; apply normative frameworks; reason about trade-offs; and communicate decisions with transparency about limitations.
  - Why it matters: Ethical judgment in complex, conflicting scenarios requires principled reasoning beyond pattern completion.

Notes on scope
- The four capabilities are designed to be minimally sufficient to capture core differences in general intelligence (planning with world models; robust abstraction; active information gathering under uncertainty; normative, value-aware reasoning) while being amenable to rigorous measurement and cross-domain testing.
- These capabilities are intentionally generic and domain-agnostic, facilitating cross-domain replication and comparison across models, environments, and tasks.

2) Rigorous cross-domain evaluation protocol (overview)

- Domain coverage:
  - Domain (i): Abstract reasoning and problem solving (C1 & C2 emphasis)
  - Domain (ii): Real-world planning under uncertainty (C1, C3 emphasis)
  - Domain (iii): Ethical deliberation and social reasoning under ambiguity (C4 emphasis)
- Evaluation pillars for each domain
  - a) Operational definitions and performance metrics
  - b) Strategies to control biases and confounds
  - c) Data collection and evaluation procedures minimizing leakage and prompt engineering
  - d) Cross-model comparison plan and statistical analysis
  - e) Interpretive framework to separate genuine GI from memorization or superficial tricks
- Cross-domain integration
  - A unified GI score (optional) combining domain-specific scores, with transparent weighting and per-domain diagnostics.

3) Domain-specific protocol details (a–e)

Domain (i): Abstract reasoning and problem solving (C1, C2)

a) Operational definitions and metrics
- Tasks: synthetic, controlled puzzles requiring long-horizon planning, causality reasoning, and hierarchical problem decomposition (e.g., multi-step planning in grid-worlds, chain-of-potion-type tasks, causal graphs with interventions).
- Metrics:
  - Planning success rate (fraction of tasks solved with a coherent plan)
  - Plan optimality (cost of the chosen plan relative to a known optimal plan or a baseline planner)
  - Plan robustness (success when environment changes after plan generation)
  - World-model fidelity (accuracy of inferred state transitions and causal relations)
  - Generalization score (performance on novel task compositions not present in training)
  - Computation/time efficiency (time to generate a plan, resource use)
- Operational definition: A model demonstrates C1/C2 if it consistently generates coherent, stepwise plans that correctly predict consequences and adapt to new but structurally related tasks, without requiring task-specific memorization.

b) Biases and confounds controls
- Use tasks with controlled causal structure and avoid tasks that rely primarily on language priors.
- Randomize task ordering and prompt phrasing to minimize prompt exploitation.
- Include “control tasks” with surface trickiness but no genuine planning demand to detect superficial shortcuts.
- Predefine success criteria and blind evaluators to model identity/model family.

c) Data collection and evaluation procedures (minimize leakage/prompt engineering)
- Create task generators that produce tasks algorithmically, with seeds to ensure reproducible difficulty levels.
- Use held-out test sets generated independently from any training data; do not reuse publicly available puzzles that may have appeared in model training.
- For each task, require a bounded-length, explicit plan (not only final answer) when feasible; otherwise, collect a structured justification trace that can be evaluated by humans for coherence and causal correctness.
- Use strict evaluation scripts that do not rely on any sensitive prompts or chain-of-thought prompts.

d) Cross-model comparisons and statistics
- Model set: at least 3–5 diverse GI-capable models (e.g., multiple open/closed models or baselines with and without planning modules) plus a strong AI baseline that relies on pattern matching but not planning, to separate baselines.
- Experimental design: within-model repeated measures across tasks; between-model comparisons with mixed-effects models to account for task difficulty and model variance.
- Statistics: preregistered analysis plan; nonparametric tests when sample sizes are small; Bayesian credible intervals for performance differences; correction for multiple comparisons (e.g., Holm-Berroni).

e) Interpretation framework
- Distinguish genuine planning/causal reasoning from memorized task templates by:
  - Out-of-distribution (OOD) generalization tests (novel task skeletons derived from existing tasks).
  - Task perturbations that invalidate memorized patterns but preserve underlying structure.
  - Ablation tests where internal world-models are degraded (e.g., by perturbing state-estimation inputs) to see if planning ability degrades accordingly.
- Examine failure modes: are failures due to planning errors, misestimated uncertainty, or brittle language inference that masks planning?

Domain (ii): Real-world planning under uncertainty (C1, C3)

a) Operational definitions and metrics
- Tasks: dynamic, partially observable environments (simulated robotics-like domains, household planning, or logistics with stochastic dynamics).
- Metrics:
  - Success rate under uncertainty (goal achieved despite stochasticity)
  - Adaptation rate (ability to adjust plans after observations show deviations)
  - Information-seeking behavior (frequency and usefulness of information-gathering actions)
  - Uncertainty calibration (reliability of predicted uncertainties vs observed outcomes)
  - Efficiency under time/resource constraints
- Operational definition: A GI-capable agent demonstrates robust, long-horizon planning in the face of uncertainty, actively gathers information to reduce uncertainty, and updates plans coherently as new evidence arrives.

b) Biases and confounds controls
- Use identical environments across models with randomized dynamics; ensure no model has access to privileged environment information.
- Prevent system prompts that reveal test expectations; use random seeds and blind evaluators for outcomes.
- Include both static and dynamic tasks to separate planning from static reasoning.

c) Data collection and evaluation procedures
- Simulated environments with standardized physics and uncertainty models; maintain a test-bed separate from training environments.
- Collect a diversity of tasks (varying horizon lengths, noise levels, and sensor availability).
- For each run, log full decision trajectories, uncertainty estimates, and information-seeking actions.

d) Cross-model comparisons and statistics
- Use hierarchical mixed-effects models with task difficulty as a random effect.
- Pairwise model comparisons with corrected p-values; report effect sizes and Bayes factors.
- Pre-register performance targets and confirm robustness across seeds and environment variants.

e) Interpretation framework
- Separate planning competence from reactive habit-based behavior by testing in unseen environments with new layout/topology while controlling for surface inputs.
- Analyze whether high performance correlates with reliable uncertainty estimates and active information gathering, rather than just fast rote responses.
- Include ablations that remove uncertainty signaling or information gathering to see the impact on performance.

Domain (iii): Ethical deliberation and social reasoning under ambiguity (C4)

a) Operational definitions and metrics
- Tasks: normative dilemmas, policy evaluation, fairness-sensitive decision making, cross-cultural ethical judgments, and justification with explicit reasoning.
- Metrics:
  - Normative alignment score: agreement with principled ethical theories (deontology, utilitarianism, virtue ethics) or explicit normative frameworks chosen a priori.
  - Consistency: ethical judgments across related but differently framed scenarios.
  - Trade-off transparency: quality and clarity of justifications; ability to articulate alternatives and their harms/benefits.
  - Bias/fairness indicators: resistance to demographic biases; detection of biased reasoning patterns.
  - Explainability quality: human assessments of the coherence and sufficiency of the model’s explanations.
- Operational definition: A GI-capable agent reasons about ethical considerations consistently across contexts, justifies decisions transparently, and demonstrates awareness of trade-offs and potential biases.

b) Biases and confounds controls
- Use diverse, ethically vetted scenario sets with explicit cultural and normative framing; avoid culturally biased prompts.
- Include disinformation or manipulation-resistant prompts to test robustness to prompt misuse.
- Blind evaluators to model identity and to the linguistic style of the model to reduce evaluator bias.

c) Data collection and evaluation procedures
- Use ethically curated, vetted scenarios with independent ethics review.
- Provide minimal prompt engineering (no chaining or hidden prompts); test with multiple neutral phrasings to assess robustness of judgments.
- Collect human judgments from multiple ethicists and domain experts; compute inter-rater reliability.

d) Cross-model comparisons and statistics
- Compare models on per-scenario ethics scores and overall alignment, with confidence intervals.
- Use mixed-effects models to account for scenario difficulty and rater variance.
- Conduct sensitivity analyses across normative frameworks (e.g., utilitarian vs. rights-based frameworks) to assess robustness of judgments.

e) Interpretation framework
- Distinguish genuine ethical deliberation from memorized patterns by:
  - Testing on novel ethical dilemmas with new combinations of constraints.
  - Evaluating for consistency across frames and resistance to prompt-induced bias.
  - Analyzing the quality and depth of explanations rather than just verdicts.
- Use counterfactual analyses: would the same decision hold if key facts were changed? Do explanations show awareness of the core moral principles?

4) Data integrity, leakage control, and prompt engineering mitigation

- Test-set design
  - Hold-out, procedurally generated tasks not present in training data
  - Include OOD variants and structurally novel task families
  - Ensure tasks are balanced for difficulty and domain coverage
- Leakage controls
  - Avoid using publicly released benchmarks that models might have memorized
  - Use synthetic or procedurally generated content with reproducible seeds
  - Maintain strict version control of task generators and evaluation harnesses; publish pipelines after a suitable embargo if needed
- Prompt engineering mitigation
  - Use fixed, minimal prompts across models; also test with standardized prompts that do not elicit chain-of-thought
  - Randomize prompt order and phrasing; avoid prompts that reveal evaluation expectations
  - Include a prompt-robustness check: re-run with multiple prompt variants to ensure results are not prompt-specific
- Information leakage detection
  - Monitor for memorized answers via similarity measures to a model’s training data; use hold-out prompts with paraphrased framing
  - Binary/isomorphic task variants to test whether surface cues drive success

5) Data collection and evaluation procedures to minimize leakage and enable replicability

- Task generation
  - Use open-source, auditable task generators with documented seeds and difficulty calibration
  - Create both synthetic tasks and carefully curated real-world analogs to cover domain complexity
- Evaluation harness
  - Publish evaluation scripts, data schemas, and scoring rubrics
  - Use automated scoring where feasible; human adjudication for subjective judgments (ethical deliberation) with clear rubric
- Experimental protocol
  - Pre-register hypotheses, task sets, and analysis plan
  - Run independent replicators with the exact evaluation harness and a fixed random seed policy
- Documentation
  - Provide model versions, environment versions, runtime settings, and random seeds
  - Share anonymized task data and evaluation results to enable external replication

6) Cross-model comparison and statistical analysis plan

- Model ensemble
  - Include a diverse set of models: multiple LLMs with differing training corpora and architectures, plus a non-LLM-based baseline that emphasizes planning and reasoning
- Experimental design
  - Within-model comparisons across tasks; between-model comparisons for each domain and across domains
  - Factorial design where feasible: model type × task domain × task difficulty
- Statistics
  - Use mixed-effects models with random effects for model and task, fixed effects for domain and difficulty
  - Report effect sizes (Cohen’s d or equivalent), confidence/credible intervals, and Bayesian evidence
  - Correct for multiple comparisons; preregister primary outcomes
- Robustness checks
  - Sensitivity analyses across seeds, evaluation prompts, and task perturbations
  - Out-of-distribution tests to evaluate true generalization
  - Ablation studies (remove components or alter evaluation conditions) to attribute performance to C1–C4

7) Interpretation framework: separating genuine GI from memorization or superficial tricks

- Core tests to distinguish intelligence from memory
  - Generalization tests: structurally new tasks with similar underlying rules
  - Composition tests: require combining known primitives in novel ways
  - Causal/causal-sufficient tests: involve interventions and counterfactual reasoning not present in training data
  - Uncertainty and information-seeking tests: verify active information gathering and reliable uncertainty estimates
  - Ethical reasoning tests with novel normative constraints and cross-cultural framing
- Diagnostics and diagnostics-driven reporting
  - Error analysis: categorize failures by reasoning type (planning error, misestimation of uncertainty, rote recall, bias)
  - Memorization indicators: high similarity to training data, or success on prompts easily traced to memorized outputs
  - Explanation quality: assess whether the model’s explanations reflect understanding of underlying concepts rather than surface matching
- Reporting guidelines
  - Provide per-domain GI scores, confidence intervals, and the proportion of tasks where genuine reasoning is demonstrated
  - Clearly separate results driven by planning/world-model competencies from those driven by pattern completion
  - Include limitations, potential confounds, and recommendations for future improvements

8) Practical considerations and limitations

- Resource demands
  - Cross-model, cross-domain evaluation is computationally intensive; plan for parallel runs and staged analyses
- Ethics and safety
- Content safety: ensure ethical deliberation prompts are non-harmful and responsibly framed; provide content warnings when necessary
- Interpretability vs. opacity: ensure that explainability assessments are well-defined and not easily gamed
- Transferability: while the protocol aims to be domain-agnostic, task design should be updated to reflect real-world domains you care about (robotics, planning, policy, etc.)

9) A concrete, compact checklist you can adopt

- Define four core capabilities (C1–C4) and map every task to one or more capabilities
- Build three task suites: abstract reasoning, planning under uncertainty, ethical deliberation; ensure each includes both novel and diversified prompts
- Pre-register metrics, data-generation procedures, and statistical plans
- Use held-out, procedurally generated test tasks; avoid data leakage from training corpora
- Employ fixed prompts with minimal or no chain-of-thought prompts; test prompt-robustness
- Collect both automated metrics and human judgments (where appropriate); assess inter-rater reliability
- Use mixed-effects models for analysis; report effect sizes with CIs or BFs
- Conduct ablations and OOD tests to separate genuine reasoning from memorization
- Publish data, code, and evaluation harness openly (where possible) to maximize replicability

If you’d like, I can tailor this protocol to a specific coalition of models you’re evaluating (e.g., a particular set of LLMs or agent-based systems) and provide a concrete task catalog with concrete scoring rubrics, example prompts, and a starter statistical analysis plan customized to your resources and time constraints.

In [12]:
# Anthropic has a slightly different API, and Max Tokens is required

model_name = "claude-sonnet-4-5"

claude = Anthropic()
response = claude.messages.create(model=model_name, messages=messages, max_tokens=1000)
answer = response.content[0].text

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

AuthenticationError: Error code: 401 - {'type': 'error', 'error': {'type': 'authentication_error', 'message': 'invalid x-api-key'}, 'request_id': 'req_011CXNUD2r3eVQjz7fKuWRc4'}

In [14]:
gemini = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-2.5-flash"

response = gemini.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

BadRequestError: Error code: 400 - [{'error': {'code': 400, 'message': 'API key not valid. Please pass a valid API key.', 'status': 'INVALID_ARGUMENT', 'details': [{'@type': 'type.googleapis.com/google.rpc.ErrorInfo', 'reason': 'API_KEY_INVALID', 'domain': 'googleapis.com', 'metadata': {'service': 'generativelanguage.googleapis.com'}}, {'@type': 'type.googleapis.com/google.rpc.LocalizedMessage', 'locale': 'en-US', 'message': 'API key not valid. Please pass a valid API key.'}]}}]

In [15]:
deepseek = OpenAI(api_key=deepseek_api_key, base_url="https://api.deepseek.com/v1")
model_name = "deepseek-chat"

response = deepseek.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

AuthenticationError: Error code: 401 - {'error': {'message': 'Authentication Fails, Your api key: xxxx is invalid', 'type': 'authentication_error', 'param': None, 'code': 'invalid_request_error'}}

In [15]:
# Updated with the latest Open Source model from OpenAI

groq = OpenAI(api_key=groq_api_key, base_url="https://api.groq.com/openai/v1")
model_name = "openai/gpt-oss-120b"

response = groq.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)


AuthenticationError: Error code: 401 - {'error': {'message': 'Invalid API Key', 'type': 'invalid_request_error', 'code': 'invalid_api_key'}}

## For the next cell, we will use Ollama

Ollama runs a local web service that gives an OpenAI compatible endpoint,  
and runs models locally using high performance C++ code.

If you don't have Ollama, install it here by visiting https://ollama.com then pressing Download and following the instructions.

After it's installed, you should be able to visit here: http://localhost:11434 and see the message "Ollama is running"

You might need to restart Cursor (and maybe reboot). Then open a Terminal (control+\`) and run `ollama serve`

Useful Ollama commands (run these in the terminal, or with an exclamation mark in this notebook):

`ollama pull <model_name>` downloads a model locally  
`ollama ls` lists all the models you've downloaded  
`ollama rm <model_name>` deletes the specified model from your downloads

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Super important - ignore me at your peril!</h2>
            <span style="color:#ff7800;">The model called <b>llama3.3</b> is FAR too large for home computers - it's not intended for personal computing and will consume all your resources! Stick with the nicely sized <b>llama3.2</b> or <b>llama3.2:1b</b> and if you want larger, try llama3.1 or smaller variants of Qwen, Gemma, Phi or DeepSeek. See the <A href="https://ollama.com/models">the Ollama models page</a> for a full list of models and sizes.
            </span>
        </td>
    </tr>
</table>

In [2]:
# !ollama pull llama3.2

In [16]:
import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={"model": "llama3.2", "prompt": "Hello", "stream": False}
)
print(response.json())

{'model': 'llama3.2', 'created_at': '2026-01-22T14:24:50.146576157Z', 'response': 'Hello! How can I assist you today?', 'done': True, 'done_reason': 'stop', 'context': [128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 271, 128009, 128006, 882, 128007, 271, 9906, 128009, 128006, 78191, 128007, 271, 9906, 0, 2650, 649, 358, 7945, 499, 3432, 30], 'total_duration': 722667234, 'load_duration': 141168048, 'prompt_eval_count': 26, 'prompt_eval_duration': 59922384, 'eval_count': 10, 'eval_duration': 513934661}


In [17]:
ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "llama3.2"

response = ollama.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

Designing a minimal set of cognitive capabilities that distinguish true general intelligence from large language models is an active area of research. Based on various studies, I'd propose the following cognitive capabilities as essential indicators:

1. **Abstract reasoning**: The ability to reason about abstract concepts, relationships, and hypothetical scenarios beyond specific domains or tasks.
2. **Common sense**: The capacity to understand and apply general principles, often implicitly learned from everyday experiences, to novel situations.
3. **Transfer learning**: The ability to adapt knowledge and skills acquired in one domain to another, un-related task or context.
4. **Contextual understanding**: The capacity to comprehend the nuances of language, including subtle cues, implied meaning, and figurative language.
5. **Creativity**: The ability to generate new ideas, solutions, or concepts that are not readily available through classical learning mechanisms.

To design a rigorous, replicable cross-domain evaluation protocol, I'll outline the following steps:

**Initial Steps**

1. **Defining operational definitions**: Clearly articulate the specific tasks and cognitive capabilities targeted by each assessment (abstract reasoning, real-world planning under uncertainty, ethical deliberation).
2. **Establishing performance metrics**: Develop standardized evaluation procedures, incorporating measures such as accuracy, speed, and reliability.
3. **Control for biases and confounds**: Implement strategies to mitigate domain-specific biases and minimize confounding variables.

**Task Design**

1. **Abstract reasoning**: Present a series of abstract tasks, such as:
	* Analogies (e.g., "rhythm in music" vs. "tension in a rope")
	* Counterfactuals (e.g., "what would happen if X were true?")
	* Syllogisms (e.g., "All A are B; is X an A?")

2. **Real-world planning under uncertainty**: Simulate real-world scenarios, incorporating:
	* Uncertainty and ambiguity
	* Complex decision-making frameworks
	* Domain adaptation

3. **Ethical deliberation**: Present a series of hypothetical dilemmas requiring the model to weigh competing values or principles.

4. **Cross-domain evaluation**: Assess each cognitive capability across multiple domains:

a) **Inter-domain generalization**: Test whether performance on abstract reasoning/real-world planning/ethical deliberation tasks transfers to novel domains.
b) **Domain adaptation**: Evaluate how well performance adapts to a new domain within the same task type.

**Data Collection and Evaluation Procedures**

1. **Prompt engineering**: Develop carefully crafted test procedures that minimize biases, incorporating linguistic and domain-specific nuance.
2. **Data standardization**: Ensure consistent evaluation across all tasks and domains using standardized datasets (e.g., CogSeq).
3. **Multi-fidelity approaches**: Combine high- and low-fidelity testing methods to capture both reliable performance on specific cognitive capabilities and the more general, adaptable aspects of intelligence.

**Cross-Model Comparisons**

1. **Comparative evaluation**: Assess large language models against human benchmarking methods for a subset or selection of tests.
2. **Transfer learning evaluation**: Assess how well individual models generalize across multiple tasks or domains using transfer learning techniques.

**Strategies for Statistical Analysis**

1. **Multi-modal analysis**: Combine performance metrics to create composite scores (e.g., average accuracy and confidence).
2. **Comparative meta-analysis**: Pool individual results from each model across a collection of benchmarks.
3. **Ensemble evaluation**: Use aggregation or combination methods (e.g., weighted averages, Bayesian averaging).

**Interpretation and Frameworks**

1. **General ability scores (GAS)**: Utilize composite metrics to assess total intelligence, accounting for performance on multiple cognitive capabilities.
2. **Context-free performance assessment**: Focus solely on general principles that operate independent of specific domain applications or specialized contexts.
3. **Model-agnostic interpretation**: Develop frameworks that capture and mitigate the effects of overfitting, incorporating techniques such as regularized learning objectives and norm-based constraints.

Implementing a research framework meeting these proposals will help determine whether large language models exhibit genuine intelligence similar to humans' cognition, or represent superficial memorization/ tricks. Cross-metamorphosis evaluation with the large language models against each other would further reveal any commonalities and general trends of human-like cognition in AI's performance

In [18]:
# So where are we?

print(competitors)
print(answers)


['gpt-5-nano', 'llama3.2']


In [19]:
# It's nice to know how to use "zip"
for competitor, answer in zip(competitors, answers):
    print(f"Competitor: {competitor}\n\n{answer}")


Competitor: gpt-5-nano

Below is a compact, actionable blueprint you can use to reason about true general intelligence (GI) versus large language models (LLMs), and to run a rigorous, replicable, cross-domain evaluation. It focuses on a minimal but generalizable set of cognitive capabilities and pairs them with a concrete, transparent evaluation protocol across three domains: abstract reasoning, real-world planning under uncertainty, and ethical deliberation. It also addresses data integrity, bias controls, statistics, and interpretation of results.

1) Minimal, generalizable cognitive capabilities that distinguish GI from LLMs

Proposed core capabilities (four, orthogonal and testable)

- C1. Model-based, long-horizon planning with persistent world-models
  - What it is: The ability to form internal representations of a dynamic environment, predict consequences of actions over long time horizons, and revise plans coherently when new information arrives or the environment changes.
  - 

In [20]:
# Let's bring this together - note the use of "enumerate"

together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"

In [21]:
print(together)

# Response from competitor 1

Below is a compact, actionable blueprint you can use to reason about true general intelligence (GI) versus large language models (LLMs), and to run a rigorous, replicable, cross-domain evaluation. It focuses on a minimal but generalizable set of cognitive capabilities and pairs them with a concrete, transparent evaluation protocol across three domains: abstract reasoning, real-world planning under uncertainty, and ethical deliberation. It also addresses data integrity, bias controls, statistics, and interpretation of results.

1) Minimal, generalizable cognitive capabilities that distinguish GI from LLMs

Proposed core capabilities (four, orthogonal and testable)

- C1. Model-based, long-horizon planning with persistent world-models
  - What it is: The ability to form internal representations of a dynamic environment, predict consequences of actions over long time horizons, and revise plans coherently when new information arrives or the environment changes

In [22]:
judge = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks."""


In [23]:
print(judge)

You are judging a competition between 2 competitors.
Each model has been given this question:

What is the minimal, generalizable set of cognitive capabilities that distinguish true general intelligence from a large language model, and how would you design a rigorous, replicable cross-domain evaluation protocol to measure those capabilities across (i) abstract reasoning, (ii) real-world planning under uncertainty, and (iii) ethical deliberation, including (a) clear operational definitions and performance metrics, (b) strategies to control biases and confounds, (c) data collection and evaluation procedures that minimize leakage and prompt engineering, (d) a plan for cross-model comparisons and statistical analysis, and (e) a framework for interpreting results that separates genuine general intelligence from memorization or superficial tricks?

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only 

In [24]:
judge_messages = [{"role": "user", "content": judge}]

In [25]:
# Judgement time!

openai = OpenAI()
response = openai.chat.completions.create(
    model="gpt-5-nano",
    messages=judge_messages,
)
results = response.choices[0].message.content
print(results)


{"results": ["1", "2"]}


In [26]:
# OK let's turn this into results!

results_dict = json.loads(results)
ranks = results_dict["results"]
for index, result in enumerate(ranks):
    competitor = competitors[int(result)-1]
    print(f"Rank {index+1}: {competitor}")

Rank 1: gpt-5-nano
Rank 2: llama3.2


<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/exercise.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Exercise</h2>
            <span style="color:#ff7800;">Which pattern(s) did this use? Try updating this to add another Agentic design pattern.
            </span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/business.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#00bfff;">Commercial implications</h2>
            <span style="color:#00bfff;">These kinds of patterns - to send a task to multiple models, and evaluate results,
            are common where you need to improve the quality of your LLM response. This approach can be universally applied
            to business projects where accuracy is critical.
            </span>
        </td>
    </tr>
</table>