# AI Co-Scientist for Pharma R&D with Open-Source LLMs

This notebook implements a multi-agent AI system designed to function as a "co-scientist," accelerating experimental design in pharmaceutical R&D. The system automates hypothesis generation, grounds proposals in data using tools, and uses a tiered approach of different AI agents to refine and validate experimental protocols.

This implementation is a high-fidelity recreation of the architecture from OpenAI's Model Selection Guide, fully adapted to work with customizable open-source LLMs and enhanced with advanced evaluation metrics.

## Architecture Overview and Enhancements

We adopt the original's robust multi-agent workflow, which uses different models for specialized tasks in a pattern of "escalation of intelligence."

**Core Workflow:**
1.  **Scientist Input:** A human scientist defines the research goal, compound, and constraints.
2.  **Ideation:** Multiple parallel agents, each with a specific role, generate diverse experimental plans using a fast, cost-effective LLM.
3.  **Tournament Ranking:** The generated plans are compared pairwise to efficiently select the most promising candidate.
4.  **Deep Critique:** The winning plan is escalated to a more powerful LLM for rigorous scientific review and refinement.
5.  **Safety Check:** A final check is performed to identify potential hazards.
6.  **Human Review:** The final, refined protocol is presented to the human scientist for ultimate approval.
7.  **Execution & Learning:** Experimental results are analyzed to create structured knowledge for future runs.

**Our Enhancements:**
- **An Advanced Quality Scoring Agent:** A new evaluation agent scores the final protocol on five dimensions: `Scientific Validity`, `Feasibility`, `Innovation`, `Cost-Effectiveness`, and `Clarity & Reproducibility`.
- **A Comprehensive Final Summary:** We aggregate all operational metrics and our new quality scores into a single DataFrame, offering a holistic view of the entire process for each run.

## 1. Setup and Configuration

In [1]:
# %pip install -qU openai pandas tqdm

In [None]:
import os
import json
import re
import time
import uuid
import logging
from typing import List, Dict, Any, Optional
from dataclasses import dataclass, field
from pathlib import Path
import pandas as pd
from openai import OpenAI
from tqdm.auto import tqdm
from IPython.display import display, Markdown

# --- LLM Configuration ---
API_KEY = "API_KEY"
BASE_URL = "https://api.studio.nebius.com/v1/"

MODEL_IDEATE = "Qwen/Qwen3-4B-fast"
MODEL_CRITIQUE = "Qwen/Qwen3-235B-A22B"
MODEL_SAFETY = "Qwen/Qwen3-14B"
MODEL_EVALUATE = "Qwen/Qwen3-235B-A22B"

client = OpenAI(api_key=API_KEY, base_url=BASE_URL)

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

metrics_log = []

## 2. Utilities and Core Logic

This section replicates the essential helper functions from the original `agent_utils.py` file, including the `Context` dataclass for managing run parameters and the `log_json` function for structured logging.

In [3]:
@dataclass
class Context:
    compound: str
    goal: str
    budget: float
    time_h: int
    previous: str
    client: OpenAI
    run_id: str = field(default_factory=lambda: uuid.uuid4().hex[:8])
    # We add this field to store the critique result for the final summary
    critique_recommendation: Optional[str] = None

    def prompt_vars(self):
        return {
            "compound": self.compound,
            "goal": self.goal,
            "budget": self.budget,
            "time_h": self.time_h,
            "previous": self.previous,
        }

def log_json(stage: str, data: Any, ctx: Context):
    Path("logs").mkdir(exist_ok=True)
    p = Path("logs") / f"{ctx.run_id}.log"
    with p.open("a", encoding="utf-8") as f:
        log_entry = {"ts": time.time(), "stage": stage, "data": data}
        f.write(json.dumps(log_entry, indent=2) + "\n")
    logging.info(f"Logged '{stage}' data to logs/{ctx.run_id}.log")

def parse_json_from_response(text: str) -> Dict[str, Any]:
    match = re.search(r'```(?:json)?\s*({.*?})\s*```', text, re.S)
    if match:
        json_str = match.group(1)
    else:
        start = text.find('{')
        end = text.rfind('}')
        if start != -1 and end != -1:
            json_str = text[start:end+1]
        else:
            return {"raw_text": text}
    try:
        return json.loads(json_str)
    except json.JSONDecodeError:
        logging.warning(f"Failed to parse JSON from response: {text}")
        return {"raw_text": text}

### 2.1. Mock Tool Implementations

In a real-world scenario, these functions would make API calls to internal databases, chemical suppliers, or literature search engines. For this notebook, we mock them to simulate their behavior and make the example self-contained and runnable.

In [4]:
MOCK_CHEMICALS = {
    "Palladium acetate": {"cost_per_gram": 85.50, "hazards": "Irritant"},
    "Triphenylphosphine": {"cost_per_gram": 12.75, "hazards": "Irritant"},
    "Triethylamine": {"cost_per_gram": 5.25, "hazards": "Flammable, corrosive"},
    "Sodium borohydride": {"cost_per_gram": 8.90, "hazards": "Flammable, water-reactive"},
    "Dimethylformamide": {"cost_per_gram": 3.15, "hazards": "Reproductive toxin"},
    "Palladium chloride": {"cost_per_gram": 75.20, "hazards": "Irritant, potential carcinogen"},
    "Potassium carbonate": {"cost_per_gram": 2.50, "hazards": "Irritant"},
    "Toluene": {"cost_per_gram": 1.75, "hazards": "Flammable, CNS depressant"},
}
MOCK_OUTCOMES = {"XYZ-13": [{"id": "exp-001", "yield": 62.3, "notes": "Yield decreased above 85C."}]}
MOCK_LITERATURE = [{"title": "Palladium-Catalyzed Cross-Coupling for XYZ Derivatives", "abstract": "Improved yields..."}]

def list_available_chemicals(): return {"available_chemicals": list(MOCK_CHEMICALS.keys())}
def chem_lookup(chemical_name: str): return {"properties": MOCK_CHEMICALS.get(chemical_name, {})}
def cost_estimator(reagents: List[Dict]):
    # Ensure reagents is a list of dicts
    if not isinstance(reagents, list):
        return {"status": "error", "message": "Expected a list of reagent dicts."}
    total_cost = sum(
        r.get('amount', r.get('quantity', 0)) * MOCK_CHEMICALS.get(r.get('name', ''), {}).get('cost_per_gram', 0)
        for r in reagents if isinstance(r, dict)
    )
    return {"total_cost": round(total_cost, 2)}
def outcome_db(compound: str): return {"experiments": MOCK_OUTCOMES.get(compound, [])}
def literature_search(query: str): return {"results": MOCK_LITERATURE}

TOOL_DISPATCHER = {
    "list_available_chemicals": list_available_chemicals, "chem_lookup": chem_lookup,
    "cost_estimator": cost_estimator, "outcome_db": outcome_db, "literature_search": literature_search
}

def get_tool_manifest():
    # Abridged schemas for notebook clarity, matching original source
    return [
        {"type": "function", "function": {"name": "list_available_chemicals", "description": "List all available chemicals."}},
        {"type": "function", "function": {"name": "chem_lookup", "description": "Look up chemical properties.", "parameters": {"type": "object", "properties": {"chemical_name": {"type": "string"}}, "required": ["chemical_name"]}}},
        {"type": "function", "function": {"name": "cost_estimator", "description": "Estimate experiment costs.", "parameters": {"type": "object", "properties": {"reagents": {"type": "array", "items": {"type": "object"}}}}}},
        {"type": "function", "function": {"name": "outcome_db", "description": "Query past experiment outcomes.", "parameters": {"type": "object", "properties": {"compound": {"type": "string"}}, "required": ["compound"]}}},
        {"type": "function", "function": {"name": "literature_search", "description": "Search scientific literature.", "parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}}}
    ]

### 2.2. Core Agent Runner

This central function, `call_openai`, is a direct adaptation from the source `agent_utils.py`. It handles the entire interaction loop with the language model, including making the initial request and recursively handling any tool calls the model makes.

In [5]:
def call_openai(client: OpenAI, model: str, system: str, user: str, ctx: Context, step_name: str) -> Dict[str, Any]:
    messages = [{"role": "system", "content": system}, {"role": "user", "content": user}]
    
    for i in range(5):  # Allow up to 5 tool call iterations
        start_time = time.time()
        logging.info(f"Running step '{step_name}' with model '{model}' (Turn {i})...")
        response = client.chat.completions.create(
            model=model, messages=messages, tools=get_tool_manifest(), tool_choice="auto"
        )
        latency = time.time() - start_time
        msg = response.choices[0].message
        messages.append(msg)

        p_tokens, c_tokens = response.usage.prompt_tokens, response.usage.completion_tokens
        metrics_log.append({"step": f"{step_name}_{i}", "model": model, "latency_s": latency, "prompt_tokens": p_tokens, "completion_tokens": c_tokens, "total_tokens": p_tokens + c_tokens})

        if not msg.tool_calls:
            final_json = parse_json_from_response(msg.content)
            log_json(step_name, final_json, ctx)
            return final_json

        logging.info(f"Agent requested {len(msg.tool_calls)} tool call(s)...")
        for tool_call in msg.tool_calls:
            function_name = tool_call.function.name
            try:
                args = json.loads(tool_call.function.arguments)
                logging.info(f"Calling tool: {function_name}({args})")
                result = TOOL_DISPATCHER[function_name](**args)
            except Exception as e:
                logging.error(f"Tool call failed for {function_name}: {e}")
                result = {"status": "error", "message": str(e)}
            
            messages.append({"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(result)})
            
    return {"error": "Exceeded maximum tool call limit."}

## 3. Running the Co-Scientist Pipeline

In [6]:
# Step 3.1: Scientist Input & Context Initialization
user_input = {
    "compound": "XYZ-13",
    "goal": "Improve synthesis yield by 15%",
    "budget": 15000,
    "time_h": 48,
    "previous": "Prior attempts failed at high temp; explore potential catalyst effects.",
    "client": client
}
ctx = Context(**user_input)

display(Markdown(f"### Starting Run ID: `{ctx.run_id}` for compound: **{ctx.compound}**"))

### Starting Run ID: `83a87bd3` for compound: **XYZ-13**

In [7]:
# Step 3.2: Ideation with Parallel Agents
ROLE_FOCUS = {
    "hypothesis_agent": "You are a pharmaceutical hypothesis specialist. Focus exclusively on analyzing the compound structure and research goals to generate testable hypotheses. Consider mechanism of action, binding affinity predictions, and potential off-target effects.",
    "protocol_agent": "You are a laboratory protocol specialist. Design experimental procedures that will effectively test the provided hypothesis. Focus on experimental conditions, controls, and measurement techniques.",
    "resource_agent": "You are a laboratory resource optimization specialist. Review the proposed protocol and optimize for efficiency. Identify opportunities to reduce reagent use, equipment time, and overall costs while maintaining scientific validity."
}

# This two-part prompt construction mirrors the original source exactly
IDEATION_PROMPT = """You are a pharmaceutical {role} specialist. Your goal is to {goal} for compound {compound}.
Constraints:\n- Budget: ${budget}\n- Approved reagents only\n- Complete within {time_h} hours\n- Previous attempts: {previous}
Respond with structured JSON describing your protocol."""

IDEATION_PROMPT += """\nUse the following tools as appropriate:
- Use the `list_available_chemicals` tool to get list of approved reagents.
- Use the `chem_lookup` tool to verify properties of reagents mentioned.
- Use the `cost_estimator` tool to calculate the approximate cost based on reagents and proposed steps.
- Check the `outcome_db` for relevant prior experiments with {compound}"""

def ideation(ctx: Context) -> List[Dict]:
    logging.info("--- Starting Ideation Phase ---")
    ideas = []
    for role, focus in tqdm(ROLE_FOCUS.items(), desc="Running Ideation Agents"):
        sys_prompt = IDEATION_PROMPT.format(role=role, focus=focus, **ctx.prompt_vars())
        user_prompt = f"Design a protocol to {ctx.goal} within ${ctx.budget}."
        idea = call_openai(ctx.client, MODEL_IDEATE, sys_prompt, user_prompt, ctx, f"ideation_{role}")
        ideas.append(idea)
    log_json("ideation_done", ideas, ctx)
    return ideas

generated_ideas = ideation(ctx)

2025-08-17 22:51:37,331 - INFO - --- Starting Ideation Phase ---


Running Ideation Agents:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-17 22:51:37,364 - INFO - Running step 'ideation_hypothesis_agent' with model 'Qwen/Qwen3-4B-fast' (Turn 0)...
2025-08-17 22:51:41,565 - INFO - HTTP Request: POST https://api.studio.nebius.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-17 22:51:41,597 - INFO - Agent requested 1 tool call(s)...
2025-08-17 22:51:41,597 - INFO - Calling tool: list_available_chemicals({})
2025-08-17 22:51:41,597 - INFO - Running step 'ideation_hypothesis_agent' with model 'Qwen/Qwen3-4B-fast' (Turn 1)...
2025-08-17 22:51:43,520 - INFO - HTTP Request: POST https://api.studio.nebius.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-17 22:51:43,520 - INFO - Agent requested 1 tool call(s)...
2025-08-17 22:51:43,520 - INFO - Calling tool: outcome_db({'compound': 'XYZ-13'})
2025-08-17 22:51:43,520 - INFO - Running step 'ideation_hypothesis_agent' with model 'Qwen/Qwen3-4B-fast' (Turn 2)...
2025-08-17 22:51:45,259 - INFO - HTTP Request: POST https://api.studio.nebius.com/v1/chat/completions "HTTP/1.1 20

In [8]:
# Step 3.3: Tournament Ranking
TOURNAMENT_PROMPT = """Protocol A: {protocol_a}
Protocol B: {protocol_b}

Compare Protocol A and Protocol B for synthesizing {compound} aimed at {goal}. Score them on:
1. Likelihood of achieving ≥ 15% yield increase.
2. Practical feasibility (reagents, time).
3. Estimated cost-efficiency (use tool if needed).
4. Scientific novelty/risk.

Return JSON {{"winner": "A"|"B", "justification": "..."}}."""

def tournament(protocols: List[Dict], ctx: Context) -> Dict:
    logging.info("--- Starting Tournament Ranking Phase ---")
    if not protocols: return {}
    if len(protocols) == 1: return protocols[0]
    
    protocol_a, protocol_b = protocols[0], protocols[1]
    # Note: A real implementation would compare pairs in a bracket style
    sys_prompt = TOURNAMENT_PROMPT.format(
        protocol_a=json.dumps(protocol_a), protocol_b=json.dumps(protocol_b), **ctx.prompt_vars()
    )
    result = call_openai(ctx.client, MODEL_IDEATE, sys_prompt, "Choose the winning protocol.", ctx, "tournament")
    winner = protocol_a if result.get('winner', 'A').upper() == 'A' else protocol_b
    logging.info(f"Tournament winner selected. Justification: {result.get('justification')}")
    log_json("tournament", result, ctx)
    return winner

top_protocol = tournament(generated_ideas, ctx)

2025-08-17 22:52:16,144 - INFO - --- Starting Tournament Ranking Phase ---
2025-08-17 22:52:16,149 - INFO - Running step 'tournament' with model 'Qwen/Qwen3-4B-fast' (Turn 0)...
2025-08-17 22:52:21,202 - INFO - HTTP Request: POST https://api.studio.nebius.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-17 22:52:21,217 - INFO - Logged 'tournament' data to logs/83a87bd3.log
2025-08-17 22:52:21,217 - INFO - Tournament winner selected. Justification: Protocol B provides a detailed, actionable plan with specific reagents, budget, and scientific rationale. Protocol A is invalid due to the 'exceeded maximum tool call limit' error. Protocol B's steps (heterogeneous catalyst, microwave-assisted synthesis, base addition) align with established methods for yield improvement, and its cost breakdown is practical. While Protocol B carries moderate risk (e.g., catalyst efficiency), its feasibility and scientific justification make it the superior choice.
2025-08-17 22:52:21,217 - INFO - Logged 'tou

In [9]:
# Step 3.4: Deep Critique & Synthesis
CRITIQUE_PROMPT = """You are a senior researcher reviewing a proposed synthesis protocol 
for {compound} aiming for {goal}, budget ${budget} using approved reagents. Review the protocol below rigorously:
1. Identify scientific flaws or methodological weaknesses.
2. Assess safety risks and budget compliance (use `cost_estimator` tool if needed).
3. Check for consistency with prior `outcome_db` results if relevant.
4. Suggest concrete improvements or rewrite sections if necessary.
5. Provide a final go/no-go recommendation.

Return JSON {{"revised_protocol": ..., "critique": "...", "recommendation": "go|no-go"}}."""

def critique(protocol: Dict, ctx: Context) -> Dict:
    logging.info("--- Starting Deep Critique Phase ---")
    sys_prompt = CRITIQUE_PROMPT.format(**ctx.prompt_vars())
    user_prompt = f"Protocol to Review:\n{json.dumps(protocol)}"
    result = call_openai(ctx.client, MODEL_CRITIQUE, sys_prompt, user_prompt, ctx, "critique")
    logging.info(f"Critique complete. Recommendation: {result.get('recommendation')}")
    ctx.critique_recommendation = result.get('recommendation', 'N/A')
    return result.get("revised_protocol", protocol)

critiqued_protocol = critique(top_protocol, ctx)

2025-08-17 22:52:34,306 - INFO - --- Starting Deep Critique Phase ---
2025-08-17 22:52:34,314 - INFO - Running step 'critique' with model 'Qwen/Qwen3-235B-A22B' (Turn 0)...
2025-08-17 22:53:28,893 - INFO - HTTP Request: POST https://api.studio.nebius.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-17 22:53:28,907 - INFO - Logged 'critique' data to logs/83a87bd3.log
2025-08-17 22:53:28,911 - INFO - Critique complete. Recommendation: go


In [10]:
# Step 3.5: Safety Check
SAFETY_PROMPT = """You are a lab‑safety specialist. 
Identify hazards, unsafe conditions, or compliance issues in this protocol for {compound}. 
Use `chem_lookup` tool if needed. Return JSON assessment."""

def safety(protocol: Dict, ctx: Context) -> Dict:
    logging.info("--- Starting Safety Check Phase ---")
    sys_prompt = SAFETY_PROMPT.format(**ctx.prompt_vars())
    user_prompt = json.dumps(protocol)
    assessment = call_openai(ctx.client, MODEL_SAFETY, sys_prompt, user_prompt, ctx, "safety_check")
    return {"protocol": protocol, "safety_assessment": assessment}

final_package = safety(critiqued_protocol, ctx)

2025-08-17 22:54:17,329 - INFO - --- Starting Safety Check Phase ---
2025-08-17 22:54:17,329 - INFO - Running step 'safety_check' with model 'Qwen/Qwen3-14B' (Turn 0)...
2025-08-17 22:54:39,070 - INFO - HTTP Request: POST https://api.studio.nebius.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-17 22:54:39,080 - INFO - Agent requested 2 tool call(s)...
2025-08-17 22:54:39,080 - INFO - Calling tool: chem_lookup({'chemical_name': 'N,N-Dimethylformamide'})
2025-08-17 22:54:39,083 - INFO - Calling tool: chem_lookup({'chemical_name': 'Cesium carbonate'})
2025-08-17 22:54:39,087 - INFO - Running step 'safety_check' with model 'Qwen/Qwen3-14B' (Turn 1)...
2025-08-17 22:54:54,801 - INFO - HTTP Request: POST https://api.studio.nebius.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-17 22:54:54,812 - INFO - Logged 'safety_check' data to logs/83a87bd3.log


In [11]:
# Step 3.6: [ENHANCEMENT] Automated Quality Evaluation
def evaluate_protocol_quality(protocol: Dict, ctx: Context) -> Dict:
    logging.info("--- Starting Automated Quality Evaluation ---")
    system_prompt = "You are an expert panel of scientists evaluating a research protocol."
    user_prompt = f"""Score the following protocol for the goal '{ctx.goal}' on a scale of 0.0 to 1.0 for each category.
    - scientific_validity: How sound is the underlying science and hypothesis?
    - feasibility: How practical is this to execute in a standard lab?
    - innovation: How novel is this approach?
    - cost_effectiveness: How well does it balance potential outcomes with cost?
    - clarity_and_reproducibility: How clear and easy to follow are the instructions?
    PROTOCOL: {json.dumps(protocol)}
    Respond with a JSON object containing a score (float) and justification (string) for each of the five categories."""
    
    quality_scores = call_openai(ctx.client, MODEL_EVALUATE, system_prompt, user_prompt, ctx, "quality_evaluation")
    return quality_scores

quality_assessment = evaluate_protocol_quality(final_package['protocol'], ctx)

2025-08-17 22:54:57,408 - INFO - --- Starting Automated Quality Evaluation ---
2025-08-17 22:54:57,408 - INFO - Running step 'quality_evaluation' with model 'Qwen/Qwen3-235B-A22B' (Turn 0)...
2025-08-17 22:55:23,784 - INFO - HTTP Request: POST https://api.studio.nebius.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-17 22:55:23,798 - INFO - Logged 'quality_evaluation' data to logs/83a87bd3.log


In [12]:
# Step 3.7: Human Review
def human_review(package: Dict, quality: Dict, ctx: Context) -> Dict:
    logging.info("--- Awaiting Human Review ---")
    display(Markdown("### PROTOCOL FOR HUMAN REVIEW"))
    display(Markdown(f"**Run ID:** `{ctx.run_id}`"))
    display(Markdown(f"**Protocol Title:** {package['protocol'].get('protocol_title', 'N/A')}"))
    display(Markdown("**AI Quality Assessment:**"))
    display(quality)
    display(Markdown("**AI Safety Assessment:**"))
    display(package['safety_assessment'])
    
    while True:
        approval = input("\nApprove protocol for execution? (yes/no): ").lower()
        if approval in ['yes', 'y']:
            logging.info("Protocol APPROVED by human reviewer.")
            return {"approved": True, "final_protocol": package['protocol']}
        if approval in ['no', 'n']:
            logging.info("Protocol REJECTED by human reviewer.")
            return {"approved": False, "final_protocol": package['protocol']}
        print("Invalid input. Please enter 'yes' or 'no'.")

human_decision = human_review(final_package, quality_assessment, ctx)

2025-08-17 22:56:09,431 - INFO - --- Awaiting Human Review ---


### PROTOCOL FOR HUMAN REVIEW

**Run ID:** `83a87bd3`

**Protocol Title:** N/A

**AI Quality Assessment:**

{'scientific_validity': {'score': 0.85,
  'justification': 'The protocol employs established principles of heterogeneous catalysis (Pd/C), ligand immobilization for recyclability, and base selection based on pKa. Microwave heating is a validated method for kinetic enhancement. The rationale aligns with mechanistic understanding of transition-metal catalysis.'},
 'feasibility': {'score': 0.8,
  'justification': 'Pd/C and Cs2CO3 are commercially available; microwave reactors with temperature/pressure control are standard in industrial labs. Scaling may require optimization of solvent volumes and catalyst loading, but no exotic equipment is required. Immobilized ligands may require specialized synthesis.'},
 'innovation': {'score': 0.65,
  'justification': "Combines heterogeneous catalysis with microwave-assisted synthesis and base optimization. While not entirely novel individually, the specific integration for this compound's synthesis represents incremental innovation. Similar approach

**AI Safety Assessment:**

{'hazards': [{'type': 'chemical_hazard',
   'description': 'Cesium carbonate (Cs2CO3) is a strong base and corrosive substance. Prolonged skin contact or inhalation may cause irritation or chemical burns. Requires PPE (gloves, goggles) and proper handling procedures.'},
  {'type': 'toxicity',
   'description': 'DMF is a suspected carcinogen and skin irritant. Inhalation or skin contact must be minimized using fume hoods, gloves, and lab coats.'},
  {'type': 'fire_explosion',
   'description': 'Pd/C (carbon support) dust may pose a fire hazard if not handled properly (e.g., under inert atmosphere or with dust controls).'},
  {'type': 'equipment_safety',
   'description': 'Microwave-assisted synthesis requires a reactor rated for pressure and thermal stability. Lack of specifications for equipment compatibility raises safety concerns.'},
  {'type': 'waste_disposal',
   'description': 'No protocols for disposal of Cs2CO3, DMF, or catalyst waste. Cesium compounds and organic solvents requi

2025-08-17 22:56:24,140 - INFO - Protocol APPROVED by human reviewer.


In [13]:
# Step 3.8: Execution & Learning (Mocked)
ANALYSIS_PROMPT = """You are a data analyst. 
Did the experiment achieve {goal}? Analyse factors, suggest improvements, and return structured JSON."""

def execute_and_analyse(decision: Dict, ctx: Context) -> Optional[Dict]:
    if not decision['approved']:
        logging.warning("Execution skipped as protocol was not approved.")
        return None
    
    logging.info("--- Starting Mock Execution and Learning Phase ---")
    mock_results = {"yield_improvement": 12.5, "success": False, "actual_cost": 12500, "notes": "Yield improved but did not meet 15% target."}
    
    sys_prompt = ANALYSIS_PROMPT.format(**ctx.prompt_vars())
    user_prompt = f"""The experiment has been run. The protocol was: {json.dumps(decision['final_protocol'])}. The results were: {json.dumps(mock_results)}. Analyze these results and return a structured JSON with a 'summary' and 'next_steps'."""
    
    analysis = call_openai(ctx.client, MODEL_CRITIQUE, sys_prompt, user_prompt, ctx, "analysis")
    log_json("analysis", analysis, ctx)
    
    # Write final summary to file, as in original source
    Path("output").mkdir(exist_ok=True)
    out_path = Path("output") / f"{ctx.run_id}_summary.json"
    out_path.write_text(json.dumps(analysis, indent=2))
    logging.info(f"Completed. Analysis summary written to {out_path}")
    return analysis

learning_summary = execute_and_analyse(human_decision, ctx)

2025-08-17 22:56:35,420 - INFO - --- Starting Mock Execution and Learning Phase ---
2025-08-17 22:56:35,427 - INFO - Running step 'analysis' with model 'Qwen/Qwen3-235B-A22B' (Turn 0)...
2025-08-17 22:56:53,885 - INFO - HTTP Request: POST https://api.studio.nebius.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-17 22:56:53,901 - INFO - Logged 'analysis' data to logs/83a87bd3.log
2025-08-17 22:56:53,905 - INFO - Logged 'analysis' data to logs/83a87bd3.log
2025-08-17 22:56:53,910 - INFO - Completed. Analysis summary written to output\83a87bd3_summary.json


## 4. Final Analysis and Summary

Finally, we'll consolidate all our metrics into two clear summaries. The first DataFrame provides a detailed breakdown of each API call, while the second offers a high-level summary of the entire run's performance and quality.

In [14]:
model_prices_per_million_tokens = {
    "Qwen/Qwen3-4B-fast": {
        "input": 0.08,  # Example price in USD per 1M input tokens
        "output": 0.24   # Example price in USD per 1M output tokens
    },
    "Qwen/Qwen3-14B": {
        "input": 0.08,  # Example price
        "output": 0.24   # Example price
    },
    "Qwen/Qwen3-235B-A22B": {
        "input": 0.20,  # Example price for a large model
        "output": 0.60   # Example price for a large model
    }
}

In [15]:
# 4.1 Per-Step Operational Metrics
if metrics_log:
    df_metrics = pd.DataFrame(metrics_log)

    def calculate_cost(row):
        prices = model_prices_per_million_tokens.get(row['model'], {"input": 0, "output": 0})
        return (row['prompt_tokens'] / 1_000_000) * prices['input'] + (row['completion_tokens'] / 1_000_000) * prices['output']

    df_metrics['cost_usd'] = df_metrics.apply(calculate_cost, axis=1)
    
    display(Markdown("### Per-Step Performance and Cost Analysis"))
    display(df_metrics)
else:
    logging.info("No metrics were logged.")

### Per-Step Performance and Cost Analysis

Unnamed: 0,step,model,latency_s,prompt_tokens,completion_tokens,total_tokens,cost_usd
0,ideation_hypothesis_agent_0,Qwen/Qwen3-4B-fast,4.232954,525,446,971,0.000149
1,ideation_hypothesis_agent_1,Qwen/Qwen3-4B-fast,1.92325,1039,262,1301,0.000146
2,ideation_hypothesis_agent_2,Qwen/Qwen3-4B-fast,1.752038,1351,217,1568,0.00016
3,ideation_hypothesis_agent_3,Qwen/Qwen3-4B-fast,2.438934,1613,223,1836,0.000183
4,ideation_hypothesis_agent_4,Qwen/Qwen3-4B-fast,2.678942,1873,380,2253,0.000241
5,ideation_protocol_agent_0,Qwen/Qwen3-4B-fast,3.111892,525,447,972,0.000149
6,ideation_protocol_agent_1,Qwen/Qwen3-4B-fast,8.213036,1119,686,1805,0.000254
7,ideation_resource_agent_0,Qwen/Qwen3-4B-fast,2.250857,525,302,827,0.000114
8,ideation_resource_agent_1,Qwen/Qwen3-4B-fast,2.146295,877,296,1173,0.000141
9,ideation_resource_agent_2,Qwen/Qwen3-4B-fast,3.374945,1241,483,1724,0.000215


In [16]:
# 4.2 Final Run Summary
if metrics_log:
    def get_score(assessment, key):
        if isinstance(assessment, dict) and key in assessment and isinstance(assessment[key], dict):
            return assessment[key].get('score', 0.0)
        return 0.0

    scores = {
        'validity': get_score(quality_assessment, 'scientific_validity'),
        'feasibility': get_score(quality_assessment, 'feasibility'),
        'innovation': get_score(quality_assessment, 'innovation'),
        'cost_effect': get_score(quality_assessment, 'cost_effectiveness'),
        'clarity': get_score(quality_assessment, 'clarity_and_reproducibility')
    }
    avg_quality = sum(scores.values()) / len(scores) if scores else 0.0

    summary_data = {
        'Metric': [
            'Run ID',
            'Compound',
            'Total Latency (s)',
            'Total Cost (USD)',
            'Total Tokens',
            'Critique Recommendation',
            'Human Decision',
            '--- Quality Scores (AI) ---',
            'Scientific Validity',
            'Feasibility',
            'Innovation',
            'Cost-Effectiveness',
            'Clarity & Reproducibility',
            '**Overall Quality Score**'
        ],
        'Value': [
            ctx.run_id,
            ctx.compound,
            f"{df_metrics['latency_s'].sum():.2f}",
            f"${df_metrics['cost_usd'].sum():.6f}",
            f"{df_metrics['total_tokens'].sum():,}",
            ctx.critique_recommendation,
            'Approved' if human_decision.get('approved') else 'Rejected',
            '---',
            f"{scores['validity']:.2f}",
            f"{scores['feasibility']:.2f}",
            f"{scores['innovation']:.2f}",
            f"{scores['cost_effect']:.2f}",
            f"{scores['clarity']:.2f}",
            f"**{avg_quality:.2f}**"
        ]
    }
    
    df_summary = pd.DataFrame(summary_data).set_index('Metric')
    
    display(Markdown("### Final Run Summary"))
    display(df_summary)
else:
    logging.info("Cannot generate summary as no metrics were logged.")

### Final Run Summary

Unnamed: 0_level_0,Value
Metric,Unnamed: 1_level_1
Run ID,83a87bd3
Compound,XYZ-13
Total Latency (s),180.64
Total Cost (USD),$0.006600
Total Tokens,34003
Critique Recommendation,go
Human Decision,Approved
--- Quality Scores (AI) ---,---
Scientific Validity,0.85
Feasibility,0.80


## 5. Conclusion

This notebook has demonstrated a complete, end-to-end implementation of the AI Co-Scientist workflow using customizable open-source language models. By preserving the original's sophisticated multi-agent architecture and enhancing it with automated quality scoring and comprehensive summary reporting, we have created a powerful and observable system for complex R&D tasks.

**Key Takeaways:**

1.  **The Escalation of Intelligence Pattern is Effective:** Using cheap, fast models for broad ideation and powerful, expensive models for deep critique is a cost-effective strategy for high-quality results.
2.  **Automated Evaluation Adds a Layer of Confidence:** Adding an LLM-based quality assessment step provides a quantifiable measure of the AI's output before human review, aiding decision-making and long-term tracking.
3.  **Observability is Paramount:** Consolidating operational and qualitative metrics into a final summary is critical for understanding system performance, debugging issues, and demonstrating value.