# Virtual Lab: AI-Guided Biomarker Discovery

This notebook demonstrates how LLM agents collaborate to discover prognostic biomarkers from gene expression data.

## Overview

The Virtual Lab assembles a team of AI agents with different expertise:
- **Principal Investigator**: Coordinates the project
- **Biostatistician**: Recommends statistical methods
- **Bioinformatician**: Performs data analysis
- **Clinical Oncologist**: Provides clinical context
- **Systems Biologist**: Interprets biological mechanisms

The agents:
1. **Discuss** analysis strategies in team meetings
2. **Search** PubMed for relevant literature
3. **Recommend** statistical methods
4. **Interpret** results in biological/clinical context
5. **Propose** validation strategies

---

In [None]:
import sys
sys.path.append('../src')
sys.path.append('.')

from virtual_lab import run_meeting
from biomarker_constants import (
    BIOMARKER_TEAM,
    PLANNING_TEAM,
    ANALYSIS_TEAM,
    INTERPRETATION_TEAM,
    VALIDATION_TEAM,
    PI,
    BIOSTATISTICIAN,
    BIOINFORMATICIAN,
    METHODS_CRITIC,
    CLINICAL_CRITIC
)

import pandas as pd
import json
from pathlib import Path

# Create discussions directory
discussions_dir = Path('discussions')
discussions_dir.mkdir(exist_ok=True)

print("Virtual Lab Biomarker Discovery initialized!")
print(f"Team size: {len(BIOMARKER_TEAM)} agents")

## Step 1: Project Planning Meeting

The team discusses the overall biomarker discovery strategy.

In [None]:
planning_agenda = """
# Biomarker Discovery Project Planning

## Project Goal
Identify robust prognostic biomarker genes from TCGA triple-negative breast cancer (TNBC) data.

## Dataset
- File: Example_TCGA_TNBC_data.csv
- Samples: 144 TCGA TNBC patients  
- Features: ~20,000 gene expression values (log-transformed)
- Outcomes: Overall survival (OS) and survival time (OS.year)

## Discussion Points
1. What statistical methods should we use for biomarker discovery?
2. How should we handle multiple testing correction?
3. What criteria define a "robust" biomarker?
4. How can we ensure clinical relevance?
5. What validation strategies should we plan?

## Deliverables
- Consensus on statistical approach
- List of methods to implement
- Success criteria for biomarker candidates
"""

planning_meeting = run_meeting(
    agent_team=PLANNING_TEAM,
    agenda=planning_agenda,
    model="gpt-4o-2024-08-06",
    temperature=0.7,
    num_rounds=3
)

# Save discussion
(discussions_dir / 'project_planning').mkdir(exist_ok=True)
planning_meeting.save(discussions_dir / 'project_planning')

## Step 2: Statistical Methods Selection

The biostatistician recommends specific statistical approaches.

In [None]:
methods_agenda = """
# Statistical Methods for Biomarker Discovery

## Context
Based on our planning discussion, we need to select appropriate statistical methods.

## Available Approaches
1. **Cox Proportional Hazards Regression**: Univariate survival analysis for each gene
2. **Log-rank Test (Kaplan-Meier)**: Compare high vs. low expression groups
3. **Differential Expression**: Mann-Whitney U test between event and censored groups
4. **Elastic Net Cox**: Multivariate regularized survival analysis

## Discussion Points
1. Which methods are most appropriate for this dataset?
2. What are the assumptions and limitations of each method?
3. How should we combine results from multiple methods?
4. What p-value threshold and correction method should we use?

## Search PubMed
Please search for:
- Recent biomarker studies in triple-negative breast cancer
- Best practices for survival-based biomarker discovery
- Methods for robust biomarker selection

## Deliverable
Recommended statistical workflow with justification.
"""

methods_meeting = run_meeting(
    agent_team=ANALYSIS_TEAM,
    agenda=methods_agenda,
    model="gpt-4o-2024-08-06",
    temperature=0.7,
    num_rounds=3
)

(discussions_dir / 'methods_selection').mkdir(exist_ok=True)
methods_meeting.save(discussions_dir / 'methods_selection')

## Step 3: Execute Analysis (Individual Meeting with Bioinformatician)

The bioinformatician implements the recommended analysis.

In [None]:
analysis_request = """
# Biomarker Analysis Implementation

## Task
Please implement the biomarker discovery analysis using our select_marker_genes.py script.

## Command to Run
```bash
python scripts/select_marker_genes.py \
    --input_file ../Example_TCGA_TNBC_data.csv \
    --output_dir ../biomarker_results_agent \
    --n_top_genes 50 \
    --methods cox logrank differential elasticnet \
    --p_value_threshold 0.05 \
    --visualization
```

## After Running
1. Examine the results files
2. Identify top consensus genes
3. Note any interesting patterns or surprises
4. Prepare a summary for the team

Please provide:
- Summary statistics (# significant genes per method)
- Top 20 consensus genes
- Key observations
"""

analysis_meeting = run_meeting(
    agent=BIOINFORMATICIAN,
    task=analysis_request,
    critic=METHODS_CRITIC,
    model="gpt-4o-2024-08-06",
    temperature=0.3,
    num_rounds=2
)

(discussions_dir / 'analysis_execution').mkdir(exist_ok=True)
analysis_meeting.save(discussions_dir / 'analysis_execution')

## Step 4: Results Interpretation

The full team interprets the identified biomarkers.

In [None]:
# Load results (simulated - in real workflow, load actual files)
interpretation_agenda = """
# Biomarker Results Interpretation

## Results Summary
[Bioinformatician will have provided this from analysis]

Example top genes:
- BRCA1, TP53, MYC, CCND1, ESR1, PGR, ERBB2, etc.
(These are examples - actual genes will come from analysis)

## Discussion Points
1. **Biological Plausibility**: Do these genes make biological sense for TNBC prognosis?
2. **Known vs. Novel**: Which genes are known biomarkers? Which are novel?
3. **Pathway Enrichment**: Are there common pathways or processes?
4. **Clinical Relevance**: Could these be used clinically?
5. **Actionability**: Are any of these druggable targets?

## Tasks for Team Members

**Systems Biologist**: 
- Search PubMed for pathway analyses
- Identify biological processes
- Explain mechanistic links to prognosis

**Clinical Oncologist**:
- Search PubMed for clinical studies on top genes
- Assess clinical utility
- Identify therapeutic implications

**Molecular Biologist**:
- Explain gene functions
- Propose molecular mechanisms
- Suggest experimental validation

## Deliverable
Comprehensive interpretation of top biomarker candidates.
"""

interpretation_meeting = run_meeting(
    agent_team=INTERPRETATION_TEAM,
    agenda=interpretation_agenda,
    model="gpt-4o-2024-08-06",
    temperature=0.7,
    num_rounds=3
)

(discussions_dir / 'results_interpretation').mkdir(exist_ok=True)
interpretation_meeting.save(discussions_dir / 'results_interpretation')

## Step 5: Validation Strategy

The team plans validation experiments.

In [None]:
validation_agenda = """
# Biomarker Validation Strategy

## Current Status
We have identified candidate prognostic biomarkers from TCGA TNBC data.

## Validation Needs
1. **Independent Cohort Validation**: Test in other TNBC datasets
2. **Technical Validation**: Validate RNA-seq with qRT-PCR or IHC
3. **Functional Validation**: Test biological mechanisms in cell lines/mice
4. **Clinical Validation**: Prospective clinical trial

## Discussion Points
1. What validation experiments are most critical?
2. Which independent datasets should we use?
3. What functional experiments would test mechanisms?
4. How can we assess clinical utility?
5. What are the regulatory/assay development requirements?

## Search PubMed
Please search for:
- TNBC biomarker validation studies
- Available TNBC datasets for validation
- Regulatory guidelines for prognostic biomarkers

## Deliverable
Prioritized validation plan with specific experiments and datasets.
"""

validation_meeting = run_meeting(
    agent_team=VALIDATION_TEAM,
    agenda=validation_agenda,
    model="gpt-4o-2024-08-06",
    temperature=0.7,
    num_rounds=3
)

(discussions_dir / 'validation_planning').mkdir(exist_ok=True)
validation_meeting.save(discussions_dir / 'validation_planning')

## Step 6: Final Recommendations

The PI synthesizes all discussions and provides final recommendations.

In [None]:
final_agenda = """
# Final Biomarker Recommendations

## Project Summary
We have completed:
1. ✓ Project planning and methods selection
2. ✓ Multi-method statistical analysis
3. ✓ Biological and clinical interpretation
4. ✓ Validation strategy planning

## Task for PI
Please synthesize all previous discussions and provide:

1. **Top 10 Biomarker Candidates**: With brief justification for each
2. **Key Findings**: Most important insights from the analysis
3. **Clinical Implications**: How these could impact TNBC treatment
4. **Next Steps**: Prioritized action items
5. **Manuscript Outline**: Key points for a research paper

## Consider
- Statistical robustness
- Biological plausibility  
- Clinical relevance
- Novelty vs. validation of known biomarkers
- Feasibility of validation

## Deliverable
Executive summary and recommendations for moving forward.
"""

final_meeting = run_meeting(
    agent=PI,
    task=final_agenda,
    critic=CLINICAL_CRITIC,
    model="gpt-4o-2024-08-06",
    temperature=0.5,
    num_rounds=2
)

(discussions_dir / 'final_recommendations').mkdir(exist_ok=True)
final_meeting.save(discussions_dir / 'final_recommendations')

## Summary

This workflow demonstrates how AI agents can:
1. **Collaboratively plan** biomarker discovery studies
2. **Select and justify** statistical methods
3. **Execute** complex bioinformatics analyses
4. **Interpret** results in biological/clinical context
5. **Plan** validation strategies
6. **Synthesize** findings into actionable recommendations

The agents use:
- **PubMed search** to access literature
- **Domain expertise** to provide context
- **Critical thinking** to evaluate approaches
- **Collaborative discussion** to reach consensus

All discussions are saved in the `discussions/` directory for review and documentation.

---

**Next Steps:**
1. Review saved discussions in `discussions/`
2. Examine analysis results in `biomarker_results_agent/`
3. Implement validation experiments
4. Prepare manuscript based on agent recommendations