# THIS IS AN EXAMPLE NOTEBOOK
This is meant to serve as a placeholder for reference. Please use the `run_evaluation_{team_name}.ipynb` notebook provided in your team-specific package.

In [None]:
#=================================EXAMPLE NOTEBOOK=================================#

# Evaluation Pipeline - Example

## Setup

### Virtual Environment
Use `uv` or regular `pip` to install dependencies. Below is an example snippet on how to set up a virtual environment
```bash
uv python install 3.12.10
uv venv .venv --python 3.12.10
.venv\Scripts\activate.bat # on Windows; or source .venv/bin/activate on Mac
uv pip install -r requirements.txt
uv pip install ipykernel
python -m ipykernel install --user --name=venv --display-name "Evaluator"
```

**Note**: After installing the kernel, you may need to:
- Restart your Jupyter server or refresh the browser
- Reload VS Code window (Ctrl+Shift+P → "Reload Window")
- Select the "Evaluator" kernel from the kernel picker

### OpenAI API Key
Add your `OPENAI_API_KEY` to an `.env` file in the project root.

### Required Data Files
Place the following files in `inputs\session_data\`:
1. `session_data.csv` - Your session data to evaluate
2. `human_evaluation.csv` - Human evaluation data (optional, but recommended)

File paths and names are configurable in [`config.toml`](config.toml).

**Data Preprocessing:** Use the helper function in `evaluation_pipeline.utils` or the full notebook at `data_preprocessing.ipynb` to convert data packages into `session_data.csv`.

### Configuration
All evaluation settings are managed through `config.toml`. See [`CONFIG.md`](CONFIG.md) for detailed options.
The current notebook is configured to sample `5` rows out of the full session data for evaluation.

## More Information
For more information, see [`README.md`](README.md).

## 1. Initialization
**Important** - Before running these cells, refer to `data_preprocessing.ipynb` to format your data package file into `session_data.csv`

In [None]:
import logging
import json
import sys
from evaluation_pipeline import Config, Evaluator

In [None]:
config = Config.from_toml("placeholder_config.toml") # Specify your config file here; this will not run as-is

In [None]:
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)

logging.basicConfig(
    level=logging.INFO,
    format='%(levelname)s: %(message)s',
    handlers=[
        logging.StreamHandler(sys.stdout),
        logging.FileHandler(config.dirs.logs / f'{config.run_id}.log')
    ]
)

# Suppress noisy loggers
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("openai").setLevel(logging.WARNING)

In [None]:
# create evaluator instance
evaluator = Evaluator(config)

In [None]:
evaluator

## 2. Full Run

**Note:** For your first time, we recommend stepping through the remaining cells to understand each stage of the evaluation process before running the full pipeline.

In [None]:
# # Uncomment to run the full pipeline
# evaluator.run(
#     auto_approve=True,  # Skip cost estimate approval prompts
#     mode="batch",  # 'batch' or 'flex' (same pricing)
#     skip_adjudication=False, 
#     check_interval=60, # Check every 60 seconds for completed evaluations
# )

## 3. Step-by-Step Run

### Overview
The overall flow of the evaluation is as follows:

1. Session data and human evaluation data are loaded. The config file and session data file contents are hashed to create a `run_id`. All artifacts will have this `run_id` as an affix.
2. Generate evaluation guidelines using human evaluation data. **[Section A]**
3. Generate context augmented dynamic prompts for all session data. **[Section B1/B2]**
4. Run evaluations twice. **[Section B1/B2]**
5. Generate dynamic prompts for evaluations that require adjudication (= ANY of the subcriteria have a score gap >= 2, or if there is disagreement on whether the input are mathematically relevant). **[Section B1/B2]**
6. Run adjudication. **[Section B1/B2]**
7. Generate final scores. **[Section C]**

**Note:** Choose either **Section B1** (Flex Processing) or **Section B2** (Batch Processing) for evaluation—not both.

### A. Generate Evaluation Guidelines
Evaluation guidelines are generated using the following components as contextual information:
- Human evaluations
- Math intervention practice guides from Doing What Works
- The evaluation rubric
- Data description, tool description, and tool-specific considerations

The guideline generation runs three times: the first two runs generate guidelines independently, and the third run aggregates them to create a more stable, consistent evaluation guideline.

**Recommended:** Review and refine the generated guideline manually before proceeding to ensure it aligns with your evaluation goals.

In [None]:
evaluator.generate_evaluation_guidelines(auto_approve=False, force_regenerate=False, test_run=True) 

### B1. Evaluation: Flex Processing (Direct API Calls)

Run evaluations using direct API calls.

In [None]:
# Generate dynamic prompts for evaluation
evaluator.generate_dynamic_prompts()

In [None]:
# Run initial evaluations (2 per session)
# Provides cost estimate and requires user input to proceed - Set auto_approve=True to skip cost approval prompts
evaluator.flex_evaluate()

In [None]:
# Run check_evaluation_status at any point to see current status and suggested next steps
evaluator.check_evaluation_status()

In [None]:
# Skip this part unless Next steps: Adjudication is suggested
# Generate dynamic prompts for adjudication
evaluator.generate_dynamic_prompts(adjudication=True)
evaluator.flex_evaluate(adjudication=True)

### B2. Evaluation: Batch Processing 

Run evaluations using OpenAI's batch API for cost savings (50% off).
Follows the same evaluation flow as flex processing, but uses batch mode instead.

**Note:** If you've already run flex evaluation (B1), skip this section. Choose either flex or batch processing, not both. You can delete the output files or rename the output file path in `config.toml` to start over.

**Troubleshooting:**
- **Kernel restart during batch**: Use `batch_id_override` to retrieve results:
```python
  # Find your batch_id in logs or OpenAI dashboard
  evaluator.check_and_retrieve(
      until_complete=True, 
      batch_id_override="batch_abc123"
  )
```
- **Check status manually**: 
```python
  evaluator.check_batch_status() 
```
- **Cancel batch**:
```python
  evaluator.cancel_batch()
```

In [None]:
# Generate dynamic prompts for evaluation
evaluator.generate_dynamic_prompts()

In [None]:
# Prepare batch file
evaluator.prepare_batch_file()

In [None]:
evaluator.upload_batch()

In [None]:
# Set until_complete to False if you want to run this in a non-blocking way and check back later for results.
evaluator.check_and_retrieve(until_complete=True, check_interval=60)

# Use batch_id_override to specify a particular batch ID after kernel restart
# evaluator.check_and_retrieve(batch_id_override="your_batch_id_here", until_complete=True, check_interval=60)

In [None]:
# Adjudication batch (if needed)
evaluator.generate_dynamic_prompts(adjudication=True)
evaluator.prepare_batch_file(adjudication=True)
evaluator.upload_batch()
evaluator.check_and_retrieve(until_complete=True, check_interval=60)

### C. Finalize Results

Generate final scores by aggregating/adjudicating evaluations.

**What happens:**
- For sessions with 3 evaluations (2 + adjudication): Uses adjudicated score
- For sessions with 2 evaluations (no disagreement): Averages the scores
- For sessions with 1 evaluation: Marks it as incomplete
- Saves to the `evaluation_results` directory path defined in `config_{tool_name}.toml`

**Output Format:**
```json
{
  "session_id_1": {
    "scores": {
      "Mathematical_Accuracy": {
        "Validity": <1-4 or null>,
        "Clarity_and_Labeling": <1-4 or null>,
        "Justification_and_Explanation": <1-4 or null>
      },
      "Pedagogical_Quality": {
        "Problem_Solving_Strategies": <1-4>,
        "Relevance": <1-4>,
        "Scaffolded_Support": <1-4>,
        "Clarity_of_Explanation": <1-4>,
        "Feedback": <1-4>,
        "Motivational_Engagement": <1-4>
      },
      "Equity_and_Fairness": {
        "Language_neutrality": <1-3>,
        "Feedback_tone": <1-3>,
        "Cultural_relevance": <1-3>
      }
    },
    "explanations": {
      "Mathematical_Accuracy": {
        "Validity": "Brief explanation with specific evidence",
        "Clarity_and_Labeling": "Concise justification with examples",
        "Justification_and_Explanation": "Brief reasoning with evidence"
      },
      "Pedagogical_Quality": {
        "Problem_Solving_Strategies": "Brief explanation with evidence",
        "Relevance": "Concise justification",
        "Scaffolded_Support": "Brief reasoning with examples",
        "Clarity_of_Explanation": "Concise explanation",
        "Feedback": "Brief justification",
        "Motivational_Engagement": "Assessment based on student responses when available"
      },
      "Equity_and_Fairness": {
        "Language_neutrality": "Brief explanation",
        "Feedback_tone": "Concise justification",
        "Cultural_relevance": "Brief assessment"
      }
    },
    "mathematical_accuracy_relevance": {
      "applicable": <true/false>,
      "explanation": "Specific analysis of whether AI output contains evaluable mathematical content",
      "extracted_mathematical_content": "If applicable, any mathematical content extracted from the session data by the LLM judges.",
      "catastrophic_errors": "Any significant mathematical errors made by the AI (for example, incorrect calculations such as 2+2=5, or misidentifying a square as a triangle)."
    }
  },
  "session_id_2": {...}
}
```

Adjudications will also include the following field:
```json
  "adjudication_notes": {
    "key_discrepancies_resolved": "Brief summary of main disagreements and how they were resolved",
    "evaluation_preferred": "If one evaluation was generally more accurate, note which (1 or 2) and why"
  }
```

In [None]:
evaluator.generate_final_scores()

In [None]:
# open the results json file and print its contents
with open(config.dirs.evaluation_results / f"{config.run_id}_final_scores.json", "r") as f:
    final_results = json.load(f)

In [None]:
# Print final results for viewing here
print(json.dumps(final_results, indent=2))