# Workflow for MIRA equations extractions - notes

This notebook outlines ideas for prompting for extracting mathematical equations from PDFs in the MIRA framework.

---

## One-shot Prompting:
The basic workflow of MIRA is a one-shot prompting architecture (let's call this **verison = 001**).


Process of the extraction: *'mira/notebooks/llm_extraction.ipynb'*

Pipeline: *'mira/sources/sympy_ode/llm_util.py'

Prompts: *'mira/sources/sympy_ode/constants.py'*

Detailed results can be found in the notebook: *mira_llm_extraction_evaluation.ipynb*

---

## Iterative promting workflow:
**version = 002**

To improve the precision of the extraction, an iterative workflow is being introduced, with the following steps:
### Agent 1:
First, an extraction agent uses the original MIRA process to convert equation images into SymPy code and ground biological concepts. 

### Agent 2:
Then, a validation agent checks the extraction for execution errors (missing imports, undefined variables), parameter consistency issues, and incorrect concept grounding. 
If errors are found, the validation agent corrects them and the process repeats for up to 3 iterations until all checks pass. 

This multi-agent approach improves extraction accuracy by catching and fixing common errors that the single-shot method might miss, while maintaining backward compatibility with the existing MIRA codebase.

### RESULTS OF IMPLEMENTATION:
*Forked version on GitHub: *'fruzsedua/mira/tree/extraction-development'*

Examples for each result found in this folder: *'mira/notebooks/equation extraction development/extraction error check/string mismatch check/comparison_results_version002'*

Process of the extraction: *'mira/notebooks/llm_extraction.ipynb'* -> **More detalied process**

Pipeline: *'mira/sources/sympy_ode/llm_util.py' -> **New functions added**

Prompts: *'mira/sources/sympy_ode/constants.py'* -> **Error handling prompt added**
**
**Image extraction:**
- Additional rules added: symmetry, transmission structure, patterns, mathematical structure, parameter consistency, completeness check
- Epidemology based rules are just ideas (from Claude) -> *revision needed!*

**Error checking and correcting:**
- Execution errors are mostly fixed during iteration 1:
- Syntax rules for detecting and handling functions/symbols
- Handling of imports, utilizing their names precisely
- Missing parameters are included

- Data cannot be parsed if the output format of the prompt is not aligned with the next function -> exact clarification is added to the prompt
- Comparing number of factors to the original (count * operators and variables)
- Preserving content between iterations of the error handling prompt
- Missing /N fixed

**Comparison of the extracted odes added:**
- Sympy format matching
- Sorting of equations (based on the variable on the LHS) for comparison
- Template Model → Mtx odes confuses a lot of information due to multiple formatting steps -> *fix needed!*


Error handling multi-agent architecture is part of the tm creation 
pipeline:

**Image → LLM Extraction → Multi-Agent Validation → JSON (corrected ODEs + concepts) → Template Model → Mtx odes**

**REMAINING ERRORS:**
- Parameter consistency: mostly symbolic differences (e.g. rho_1 vs. rho1), sometimes more serious: e.g. rho vs. q (similar) -> LLM has no info, which one is used, doesn’t know it needs fix
- Multiplication vs. addition still gets mixed up sometimes
- Semantic compartment mismatches I(t) vs. T(t)-> extra validation needed e.g. linear and 
- Strengthening of the arithmetic validation is much needed!
- Precision of coefficient extraction 
- Still remains: CodeExecutionError: Error while executing the code: 'Symbol' object is not callable (examples: BIOMD000000972, BIOMD000000976)
- The error handling function  mixes up the order of operations in some cases (example: BIOMD0000000991)
- Extraction of the compartments differ from the original completely, maybe derived from the RHS (example: 2024_dec_epi_1_model_A)


---

> CURRENT VERSION:
## Multi-Agent Pipeline:
**version= 003**

There are clearly separable problem areas, which will be better managed by detailing and resolving the prompting. An agenda based approach will systematically address extraction challenges by organizing the process into distinct agenda items, each targeting specific aspects of the process:

### Agent 1: Initial Extraction
- Extract equations from image/PDF using existing MIRA logic
- Convert mathematical notation to SymPy code representation
- Pass raw code string to next agent

### Agent 2: Execution Error Handler
- Attempt to execute the extracted SymPy code
- Catch and diagnose execution errors (missing imports, undefined variables, syntax errors)
- Automatically fix common issues and retry execution
- Pass executable code and any remaining warnings forward

### Agent 3: Symbol & Parameter Analysis
#### Time dependency classification:
- Identify all variables that appear with d/dt (time-dependent)
- Classify remaining symbols as parameters or independent variables
- Flag any inconsistencies in variable usage

#### Parameter consistency checking:
- Detect parameters that appear in equations but aren't defined
- Identify duplicate parameter definitions
- Find defined but unused parameters
- Check notation consistency (subscripts, superscripts, Greek letters)
- Pass comprehensive symbol mapping to next agent

### Agent 4: Diagnostic & Scoring
- Calculate extraction quality score based on:
 - Successful execution (from Agent 2)
 - Symbol consistency (from Agent 3)
 - Common extraction error patterns
- Generate final report with:
 - Overall confidence score
 - Specific warnings about potential extraction errors
 - Recommendations for manual review if score is low
- Optional: Include lightweight mathematical validation (missing negative signs, suspicious parameter usage)

This pipeline transforms the single-shot extraction into a robust, multi-step process where each agent specializes in one aspect of validation and correction. Since each agent requires a distinct approach and prompt configuration, the LLM can achieve better focus (rather than receiving a summarized, less detailed message).

Other possible agenda items:

6. Symbol Validation – Are all variables and parameters defined? this focuses more on JSON

7. Biological Context Tagging – Are compartments semantically labeled (e.g., S = susceptible)?

8. JSON Structure Integrity – Is the output JSON consistent and complete?
