In [28]:
import pandas as pd
#features
import re
import numpy as np
from collections import Counter
#comparison
import json
#LLM
from openai import OpenAI
import os
import time
from dotenv import load_dotenv

In [29]:
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# verifying the setup
if os.getenv("OPENAI_API_KEY"):
    print("OpenAI API key loaded successfully")
else:
    print("Error: OPENAI_API_KEY not found in .env file")

OpenAI API key loaded successfully


### Error analysis and validation for MIRA equations extraction

This project develops pipeline to automatically analyse the extraction method of equations by MIRA, focusing specifically on correctness errors. These errors can include both equation components (variables, parameters etc.) and their graphical representations in source articles, which are analyzed through detailed feature engineering. The workflow integrates analysis of the used symbols, mathematical validation and visual features to assess MIRA's extraction quality.

Key steps:
- Data extraction: original sources (correct equations) and results of the extraction process (extracted equations)
- Defining all the possible features ODEs can have
- Comparing ODE systems in 3 steps:
    1. identifying the symbolic, mathematical and graphic features of the original equations and calculating complexty metrics
    1. comparing the extracted equations to the original equations (model specific) 
    2. categorizing or labeling the appearing errors
- Finding the connection between the features of the equations and the appearing error types (correlation, odds ratios, test values etc.)

- Evaluation of the trained model for development purposes: Performance is measured using cross-validation and robustness testing

An example of a possible outcome: 89% of the models, where equations contain the greek letter 'alpha' have a symbol detection problem not recognizing the subscript parts of the alpha symbols - conclusion: prompting should be improved focusing on greek letters with sub- or superscripts

Once the error analysis system is successfully completed, it could potentially be integrated into MIRA's final diagnostic step (the last agent item), enabling the LLM to perform automatic correctness evaluations as part of the extraction process.

Let's start with importing the original equations and the extracted version of the models:

#### Configurating the parameters

In [30]:
# OpenAI Configuration
OPENAI_MODEL = "gpt-4"
TEMPERATURE = 0.0
MAX_TOKENS = 1024

# Analysis Configuration
VERSION = '001'  # Version to analyze from extracted equations
SAVE_RESULTS = True 

# Output file names
OUTPUT_DIR = '.'  # current directory, change if needed
FEATURES_OUTPUT = f'{OUTPUT_DIR}/features_analysis_{VERSION}.csv'
SUMMARY_OUTPUT = f'{OUTPUT_DIR}/error_analysis_summary_{VERSION}.csv'
COMPARISON_OUTPUT = f'{OUTPUT_DIR}/comparison_results_{VERSION}.json'
CATEGORIZATION_OUTPUT = f'{OUTPUT_DIR}/categorization_results_{VERSION}.json'

print("Configuration loaded:")
print(f"- OpenAI Model: {OPENAI_MODEL}")
print(f"- Version: {VERSION}")
print(f"- Output Directory: {OUTPUT_DIR}")

Configuration loaded:
- OpenAI Model: gpt-4
- Version: 001
- Output Directory: .


#### Importing data
**Importing original sources** and **Importing extracted equations**

> correct_eqs_list.csv

> extracted_eqs.csv


In [31]:
correct_eqs_df = pd.read_csv('/Users/kovacs.f/Desktop/mira/notebooks/F/correct_eqs_list.tsv', sep='\t')
extracted_eqs_df = pd.read_csv('/Users/kovacs.f/Desktop/mira/notebooks/F/extracted_eqs_VERSION001.tsv', sep='\t')  #VERSION INPUT HERE -> TSV FILE NAME ENDING

#check header names
correct_eqs_df.columns, extracted_eqs_df.columns

(Index(['model', 'correct_eqs'], dtype='object'),
 Index(['model', 'extracted_eqs'], dtype='object'))

The aim is to compare the extracted models to their original form and analyse the results in order to detect the components of the process to be imporved and create the best possible extraction method.

#### Feature Engineering
Feature engineering in this case captures MIRA's extracted equations into numerical features that capture potential error patterns and extraction challenges. The error types expanded as features:

1. Time-dependence inconsistencies
2. Special symbol usage
3. Sub- and superscript usage
4. Undefined or unused parameters
5. Equation graph structure (density, loops, etc.)
6. Graphical features of the input pdf or png

In [32]:
#helper functions for the feature extraction

def _count_unicode_blocks(text):
    """Count different Unicode block types in text"""
    blocks = set()
    for char in text:
        if ord(char) > 127:
            code = ord(char)
            if 0x0370 <= code <= 0x03FF:
                blocks.add('GREEK')
            elif 0x2200 <= code <= 0x22FF:
                blocks.add('MATHEMATICAL_OPERATORS')
            elif 0x2070 <= code <= 0x209F:
                blocks.add('SUPERSCRIPTS_SUBSCRIPTS')
            elif 0x1D400 <= code <= 0x1D7FF:
                blocks.add('MATHEMATICAL_ALPHANUMERIC')
            elif 0x2190 <= code <= 0x21FF:
                blocks.add('ARROWS')
    return len(blocks)

def _calculate_encoding_complexity(text):
    """Calculate complexity based on character variety"""
    if not text:
        return 0
    unique_chars = len(set(text))
    total_chars = len(text)
    # Add weight for non-ASCII characters
    non_ascii = sum(1 for c in text if ord(c) > 127)
    complexity = (unique_chars / total_chars) * 5 + (non_ascii / total_chars) * 5
    return min(10, complexity)

def _calculate_max_nesting_depth(text, delimiter):
    """Calculate maximum nesting depth for subscripts/superscripts"""
    max_depth = 0
    current_depth = 0
    in_bracket = False
    
    for i, char in enumerate(text):
        if char == delimiter and i + 1 < len(text) and text[i + 1] == '{':
            current_depth += 1
            max_depth = max(max_depth, current_depth)
            in_bracket = True
        elif char == '}' and in_bracket:
            current_depth = max(0, current_depth - 1)
            if current_depth == 0:
                in_bracket = False
    return max_depth

def _count_single_occurrence_vars(text):
    """Count variables that appear only once (potentially undefined)"""
    import re
    # Find all single letter variables
    variables = re.findall(r'\b[a-zA-Z]\b', text)
    # Count those appearing only once
    var_counts = Counter(variables)
    return sum(1 for count in var_counts.values() if count == 1)

def _calculate_parameter_consistency(text):
    """Score how consistently parameters are used (0-1)"""
    import re
    # Find all variables
    variables = re.findall(r'\b[a-zA-Z]+\b', text)
    if not variables:
        return 1.0
    
    # Check consistency (simplified)
    var_counts = Counter(variables)
    # If all variables appear at least twice, good consistency
    single_use = sum(1 for count in var_counts.values() if count == 1)
    total_vars = len(var_counts)
    
    return 1.0 - (single_use / max(total_vars, 1))

def _count_implicit_parameters(text):
    """Count parameters that seem assumed but not defined"""
    import re
    # Look for common parameter patterns not in derivatives
    params = re.findall(r'\b[a-zA-Z]\b(?![\'"])', text)
    # Common parameters that might be implicit
    common_params = ['a', 'b', 'c', 'k', 'r', 'alpha', 'beta', 'gamma']
    implicit_count = 0
    for param in set(params):
        if param in common_params and params.count(param) == 1:
            implicit_count += 1
    return implicit_count

def _classify_naming_convention(text):
    """Classify variable naming style"""
    import re
    variables = re.findall(r'\b[a-zA-Z_]+[a-zA-Z0-9_]*\b', text)
    if not variables:
        return "none"
    
    # Check patterns
    has_underscore = any('_' in var for var in variables)
    has_multi_letter = any(len(var) > 1 for var in variables)
    has_subscript = '_' in text
    
    if has_subscript:
        return "subscripted"
    elif has_multi_letter:
        return "multi_letter"
    else:
        return "single_letter"

def _extract_variables(text):
    """Extract all variables from equation"""
    import re
    # Extract single letters and multi-letter variables
    variables = set(re.findall(r'\b[a-zA-Z]+\b', text))
    # Remove common functions
    functions = {'sin', 'cos', 'tan', 'exp', 'log', 'ln', 'sqrt', 'max', 'min'}
    return variables - functions

def _calculate_derivative_consistency(text):
    """Check if derivative notation is consistent"""
    dot_notation = bool(re.search(r'[ẋẏżẍÿz̈]|\\dot\{|\\ddot\{', text))
    prime_notation = "'" in text or "′" in text
    leibniz_notation = bool(re.search(r'd[a-zA-Z]/d[a-zA-Z]|\\frac\{d', text))
    
    notations_used = sum([dot_notation, prime_notation, leibniz_notation])
    
    # Consistent if only one notation used or no derivatives
    if notations_used <= 1:
        return 1.0
    else:
        return 0.5  # Mixed notation

def _check_time_var_consistency(text):
    """Check if time variable is used consistently"""
    import re
    # Look for common time variables
    time_vars = re.findall(r'\b[tτ]\b', text)
    if not time_vars:
        return True
    
    # Check if only one type is used
    unique_vars = set(time_vars)
    return len(unique_vars) == 1

def _check_derivative_progression(text):
    """Check if derivative orders make sense"""
    import re
    # Look for derivative patterns
    first_order = bool(re.search(r"[a-zA-Z]'(?!')|\b[a-zA-Z]\b'", text))
    second_order = bool(re.search(r"[a-zA-Z]''|\\ddot", text))
    
    # If we have second order, we should have first order
    if second_order and not first_order:
        return False
    return True

def _count_time_dependence_issues(text):
    """Count inconsistencies in time dependence notation"""
    import re
    # Count y vs y(t) style inconsistencies
    bare_vars = len(re.findall(r'\b[a-zA-Z]\b(?!\()', text))
    function_vars = len(re.findall(r'\b[a-zA-Z]\([a-zA-Z]\)', text))
    
    # If we have both styles, that's an issue
    if bare_vars > 0 and function_vars > 0:
        return min(bare_vars, function_vars)
    return 0

def _count_derivative_mismatches(text):
    """Count derivatives with undefined variables"""
    import re
    # Find derivatives like dy/dx
    derivatives = re.findall(r'd([a-zA-Z])/d([a-zA-Z])', text)
    mismatches = 0
    
    all_vars = set(re.findall(r'\b[a-zA-Z]\b', text))
    
    for dep_var, indep_var in derivatives:
        if indep_var not in all_vars:
            mismatches += 1
    
    return mismatches

def _build_dependency_graph(text):
    """Build variable dependency graph"""
    import re
    # Simple version - just count variables and relationships
    variables = set(re.findall(r'\b[a-zA-Z]\b', text))
    
    # Look for equations (separated by =)
    equations = text.split('=')
    edges = []
    
    for i in range(len(equations) - 1):
        left_vars = set(re.findall(r'\b[a-zA-Z]\b', equations[i]))
        right_vars = set(re.findall(r'\b[a-zA-Z]\b', equations[i + 1]))
        for lv in left_vars:
            for rv in right_vars:
                if lv != rv:
                    edges.append((lv, rv))
    
    return {'nodes': list(variables), 'edges': edges}

def _calculate_graph_connectivity(text):
    """Calculate connectivity of variable graph"""
    graph = _build_dependency_graph(text)
    if not graph['nodes']:
        return 0
    
    # Simple connectivity measure
    num_edges = len(graph['edges'])
    num_nodes = len(graph['nodes'])
    
    if num_nodes <= 1:
        return 0
    
    # Normalized connectivity
    max_edges = num_nodes * (num_nodes - 1)
    return num_edges / max_edges if max_edges > 0 else 0

def _check_cyclic_dependencies(text):
    """Check for circular dependencies"""
    graph = _build_dependency_graph(text)
    edges = graph['edges']
    
    # Simple cycle detection
    for a, b in edges:
        if (b, a) in edges:
            return True
    return False

def _calculate_coupling_strength(text):
    """Calculate how strongly equations are coupled"""
    # Split by newlines or equation separators
    equations = re.split(r'[,\n\\\\]', text)
    if len(equations) <= 1:
        return 0
    
    # Find shared variables
    all_vars = []
    for eq in equations:
        vars_in_eq = set(re.findall(r'\b[a-zA-Z]\b', eq))
        all_vars.append(vars_in_eq)
    
    # Calculate overlap
    if not all_vars:
        return 0
    
    shared = set.intersection(*all_vars) if all_vars else set()
    total = set.union(*all_vars) if all_vars else set()
    
    return len(shared) / len(total) if total else 0

def _estimate_jacobian_sparsity(text):
    """Estimate sparsity of Jacobian matrix"""
    graph = _build_dependency_graph(text)
    num_vars = len(graph['nodes'])
    
    if num_vars == 0:
        return 0
    
    # Estimate non-zero entries
    num_edges = len(graph['edges'])
    total_possible = num_vars * num_vars
    
    return 1 - (num_edges / total_possible) if total_possible > 0 else 1

def _calculate_equation_similarity(text):
    """Calculate similarity between equations in system"""
    equations = re.split(r'[,\n\\\\]', text)
    if len(equations) <= 1:
        return 1.0
    
    # Simple character-based similarity
    similarities = []
    for i in range(len(equations)):
        for j in range(i + 1, len(equations)):
            eq1_chars = set(equations[i])
            eq2_chars = set(equations[j])
            if eq1_chars or eq2_chars:
                similarity = len(eq1_chars & eq2_chars) / len(eq1_chars | eq2_chars)
                similarities.append(similarity)
    
    return sum(similarities) / len(similarities) if similarities else 1.0

def _count_terms(text):
    """Count mathematical terms in equation"""
    # Split by operators
    terms = re.split(r'[+\-=]', text)
    # Filter out empty strings
    return len([t for t in terms if t.strip()])

def _estimate_font_size(ocr_data):
    """Estimate font size from OCR data"""
    if not ocr_data:
        return 12
    
    bbox_height = ocr_data.get('bbox_height', 20)
    # Rough estimation
    return bbox_height * 0.75

def _find_keyword_distance(equation_text, context_data):
    """Find distance to definition keywords"""
    if not context_data:
        return float('inf')
    
    text_before = context_data.get('text_before', '')
    keywords = ['where', 'with', 'given', 'such that']
    
    min_distance = float('inf')
    for keyword in keywords:
        if keyword in text_before:
            distance = len(text_before) - text_before.rfind(keyword)
            min_distance = min(min_distance, distance)
    
    return min_distance

def _classify_text_type(context_data):
    """Classify surrounding text type"""
    if not context_data:
        return "unknown"
    
    full_text = context_data.get('full_text', '').lower()
    
    if 'proof' in full_text:
        return "proof"
    elif 'theorem' in full_text:
        return "theorem"
    elif 'definition' in full_text:
        return "definition"
    elif 'example' in full_text:
        return "example"
    else:
        return "prose"

def _count_confusable_symbols(text):
    """Count easily confused symbols"""
    confusables = [
        ('0', 'O', 'o'),
        ('1', 'l', 'I', '|'),
        ('5', 'S', 's'),
        ('2', 'Z', 'z')
    ]
    
    count = 0
    for group in confusables:
        chars_in_text = sum(1 for char in text if char in group)
        if chars_in_text > 1:  # Multiple from same group
            count += chars_in_text
    
    return count

def _count_similar_variables(text):
    """Count similar variable names"""
    import re
    variables = set(re.findall(r'\b[a-zA-Z]\b', text))
    
    similar_pairs = [
        ('x', 'χ'), ('v', 'ν'), ('p', 'ρ'),
        ('a', 'α'), ('b', 'β'), ('g', 'γ')
    ]
    
    count = 0
    for v1, v2 in similar_pairs:
        if v1 in text and v2 in text:
            count += 1
    
    return count

def _determine_ode_order(text):
    """Determine ODE order"""
    import re
    # Look for derivative patterns
    second_order = bool(re.search(r"d²|\\frac\{d\^2|''|\\ddot", text))
    third_order = bool(re.search(r"d³|\\frac\{d\^3|'''", text))
    
    if third_order:
        return 3
    elif second_order:
        return 2
    elif 'd' in text or "'" in text or '\\dot' in text:
        return 1
    else:
        return 0

def _check_linearity(text):
    """Check if ODE is linear"""
    # Simplified check - look for products of variables
    import re
    variables = re.findall(r'\b[a-zA-Z]\b', text)
    
    # Check for products like xy, x^2
    for i in range(len(text) - 1):
        if text[i].isalpha() and text[i+1].isalpha():
            return False
        if text[i].isalpha() and text[i+1] == '^':
            return False
    
    return True

def _check_autonomous(text):
    """Check if ODE is autonomous"""
    # Check if time variable appears explicitly
    time_vars = ['t', 'τ', 'time']
    
    for var in time_vars:
        # Check if it appears not as derivative variable
        if re.search(rf'\b{var}\b(?![)])', text):
            return False
    
    return True

def _check_forcing_term(text):
    """Check for forcing/source terms"""
    # Look for common forcing term patterns
    forcing_patterns = [
        r'sin\(.*t',
        r'cos\(.*t',
        r'e\^.*t',
        r'f\(t\)',
        r'g\(t\)'
    ]
    
    for pattern in forcing_patterns:
        if re.search(pattern, text):
            return True
    
    return False

def _check_standard_forms(text):
    """Check if matches standard ODE forms"""
    # Check for common forms
    separable = bool(re.search(r'\\frac\{dy\}\{dx\}\s*=.*f\(x\).*g\(y\)', text))
    exact = bool(re.search(r'M.*dx.*\+.*N.*dy.*=.*0', text))
    
    return separable or exact

def _calculate_parenthesis_depth(text):
    """Calculate maximum nesting depth of parentheses"""
    max_depth = 0
    current_depth = 0
    
    for char in text:
        if char == '(':
            current_depth += 1
            max_depth = max(max_depth, current_depth)
        elif char == ')':
            current_depth = max(0, current_depth - 1)
    
    return max_depth

In [33]:
def extract_features(equation_text, ocr_data=None, context_data=None):
    features = {}
       
    # ========== SPECIAL SYMBOLS & GREEK LETTERS ==========
       
    # Binary: contains any Greek letters (α, β, γ, etc.)
    features['contains_greek_letters'] = bool(re.search(r'[α-ωΑ-Ω]|\\(alpha|beta|gamma|delta|epsilon|theta|lambda|mu|nu|xi|pi|rho|sigma|tau|phi|chi|psi|omega)', equation_text))
       
    # Count: total number of Greek letters in equation
    features['num_greek_letters'] = len(re.findall(r'[α-ωΑ-Ω]|\\(alpha|beta|gamma|delta|epsilon|theta|lambda|mu|nu|xi|pi|rho|sigma|tau|phi|chi|psi|omega)', equation_text))
       
    # Ratio: unicode (not ASCII) characters / total characters (higher = more special symbols)
    features['unicode_ratio'] = len([c for c in equation_text if ord(c) > 127]) / max(len(equation_text), 1)
       
    # Count: different unicode block types (Greek, Mathematical, etc.)
    features['unicode_block_diversity'] = _count_unicode_blocks(equation_text)
       
    # Count: rare mathematical symbols (∀, ∃, ∈, ∅, etc.)
    features['rare_symbol_count'] = len(re.findall(r'[∀∃∈∅⊂⊃⊆⊇∪∩≡≈≠≤≥∞∇∂]', equation_text))
       
    # Score: complexity based on character encoding variety
    features['encoding_complexity_score'] = _calculate_encoding_complexity(equation_text)
       
    # ========== SUBSCRIPTS & SUPERSCRIPTS ==========
       
    # Binary: has subscripts (x_i or x_{ij})
    features['has_subscripts'] = '_' in equation_text or '_{' in equation_text
       
    # Count: total number of subscripts
    features['num_subscripts'] = equation_text.count('_')
       
    # Binary: has nested subscripts (x_{i_j} or deeper)
    features['has_nested_subscripts'] = bool(re.search(r'_\{[^}]*_', equation_text))
       
    # Count: maximum subscript nesting depth (x_{i_{j_{k}}} = 3)
    features['max_subscript_depth'] = _calculate_max_nesting_depth(equation_text, '_')

    # Binary: has superscripts (x^i or x^{ij})
    features['has_superscripts'] = '^' in equation_text or '^{' in equation_text

    # Count: total number of superscripts
    features['num_superscripts'] = equation_text.count('^')  # Also fixed the typo: num_supescripts → num_superscripts
       
    
    # Count: mixed subscript/superscript (x_i^j)
    features['mixed_sub_super_count'] = len(re.findall(r'_[^_\s]+\^|_\{[^}]+\}\^', equation_text))
       
    # Binary: has numeric subscripts (x_1, x_2)
    features['has_numeric_subscripts'] = bool(re.search(r'_\d|_\{\d', equation_text))
       
    # Binary: has alphabetic subscripts (x_i, x_j)
    features['has_alphabetic_subscripts'] = bool(re.search(r'_[a-zA-Z]|_\{[a-zA-Z]', equation_text))

    # ========== PARAMETERS & VARIABLES ==========
       
    # Count: variables that appear only once (potential undefined)
    features['undefined_variable_candidates'] = _count_single_occurrence_vars(equation_text)
       
    # Score: how consistently parameters are used (0-1, 1=perfect)
    features['parameter_consistency_score'] = _calculate_parameter_consistency(equation_text)
       
    # Count: parameters that seem assumed but not defined
    features['implicit_parameter_count'] = _count_implicit_parameters(equation_text)
       
    # Category: naming style (single_letter/subscripted/multi_letter)
    features['parameter_naming_convention'] = _classify_naming_convention(equation_text)
       
    # Count: total unique variables
    features['num_unique_variables'] = len(_extract_variables(equation_text))
       
    # ========== TIME DEPENDENCE & DERIVATIVES ==========
       
    # Binary: uses dot notation for derivatives (ẋ, ẍ)
    features['uses_dot_notation'] = bool(re.search(r'[ẋẏżẍÿz̈]|\\dot\{|\\ddot\{', equation_text))
       
    # Binary: uses prime notation (y', y'')
    features['uses_prime_notation'] = "'" in equation_text or "′" in equation_text or "″" in equation_text
       
    # Binary: uses Leibniz notation (dy/dx, d²y/dx²)
    features['uses_leibniz_notation'] = bool(re.search(r'd[a-zA-Z]/d[a-zA-Z]|\\frac\{d', equation_text))
       
    # Binary: mixed derivative notations in same equation
    features['mixed_derivative_notation'] = sum([features['uses_dot_notation'], features['uses_prime_notation'], features['uses_leibniz_notation']]) > 1
       
    # Score: how consistently derivative notation is used
    features['derivative_notation_consistency'] = _calculate_derivative_consistency(equation_text)
       
    # Binary: consistent time variable (always 't' or always 'τ')
    features['time_variable_consistency'] = _check_time_var_consistency(equation_text)
       
    # Binary: derivative order makes sense (has y' if has y'')
    features['derivative_order_progression'] = _check_derivative_progression(equation_text)
       
    # Count: y vs y(t) inconsistencies
    features['implicit_time_dependence_issues'] = _count_time_dependence_issues(equation_text)
       
    # Count: derivatives with undefined variables (dy/dx but no x)
    features['derivative_variable_mismatch'] = _count_derivative_mismatches(equation_text)
       
    # ========== GRAPH STRUCTURE & COMPLEXITY ==========
       
    # Count: number of variables in ODE system
    features['dependency_graph_nodes'] = len(_build_dependency_graph(equation_text)['nodes'])
       
    # Count: variable relationships (x appears with y)
    features['dependency_graph_edges'] = len(_build_dependency_graph(equation_text)['edges'])
       
    # Count: strongly connected components in variable graph
    features['graph_connectivity'] = _calculate_graph_connectivity(equation_text)
       
    # Binary: has circular dependencies (x depends on y, y on x)
    features['has_cyclic_dependencies'] = _check_cyclic_dependencies(equation_text)
       
    # Score: how many equations share variables (0-1)
    features['coupling_strength'] = _calculate_coupling_strength(equation_text)
       
    # Ratio: non-zero entries in Jacobian matrix
    features['jacobian_sparsity'] = _estimate_jacobian_sparsity(equation_text)
       
    # Score: average similarity between equations in system
    features['equation_similarity_score'] = _calculate_equation_similarity(equation_text)
       
    # Count: total terms in equation
    features['total_term_count'] = _count_terms(equation_text)
       
    # ========== VISUAL FEATURES (requires OCR data) ==========
       
    if ocr_data:
        # Area: size of equation bounding box in pixels
        features['bounding_box_area'] = ocr_data.get('bbox_area', 0)
           
        # Ratio: width/height of equation region
        features['bounding_box_aspect_ratio'] = ocr_data.get('bbox_width', 1) / max(ocr_data.get('bbox_height', 1), 1)
           
        # Density: characters per pixel area
        features['character_density'] = len(equation_text) / max(features['bounding_box_area'], 1)
           
        # Ratio: whitespace pixels / total pixels
        features['whitespace_ratio'] = ocr_data.get('whitespace_ratio', 0)
           
        # Score: average OCR confidence (0-1)
        features['ocr_confidence_mean'] = ocr_data.get('confidence_scores', [0.5])[0] if ocr_data.get('confidence_scores') else 0.5
           
        # Score: standard deviation of OCR confidence
        features['ocr_confidence_std'] = np.std(ocr_data.get('confidence_scores', [0.5]))
           
        # Count: characters with confidence < 0.7
        features['low_confidence_char_count'] = sum(1 for c in ocr_data.get('char_confidences', []) if c < 0.7)
           
        # Pixels: estimated font size from bbox
        features['font_size_estimate'] = _estimate_font_size(ocr_data)
           
        # Score: how well multi-line equations align (0-1)
        features['alignment_score'] = ocr_data.get('alignment_score', 0)
           
        # Count: overlapping character regions
        features['overlapping_regions'] = ocr_data.get('overlap_count', 0)
       
    # ========== CONTEXT FEATURES ==========
    if context_data:
           # Binary: has "where x is..." before/after equation
           features['has_preceding_definition'] = bool(re.search(r'where|with|given|such that', context_data.get('text_before', '')))
           
           # Distance: characters to nearest definition keyword
           features['definition_keyword_distance'] = _find_keyword_distance(equation_text, context_data)
           
           # Binary: equation has label like (1), (2.3), Eq. 1
           features['equation_label_present'] = bool(re.search(r'\(\d+\.?\d*\)|Eq\.?\s*\d+', context_data.get('full_text', '')))
           
           # Binary: in LaTeX equation environment
           features['in_equation_environment'] = '\\begin{equation}' in context_data.get('full_text', '')
           
           # Category: prose/list/proof/theorem/definition
           features['surrounding_text_type'] = _classify_text_type(context_data)
           
           # Binary: references previous equations
           features['references_previous_equation'] = bool(re.search(r'equation\s*\(\d+\)|from\s*\(\d+\)|see\s*\(\d+\)', context_data.get('text_after', '')))
           
           # Binary: has boundary/initial condition keywords
           features['has_boundary_condition_text'] = bool(re.search(r'subject to|with IC|initial condition|boundary condition|BC:|IC:', context_data.get('full_text', '')))
           
           # Binary: domain specified (for x ∈ [0,1])
           features['domain_specification_present'] = bool(re.search(r'for\s+[a-zA-Z]\s*[∈∊]\s*[\[\(]|[a-zA-Z]\s*in\s*[\[\(]', context_data.get('full_text', '')))
       
       # ========== AMBIGUITY FEATURES ==========
       
    # Count: ambiguous notations (log vs ln, etc.)
    features['ambiguous_notation_count'] = len(re.findall(r'\b(log|arg|det|dim|ker|span)\b', equation_text))
       
    # Count: implicit multiplications (xy vs x*y)
    features['implicit_multiplication_count'] = len(re.findall(r'[a-zA-Z]\d|[a-zA-Z][a-zA-Z]|\d[a-zA-Z]', equation_text))
       
    # Count: easily confused symbols (0/O, 1/l/I)
    features['confusable_symbol_count'] = _count_confusable_symbols(equation_text)
       
    # Count: similar variable names (x/χ, v/ν, p/ρ)
    features['similar_variable_count'] = _count_similar_variables(equation_text)
       
    # ========== ODE-SPECIFIC FEATURES ==========
       
    # Order: highest derivative order (1, 2, 3, etc.)
    features['ode_order'] = _determine_ode_order(equation_text)
       
    # Binary: is linear ODE
    features['is_linear_ode'] = _check_linearity(equation_text)
       
    # Binary: is autonomous (no explicit time variable)
    features['is_autonomous'] = _check_autonomous(equation_text)
       
    # Binary: has forcing/source term
    features['has_forcing_term'] = _check_forcing_term(equation_text)
       
    # Count: number of coupled equations
    features['num_coupled_equations'] = equation_text.count('=') if '=' in equation_text else 0
       
    # Binary: matches standard forms (separable, exact, etc.)
    features['is_standard_form'] = _check_standard_forms(equation_text)
       
    # Length: total character count
    features['equation_length'] = len(equation_text)
       
    # Binary: is multi-line equation
    features['is_multiline'] = '\n' in equation_text or '\\\\' in equation_text
       
    # Count: number of lines
    features['num_lines'] = equation_text.count('\n') + equation_text.count('\\\\') + 1

     # ========== MATHEMATICAL OPERATIONS ==========

    # Binary: has derivatives (any notation)
    features['has_derivatives'] = any([
        features['uses_dot_notation'],
        features['uses_prime_notation'], 
        features['uses_leibniz_notation']
        ])
    
    # Binary: has integrals (∫ or \int or \iint or \oint etc.)
    features['has_integrals'] = bool(re.search(r'∫|∬|∭|∮|\\int|\\iint|\\iiint|\\oint|\\smallint', equation_text))

    # Binary: has summation
    features['has_summation'] = bool(re.search(r'∑|\\sum', equation_text))

    # Binary: has product notation
    features['has_product'] = bool(re.search(r'∏|\\prod', equation_text))


    # Binary: has special functions
    features['has_special_functions'] = bool(re.search(
        r'(sin|cos|tan|sinh|cosh|tanh|exp|log|ln|sqrt|erf|gamma|bessel|arcsin|arccos|arctan)',
        equation_text, re.IGNORECASE
        ))

    # Binary: has fractions
    features['has_fractions'] = bool(re.search(r'/|\\frac|\\dfrac|\\tfrac', equation_text))

    # Binary: has matrices
    features['has_matrices'] = bool(re.search(
      r'\\begin\{[pbvBV]?matrix\}|\\matrix|\\det|\\text\{det\}|\\begin\{array\}',
     equation_text
    ))

    # Count: mathematical operations
    features['num_additions'] = equation_text.count('+')
    features['num_subtractions'] = equation_text.count('-') - equation_text.count('e-')  # Exclude scientific notation
    features['num_multiplications'] = equation_text.count('*') + equation_text.count('·') + equation_text.count('\\cdot') + equation_text.count('\\times')
    features['num_divisions'] = equation_text.count('/') + equation_text.count('÷') + equation_text.count('\\div')

    # ========== STRUCTURAL FEATURES ==========

    # Count: parentheses pairs
    features['num_parentheses_pairs'] = min(equation_text.count('('), equation_text.count(')'))

    # Count: maximum nesting depth of parentheses
    features['max_nesting_depth'] = _calculate_parenthesis_depth(equation_text)
       
       
    return features

#### 3-step comparison of ODE systems

##### STEP 1: Extract features and calculate scores

In [34]:
def step1_extract_and_score_features(correct_df):
    all_features = []
    skipped_count = 0
    
    for _, row in correct_df.iterrows():
        equation = row['correct_eqs']
        model = row['model']

        if pd.isna(equation) or equation is None:
            print(f"Skipping {model} - no equation data")
            skipped_count += 1
            continue
            
        equation = str(equation)
        
        if not equation.strip() or equation == 'nan':
            print(f"Skipping {model} - empty equation")
            skipped_count += 1
            continue
        
        try:
            # extracting all features
            features = extract_features(equation)
        except Exception as e:
            print(f"Error extracting features for {model}: {e}")
            skipped_count += 1
            continue
        
        features['model'] = model
        features['original_equation'] = equation
        
        # calculating different complexity scores

        # 1. Symbol Complexity Score (0-10)
        symbol_score = (
            features['contains_greek_letters'] * 2 +
            features['num_greek_letters'] * 0.5 +
            features['unicode_ratio'] * 10 +
            features['rare_symbol_count'] * 1 +
            features['encoding_complexity_score']
        )
        features['symbol_complexity_score'] = min(10, symbol_score)

        # 2. Structural Complexity Score (0-10)
        structural_score = (
            features['has_subscripts'] * 1 +
            features['num_subscripts'] * 0.3 +
            features['has_superscripts'] * 1 +
            features['num_superscripts'] * 0.3 +
            features['max_subscript_depth'] * 2 +
            features['has_nested_subscripts'] * 3 +
            features['mixed_sub_super_count'] * 0.5
        )
        features['structural_complexity_score'] = min(10, structural_score)

        # 3. Mathematical Complexity Score (0-10)
        math_score = (
            features['ode_order'] * 1.5 +
            features['num_coupled_equations'] * 2 +
            features['has_integrals'] * 1 +
            features['has_derivatives'] * 0.5 +
            features['has_summation'] * 1 +
            features['has_special_functions'] * 1.5 +
            (not features['is_linear_ode']) * 2 +
            features['total_term_count'] * 0.1
        )
        features['mathematical_complexity_score'] = min(10, math_score)

        # 4. Visual Complexity Score (0-10)
        visual_score = (
            features['is_multiline'] * 3 +
            features['num_lines'] * 0.5 +
            features['equation_length'] / 50 +  # Normalized by typical length
            features['num_parentheses_pairs'] * 0.3 +
            features['max_nesting_depth'] * 1 +
            features['has_matrices'] * 2
        )
        features['visual_complexity_score'] = min(10, visual_score)

        # 5. Overall Complexity Score (0-10)
        features['overall_complexity_score'] = (
            features['symbol_complexity_score'] * 0.3 +
            features['structural_complexity_score'] * 0.25 +
            features['mathematical_complexity_score'] * 0.25 +
            features['visual_complexity_score'] * 0.2
        )

        # Risk Assessment Scores

        # OCR Risk Score (0-10) - likelihood of OCR errors
        features['ocr_risk_score'] = (
            features['unicode_ratio'] * 15 +
            features['contains_greek_letters'] * 2 +
            features['num_greek_letters'] * 0.3 +
            features['confusable_symbol_count'] * 1 +
            features['similar_variable_count'] * 0.5 +
            features['implicit_multiplication_count'] * 0.2
        )
        features['ocr_risk_score'] = min(10, features['ocr_risk_score'])

        # Extraction Difficulty Score (0-10)
        features['extraction_difficulty_score'] = (
            features['overall_complexity_score'] * 0.4 +
            features['ocr_risk_score'] * 0.6
        )

        # Feature Counts Summary
        features['total_special_symbols'] = (
            features['num_greek_letters'] +
            features['rare_symbol_count'] +
            features['num_subscripts'] +
            features['num_superscripts']
        )

        features['total_mathematical_operations'] = (
            features['num_additions'] +
            features['num_subtractions'] +
            features['num_multiplications'] +
            features['num_divisions'] +
            features['has_integrals'] +
            features['has_summation'] +
            features['has_product']
        )

        features['total_complexity_indicators'] = sum([
            features['contains_greek_letters'],
            features['has_subscripts'],
            features['has_superscripts'],
            features['has_fractions'],
            features['has_integrals'],
            features['has_derivatives'],
            features['has_matrices'],
            features['has_special_functions'],
            features['is_multiline'],
            features['mixed_derivative_notation']
        ])
        
        # Risk Categories
        if features['extraction_difficulty_score'] < 3:
            features['risk_category'] = 'low'
        elif features['extraction_difficulty_score'] < 6:
            features['risk_category'] = 'medium'
        elif features['extraction_difficulty_score'] < 8:
            features['risk_category'] = 'high'
        else:
            features['risk_category'] = 'very_high'
        
        # Specific risk flags
        features['has_high_risk_features'] = any([
            features['unicode_ratio'] > 0.3,
            features['max_subscript_depth'] > 2,
            features['num_greek_letters'] > 5,
            features['mixed_derivative_notation'],
            features['has_matrices']
        ])
        
        all_features.append(features)
    
    # creating features_df
    features_df = pd.DataFrame(all_features)
    
    # summary statistics
    print("\n Feature Extraction Summary")
    print("="*50)
    print(f"Total equations analyzed: {len(features_df)}")
    print(f"\nComplexity Score Distribution:")
    print(f"  Low (0-3): {len(features_df[features_df['overall_complexity_score'] < 3])}")
    print(f"  Medium (3-6): {len(features_df[(features_df['overall_complexity_score'] >= 3) & (features_df['overall_complexity_score'] < 6)])}")
    print(f"  High (6-8): {len(features_df[(features_df['overall_complexity_score'] >= 6) & (features_df['overall_complexity_score'] < 8)])}")
    print(f"  Very High (8-10): {len(features_df[features_df['overall_complexity_score'] >= 8])}")
    
    print(f"\nRisk Categories:")
    print(features_df['risk_category'].value_counts())
    
    print(f"\nTop 5 Most Complex Equations:")
    top_complex = features_df.nlargest(5, 'overall_complexity_score')[['model', 'overall_complexity_score', 'risk_category']]
    print(top_complex)
    
    # feature statistics
    print(f"\nFeature Statistics:")
    print(f"  Equations with Greek letters: {features_df['contains_greek_letters'].sum()} ({features_df['contains_greek_letters'].mean()*100:.1f}%)")
    print(f"  Equations with subscripts: {features_df['has_subscripts'].sum()} ({features_df['has_subscripts'].mean()*100:.1f}%)")
    print(f"  Multi-line equations: {features_df['is_multiline'].sum()} ({features_df['is_multiline'].mean()*100:.1f}%)")
    print(f"  Average ODE order: {features_df['ode_order'].mean():.2f}")
    print(f"  Average complexity indicators per equation: {features_df['total_complexity_indicators'].mean():.2f}")
    
    return features_df

> these need to be checked if they are useful:

In [35]:
# extras for step 1:

# Helper function to create a feature summary report
def create_feature_summary_report(features_df):
   # Create a summary report of feature distributions and risk assessments
   report = {
       'total_equations': len(features_df),
       'complexity_distribution': {
           'low': len(features_df[features_df['overall_complexity_score'] < 3]),
           'medium': len(features_df[(features_df['overall_complexity_score'] >= 3) & (features_df['overall_complexity_score'] < 6)]),
           'high': len(features_df[(features_df['overall_complexity_score'] >= 6) & (features_df['overall_complexity_score'] < 8)]),
           'very_high': len(features_df[features_df['overall_complexity_score'] >= 8])
       },
       'risk_distribution': features_df['risk_category'].value_counts().to_dict(),
       'feature_prevalence': {
           'greek_letters': features_df['contains_greek_letters'].mean(),
           'subscripts': features_df['has_subscripts'].mean(),
           'superscripts': features_df['has_superscripts'].mean(),
           'fractions': features_df['has_fractions'].mean(),
           'integrals': features_df['has_integrals'].mean(),
           'derivatives': features_df['has_derivatives'].mean(),
           'multiline': features_df['is_multiline'].mean()
       },
       'average_scores': {
           'symbol_complexity': features_df['symbol_complexity_score'].mean(),
           'structural_complexity': features_df['structural_complexity_score'].mean(),
           'mathematical_complexity': features_df['mathematical_complexity_score'].mean(),
           'visual_complexity': features_df['visual_complexity_score'].mean(),
           'overall_complexity': features_df['overall_complexity_score'].mean(),
           'ocr_risk': features_df['ocr_risk_score'].mean(),
           'extraction_difficulty': features_df['extraction_difficulty_score'].mean()
       },
       'high_risk_equations': features_df[features_df['has_high_risk_features']]['model'].tolist()
   }
   
   return report

# Visualization function for feature analysis
def visualize_feature_analysis(features_df):
   # Create visualizations of feature distributions and complexity scores
   import matplotlib.pyplot as plt
   import seaborn as sns
   
   fig, axes = plt.subplots(2, 2, figsize=(15, 12))
   
   # 1. Complexity score distribution
   axes[0, 0].hist(features_df['overall_complexity_score'], bins=20, edgecolor='black')
   axes[0, 0].set_title('Overall Complexity Score Distribution')
   axes[0, 0].set_xlabel('Complexity Score (0-10)')
   axes[0, 0].set_ylabel('Number of Equations')
   
   # 2. Risk category pie chart
   risk_counts = features_df['risk_category'].value_counts()
   axes[0, 1].pie(risk_counts.values, labels=risk_counts.index, autopct='%1.1f%%')
   axes[0, 1].set_title('Risk Category Distribution')
   
   # 3. Feature prevalence bar chart
   feature_cols = ['contains_greek_letters', 'has_subscripts', 'has_superscripts', 
                  'has_fractions', 'has_integrals', 'has_derivatives', 'is_multiline']
   feature_prevalence = [features_df[col].mean() * 100 for col in feature_cols]
   feature_names = ['Greek', 'Subscripts', 'Superscripts', 'Fractions', 
                   'Integrals', 'Derivatives', 'Multi-line']
   
   axes[1, 0].bar(feature_names, feature_prevalence)
   axes[1, 0].set_title('Feature Prevalence (%)')
   axes[1, 0].set_ylabel('Percentage of Equations')
   axes[1, 0].set_xticklabels(feature_names, rotation=45)
   
   # 4. Complexity components scatter
   axes[1, 1].scatter(features_df['symbol_complexity_score'], 
                     features_df['ocr_risk_score'], 
                     c=features_df['overall_complexity_score'], 
                     cmap='RdYlBu_r', alpha=0.6)
   axes[1, 1].set_xlabel('Symbol Complexity Score')
   axes[1, 1].set_ylabel('OCR Risk Score')
   axes[1, 1].set_title('Symbol Complexity vs OCR Risk')
   cbar = plt.colorbar(axes[1, 1].collections[0], ax=axes[1, 1])
   cbar.set_label('Overall Complexity')
   
   plt.tight_layout()
   plt.savefig('feature_analysis_summary.png', dpi=300)
   plt.show()

# Main execution for Step 1
def run_step1_analysis(correct_df):
   # Run complete Step 1 analysis
   print("Starting Step 1: Feature Extraction and Scoring")
   print("-" * 50)
   
   # Extract features and calculate scores
   features_df = step1_extract_and_score_features(correct_df)
   
   # Create summary report
   summary_report = create_feature_summary_report(features_df)
   
   # Save results
   features_df.to_csv('equation_features_with_scores.csv', index=False)
   
   # Create visualizations
   visualize_feature_analysis(features_df)
   
   return features_df, summary_report

##### STEP 2: Equation Comparison

In [36]:
EQUATION_COMPARISON_PROMPT = """
You are an expert in mathematical notation and ODE systems. Compare these two ODE systems and identify ALL differences with high precision.

CORRECT ODE SYSTEM:
{correct_system}

EXTRACTED ODE SYSTEM:
{extracted_system}

Analyze the systems for:
1. Mathematical equivalence (same mathematical meaning despite notation differences)
2. Symbol accuracy (Greek letters, subscripts, superscripts)
3. Structural integrity (derivatives, fractions, matrices)
4. Completeness (missing equations, terms, or conditions)
5. Notation consistency (derivative notation, function notation)

Return a structured JSON response:
{{
    "mathematically_equivalent": true/false,
    "comparison_summary": {{
        "total_equations_correct": integer,
        "total_equations_extracted": integer,
        "completely_correct_equations": integer,
        "partially_correct_equations": integer,
        "missing_equations": integer,
        "extra_equations": integer
    }},
    "errors": [
        {{
            "equation_index": integer,
            "error_description": "specific description",
            "correct_form": "what it should be",
            "extracted_form": "what was extracted",
            "error_location": "specific location in equation"
        }}
    ],
    "notation_issues": [
        {{
            "type": "greek_letter|subscript|superscript|derivative|operator",
            "details": "specific notation problem"
        }}
    ]
}}
"""

ERROR_CATEGORIZATION_PROMPT = """
You are an expert in analyzing mathematical extraction errors. Categorize the errors from the equation comparison.

COMPARISON RESULTS:
{comparison_results}

Analyze and categorize each error based on:

1. ERROR CATEGORIES:
   - symbol_recognition: Greek letters, special mathematical symbols misrecognized
   - subscript_superscript: Issues with sub/superscript notation
   - structural_corruption: Fractions, matrices, nested structures damaged
   - derivative_notation: Problems with derivative representations
   - boundary_initial_conditions: Missing or corrupted conditions
   - operator_errors: Mathematical operators (+, -, ×, ÷) misrecognized
   - completeness_errors: Missing equations, terms, or parts
   - formatting_errors: Layout, alignment, multi-line equation issues

2. SEVERITY LEVELS:
   - low: Cosmetic issues that don't affect mathematical meaning
   - medium: Errors that make equations harder to read but still understandable
   - high: Significant errors affecting mathematical correctness
   - critical: Errors making equations unusable or mathematically wrong

3. ROOT CAUSES:
   - ocr_limitation: Typical OCR challenges (similar looking characters)
   - complexity_induced: Errors due to equation complexity
   - notation_ambiguity: Unclear or ambiguous notation in source
   - extraction_logic: Issues with the extraction algorithm

Return structured JSON:
{{
    "overall_severity": "low|medium|high|critical",
    "primary_error_category": "most common category",
    "error_distribution": {{
        "by_category": {{
            "symbol_recognition": integer,
            "subscript_superscript": integer,
            ...
        }},
        "by_severity": {{
            "low": integer,
            "medium": integer,
            "high": integer,
            "critical": integer
        }}
    }},
    "detailed_categorization": [
        {{
            "error_id": integer,
            "categories": ["primary_category", "secondary_category"],
            "severity": "low|medium|high|critical",
            "root_cause": "identified cause",
            "fix_difficulty": "easy|moderate|hard",
            "suggested_improvement": "specific suggestion for MIRA"
        }}
    ],
    "patterns_identified": [
        {{
            "pattern": "description of recurring error pattern",
            "frequency": integer,
            "affected_features": ["list of equation features that correlate with this error"]
        }}
    ],
    "extraction_quality_score": float (0-100)
}}
"""


In [37]:
def get_openai_completion(prompt, model="gpt-4", temperature=0.0, max_tokens=1024):
    """Send prompt to OpenAI API"""
    try:
        response = client.chat.completions.create(  
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=max_tokens,
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"OpenAI API error: {e}")
        return None

In [38]:
def clean_equation_string(eq_str):
    """Clean equation string from CSV format"""
    if pd.isna(eq_str):
        return None
    
    eq_str = str(eq_str)
    
    # YOUR SPECIFIC CASE: starts with '" and ends with )"
    # Check for single quote followed by double quote
    if len(eq_str) >= 2 and eq_str[0] == "'" and eq_str[1] == '"':
        eq_str = eq_str[2:]  # Remove '"
    elif eq_str.startswith('"'):
        eq_str = eq_str[1:]  # Remove just "
    
    # Remove the trailing patterns
    if eq_str.endswith(')"'):
        eq_str = eq_str[:-2]  # Remove )"
    elif eq_str.endswith('"'):
        eq_str = eq_str[:-1]  # Remove just "
    elif eq_str.endswith("'"):
        eq_str = eq_str[:-1]  # Remove just '
    
    # Clean up escapes
    eq_str = eq_str.replace('\\n', '\n')
    eq_str = eq_str.replace('\\"', '"')
    eq_str = eq_str.replace("\\\'", "'")
    
    return eq_str

In [39]:
def compare_ode_system(model, correct_df, extracted_df, version=VERSION):
    """Compare ODE systems using OpenAI"""
    
    # getting the ODE systems
    correct_system = correct_df[correct_df['model'] == model]['correct_eqs'].values
    if len(correct_system) == 0:
        print(f"No correct equation found for {model}")
        return None
    correct_system = correct_system[0]
    
    # No version filtering needed - just get by model
    extracted_row = extracted_df[extracted_df['model'] == model]
    
    if len(extracted_row) == 0:
        print(f"No extraction found for {model}")
        return None
    
    extracted_system = extracted_row['extracted_eqs'].values[0]
    
    # Check for None (was NaN)
    if correct_system is None or extracted_system is None:
        print(f"Missing equation data for {model}")
        return None
    
    print(f"\nComparing equations for {model}")
    print(f"Correct equation length: {len(correct_system)} chars")
    print(f"Extracted equation length: {len(extracted_system)} chars")
    
    # prompt
    formatted_prompt = EQUATION_COMPARISON_PROMPT.format(
        correct_system=correct_system,
        extracted_system=extracted_system
    )
    
    # send to OpenAI
    try:
        response = get_openai_completion(formatted_prompt, temperature=0.0, max_tokens=2048)
        
        if response:
            # DEBUG show response
            print(f"\nDEBUG - OpenAI response (first 200 chars): {response[:200]}")
            
            result = json.loads(response)
            
            print(f"\nModel: {model}")
            print("="*50)
            print(f"Mathematically equivalent: {result['mathematically_equivalent']}")
            
            if result.get('errors'):
                print(f"\nErrors found ({len(result['errors'])}):") 
                for i, error in enumerate(result['errors'], 1):
                    print(f"  {i}. {error.get('error_description', 'No description')}")
            else:
                print("\nNo errors - perfect extraction!")
            
            return result
            
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON response: {e}")
        print(f"Raw response: {response}")
        return None
    except Exception as e:
        print(f"Error in comparison: {e}")
        import traceback
        traceback.print_exc()
        return None

##### STEP3: Categorizing the errors

In [40]:
def step3_ai_categorize_errors(model, step2_result):
    """Categorize errors using OpenAI"""
    
    if not step2_result or not step2_result.get('errors'):
        return {
            'model': model,
            'overall_severity': 'none',
            'primary_error_category': 'perfect_extraction',
            'error_distribution': {'by_category': {}, 'by_severity': {'none': 1}},
            'detailed_categorization': [],
            'patterns_identified': [],
            'extraction_quality_score': 100.0
        }
    
    # prepare comparison results for categorization
    comparison_data = {
        'model': model,
        'errors': step2_result.get('errors', []),
        'notation_issues': step2_result.get('notation_issues', []),
        'summary': step2_result.get('comparison_summary', {})
    }
    
    # prompt
    formatted_prompt = ERROR_CATEGORIZATION_PROMPT.format(
        comparison_results=json.dumps(comparison_data, indent=2)
    )
    
    # send to OpenAI
    response = get_openai_completion(formatted_prompt, temperature=0.0, max_tokens=1024)
    
    if response:
        try:
            categorization = json.loads(response)
            categorization['model'] = model
            categorization['total_errors'] = len(step2_result.get('errors', []))
            
            # Print results
            print(f"\nERROR CATEGORIZATION: {model}")
            print("="*50)
            print(f"Overall Severity: {categorization.get('overall_severity', 'unknown')}")
            print(f"Primary Category: {categorization.get('primary_error_category', 'unknown')}")
            print(f"Quality Score: {categorization.get('extraction_quality_score', 0)}/100")
            
            if categorization.get('error_distribution'):
                print("\nError Distribution by Category:")
                for category, count in categorization['error_distribution'].get('by_category', {}).items():
                    print(f"  {category}: {count}")
            
            return categorization
            
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON response: {e}")
            print(f"Raw response: {response}")
            return None
    
    return None


Main Execution Pipeline

In [41]:
def run_complete_analysis(correct_df, extracted_df, version=VERSION, models_to_analyze=None):
    """Run the complete error analysis pipeline
    
    Args:
        correct_df: DataFrame with correct equations
        extracted_df: DataFrame with extracted equations  
        version: Version string to filter extracted equations
        models_to_analyze: List of specific models to analyze, or None for all
    """
    
    print("STEP 1: Extracting Features...")
    features_df = step1_extract_and_score_features(correct_df)
    
    print("\nSTEP 2 & 3: Comparing and Categorizing...")
    comparison_results = []
    categorization_results = []
    
    # get models to process
    if models_to_analyze is not None:
        models = models_to_analyze
    else:
        models = correct_df['model'].unique()
    
    print(f"Will analyze {len(models)} models")
    
    # Process each model
    for i, model in enumerate(models):
        print(f"\n[{i+1}/{len(models)}] Processing {model}...")
        
        step2_result = compare_ode_system(model, correct_df, extracted_df, version)
        
        if step2_result:
            comparison_results.append({
                'model': model,
                'result': step2_result
            })
            
            step3_result = step3_ai_categorize_errors(model, step2_result)
            
            if step3_result:
                categorization_results.append(step3_result)
        
        time.sleep(1)
    
    # creating summary df
    summary_data = []
    for comp, cat in zip(comparison_results, categorization_results):
        row = {
            'model': comp['model'],
            'mathematically_equivalent': comp['result'].get('mathematically_equivalent', False),
            'total_errors': len(comp['result'].get('errors', [])),
            'overall_severity': cat.get('overall_severity', 'unknown'),
            'primary_error_category': cat.get('primary_error_category', 'unknown'),
            'quality_score': cat.get('extraction_quality_score', 0)
        }
        summary_data.append(row)
    
    summary_df = pd.DataFrame(summary_data)
    
    print("\n" + "="*60)
    print("ANALYSIS COMPLETE")
    print("="*60)
    print(f"Models analyzed: {len(summary_df)}")
    if len(summary_df) > 0:
        print(f"Perfect extractions: {summary_df['mathematically_equivalent'].sum()}")
        print(f"Average quality score: {summary_df['quality_score'].mean():.1f}/100")
    
    return features_df, comparison_results, categorization_results, summary_df

In [42]:
def get_models_for_analysis(correct_df, extracted_df, version=VERSION):
    """Get models that exist in both correct and extracted dataframes for a given version"""
    
    # Get models from correct equations
    correct_models = set(correct_df['model'].dropna().unique())
    
    # Get models from extracted equations (no version filtering needed)
    extracted_models = set(extracted_df['model'].dropna().unique())
    
    # Find intersection - models that exist in both
    common_models = correct_models.intersection(extracted_models)
    
    print(f"Models in correct_eqs: {len(correct_models)}")
    print(f"Models in extracted_eqs (from {version} file): {len(extracted_models)}")
    print(f"Common models to analyze: {len(common_models)}")
    
    if len(common_models) == 0:
        print("\nNo common models found!")
        print("Sample correct models:", list(correct_models)[:5])
        print("Sample extracted models:", list(extracted_models)[:5])
    
    return list(common_models)

#### Execution of the steps

In [43]:
# master csv for results
def append_to_master_results(analysis_results, version=VERSION, master_file='master_analysis_results.csv'):
    """Append current analysis results to a master CSV file"""
    # [Insert the full append_to_master_results function code here]

def create_analysis_summary_report(master_file='master_analysis_results.csv'):
    """Create a summary report from the master results file"""
    # [Insert the full create_analysis_summary_report function code here]

def run_and_append_analysis(correct_df, extracted_df, version=VERSION, models_to_analyze=None):
    """Run analysis and append to master file - convenience function"""
    # [Insert the full run_and_append_analysis function code here]

In [47]:
if __name__ == "__main__":
    # check API key
    if not os.getenv("OPENAI_API_KEY"):
        print("ERROR: OPENAI_API_KEY environment variable not set!")
        print("Set it using: export OPENAI_API_KEY='your-api-key-here'")
    else:
        print("OpenAI API key found!")
        
        print(f"\nAnalyzing models from version: {VERSION}")
        print("="*60)
        
        try:
            models_to_analyze = get_models_for_analysis(
                correct_eqs_df,
                extracted_eqs_df, 
                version=VERSION
            )
            
            if len(models_to_analyze) == 0:
                print("\nNo models to analyze. Checking data integrity...")
                print("\nSample from correct_eqs_df:")
                print(correct_eqs_df[['model', 'correct_eqs']].head())
                print("\nSample from extracted_eqs_df:")
                print(extracted_eqs_df[['model', 'extracted_eqs']].head())
            else:
                print(f"\nStarting Error Analysis Pipeline for {len(models_to_analyze)} models...")
                print("="*60)
                
                # Run the complete analysis directly
                features_df, comparison_results, categorization_results, summary_df = run_complete_analysis(
                    correct_eqs_df,
                    extracted_eqs_df,
                    version=VERSION,
                    models_to_analyze=models_to_analyze
                )
                
                print(f"\n✓ Analysis complete!")
                
                # Save results
                if SAVE_RESULTS:
                    features_df.to_csv(FEATURES_OUTPUT, index=False)
                    summary_df.to_csv(SUMMARY_OUTPUT, index=False)
                    
                    with open(COMPARISON_OUTPUT, 'w') as f:
                        json.dump(comparison_results, f, indent=2)
                    
                    with open(CATEGORIZATION_OUTPUT, 'w') as f:
                        json.dump(categorization_results, f, indent=2)
                    
                    print(f"\nResults saved to:")
                    print(f"  - {FEATURES_OUTPUT}")
                    print(f"  - {SUMMARY_OUTPUT}")
                    print(f"  - {COMPARISON_OUTPUT}")
                    print(f"  - {CATEGORIZATION_OUTPUT}")
                
                # Show summary
                if not summary_df.empty:
                    print("\nQuick Summary:")
                    print(summary_df[['model', 'mathematically_equivalent', 'total_errors', 
                                     'overall_severity', 'quality_score']].head(10))
                
        except Exception as e:
            print(f"\nError during analysis: {e}")
            import traceback
            traceback.print_exc()

OpenAI API key found!

Analyzing models from version: 001
Models in correct_eqs: 10
Models in extracted_eqs (from 001 file): 10
Common models to analyze: 10

Starting Error Analysis Pipeline for 10 models...
STEP 1: Extracting Features...

 Feature Extraction Summary
Total equations analyzed: 10

Complexity Score Distribution:
  Low (0-3): 0
  Medium (3-6): 7
  High (6-8): 3
  Very High (8-10): 0

Risk Categories:
risk_category
high         5
very_high    5
Name: count, dtype: int64

Top 5 Most Complex Equations:
                    model  overall_complexity_score risk_category
7  2024_dec_epi_1_model_B                  6.876132     very_high
6  2024_dec_epi_1_model_A                  6.807692     very_high
8  2024_dec_epi_1_model_C                  6.174147     very_high
3         BIOMD0000000958                  5.861461     very_high
4         BIOMD0000000960                  5.658621     very_high

Feature Statistics:
  Equations with Greek letters: 0 (0.0%)
  Equations with subscr

---
Other approaches considered to be incorrect:
- **Reference validation against existing databases** is impractical due to inconsistent notation across papers (different symbols for same variables), lack of standardized databases and the context-dependent nature of these mathematical expressions. More effective validation: focusing on internal consistency, dimensional analysis, and alignment with the paper's semantic context rather than external database comparisons.
>- egyebek