Skip to content

Feature: Use GeoVerify system to compare mathematical equivalence #78

@DericHuynh

Description

@DericHuynh

GeoVerify (created to spit in Meta's face), seems to be a great Mathematical equivalence evaluator, and seems it would great to use math-verify parsing alongside GeoVerify's equivalence engine.

It seems to have impressive performance, however it's not exactly apples-to-apples comparison. The original benchmark is closed.

Method Parameters Agreement Precision Recall F1
math-verify (rule-based) 0 5.95% 5.38 6.67 5.96
general-verifier 1.5B 82.74% 83.13 93.24 87.90
CompassVerifier 32B 91.66% 94.20 86.67 90.28
Qwen3-4B (prompted) 4B 92.26% 89.74 93.33 91.50
Qwen3-14B (prompted) 14B 93.45% 92.21 94.67 93.42
o3 (prompted) undisclosed 94.05% 93.33 93.33 93.33
GPT-OSS-20B (prompted) 3.6B 94.64% 95.83 92.00 93.88
GPT-OSS-120B (prompted) 5.1B 95.24% 97.18 92.00 94.52
GeoVerify 0 95.88% 94.81 96.05 95.42

It was made by Richard Aragon, it's MIT license.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions