You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ReasonOps: Operator Discovery in LLM Chain-of-Thought
ReasonOps induces a compact vocabulary of seven discourse-level reasoning operators from 44,662 chain-of-thought traces across 12 thinking LLMs and 8 benchmarks, fully unsupervised, and uses them to identify the source model with near-perfect AUC and to predict answer correctness from partial reasoning traces.
Headline numbers
Metric
Value
Corpus
44,662 traces · 12 LLMs · 8 benchmarks
Operators
K=7 (Cohen's κ = 0.693–0.720 across three LLM judges)
First 3 lowercase alphabetic tokens of each sentence. Keep pivots that appear in ≥100 traces across ≥3 datasets, with all tokens in the top-2,000 most frequent corpus tokens. Yields 5,464 accepted pivots.
Embedding
Each pivot embedded with intfloat/e5-small-v2 (384-dim, L2-normalized).
Clustering
KMeans with K ∈ {6..11}, 30 restarts. K chosen by maximizing Cohen's κ against an independent LLM judge. K=7 wins.
Per-span nearest-centroid lookup. End-to-end discovery + annotation runs in under 7 minutes on a single CPU core for the full corpus.
Op-XGB
XGBoost on a 117-dim handcrafted operator feature vector (frequencies, quartile localization, bigram transitions, run lengths, first/last one-hots, entropy/length scalars) concatenated with an 8,000-feature anchor-phrase TF-IDF representation. Used for both correctness prediction and 12-class model identification.
OST
A ~800K-parameter Transformer encoder over the discrete operator label sequence (4 layers, d=128, 4 heads, pre-LayerNorm). Trained with a pairwise contrastive loss within each problem; supports early prediction natively.
Evaluation
Within-problem AUC (WP-AUC), problem-level 5-fold cross-validation, both cross-dataset and within-dataset protocols.
OPENROUTER_API_KEY # all non-Claude models
ANTHROPIC_API_KEY # Claude Sonnet 4.5, Claude Haiku 4.5
About
Unsupervised discovery of meso-scale reasoning operators from LLM chain-of-thought traces—a reusable behavioral layer beyond tokens and final-answer accuracy. 7 operators emerge from 44k+ traces across 12 models and 8 datasets, enabling model fingerprinting, early correctness prediction, and a foundation for process supervision and agent monitoring