Replies: 1 comment 1 reply
-
The quality of the response (in terms of LLM accuracy) is part of the security model itself: governance exactly defines which models are served then cross-validation verifies that. The validation process itself requires some improvement but the idea is to guarantee the identicall quality from all participant explicitly, not by feedback
Same point, it distributed only between workers who served the exatly same model. Host can't choose to serve differnt one by itself The idea to measure performance in general is a good direction. But i feel that current proposal don't take into account how chain works now |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
This discussion is the design document that precedes implementation. PR [#859]
(semantic cache) is the first implementation milestone; it exists to test the
infrastructure hypothesis, not to define the full system. The full system is defined here.
Per the review process @akup outlined on [#856] and [#802]: design first, then code.
This GiP is that design step. PR [#859] is scoped strictly to what this document
justifies in Phase 0.
PR [#859] introduces
CacheQualityWeight— a reward for cache reuse. It is a workingimplementation, but deliberately scoped: it solves one part of a larger problem.
This discussion proposes what that larger problem is, how it connects to everything
already in the protocol, and what the full solution looks like.
The gap: quality has no protocol representation
The Gonka network has a rigorous economic model for compute: Proof-of-Compute
measures nonce generation, validates it across nodes, and converts it to epoch weight.
Every node understands, optimizes for, and is incentivized by PoC.
Quality of inference — whether a response was useful, accurate, timely, or
appropriate for the request type — has no equivalent protocol representation. It is
invisible to the chain.
This is not a criticism. It is a natural stage of development. PoC needed to land first
(see [#856], [#821]). But as the network grows, the absence of a quality signal creates
predictable failure modes:
it was supposed to.
CacheQualityWeightbased solely onreuseCountwould begamed by routing, not by quality improvement.
GetRandomExecutordistributes traffic uniformly regardlessof which node is better at which task. A node specializing in code generation gets
the same traffic as one optimized for translation.
whether the result was useful. Their experience does not improve the protocol.
on how to structure requests for best results, which model to use for which task, or
how to measure their own inference quality over time.
Measured from live network data (epochs 161–191, 2,503,595 inferences):
The quality gap is measurable. The improvement path is quantifiable.
What this GiP proposes
A governance-controlled, multi-dimensional quality measurement and routing framework,
built incrementally on the infrastructure PR [#859] provides.
It has two interlocking components:
1 — Quality Axis Registry (measurement)
Ten axes, each independently activatable via governance weights:
X-Inference-FeedbackheaderComposite score:
The registry is additive. Nothing breaks if a weight is zero. Axes activate when
governance decides the measurement is trustworthy enough to affect rewards.
2 — Semantic Inference Optimization (routing + developer experience)
As the protocol accumulates completed inferences, it builds a semantic map of execution
patterns: which task types succeed on which nodes, which models handle which request
archetypes best, what latency and completion rate look like per specialization.
This map enables two things:
Protocol side (DAPI):
GetQualityWeightedExecutorreplacesGetRandomExecutorQualityScore, not uniformlyhigher
CacheQualityWeight→ more rewards → deeper specializationDeveloper / participant side:
GET /v1/models/profiles— exposes node specialization centroids and quality scoresX-Suggested-Model,X-Task-Archetype,X-Quality-Scoretrial and error
This is not prompt modification. The protocol does not change what users send.
It provides metadata: "for this type of request, here is what the network knows works."
Developers and clients act on that information voluntarily.
Why this is not on the edge of feasibility
The technical primitives are proven, deployed, and in production across the industry:
MLNodeEmbedder+ cosine scan — existsGetRandomExecutor→ replaceableStatsStorage.InferenceRecord— existsInMemoryCacheStore— existsThe infrastructure from PR [#859] is sufficient for phases 0–4. Phases 5–7 require
additional endpoints and a client library (discussed below).
Measured evidence
All numbers are reproducible from public endpoints. No private data used.
Network baseline (gonka.gg/api/public, epochs 161–191):
Live inference (proxy.gonka.gg, Qwen3-235B, 16 requests):
Specialization multiplier:
The economic case for specialization is mathematical, not speculative.
Routing simulation:
Hypotheses (all PROVEN from measured data):
Implementation roadmap
CacheQualityEpochSummary(fields 8–13: L4/L7/L8 axes)QualityReporterX-Inference-Feedbackheader parser in DAPIGetQualityWeightedExecutorroutingStatsStorage/v1/models/profiles+ enrichment headersgonka-sdk, Python + TypeScript)Phase 7 is a developer-facing product, not a protocol proposal. It belongs in a
separate repository under the Gonka Labs umbrella. The protocol (Phases 0–6) provides
the data and the endpoints; the SDK makes them ergonomic. Keeping them separate means:
on the same Phase 6 endpoints independently
Developer tooling strategy (Phase 7 scope)
The gap today: developers integrating Gonka do not have a standard pattern. They write
raw HTTP calls, pick models manually, have no signal on inference quality, and get no
guidance from the protocol on how to improve their workloads.
The SDK fills that gap using infrastructure the protocol will have after Phase 6.
What the SDK wraps
SDK design (TypeScript / Python)
TypeScript (Axios-based, OpenAI-SDK-compatible drop-in):
Python (httpx-based, drop-in for openai package):
What this achieves
X-Inference-Feedback, improving L4 data for all nodesGetQualityWeightedExecutor→better routing → higher QualityScore → SDK reports better outcomes → loop
Relationship to existing open-source patterns
GonkaClientwithautoRoute/v1/models/profiles+X-Task-ArchetypeGetQualityWeightedExecutor(Phase 4)@gonka-labs/vercel-ai-adapter(Phase 7 stretch)The Gonka SDK is not a novel architectural invention — it follows established patterns.
What makes it Gonka-specific is that the routing and quality signals come from the
on-chain quality registry, not a centralized service. That is the differentiator.
Proto extension (Phase 1)
Extend
CacheQualityEpochSummarywith additional axes:Governance weight parameters (new fields in
CacheQualityParams):Scale constraint (honest)
InMemoryCacheStorecurrently has no entry limit. At mainnet scale (75Kinferences/epoch, 384-dim embeddings,
MaxCacheAgeEpochs=10): peak ~1.15GB RAMand O(75K) cosine scan per request.
max_cache_entriesgovernance parameter (Phase 1) bounds this. With N=50,000:peak ~75MB, scan O(50K) — acceptable on any modern node. The
EvictExpiredcall ateach epoch boundary keeps the store bounded over time.
Related work
ContinuousPoCis now live infrastructure; quality measurement (L0: computestability, CV=0.35 measured) sits on top of this foundation. Timing is deliberate:
PoC lands first, quality layer follows.
every inference, including cache HITs)
Atomicity fixes reduce false invalidations, improving baseline L2 score.
/admin/v1/cache/statsis Source A in thethree-source cross-check triangle proposed there
organically through model specialization (M=1 per node)
directly quantify the root cause
L8 (latency consistency) baseline measurements
Open questions for the community
Weight governance: who proposes initial
axis_weights? What's the amendmentprocess when a new axis is added?
L4 feedback incentive: should participants be rewarded (even nominally) for
submitting feedback? Without incentive, adoption will be low.
L5 developer webhook: opt-in or opt-out default? What's the privacy model
for outcome data?
SDK scope: should Phase 7 be a Gonka Labs project or a community-owned
repository? What's the governance model for the SDK itself?
max_cache_entries default: 50,000 is conservative. Is there a preferred bound
based on expected node hardware profiles?
ContinuousPoC integration: should
ContinuousPoCEpochSummary.effective_poc_weightbe part of L0 axis calculation, or remain a separate PoC track? (@akup, @Mayveskii)
Full design document with scores, routing simulation, and scenario matrix:
docs/specs/inference-quality-protocol.mdin the PR [#859] branch.Beta Was this translation helpful? Give feedback.
All reactions