Skip to content

Fix Truncate Sequence Length#17

Merged
shivam-MBZUAI merged 2 commits into
mainfrom
hotfixes
Feb 8, 2026
Merged

Fix Truncate Sequence Length#17
shivam-MBZUAI merged 2 commits into
mainfrom
hotfixes

Conversation

@shivam-MBZUAI
Copy link
Copy Markdown
Contributor

@shivam-MBZUAI shivam-MBZUAI commented Feb 6, 2026

Summary by CodeRabbit

  • New Features

    • Immediate adaptive-threshold updates when a new leader emerges; API responses now include verification error codes; evaluations view shows more metadata (timings, status, errors).
  • Bug Fixes

    • Submission processing hardened to avoid marking failed attempts as processed; deterministic fatal errors short‑circuit remaining evaluations; downloads limited by chunked cap with tightened content detection and SSRF protections; stricter sequence/logits and gradient verification to catch invalid submissions.
  • Documentation

    • Docs and training/verification guidance updated to MFU-centric rules; package version bumped; console messages standardized to [OK]/[FAILED]/[WARNING]/[ERROR].

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 6, 2026

Walkthrough

Switches validator scoring to MFU, tightens streamed downloads and verification (logits/sequence/gradients), adds deterministic fatal-error handling and safe DB checks to avoid marking failed URLs processed, and introduces immediate adaptive-threshold updates when a finished submission becomes the new leader.

Changes

Cohort / File(s) Summary
Validator core
neurons/validator.py
Switch scoring to MFU; stream downloads with a 500KB chunk cap and hardened HTML/JSON detection; wrap commitment processing to avoid marking failed URLs as processed; add per-evaluation fatal_error handling; change _finalize_submission(..., fatal_error=False) signature and add _check_and_update_threshold_if_new_leader(...); safer DB checks and more granular logging.
Verification & gradient capture
environments/templar/env.py, local_test/verify.py, local_test/train.py
Introduce GradientCapturingOptimizer with protected attributes; capture per-layer gradient vectors; replace cosine/norm checks with relative-error-based gradient verification; enforce final_logits presence/3D shape and sequence-length checks; tighten verification defaults and surface error_code in responses.
Error codes & fatality classification
src/crusades/core/exceptions.py, src/crusades/affinetes/runner.py
Add error codes NO_GRADIENTS_CAPTURED, MISSING_LOGITS, INVALID_LOGITS_SHAPE, SEQUENCE_TRUNCATION; add EvaluationErrorCode.is_fatal(...) and EvaluationResult.is_fatal() to classify deterministic/fatal failures used to short‑circuit evaluation runs.
Docs & README
README.md, docs/Validator.md
Rebrand TPS → MFU with updated MFU formula and token-count semantics; tighten trainer rules and logits/sequence requirements; update examples, Docker placeholders, and memory defaults; wording/formatting tweaks.
API / models
environments/templar/env.py
Add error_code to EvaluateResponse; tighten EvaluateRequest defaults (e.g., gradient_norm_ratio_max → 1.02) and propagate verification error_code through evaluation path.
Submission viewing & CLI messaging
scripts/view_submission.py, neurons/miner.py
Expand evaluations SELECT to include wall_time_seconds, success, error, created_at; standardize CLI prefixes to [OK]/[FAILED]/[WARNING]/[ERROR].
Storage / threshold logic
src/crusades/storage/database.py, src/crusades/chain/weights.py
Unconditional loading of previous winner in set_weights (remove _initialized guard); minor formatting changes; validator persists immediate threshold/leader updates when new leader emerges.
Version & packaging
pyproject.toml, src/crusades/__init__.py
Bump project/package version to 0.3.0 (project.version and __version__).
Logging / TUI / small refactors
src/crusades/logging.py, src/crusades/tui/client.py, src/crusades/storage/models.py
Formatting, docstring, and minor non-functional refactors; single-line env/formatter access and style tweaks.
Local test harness
local_test/*
Major rework of verification harness to match production-validator semantics: unified verify flow, stricter checks, and GradientCapturingOptimizer integration.

Sequence Diagram(s)

sequenceDiagram
    participant Validator as Validator
    participant Downloader as Downloader
    participant Evaluator as Evaluator
    participant DB as Database
    participant Threshold as ThresholdService

    Validator->>Downloader: fetch submission URL (streamed, 500KB chunk cap)
    Downloader-->>Validator: streamed content (HTML/JSON detection)
    Validator->>DB: check if URL already processed
    DB-->>Validator: not processed
    Validator->>Evaluator: run evaluation (multiple runs)
    Evaluator-->>Validator: per-run results (+/− fatal_error)
    alt fatal_error occurred
        Validator->>DB: mark submission FAILED
    else all runs complete
        Evaluator-->>Validator: final MFU score
        Validator->>DB: save finalized submission
        Validator->>Threshold: check/update if new leader (immediate)
        Threshold-->>DB: persist new leader/threshold
        Validator->>DB: mark URL processed
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰
I counted FLOPs while nibbling a log,
I hopped through chunks and dodged each fog,
When a leader sprung, I thumped a beat anew,
I saved the score, kept the packets true,
A rabbit’s hop for MFU—hooray, we grew!

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 73.68% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Fix Truncate Sequence Length' is vague and does not clearly convey the scope or nature of the changes. While the PR does address sequence length validation issues, the title lacks specificity about what is being fixed. Consider a more descriptive title such as 'Add sequence length validation and MFU-based scoring' or 'Implement deterministic error handling and adaptive threshold updates' to better reflect the substantial changes across multiple files.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hotfixes

No actionable comments were generated in the recent review. 🎉


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@neurons/validator.py`:
- Around line 550-554: The call to
_check_and_update_threshold_if_new_leader(submission_id, median_mfu) can raise
and abort the evaluation loop (propagating through _finalize_submission →
evaluate_submissions); wrap this call in a try/except that catches exceptions
(at minimum Exception), logs the error along with submission_id and relevant
context (e.g., median_mfu and hparams.adaptive_threshold), and does not re-raise
so the loop continues; ensure logging uses the existing logger and make clear
this is best-effort post-finish work since the submission is already marked
FINISHED.
- Around line 576-581: The code falls back to current_block = 0 when
self.commitment_reader is None, which then gets stored via
update_adaptive_threshold and corrupts decay calculations in
get_adaptive_threshold; change the logic in the block where current_block is
computed (the current_block assignment near commitment_reader usage) to return
early (or skip calling update_adaptive_threshold/get_adaptive_threshold) if
self.commitment_reader is None or cannot provide a valid block number instead of
assigning 0, ensuring last_update_block is not persisted with a sentinel zero
value; locate and modify the code paths that call update_adaptive_threshold and
get_adaptive_threshold so they only run when a real current_block is available.
- Around line 606-612: The float conversion of previous_winner_score_str is
unguarded and can raise ValueError; update the code around previous_winner_id /
previous_winner_score_str / previous_winner_score (the block that calls
self.db.get_validator_state) to safely parse the string: attempt to convert
previous_winner_score_str to float inside a try/except (or use a safe parse
helper), on exception or if the string is empty/non-numeric set
previous_winner_score = 0.0 and log or warn about the malformed stored value so
the evaluation loop won’t crash.
🧹 Nitpick comments (1)
neurons/validator.py (1)

614-645: Possible duplicate threshold update due to async interleaving with the weight setter.

Between the reads on lines 606-611 and the writes on lines 644-645, there are multiple await points (lines 620, 635, 644, 645). If maybe_set_weights runs during any of these yields and performs its own leader-check + threshold-update using the same previous_winner_id/previous_winner_score keys, both code paths will compute and persist the same threshold update independently.

This is likely benign since both paths would compute the same result from the same inputs, but it's worth noting. If you want strict single-writer semantics, consider acquiring an async lock or performing the read-check-write in a single DB transaction.

Comment thread neurons/validator.py Outdated
Comment thread neurons/validator.py Outdated
Comment thread neurons/validator.py Outdated
@shivam-MBZUAI shivam-MBZUAI changed the title Update Weight Setting Logic Fix Truncate Sequence Length + Weight Setting Logic Feb 7, 2026
@shivam-MBZUAI shivam-MBZUAI changed the title Fix Truncate Sequence Length + Weight Setting Logic Fix Truncate Sequence Length Feb 8, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/Validator.md (1)

311-324: ⚠️ Potential issue | 🟡 Minor

Inconsistent min_gpu_memory_gb default: JSON says 64, table says 80.

The example JSON on line 311 sets "min_gpu_memory_gb": 64, but the table on line 324 lists the default as 80. Pick one value and use it consistently.

🤖 Fix all issues with AI agents
In `@environments/templar/env.py`:
- Around line 619-644: The _PROTECTED_ATTRS set itself can be overwritten by
miner code, so update the protection logic: add the string "_PROTECTED_ATTRS"
into the frozenset declared at class-level and change the check in __setattr__
to reference the class constant instead of the instance attribute (e.g. use
type(self)._PROTECTED_ATTRS or self.__class__._PROTECTED_ATTRS) so the guard
cannot be bypassed by assigning to self._PROTECTED_ATTRS; keep the rest of
__init__ behavior using object.__setattr__ unchanged.
- Around line 765-791: The default for norm_ratio_max is inconsistent with the
docstring: change the parameter default norm_ratio_max from 1.02 to 1.01 so
relative_error_threshold = norm_ratio_max - 1.0 yields 0.01 (1%), and update the
nearby example comments that reference norm_ratio_max values (e.g., the
"norm_ratio_max=1.01 -> 1% relative error" comment) to remain accurate; ensure
the docstring line "Production default: 0.01 (1% relative error)" stays as-is
and that the symbols norm_ratio_max and relative_error_threshold reflect the new
default.

In `@neurons/validator.py`:
- Around line 679-706: The threshold update is happening before persisting the
winner identity, risking duplicate threshold bumps if subsequent
set_validator_state("previous_winner_id", ...) or
set_validator_state("previous_winner_score", ...) fail; change the order so the
winner identity and score are saved to the database first (via
self.db.set_validator_state calls) before calling
self.db.update_adaptive_threshold, or, if supported, perform both actions inside
a single atomic DB transaction using self.db (e.g., a transactional method or
begin/commit API) so update_adaptive_threshold and the two set_validator_state
updates succeed or roll back together.
🧹 Nitpick comments (5)
src/crusades/__init__.py (1)

12-12: Remove the unused noqa directive.

Ruff reports E402 is not triggered here, making # noqa: E402 unnecessary. The import follows only module-level assignments, which doesn't violate E402.

Suggested fix
-from crusades.logging import LOKI_URL, setup_loki_logger  # noqa: E402
+from crusades.logging import LOKI_URL, setup_loki_logger
src/crusades/core/exceptions.py (1)

62-87: Consider composing is_fatal from is_verification_failure to avoid duplicating the verification-failure set.

If a new verification code is added, both is_verification_failure and is_fatal must be updated in lockstep. Composing avoids that risk:

Suggested refactor
     `@classmethod`
     def is_fatal(cls, code: "EvaluationErrorCode") -> bool:
         """Check if error is fatal/deterministic (no point retrying).
 
         Fatal errors will fail the same way on every retry because
         they are caused by the miner's code logic, not transient issues.
         """
-        return code in {
-            # Code validation errors - code won't magically fix itself
-            cls.NO_CODE,
-            cls.SYNTAX_ERROR,
-            cls.MISSING_INNER_STEPS,
-            cls.INVALID_RETURN_TYPE,
-            # All verification failures - deterministic security checks
-            cls.INSUFFICIENT_TRAINABLE_PARAMS,
-            cls.INSUFFICIENT_PARAMS_CHANGED,
-            cls.GRADIENT_COVERAGE_FAILED,
-            cls.GRADIENT_NORM_RATIO_FAILED,
-            cls.GRADIENT_COSINE_FAILED,
-            cls.LOSS_MISMATCH,
-            cls.TOKEN_COUNT_MISMATCH,
-            cls.NO_GRADIENTS_CAPTURED,
-            cls.MISSING_LOGITS,
-            cls.INVALID_LOGITS_SHAPE,
-            cls.SEQUENCE_TRUNCATION,
-        }
+        # Code validation errors - code won't magically fix itself
+        code_validation = code in {
+            cls.NO_CODE,
+            cls.SYNTAX_ERROR,
+            cls.MISSING_INNER_STEPS,
+            cls.INVALID_RETURN_TYPE,
+        }
+        # All verification failures are also fatal (deterministic)
+        return code_validation or cls.is_verification_failure(code)
neurons/validator.py (2)

255-264: Use logger.exception instead of logger.error to capture the traceback.

When catching a broad Exception, the traceback is valuable for debugging submission creation failures. logger.exception automatically includes it.

Proposed fix
                     except Exception as e:
-                        logger.error(f"Failed to create submission for {code_url[:60]}: {e}")
+                        logger.exception(f"Failed to create submission for {code_url[:60]}: {e}")
                         # Don't record URL - will retry on next cycle

548-561: Fatal error finalization: failed_evals[-1] may pick up a failure from a different evaluator.

get_evaluations(submission_id) returns evaluations from all evaluators. If another validator previously recorded a non-fatal failure, failed_evals[-1] could surface that error message instead of the actual fatal error from the current evaluator. Consider filtering by self.hotkey:

Proposed fix
         if fatal_error:
             all_evals = await self.db.get_evaluations(submission_id)
-            failed_evals = [e for e in all_evals if not e.success and e.error]
+            failed_evals = [
+                e for e in all_evals
+                if not e.success and e.error and e.evaluator_hotkey == self.hotkey
+            ]
             error_msg = failed_evals[-1].error if failed_evals else "Fatal error in evaluation"
environments/templar/env.py (1)

848-866: Consider using strict=True in zip(ref_vecs, cand_vecs) (line 853).

The length mismatch is already checked at line 842, so strict=True would be redundant at runtime, but it documents the invariant and would catch bugs if the pre-check is ever refactored away. Minor nit flagged by Ruff (B905).

Proposed fix
-        for ref_layer, cand_layer in zip(ref_vecs, cand_vecs):
+        for ref_layer, cand_layer in zip(ref_vecs, cand_vecs, strict=True):

Comment thread environments/templar/env.py
Comment on lines +765 to +791
cosine_min: float = 0.8, # Legacy, unused - kept for signature compat
norm_ratio_min: float = 0.5, # Legacy, unused - kept for signature compat
norm_ratio_max: float = 1.02, # Relative error threshold: 1.02 = 2% max error
) -> tuple[bool, str | None, dict]:
"""Verify candidate gradients match reference using cosine similarity and norm ratio.
"""Verify candidate gradients match reference using relative error |g - g_truth| / |g_truth|.

This is a direct comparison that catches any deviation including truncation,
layer freezing, and step skipping. The threshold should be near numerical
precision (bfloat16 has ~0.8% relative error for accumulated ops).

Args:
reference_grad: Reference implementation gradients
candidate_grad: Miner's implementation gradients
cosine_min: Minimum cosine similarity required
norm_ratio_min: Minimum ratio of candidate/reference gradient norm
norm_ratio_max: Maximum ratio of candidate/reference gradient norm
cosine_min: (legacy, kept for signature compat) Not used in new check
norm_ratio_min: (legacy, kept for signature compat) Not used in new check
norm_ratio_max: Maximum allowed relative error |g - g_truth| / |g_truth|
Repurposed: norm_ratio_max is used as the relative error threshold.
Production default: 0.01 (1% relative error)

Returns:
Tuple of (success, error_message, details)
"""
# Repurpose norm_ratio_max as the relative error threshold
# This keeps backward compat with hparams without adding a new field
# norm_ratio_max=1.1 means 10% relative error allowed (old default)
# norm_ratio_max=1.01 means 1% relative error allowed (tight)
relative_error_threshold = norm_ratio_max - 1.0 # e.g., 1.01 -> 0.01
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring says "Production default: 0.01 (1%)" but the code default is 1.02 → 2%.

Line 767 declares norm_ratio_max: float = 1.02, yielding a 2% threshold via norm_ratio_max - 1.0. The docstring at line 782 claims "Production default: 0.01 (1% relative error)". One of them is stale — update whichever is incorrect.

🧰 Tools
🪛 Ruff (0.14.14)

[warning] 765-765: Unused function argument: cosine_min

(ARG001)


[warning] 766-766: Unused function argument: norm_ratio_min

(ARG001)

🤖 Prompt for AI Agents
In `@environments/templar/env.py` around lines 765 - 791, The default for
norm_ratio_max is inconsistent with the docstring: change the parameter default
norm_ratio_max from 1.02 to 1.01 so relative_error_threshold = norm_ratio_max -
1.0 yields 0.01 (1%), and update the nearby example comments that reference
norm_ratio_max values (e.g., the "norm_ratio_max=1.01 -> 1% relative error"
comment) to remain accurate; ensure the docstring line "Production default: 0.01
(1% relative error)" stays as-is and that the symbols norm_ratio_max and
relative_error_threshold reflect the new default.

Comment thread neurons/validator.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/Validator.md (1)

311-324: ⚠️ Potential issue | 🟡 Minor

Inconsistent min_gpu_memory_gb values across docs and code.

Three different defaults appear:

  • Line 311 (JSON example in this doc): 64
  • Line 324 (settings table in this doc): 80
  • src/crusades/affinetes/runner.py line 181 (code default): 40

Pick one canonical default and align all three locations to avoid operator confusion.

🧹 Nitpick comments (7)
src/crusades/storage/database.py (1)

411-415: Good guard against division by zero.

One consideration: returning state.current_threshold (undecayed) when decay_interval_blocks <= 0 means a misconfigured parameter silently freezes the threshold at its last-set value, which could be much higher than base_threshold. Returning base_threshold instead would be a safer fallback. Either way, consider logging a warning here to surface the bad configuration.

💡 Optional: safer fallback + warning
             if decay_interval_blocks <= 0:
-                return state.current_threshold
+                import logging
+                logging.getLogger(__name__).warning(
+                    "decay_interval_blocks=%d is invalid, falling back to base_threshold",
+                    decay_interval_blocks,
+                )
+                return base_threshold
src/crusades/__init__.py (1)

12-12: Remove unused # noqa: E402 directive.

Ruff reports this noqa comment is unnecessary (the E402 rule is not being triggered here). Keeping unused suppressions adds noise and can mask future real violations.

Suggested fix
-from crusades.logging import LOKI_URL, setup_loki_logger  # noqa: E402
+from crusades.logging import LOKI_URL, setup_loki_logger
src/crusades/affinetes/runner.py (1)

847-851: Consider using logger.exception instead of manual traceback formatting.

logger.exception() automatically captures the current exception's traceback, making the explicit traceback.format_exc() call redundant.

Suggested fix
         except Exception as e:
-            logger.error("[BASILICA] Failed to deploy!")
-            logger.error(f"   Error: {e}")
-            logger.error(traceback.format_exc())
+            logger.exception("[BASILICA] Failed to deploy: %s", e)
             return None
environments/templar/env.py (1)

835-897: Relative error check is a sound replacement for separate cosine + norm checks.

A single |g - g_truth| / |g_truth| metric detects all classes of gradient deviation (truncation, freezing, skipping) in one pass. The incremental layer-by-layer computation avoids large memory allocations.

Minor: the zip() at line 853 could use strict=True since length equality is pre-checked at line 842, but this is a nit.

local_test/verify.py (3)

204-217: batch_size parameter is unused.

The batch_size parameter at line 209 is accepted but never referenced in the function body. Either remove it or use it (e.g., for a batch-dimension check on logits).

Proposed fix
 def verify_all(
     reference,
     candidate,
     expected_tokens: int,
     expected_seq_len: int,
-    batch_size: int,
     initial_state: dict,
     model,
     reference_grad: GradientInfo,
     candidate_grad: GradientInfo,
     max_loss_difference: float = 0.3,
     min_changed_ratio: float = 0.8,
     gradient_norm_ratio_max: float = 1.02,
 ) -> bool:

And update the call site at line 593:

     passed = verify_all(
         reference=reference,
         candidate=candidate,
         expected_tokens=expected_tokens,
         expected_seq_len=expected_seq_len,
-        batch_size=batch_size,
         initial_state=initial_state,

41-48: GradientInfo field names diverge from production (vectors vs grad_vector).

In environments/templar/env.py, the field is named grad_vector, while here it's vectors. Though they're separate dataclasses, keeping the naming consistent would reduce confusion for developers working across both files.


51-80: Same _PROTECTED_ATTRS bypass as production code — "_PROTECTED_ATTRS" not in the set.

For a local test tool the security impact is minimal, but keeping this in sync with the production fix ensures miners get accurate test results. Apply whatever fix is adopted in environments/templar/env.py.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
scripts/view_submission.py (1)

121-128: ⚠️ Potential issue | 🟡 Minor

Potential TypeError if numeric fields are NULL.

If mfu, tokens_per_second, or wall_time_seconds are NULL in the database, the :.2f / :.0f format specifiers on lines 124-125 will raise a TypeError. Consider adding fallback defaults similar to how final_score is handled on line 56.

Proposed fix
             for i, e in enumerate(evals, 1):
                 status = "PASS" if e["success"] else "FAIL"
+                mfu = e['mfu'] or 0.0
+                tps = e['tokens_per_second'] or 0.0
+                wall_time = e['wall_time_seconds'] or 0.0
                 print(
-                    f"  #{i}: MFU={e['mfu']:.2f}% TPS={e['tokens_per_second']:.0f} "
-                    f"tokens={e['total_tokens']} time={e['wall_time_seconds']:.2f}s [{status}]"
+                    f"  #{i}: MFU={mfu:.2f}% TPS={tps:.0f} "
+                    f"tokens={e['total_tokens']} time={wall_time:.2f}s [{status}]"
                 )
docs/Validator.md (1)

196-203: ⚠️ Potential issue | 🟡 Minor

Inconsistency between documented hparams example and code defaults.

The example hparams.json shows "gradient_norm_ratio_max": 1.1 (10% relative error), but both environments/templar/env.py and local_test/verify.py default to 1.02 (2% relative error). Similarly, "max_loss_difference": 0.3 in the example config matches verify.py but not the code default in env.py (0.5). Validators using these example values will get different behavior than the code defaults suggest.

Consider aligning the example config with the current code defaults (or vice versa) to avoid confusion.

🤖 Fix all issues with AI agents
In `@local_test/verify.py`:
- Around line 51-80: The __setattr__ protection can be bypassed by reassigning
the instance attribute _PROTECTED_ATTRS; change the check in
GradientCapturingOptimizer.__setattr__ to reference the class-level constant
rather than the instance (e.g., use GradientCapturingOptimizer._PROTECTED_ATTRS
or type(self)._PROTECTED_ATTRS) so assignments like optimizer._PROTECTED_ATTRS =
... no longer disable protection, and keep the existing _initialized guard
intact when performing the check.
- Around line 569-581: The check for required attributes misses final_loss
causing a possible AttributeError when constructing InnerStepsResult; update the
validation where ok is computed (currently checking miner_result.final_logits
and miner_result.total_tokens) to also require hasattr(miner_result,
"final_loss") before creating candidate, and ensure the error/exit path remains
the same so the printed failure message matches the actual checked attributes
(references: miner_result, ok, InnerStepsResult, candidate).
🧹 Nitpick comments (3)
src/crusades/__init__.py (1)

12-12: Unused noqa directive.

Ruff reports E402 is not enabled, so the # noqa: E402 comment is unnecessary.

Proposed fix
-from crusades.logging import LOKI_URL, setup_loki_logger  # noqa: E402
+from crusades.logging import LOKI_URL, setup_loki_logger
src/crusades/affinetes/runner.py (1)

847-851: Consider logger.exception instead of manual traceback formatting.

logger.exception() automatically includes the traceback, making line 850 redundant. This is also flagged by Ruff (TRY400).

Proposed fix
         except Exception as e:
-            logger.error("[BASILICA] Failed to deploy!")
-            logger.error(f"   Error: {e}")
-            logger.error(traceback.format_exc())
+            logger.exception("[BASILICA] Failed to deploy: %s", e)
             return None
local_test/verify.py (1)

204-217: Unused batch_size parameter in verify_all.

The batch_size parameter (line 209) is accepted but never used inside verify_all. This was also flagged by static analysis (Ruff ARG001). Consider removing it if unneeded, or adding a _ prefix if reserved for future use.

Comment thread local_test/verify.py
Comment thread local_test/verify.py
@shivam-MBZUAI shivam-MBZUAI force-pushed the hotfixes branch 4 times, most recently from a7795dd to 2b9cfc4 Compare February 8, 2026 15:12
Security:
- Replace cosine sim + norm ratio with direct gradient relative error |g - g_truth| / |g_truth|
- Protect GradientCapturingOptimizer attributes from miner tampering (__setattr__ guard)
- Fix default gradient threshold from 100% to 2% relative error
- Add logits None check, 3D shape check, and sequence length check
- Use expected_tokens instead of miner-reported tokens for MFU calculation
- Chunked download to prevent OOM from malicious large responses
- Require loss to be positive (cross-entropy > 0)
- Mandatory loss comparison (no silent skip when reference unavailable)

Validator:
- Add fatal error early termination (skip remaining eval runs on deterministic failures)
- Add immediate adaptive threshold update after new leader finishes
- Fix URL tracked even when DB save fails (submissions silently lost)
- Fix DB error check fallthrough that could create duplicate submissions
- Fix fatal error classification: only code-level checks are fatal, not data-dependent ones

Verification:
- Add new error codes: MISSING_LOGITS, INVALID_LOGITS_SHAPE, SEQUENCE_TRUNCATION, NO_GRADIENTS_CAPTURED
- Add is_fatal() classifier (code-level vs data-dependent errors)
- Add is_fatal() method on EvaluationResult
- Add error_code to EvaluateResponse HTTP API

Docs & cleanup:
- Fix MFU formula in Validator.md (was using wrong per-token calculation)
- Update verification checks table (cosine+norm -> relative error)
- Fix README: TPS -> MFU, correct repo URL, add verification rules for miners
- Fix pyproject.toml version mismatch (0.1.0 -> 0.2.2)
- Update train.py with clear verification rules and legitimate optimizations
- Rewrite verify.py to match production validator (uses GradientCapturingOptimizer)
- Remove all emoji from codebase (use plain text labels)
- Bump version to 0.2.2
- ruff check + format across entire repo

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
scripts/view_submission.py (1)

121-128: ⚠️ Potential issue | 🟡 Minor

Potential TypeError if mfu or wall_time_seconds is NULL in the database.

e['mfu'] and e['wall_time_seconds'] are formatted with :.2f / :.0f, which will raise TypeError on None. Consider defaulting, e.g. (e['mfu'] or 0.0).

🛡️ Suggested fix
-                    f"  #{i}: MFU={e['mfu']:.2f}% TPS={e['tokens_per_second']:.0f} "
-                    f"tokens={e['total_tokens']} time={e['wall_time_seconds']:.2f}s [{status}]"
+                    f"  #{i}: MFU={e['mfu'] or 0.0:.2f}% TPS={e['tokens_per_second'] or 0:.0f} "
+                    f"tokens={e['total_tokens'] or 0} time={e['wall_time_seconds'] or 0.0:.2f}s [{status}]"
docs/Validator.md (2)

196-211: ⚠️ Potential issue | 🟡 Minor

Stale hparams example: gradient_cosine_min is no longer used, and min_gpu_memory_gb conflicts with the table.

Two inconsistencies in this example JSON block:

  1. gradient_cosine_min: 0.95 (line 200) is documented here but the verification table above has replaced the cosine check with Gradient Relative Error (gradient_norm_ratio_max - 1.0). This will confuse operators.
  2. min_gpu_memory_gb: 64 (line 311 in Basilica config) conflicts with the default of 80 shown in the table at line 324.
Proposed fix for the verification block
     "verification": {
         "max_loss_difference": 0.3,
         "min_params_changed_ratio": 0.8,
-        "gradient_cosine_min": 0.95,
-        "gradient_norm_ratio_min": 0.9,
-        "gradient_norm_ratio_max": 1.1
+        "gradient_norm_ratio_max": 1.02
     },

306-315: ⚠️ Potential issue | 🟡 Minor

min_gpu_memory_gb value inconsistent with the defaults table below.

The JSON example shows 64 but the table at line 324 states the default is 80.

Proposed fix
-        "min_gpu_memory_gb": 64,
+        "min_gpu_memory_gb": 80,
🧹 Nitpick comments (6)
environments/templar/env.py (1)

860-860: Consider adding strict=True to zip() for defense-in-depth.

The length check on line 849 already guards this, but strict=True makes the invariant self-documenting and catches any future refactoring that removes the earlier check.

-        for ref_layer, cand_layer in zip(ref_vecs, cand_vecs):
+        for ref_layer, cand_layer in zip(ref_vecs, cand_vecs, strict=True):
src/crusades/__init__.py (1)

3-12: Unnecessary noqa: E402 directive.

Ruff reports E402 is not enabled in your lint config, so this suppression is dead. Remove it to avoid confusion.

🧹 Suggested fix
-from crusades.logging import LOKI_URL, setup_loki_logger  # noqa: E402
+from crusades.logging import LOKI_URL, setup_loki_logger
src/crusades/affinetes/runner.py (1)

847-851: Consider logger.exception instead of logger.error + manual traceback.format_exc().

Inside the except block, logger.exception(...) automatically attaches the traceback, making the explicit traceback.format_exc() call on line 850 redundant.

🧹 Suggested fix
-            logger.error("[BASILICA] Failed to deploy!")
-            logger.error(f"   Error: {e}")
-            logger.error(traceback.format_exc())
+            logger.exception("[BASILICA] Failed to deploy: %s", e)
neurons/validator.py (1)

683-684: Minor atomicity gap between the two set_validator_state calls.

If line 683 succeeds but line 684 fails (unlikely but possible), the DB will have previous_winner_id set to this submission but previous_winner_score will be stale. On the next cycle, line 676 will see the ID as "already handled" and skip, leaving the old score persisted. The threshold update would have been skipped too, so no corruption occurs — but the leader's score in the DB will be wrong until a new leader emerges.

Given that both are simple upserts and failure between them is extremely unlikely, this is acceptable. A transactional batch write would be the ideal fix if the DB supports it.

local_test/verify.py (2)

209-222: batch_size parameter is unused in verify_all.

Ruff ARG001 confirms batch_size is never referenced in the function body. Either remove it or use it (e.g., to cross-check logits batch dimension).

Option A: Remove unused parameter
 def verify_all(
     reference,
     candidate,
     expected_tokens: int,
     expected_seq_len: int,
-    batch_size: int,
     initial_state: dict,
     model,
     reference_grad: GradientInfo,
     candidate_grad: GradientInfo,
     max_loss_difference: float = 0.3,
     min_changed_ratio: float = 0.8,
     gradient_norm_ratio_max: float = 1.02,
 ) -> bool:

(Also remove the argument at the call site, line 602.)

Option B: Use it to validate logits batch dimension
     # CHECK: Logits shape is 3D
     check_num += 1
     print(f"\n[CHECK {check_num}] Logits shape must be 3D (batch, seq, vocab)")
     if candidate.final_logits is not None:
         shape = candidate.final_logits.shape
         print(f"  Shape: {tuple(shape)}")
         if len(shape) != 3:
             print(f"  [FAILED] Expected 3D tensor, got {len(shape)}D")
             all_passed = False
+        elif shape[0] != batch_size:
+            print(f"  [FAILED] Batch dim is {shape[0]}, expected {batch_size}")
+            all_passed = False
         else:
             print("  [PASSED]")

355-392: Gradient relative error implementation is correct.

The layer-by-layer accumulation of diff_norm_sq and ref_norm_sq avoids allocating a single concatenated vector — consistent with the memory optimization noted in GradientInfo's docstring.

One minor nit: zip(ref_vecs, cand_vecs) at line 360 could use strict=True for extra safety, though the length check at line 357 already guards against mismatched lengths.

Nitpick: add strict=True
-        for ref_layer, cand_layer in zip(ref_vecs, cand_vecs):
+        for ref_layer, cand_layer in zip(ref_vecs, cand_vecs, strict=True):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants