Skip to content

(feat): change-scoring-to-1-10-format#473

Merged
swaroopvarma1 merged 1 commit intojuspay:releasefrom
narsimhaReddyJuspay:change-scoring-to-1-10-format
Jan 7, 2026
Merged

(feat): change-scoring-to-1-10-format#473
swaroopvarma1 merged 1 commit intojuspay:releasefrom
narsimhaReddyJuspay:change-scoring-to-1-10-format

Conversation

@narsimhaReddyJuspay
Copy link
Contributor

@narsimhaReddyJuspay narsimhaReddyJuspay commented Jan 7, 2026

  • changed from 0 1 scoring to 1 to 10 scoring
  • moved configs from static to dynamic

Summary by CodeRabbit

  • New Features

    • Evaluator configuration now loads dynamically at runtime, allowing updates without redeploying.
  • Improvements

    • Score monitoring uses per-evaluator configurable thresholds with sensible defaults instead of only zero-score detection.
    • Alerts now report actual score values and label issues as "BELOW THRESHOLD" for clearer notification context.

✏️ Tip: You can customize this high-level summary in your review settings.

@narsimhaReddyJuspay
Copy link
Contributor Author

@coderabbitai please review

@coderabbitai
Copy link

coderabbitai bot commented Jan 7, 2026

@narsimhaReddyJuspay, I'll review the changes in this PR. Based on the objectives, I'll pay special attention to:

  • The scoring format change from 0-1 to 1-10
  • The migration of configs from static to dynamic
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link

coderabbitai bot commented Jan 7, 2026

Walkthrough

Static LANGFUSE evaluators config was removed and replaced by an async dynamic accessor. Score monitoring moved from zero-score detection to per-evaluator threshold checks (thresholds parsed from the dynamic config), and task initialization now awaits the async evaluator config before registering monitoring tasks.

Changes

Cohort / File(s) Summary
Dynamic config (new behavior)
app/core/config/dynamic.py
Added public async LANGFUSE_EVALUATORS() -> dict[str, int] which reads a comma-separated string of name or name:threshold, parses thresholds (default 5 on parse failure/missing), and returns a mapping of evaluator → integer threshold.
Static config removal
app/core/config/static.py
Removed the static LANGFUSE_EVALUATORS constant and its parsing logic; no replacement in static config.
Score monitoring logic
app/services/langfuse/tasks/score_monitor/score.py
Replaced _is_zero_score(score) with _is_below_threshold(score, threshold) and updated monitoring flow to fetch async evaluator thresholds, evaluate per-evaluator failing scores (below threshold), update alert aggregation, and change alert text/labels to "BELOW THRESHOLD" with actual score values.
Task initialization
app/services/langfuse/tasks/task.py
Switched import to async LANGFUSE_EVALUATORS from dynamic config and updated initialization to await LANGFUSE_EVALUATORS(); task registration now depends on the awaited evaluator result.

Sequence Diagram(s)

sequenceDiagram
    participant TaskInit as Task Initialization
    participant DynCfg as Dynamic Config (async)
    participant Store as Redis/Storage
    participant Monitor as ScoreMonitor

    TaskInit->>DynCfg: await LANGFUSE_EVALUATORS()
    DynCfg->>Store: fetch LANGFUSE_EVALUATORS string
    Store-->>DynCfg: raw string
    DynCfg-->>TaskInit: dict<evaluator,threshold>

    alt evaluators present
        TaskInit->>Monitor: register monitoring task
        loop periodic check
            Monitor->>DynCfg: await LANGFUSE_EVALUATORS()
            DynCfg->>Store: fetch evaluators & thresholds
            Store-->>DynCfg: data
            DynCfg-->>Monitor: dict<evaluator,threshold>
            Monitor->>Monitor: fetch scores per evaluator
            Monitor->>Monitor: compare each score < threshold
            rect rgb(220,245,220)
            Note over Monitor: aggregate failing scores per evaluator
            end
            Monitor->>Monitor: trigger alerts if counts exceed limits
        end
    else none found
        TaskInit->>TaskInit: skip registration
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐇 Redis hums a tuneful, async beat,

Evaluators sorted, thresholds neat,
Zero is gone — now thresholds rule the day,
Monitor wakes, checks scores the new way,
Hooray for alerts that behave and play! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title check ⚠️ Warning The title describes changing scoring format from 0-1 to 1-10, but the actual changes involve migrating configs from static to dynamic and implementing threshold-based alerting—the 1-10 format is only part of the implementation, not the primary change. Revise the title to reflect the main change: 'Migrate Langfuse evaluator config to dynamic and implement threshold-based scoring alerts' or similar, as the config migration is a major structural change alongside the scoring format update.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
app/services/langfuse/tasks/score_monitor/score.py (2)

4-6: Outdated docstring.

The module docstring still references "failures (score = 0)" but the logic now uses configurable thresholds on a 1-10 scale.

📝 Suggested fix
 """
 Langfuse Score Monitoring Service
 
 This module provides functionality to poll Langfuse for LLM-as-a-judge scores
-and identify failures (score = 0) for alerting purposes.
+and identify failures (scores below configurable thresholds) for alerting purposes.
 """

310-339: Alert message hardcodes "0.0 (FAILURE)" but scores can now be any value below threshold.

With threshold-based alerting, a score of 4 (below threshold 5) would trigger an alert, but the Slack message would incorrectly display "Score: 0.0 (FAILURE)". The actual score value should be displayed.

🐛 Proposed fix
+        score_value = score.get("value", "N/A")
+
         # Build fields for the alert
         fields = [
-            {"name": "Score", "value": "0.0 (FAILURE)"},
+            {"name": "Score", "value": f"{score_value} (below threshold)"},
             {"name": "Timestamp", "value": time_str},
             {"name": "Trace ID", "value": f"`{trace_id}`"},
             {"name": "Call SID", "value": f"`{call_sid or 'N/A'}`"},

Also update the fallback text at line 338:

-            fallback_text=f"LLM Judge Failure: {evaluator_name} - Score 0.0",
+            fallback_text=f"LLM Judge Failure: {evaluator_name} - Score {score_value}",
🧹 Nitpick comments (1)
app/core/config/dynamic.py (1)

215-230: Consider validating threshold values are within documented 1-10 range.

The docstring states thresholds are on a 1-10 scale, but invalid values (e.g., negative numbers or values > 10) from Redis would be accepted without validation. This could lead to unexpected alerting behavior.

🔧 Optional: Add range clamping
             try:
-                thresholds.append(int(t))
+                val = int(t)
+                thresholds.append(max(1, min(10, val)))  # Clamp to 1-10
             except ValueError:
                 thresholds.append(5)  # Default threshold
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ac5bb6 and fcd1d83.

📒 Files selected for processing (4)
  • app/core/config/dynamic.py
  • app/core/config/static.py
  • app/services/langfuse/tasks/score_monitor/score.py
  • app/services/langfuse/tasks/task.py
💤 Files with no reviewable changes (1)
  • app/core/config/static.py
🧰 Additional context used
🧬 Code graph analysis (3)
app/core/config/dynamic.py (1)
app/services/live_config/store.py (1)
  • get_config (211-235)
app/services/langfuse/tasks/task.py (1)
app/core/config/dynamic.py (1)
  • LANGFUSE_EVALUATORS (206-212)
app/services/langfuse/tasks/score_monitor/score.py (2)
app/core/config/dynamic.py (2)
  • LANGFUSE_EVALUATOR_THRESHOLDS (215-230)
  • LANGFUSE_EVALUATORS (206-212)
app/services/redis/client.py (1)
  • get (184-192)
🪛 Ruff (0.14.10)
app/services/langfuse/tasks/score_monitor/score.py

199-199: Consider moving this statement to an else block

(TRY300)


833-833: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (5)
app/core/config/dynamic.py (1)

206-212: LGTM!

Clean implementation following the established pattern (similar to SHOPS_FOR_TEMPLATE_FLOW). The empty string filtering handles edge cases correctly.

app/services/langfuse/tasks/task.py (1)

8-8: LGTM!

Clean migration from static to dynamic config. The async flow is properly handled, and the truthiness check on the evaluators list correctly gates task registration.

Also applies to: 30-34

app/services/langfuse/tasks/score_monitor/score.py (3)

180-203: LGTM!

The threshold-based comparison logic is correct. The static analysis hint (TRY300) about moving the return to an else block is a style preference; keeping it in the try block is fine since the float conversion can raise exceptions.


784-798: LGTM on threshold mapping logic.

The approach of building a dict mapping evaluator names to their thresholds with a default fallback of 5 handles mismatched list lengths gracefully.


817-831: LGTM!

Per-evaluator threshold filtering is correctly implemented. Logging includes the threshold context for debugging.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
app/services/langfuse/tasks/score_monitor/score.py (1)

310-339: Alert displays hardcoded "0.0" instead of actual score value.

With threshold-based alerting, scores like 3 or 4 can trigger alerts, but the Slack message will incorrectly show "Score: 0.0 (FAILURE)". The actual score value should be displayed for accurate alerting.

🐛 Proposed fix
+        # Get actual score value
+        score_value = score.get("value", "N/A")
+
         # Build fields for the alert
         fields = [
-            {"name": "Score", "value": "0.0 (FAILURE)"},
+            {"name": "Score", "value": f"{score_value} (Below Threshold)"},
             {"name": "Timestamp", "value": time_str},
             {"name": "Trace ID", "value": f"`{trace_id}`"},
             {"name": "Call SID", "value": f"`{call_sid or 'N/A'}`"},
             {
                 "name": "Recording",
                 "value": (
                     f"<{recording_url}|Listen to Recording>" if recording_url else "N/A"
                 ),
             },
         ]

Also update the fallback text:

         success = await slack_alert.send(
             title=f"🔴 Breeze Buddy - {evaluator_name}",
             fields=fields,
             sections=sections if sections else None,
             links=links,
-            fallback_text=f"LLM Judge Failure: {evaluator_name} - Score 0.0",
+            fallback_text=f"LLM Judge Failure: {evaluator_name} - Score {score_value}",
         )
🧹 Nitpick comments (1)
app/services/langfuse/tasks/score_monitor/score.py (1)

1-6: Update module docstring to reflect threshold-based detection.

The docstring still references "failures (score = 0)" but the implementation now uses configurable thresholds. Update for accuracy.

📝 Suggested fix
 """
 Langfuse Score Monitoring Service
 
 This module provides functionality to poll Langfuse for LLM-as-a-judge scores
-and identify failures (score = 0) for alerting purposes.
+and identify failures (scores below configured thresholds) for alerting purposes.
 """
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ac5bb6 and fcd1d83.

📒 Files selected for processing (4)
  • app/core/config/dynamic.py
  • app/core/config/static.py
  • app/services/langfuse/tasks/score_monitor/score.py
  • app/services/langfuse/tasks/task.py
💤 Files with no reviewable changes (1)
  • app/core/config/static.py
🧰 Additional context used
🧬 Code graph analysis (3)
app/core/config/dynamic.py (1)
app/services/live_config/store.py (1)
  • get_config (211-235)
app/services/langfuse/tasks/score_monitor/score.py (1)
app/core/config/dynamic.py (2)
  • LANGFUSE_EVALUATOR_THRESHOLDS (215-230)
  • LANGFUSE_EVALUATORS (206-212)
app/services/langfuse/tasks/task.py (1)
app/core/config/dynamic.py (1)
  • LANGFUSE_EVALUATORS (206-212)
🪛 Ruff (0.14.10)
app/services/langfuse/tasks/score_monitor/score.py

199-199: Consider moving this statement to an else block

(TRY300)


833-833: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (6)
app/core/config/dynamic.py (1)

205-230: LGTM! Well-structured dynamic config accessors.

The new Langfuse configuration accessors follow the established pattern in this module (similar to SHOPS_FOR_TEMPLATE_FLOW). The threshold parsing handles errors gracefully with a sensible default of 5.

app/services/langfuse/tasks/task.py (1)

8-8: LGTM! Correct async migration.

The import source change and async await pattern are correctly implemented. The condition check properly treats an empty evaluators list as falsy.

Also applies to: 30-32

app/services/langfuse/tasks/score_monitor/score.py (4)

180-203: LGTM! Clean threshold-based detection logic.

The method correctly handles edge cases (None values, parse errors) and the threshold comparison logic is sound.


784-798: LGTM! Robust threshold mapping with sensible defaults.

The logic correctly handles mismatched lengths between evaluators and thresholds lists by defaulting to 5, consistent with the dynamic config accessor's behavior.


800-837: LGTM! Per-evaluator threshold filtering is correctly implemented.

The loop correctly applies per-evaluator thresholds and the exception handling ensures one evaluator's failure doesn't stop the entire monitoring process.


902-932: LGTM! Efficient use of cached trace details.

The alert loop correctly uses the pre-fetched trace details cache, avoiding redundant API calls.

@narsimhaReddyJuspay narsimhaReddyJuspay force-pushed the change-scoring-to-1-10-format branch 2 times, most recently from a9c0457 to 192568f Compare January 7, 2026 09:23
- changed from 0 1 scoring to 1 to 10 scoring
- moved configs from static to dynamic
@narsimhaReddyJuspay narsimhaReddyJuspay force-pushed the change-scoring-to-1-10-format branch from 192568f to 94325ac Compare January 7, 2026 09:27
@narsimhaReddyJuspay
Copy link
Contributor Author

@coderabbitai can u review again i have changed a few things

@coderabbitai
Copy link

coderabbitai bot commented Jan 7, 2026

@narsimhaReddyJuspay, I'll review the latest changes in the PR. I'll take a look at the updates you've made to the scoring format change and the config migration.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
app/services/langfuse/tasks/score_monitor/score.py (1)

179-202: Clarify threshold boundary behavior in documentation.

The comparison uses < (line 198), meaning a score equal to the threshold is considered passing. This is reasonable, but the behavior should be documented clearly in the docstring to avoid confusion.

Additionally, consider validating that:

  1. The threshold is within the expected 1-10 range
  2. The score value is within the expected 1-10 range

While the dynamic config defaults invalid thresholds to 5, runtime validation would catch configuration errors earlier and provide clearer error messages.

📝 Suggested documentation enhancement
 def _is_below_threshold(self, score: Dict[str, Any], threshold: int) -> bool:
     """
-    Check if a score is below the threshold (failure).
-    Scores are on a 1-10 scale.
+    Check if a score is below the threshold (failure).
+    Scores are on a 1-10 scale. A score equal to the threshold is considered passing.

     Args:
         score: Score dictionary
-        threshold: The threshold value (1-10). Scores below this trigger alerts.
+        threshold: The threshold value (1-10). Scores strictly below this value trigger alerts.

     Returns:
-        True if score value is below threshold, False otherwise
+        True if score value is strictly less than threshold, False otherwise
     """
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcd1d83 and 94325ac.

📒 Files selected for processing (4)
  • app/core/config/dynamic.py
  • app/core/config/static.py
  • app/services/langfuse/tasks/score_monitor/score.py
  • app/services/langfuse/tasks/task.py
💤 Files with no reviewable changes (1)
  • app/core/config/static.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • app/core/config/dynamic.py
🧰 Additional context used
🧬 Code graph analysis (2)
app/services/langfuse/tasks/score_monitor/score.py (1)
app/core/config/dynamic.py (1)
  • LANGFUSE_EVALUATORS (227-256)
app/services/langfuse/tasks/task.py (1)
app/core/config/dynamic.py (1)
  • LANGFUSE_EVALUATORS (227-256)
🪛 Ruff (0.14.10)
app/services/langfuse/tasks/score_monitor/score.py

198-198: Consider moving this statement to an else block

(TRY300)


824-824: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (7)
app/services/langfuse/tasks/task.py (2)

8-13: LGTM! Import updated to use dynamic config.

The migration of LANGFUSE_EVALUATORS from static to dynamic config is correctly implemented.


29-34: LGTM! Async config fetch correctly implemented.

The async retrieval of evaluators and the conditional check ensure that monitoring tasks are only registered when evaluators are properly configured. The boolean check on the dict is correct (empty dict is falsy).

app/services/langfuse/tasks/score_monitor/score.py (5)

1-17: LGTM! Documentation and imports updated correctly.

The module docstring accurately reflects the new threshold-based detection on a 1-10 scale, and the import migration from static to dynamic config is correctly implemented.


309-340: LGTM! Alert messages updated to display actual score values.

The alert now shows the actual score value with a "BELOW THRESHOLD" label, which is much more informative than the previous zero-score approach. This helps operators understand the severity of failures.


349-368: LGTM! Dynamic config correctly integrated.

The async retrieval of evaluators and iteration over configured evaluators is correctly implemented. The Redis-based alert tracking logic remains sound.


786-828: LGTM! Per-evaluator threshold logic correctly implemented.

The refactoring from zero-score detection to threshold-based detection is well-executed:

  • Async config retrieval is correct
  • Per-evaluator thresholds are properly extracted and applied
  • Error handling ensures one failing evaluator doesn't break monitoring for others
  • Variable naming (failing_scores_by_evaluator) is more accurate than the previous zero_scores

Note: The broad exception catch on line 824 is intentional for resilience in the monitoring loop, ensuring that a failure in one evaluator doesn't halt the entire monitoring process.


893-924: LGTM! Alert processing logic correctly refactored.

The alert processing flow has been updated to work with threshold-based failures:

  • Correctly checks for failing scores before processing
  • Uses cached trace details for efficiency (avoiding redundant API calls)
  • Maintains resilient error handling per alert
  • Log messages accurately reflect the threshold-based approach

The refactoring maintains the existing performance optimization of fetching trace details once per trace rather than once per score.

@swaroopvarma1 swaroopvarma1 merged commit f512c00 into juspay:release Jan 7, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants