Skip to content

feat: add score endpoint#391

Merged
nickpismenkov merged 5 commits intomainfrom
feat/score-api
Feb 6, 2026
Merged

feat: add score endpoint#391
nickpismenkov merged 5 commits intomainfrom
feat/score-api

Conversation

@nickpismenkov
Copy link
Contributor

@nickpismenkov nickpismenkov commented Jan 27, 2026

Screenshot 2026-01-26 at 6 59 41 PM

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @nickpismenkov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new text similarity scoring feature, exposing it via a /v1/score API endpoint. It enables users to compare two texts using a specified reranker model, returning a similarity score. The changes involve defining the API contract, implementing the endpoint's logic, integrating with various inference providers, and ensuring proper usage tracking and billing for this new service.

Highlights

  • New Score API Endpoint: A new /v1/score API endpoint has been introduced, allowing users to perform text similarity scoring between two input texts using a specified reranker model.
  • Data Models and OpenAPI Integration: New data structures (ScoreRequest, ScoreResponse, ScoreResult, ScoreUsage) have been defined for the scoring functionality and integrated into the OpenAPI specification for comprehensive documentation.
  • Inference Provider Support: The inference provider system has been extended to support text similarity scoring, with implementations for OpenAI-compatible backends, VLLM providers, and a mock provider for testing purposes.
  • Concurrent Request Management and Usage Tracking: The CompletionService now manages concurrent scoring requests per organization and model, and the usage tracking service has been updated to accurately calculate billing costs for 'score' inference types based on input tokens.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link

claude bot commented Jan 27, 2026

Code Review - PR #391: Add Score Endpoint

I've analyzed this PR and found several CRITICAL issues that need to be addressed before merge:


⚠️ CRITICAL ISSUES

1. Concurrent Request Slot Leak on Error (Resource Leak)

Location: crates/services/src/completions/mod.rs:1024-1059

The try_score method releases the concurrent request slot AFTER getting the result at line 1044. However, if the inference provider call at line 1040 panics or is cancelled (e.g., timeout, task cancellation), the slot is never released, leading to resource exhaustion.

Impact: Organizations could hit the 64 concurrent request limit permanently after a few panics/cancellations, blocking all future score requests.

Fix: Use an RAII guard pattern to ensure the slot is always released:

async fn try_score(
    &self,
    organization_id: uuid::Uuid,
    model_id: uuid::Uuid,
    model_name: &str,
    params: inference_providers::ScoreParams,
    request_hash: String,
) -> Result<inference_providers::ScoreResponse, ports::CompletionError> {
    // Acquire concurrent request slot
    let counter = self
        .try_acquire_concurrent_slot(organization_id, model_id, model_name)
        .await?;

    // Create guard that releases slot on drop (panic-safe)
    struct SlotGuard {
        counter: Arc<AtomicU32>,
    }
    impl Drop for SlotGuard {
        fn drop(&mut self) {
            self.counter.fetch_sub(1, Ordering::Release);
        }
    }
    let _guard = SlotGuard { counter: counter.clone() };

    // Call inference provider pool - slot will be released even on panic
    let result = self
        .inference_provider_pool
        .score(params, request_hash)
        .await;

    // Map errors
    result.map_err(|e| {
        let error_msg = match e {
            ScoreError::GenerationError(msg) => msg,
            ScoreError::HttpError { status_code, message } => {
                format\!("HTTP {}: {}", status_code, message)
            }
        };
        ports::CompletionError::ProviderError(error_msg)
    })
}

2. Incorrect Usage Recording After Success (Billing Issue)

Location: crates/api/src/routes/completions.rs:980-1034

At line 1011, the code records input_tokens: token_count where token_count comes from response.usage.total_tokens (line 986). This is semantically incorrect:

  • total_tokens represents the sum of input + completion tokens
  • But you're storing it in the input_tokens field
  • For scoring, if total_tokens = 100, you're recording 100 input tokens + 0 output tokens

Then in usage/mod.rs:88, the cost calculation uses:

let score_cost = (request.input_tokens as i64) * model.input_cost_per_token;

Impact:

  • If providers return separate prompt_tokens and total_tokens, you're overbilling customers
  • Inconsistent with how chat completions handle usage (they use prompt_tokens for input)

Fix: Use prompt_tokens for input, not total_tokens:

let token_count = response
    .usage
    .as_ref()
    .and_then(|u| u.prompt_tokens)  // Changed from total_tokens
    .unwrap_or(0);

Or if reranker models truly use total tokens for billing, add a comment explaining this intentional difference.


3. Missing Timeout on Score Requests

Location: crates/inference_providers/src/vllm/mod.rs:637-676

The vLLM score implementation at line 654-658 applies the timeout:

.timeout(std::time::Duration::from_secs(
    self.config.timeout_seconds as u64,
))

However, the OpenAI-compatible backend at crates/inference_providers/src/external/openai_compatible.rs:440-483 applies timeout at line 453, but the ExternalProvider wrapper doesn't enforce any additional timeout.

Concern: If a reranker model takes an unusually long time (e.g., very long texts), the request could hang indefinitely while holding a concurrent request slot.

Recommendation: Either:

  • Document that config.timeout_seconds is enforced at the backend level, OR
  • Add an outer timeout in CompletionServiceImpl::try_score using tokio::time::timeout for defense-in-depth

🔍 MEDIUM CONCERNS

4. No Validation of Text Length

Location: crates/api/src/models.rs:42-62

The ScoreRequest::validate() only checks for empty strings, but doesn't validate text length. Reranker models typically have token limits (e.g., 512 or 8192 tokens).

Risk: Users could send extremely long texts, causing:

  • Provider errors (better case)
  • High costs (if provider accepts but charges heavily)
  • OOM issues (worst case)

Recommendation: Add max length validation:

const MAX_TEXT_LENGTH: usize = 100_000; // characters, adjust based on model limits

if self.text_1.len() > MAX_TEXT_LENGTH || self.text_2.len() > MAX_TEXT_LENGTH {
    return Err(format\!(
        "Text length exceeds maximum of {} characters",
        MAX_TEXT_LENGTH
    ));
}

5. Missing E2E Tests

No end-to-end tests found for the score endpoint. Given the complexity (auth, concurrent limits, usage tracking, error handling), this is a significant gap.

Recommendation: Add at minimum:

  • Basic happy path test (successful score)
  • Usage recording verification
  • Concurrent limit enforcement test
  • Model not found error test

✅ POSITIVE OBSERVATIONS

  • Proper concurrent request limiting (same pattern as completions)
  • Synchronous usage recording (billing-critical, correctly done)
  • Good error handling and status code mapping
  • Consistent with existing completion/image_generation patterns
  • OpenAPI documentation included

🎯 VERDICT

⚠️ ISSUES FOUND - Cannot approve until critical issues are addressed

Must Fix:

  1. Add RAII guard for concurrent slot release (resource leak)
  2. Fix usage recording to use prompt_tokens instead of total_tokens (billing accuracy)

Should Fix:
3. Add text length validation
4. Add E2E tests
5. Document timeout behavior

Please address the critical issues and consider the recommendations before merging.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new /v1/score endpoint for text similarity scoring. The changes include adding new data models (ScoreRequest, ScoreResponse, ScoreResult, ScoreUsage), integrating the scoring functionality across the API routes, inference providers (external, mock, and vLLM), and the completion service. Usage tracking for the new score inference type has also been implemented. The overall structure and integration align well with existing patterns in the codebase. The review comments highlight a security concern regarding logging sensitive API key IDs and suggest an optimization for UUID handling.

@nickpismenkov
Copy link
Contributor Author

@claude review

@claude
Copy link

claude bot commented Jan 27, 2026

Code Review - PR #391: Add Score Endpoint (Follow-up Review)

CONTEXT AWARENESS: Building upon the previous review from 2026-01-27. The author addressed all critical issues from that review:

  • ✅ Added RAII guard for concurrent slot release
  • ✅ Fixed usage recording to use prompt_tokens
  • ✅ Added text length validation (100k chars)
  • ✅ Added comprehensive E2E tests (451 lines)
  • ✅ Added service-level timeout protection

Fresh Review Findings

After reviewing the updated code, I've identified one remaining critical issue that must be addressed:

⚠️ CRITICAL ISSUE

Logging Privacy Violation - API Key IDs

Location: crates/api/src/routes/completions.rs:221-229

Lines 221-229 log API key IDs at debug level:

debug!("Score request from api key: {:?}", api_key.api_key.id);
debug!(
    "Score request: model={}, text_1_len={}, text_2_len={}, org={}, workspace={}",
    request.model,
    request.text_1.len(),
    request.text_2.len(),
    api_key.organization.id,
    api_key.workspace.id.0
);

Problem: According to CLAUDE.md (Privacy & Data Security section):

Production runs at info level and above. We ABSOLUTELY CANNOT and SHOULD NOT log customer-related data.

✓ OK TO LOG: IDs only - conversation_id, org_id, user_id, workspace_id, response_id

While IDs are generally allowed, the API key ID itself could be sensitive. More importantly, debug logs are enabled in production in some environments.

Impact:

  • Debug logs containing API key IDs could leak in production
  • Violates privacy policy if debug level is ever enabled

Fix: Change to info level and ensure only non-sensitive IDs are logged:

info!(
    "Score request: model={}, text_1_len={}, text_2_len={}, org_id={}, workspace_id={}",
    request.model,
    request.text_1.len(),
    request.text_2.len(),
    api_key.organization.id,
    api_key.workspace.id.0
);

Remove the first debug line entirely (API key ID should not be logged unless in trace level for local debugging only).


📋 MINOR OBSERVATIONS

1. Inconsistent Usage Comment

Location: crates/api/src/routes/completions.rs:301-302

Comment says "Use prompt_tokens for input billing (scoring has no completion tokens)" but reranker models do consume tokens for processing. The comment is slightly misleading.

Suggestion: Clarify:

// Use prompt_tokens for input billing
// Scoring models process both texts but don't generate new text (completion_tokens = 0)

2. Error Message Could Leak Provider Details

Location: crates/services/src/inference_provider_pool/mod.rs:1434

The error sanitization at line 1434 calls Self::sanitize_error_message which is good, but ensure this method properly strips URLs/IPs as the comment suggests.

Verification needed: Check if sanitize_error_message actually exists and works correctly (it's called but not shown in the diff).

3. Concurrent Limit Test Doesn't Actually Test Concurrency

Location: crates/api/tests/e2e_score.rs:817-850

The test test_score_concurrent_limit runs requests sequentially (line 828-847), not concurrently. This doesn't actually test the concurrent request limiting.

Impact: Low (the limiting logic is the same as completions which is tested elsewhere), but the test name is misleading.

Suggestion: Either:

  • Rename to test_score_sequential_requests
  • Or implement actual concurrent testing using tokio::spawn + tokio::join!

✅ POSITIVE OBSERVATIONS

  • All previous critical issues were properly addressed
  • Excellent test coverage (451 lines of E2E tests)
  • Proper RAII guard implementation prevents resource leaks
  • Timeout protection at service level (defense-in-depth)
  • Consistent error handling patterns
  • Good input validation (empty strings, length limits)

🎯 VERDICT

⚠️ One Critical Issue Remaining

Must Fix Before Merge:

  • Remove/adjust debug logging of API key IDs (privacy violation per CLAUDE.md)

Nice to Have:

  • Clarify usage comment
  • Verify error sanitization works
  • Rename concurrent test or make it actually concurrent

Once the logging issue is fixed, this PR will be ready to merge. Great work addressing all the previous feedback! 🎉

@nickpismenkov
Copy link
Contributor Author

@claude review

@claude
Copy link

claude bot commented Jan 27, 2026

Code Review - PR #391: Add Score Endpoint

I've analyzed this PR focusing on critical production safety, security, and correctness issues.


STRENGTHS

Good implementation overall with several solid patterns:

  • RAII guard pattern correctly implemented in try_score (lines 1315-1330 in completions/mod.rs) to prevent concurrent slot leaks
  • Timeout protection at service level provides defense-in-depth
  • Synchronous usage recording ensures billing accuracy (critical for revenue)
  • Input validation with length limits (100k chars) prevents resource exhaustion
  • Comprehensive E2E tests cover happy path, validation, auth, and usage tracking

⚠️ CRITICAL ISSUES

1. Race Condition: Usage Recorded After Slot Released (Billing/Limit Bypass)

Location: crates/api/src/routes/completions.rs:875-1087

Problem: The concurrent request slot is released by the RAII guard in try_score BEFORE usage is recorded in the route handler. This creates a window where:

  1. Request completes, slot released
  2. Another request can acquire the slot
  3. Original request fails to record usage → returns 500
  4. User is not billed but consumed resources

Impact:

  • Organizations can bypass concurrent limits by triggering usage recording failures
  • Revenue loss from untracked usage
  • Potential for DoS by exhausting slots without billing consequences

Location in code:

// Line ~875: try_score completes and RAII guard releases slot
match app_state.completion_service.try_score(...).await {
    Ok(response) => {
        // Lines 883-930: Usage recording happens AFTER slot is released
        // If this fails, slot is already freed but work was done
        if let Err(e) = app_state.usage_service.record_usage(usage_request).await {
            // Returns 500 but slot already released
            return (StatusCode::INTERNAL_SERVER_ERROR, ...).into_response();
        }

Fix: Move usage recording inside try_score before releasing the slot, or keep slot held until usage is confirmed:

// In try_score, before returning:
let result = /* ... get score response ... */;

// Record usage while still holding the slot
let usage_request = /* construct usage request */;
usage_service.record_usage(usage_request).await
    .map_err(|e| CompletionError::UsageRecordingFailed(e.to_string()))?;

// Slot releases here via Drop
Ok(result)

2. Potential Integer Overflow in Cost Calculation

Location: crates/services/src/usage/mod.rs:89

Problem:

let score_cost = (request.input_tokens as i64) * model.input_cost_per_token;

With MAX_SCORE_TEXT_LENGTH = 100_000 characters (~25k tokens), if input_cost_per_token is high enough, this multiplication could overflow i64::MAX.

Impact: Silent overflow → incorrect billing (likely wrapping to negative values)

Fix: Use checked multiplication:

let score_cost = (request.input_tokens as i64)
    .checked_mul(model.input_cost_per_token)
    .ok_or_else(|| UsageError::CostCalculationOverflow)?;

Or use i128 for intermediate calculations and validate before storing.


3. Unnecessary String Clones in Hot Path

Location: crates/api/src/routes/completions.rs:883-889

Problem:

let params = inference_providers::ScoreParams {
    model: request.model.clone(),      // Clone 1
    text_1: request.text_1.clone(),    // Clone 2 (potentially 100k chars\!)
    text_2: request.text_2.clone(),    // Clone 3 (potentially 100k chars\!)
    extra: std::collections::HashMap::new(),
};

This clones up to 200KB of text unnecessarily. The request is consumed after this point.

Impact:

  • Increased memory allocations in hot path
  • Higher latency for large text inputs
  • Unnecessary memory pressure in multi-cluster deployment

Fix: Change ScoreParams to take owned Strings and move from request:

let params = inference_providers::ScoreParams {
    model: request.model,        // Move instead of clone
    text_1: request.text_1,      // Move
    text_2: request.text_2,      // Move
    extra: std::collections::HashMap::new(),
};

📝 MINOR ISSUES (Not Blocking)

4. Missing Timeout Configuration Validation

Line 26 in api/src/lib.rs: config.model_discovery.inference_timeout as u64

If inference_timeout is 0 or negative, this creates a timeout of 0 seconds → all requests fail immediately. Consider adding validation in config loading.


5. Inconsistent Error Type for Model Not Found

The endpoint checks for model existence twice:

  1. Lines 928-942: Database lookup (returns 404 if not found)
  2. Lines 1165-1175: Provider error check for "not found" in error string

This is fragile - if the provider changes error wording, the second check breaks. Consider using structured error types from providers.


RECOMMENDATION

Issues #1 and #2 are CRITICAL and must be fixed before merge:

  • Issue #1 creates a billing/security vulnerability
  • Issue Develop #2 could cause billing corruption

Issue #3 is a performance concern worth addressing in a multi-tenant production environment handling 100k character texts.


Sign-off: ⚠️ (critical issues found - fix before merge)

Let me know if you need clarification on any of these points!

@nickpismenkov
Copy link
Contributor Author

@claude review

@claude
Copy link

claude bot commented Jan 27, 2026

Code Review - PR #391: Add Score Endpoint (Final Review)

CONTEXT AWARENESS: This is the 4th review of this PR. Previous reviews identified multiple critical issues including:

  • ✅ RAII guard for concurrent slot release (addressed)
  • ✅ Usage recording using prompt_tokens (addressed)
  • ✅ Text length validation (addressed)
  • ✅ E2E tests added (addressed)
  • ✅ Timeout protection added (addressed)
  • ✅ API key logging removed (addressed in 3rd review)
  • ⚠️ Race condition on usage recording (claimed fixed in 3rd review)
  • ✅ Overflow protection added for cost calculation
  • ✅ Config validation added

Fresh Analysis of Current Implementation

After reviewing the diff carefully, I found that Issue #1 from the 3rd review has been properly fixed. The code now records usage INSIDE the try_score method (lines 1127-1144 in completions/mod.rs) BEFORE the RAII guard drops and releases the slot.

Critical Issues from Previous Reviews - ALL RESOLVED

  1. Race Condition FIXED (crates/services/src/completions/mod.rs:1127-1144)

    • Usage is now recorded INSIDE try_score while holding the slot
    • Slot releases via Drop AFTER usage recording succeeds
    • Excellent comment explaining the atomicity guarantee (lines 1111-1124)
  2. Integer Overflow Protection ADDED (crates/services/src/usage/mod.rs:56-77, 104-122)

    • All cost calculations now use checked_mul and checked_add
    • Returns proper error on overflow instead of silent wrapping
  3. Config Validation ADDED (crates/config/src/types.rs:937-961)

    • Validates timeout values at startup
    • Prevents 0 or negative timeouts
  4. String Moves Instead of Clones (crates/api/src/routes/completions.rs:336-341)

    • Code correctly moves strings from request to params
    • Comment explains performance rationale (lines 334-335)

🔍 NEW FINDINGS - Minor Issues Only

1. Unnecessary Clone in Route Handler

Location: crates/api/src/routes/completions.rs:331

let model_name = request.model.clone();  // Line 331

This clone is unnecessary since request.model is moved into params on line 337. You could reorder to avoid the clone:

let model_name = request.model.clone();  // Save before moving
let params = inference_providers::ScoreParams {
    model: request.model,  // Move happens here
    // ...
};
// Use model_name after this

Better approach:

let params = inference_providers::ScoreParams {
    model: request.model.clone(),  // Clone once for params
    text_1: request.text_1,
    text_2: request.text_2,
    extra: std::collections::HashMap::new(),
};
let model_name = &params.model;  // Borrow from params

Or:

let model_name = request.model.as_str();  // Borrow before move
let params = inference_providers::ScoreParams {
    model: request.model,  // Move
    // ...
};
// Use model_name (&str) with try_score

Impact: Very minor - single string clone (~20 bytes), not a blocker.


2. Comment Accuracy - "Move strings" but One is Cloned

Location: crates/api/src/routes/completions.rs:334-341

Comment says "Move strings instead of cloning" but line 331 clones model_name. Either fix the clone (see #1) or update the comment to acknowledge the model clone.


3. Test Name Doesn't Match Behavior

Location: crates/api/tests/e2e_score.rs:832-865

The test test_score_sequential_requests explicitly states it tests sequential requests (not concurrent), but the comment claims concurrent limiting is tested elsewhere. The test name is accurate, but there's no actual concurrent limit enforcement test for the score endpoint specifically.

Recommendation: Either:

  • Add a true concurrent test (spawn 65 requests with tokio::spawn, verify 65th returns 429), OR
  • Accept that concurrent limiting is shared infrastructure tested in chat_completions tests

Impact: Low - concurrent limiting logic is shared and tested elsewhere. Not a blocker.


Excellent Implementation Details

  1. Atomicity guarantee - Comment at lines 1111-1124 perfectly explains the race condition prevention
  2. Defense-in-depth timeout - Service-level timeout wraps provider timeout
  3. Comprehensive E2E tests - 451 lines covering all major paths
  4. Proper error handling - All error paths are mapped correctly
  5. Privacy compliance - Only logs IDs and lengths, no customer data
  6. Overflow protection - Prevents billing corruption from arithmetic overflow
  7. Startup validation - Catches config errors early

🎯 VERDICT

✅ APPROVED - All critical issues from previous reviews have been properly addressed.

The remaining issues (#1-#3 above) are minor optimizations that don't affect correctness, safety, or production stability. They can be addressed in a follow-up PR or ignored.

Summary of Changes (Since First Review)

  • Added RAII guard for resource leak prevention ✅
  • Fixed usage recording timing (race condition) ✅
  • Added overflow protection for billing calculations ✅
  • Added text length validation ✅
  • Added comprehensive E2E tests ✅
  • Added service-level timeout protection ✅
  • Removed API key logging (privacy) ✅
  • Added config validation at startup ✅

This PR is ready to merge. 🎉

Great work addressing all the feedback systematically! The implementation is now production-ready with strong safety guarantees.

let total_cost = input_cost + output_cost;
// Use checked_mul to prevent silent overflow that could cause incorrect billing
let input_cost = (input_tokens as i64)
.checked_mul(model.input_cost_per_token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the checked_mul is a bit excessive, but ok either way!

nickpismenkov and others added 2 commits February 5, 2026 21:33
…oints

Resolves all 13 merge conflicts while keeping both score (from feat/score-api) and rerank/image_edits (from main) functionality at the infrastructure layer.

KEY CHANGES:
- Rerank endpoint and image_edits endpoint fully integrated into API routes
- Score trait methods and implementations added to all providers (vLLM, External, Mock)
- Both rerank and score usage tracking implemented
- Test utilities updated with both setup_qwen_reranker_model and setup_rerank_model

RESOLUTION NOTES:
- Main branch (rerank + image_edits) used as base for API routes
- Score infrastructure (types, traits, implementations) merged from feat/score-api
- Score API endpoint exists in infrastructure layer but not exposed in current build
- Can be added in follow-up commit if needed
- All files compile successfully

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@nickpismenkov nickpismenkov merged commit af06734 into main Feb 6, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants