Skip to content

feat: add audio transcriptions endpoint#392

Merged
nickpismenkov merged 6 commits intomainfrom
feat/audio-transcriptions-api
Feb 9, 2026
Merged

feat: add audio transcriptions endpoint#392
nickpismenkov merged 6 commits intomainfrom
feat/audio-transcriptions-api

Conversation

@nickpismenkov
Copy link
Contributor

@nickpismenkov nickpismenkov commented Jan 27, 2026

Screenshot 2026-01-26 at 9 42 10 PM

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @nickpismenkov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant new feature: an audio transcription service. It enables the system to process uploaded audio files and return their textual transcriptions. This involves setting up a new API route, defining the necessary data structures for handling audio input and transcription output, integrating with underlying AI inference providers capable of speech-to-text, and ensuring proper usage tracking and billing based on audio duration. The changes also include robust validation and extensive testing to ensure reliability and correctness.

Highlights

  • New Audio Transcription Endpoint: Introduced a new API endpoint /v1/audio/transcriptions to allow users to upload audio files for transcription using Whisper models.
  • Data Models and Validation: Added new Rust structs (AudioTranscriptionRequestSchema, AudioTranscriptionRequest, AudioTranscriptionResponse, TranscriptionSegment, TranscriptionWord) to define the request and response payloads, including comprehensive validation for file size, format, and other parameters.
  • Inference Provider Integration: Integrated audio transcription capabilities across various inference providers (OpenAI-compatible, VLLM, and a mock provider) by extending the InferenceProvider and ExternalBackend traits.
  • Usage Tracking and Billing: Implemented synchronous usage recording for audio transcriptions, billing based on the audio duration in seconds.
  • OpenAPI Documentation: Updated the OpenAPI specification to include the new audio transcription endpoint and its associated schemas, enhancing API discoverability and usability.
  • End-to-End Testing: Added a comprehensive suite of end-to-end tests for the audio transcription endpoint, covering various scenarios like valid/invalid inputs, different audio formats, authentication, and model availability.
  • Dependency Updates: Updated Cargo.lock to include mime_guess and unicase and enabled the multipart feature for reqwest to support file uploads.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link

claude bot commented Jan 27, 2026

Code Review - Audio Transcriptions Endpoint

I've reviewed PR #392 and found several CRITICAL issues that must be addressed:

CRITICAL ISSUES

1. Missing Concurrent Request Limiting in Route Handler (Production Safety)

Location: crates/api/src/routes/completions.rs:1050-1054

The route handler calls inference_provider_pool.audio_transcription directly, bypassing the CompletionService.audio_transcription method that implements concurrent request limiting.

Problem: The comment claims concurrent limiting is handled by the pool, but this is incorrect. The CompletionServiceImpl::audio_transcription method at line 687-731 implements try_acquire_concurrent_slot, but the route handler bypasses the service layer entirely.

Impact:

  • Organizations can exceed their concurrent request limits
  • Could cause resource exhaustion in production
  • Breaking the established pattern used by chat_completions and image_generations

Fix: Use the service layer like other endpoints - call app_state.completion_service.audio_transcription() instead of direct provider pool access.

2. Missing Timeout Protection (Production Safety)

Related to issue #1, bypassing the service layer also means losing the 2-minute timeout protection implemented at completions/mod.rs:701-707.

Problem: Audio transcription requests could hang indefinitely if the provider becomes unresponsive.

Fix: Use the service layer which includes tokio::time::timeout with 120s duration.

3. Privacy and Logging Violations (Data Security - CRITICAL per CLAUDE.md)

Locations: crates/api/src/routes/completions.rs:927, 1125 and crates/services/src/inference_provider_pool/mod.rs:1693

Problems:

  • Line 927: Logs error message which may contain user content
  • Line 1125: Logs provider error message which may contain file content/metadata
  • Line 1693: Logs text_len which reveals content information

Per CLAUDE.md: NEVER LOG file contents, AI responses, or metadata that reveals customer information.

Impact: Violates customer trust and TEE security model. Production runs at info level, so text_len would be logged.

4. Usage Tracking After Response Success - Incorrect Billing on Errors (Logic)

Location: crates/api/src/routes/completions.rs:1056-1109

The usage recording happens AFTER receiving a successful provider response. If usage recording fails, the customer gets an error but the transcription already succeeded.

Problem Flow:

  1. Audio transcription completes successfully
  2. Usage recording fails (DB error, network issue, etc.)
  3. Customer receives 500 error and no transcription
  4. Provider was already charged/billed for the transcription

Impact: Inconsistent billing state and poor UX

Fix: Follow the established pattern from image_generations - handle usage errors gracefully and still return successful response with logging.

5. Integer Overflow Risk in Duration Calculation (Production Safety)

Location: crates/api/src/routes/completions.rs:1059

let duration_seconds = response.duration.unwrap_or(0.0).ceil() as i32;

Problem: f64::ceil() returns f64, casting to i32 can overflow for large values. Negative durations could occur if provider returns invalid data.

Fix: Add bounds checking: .max(0.0).min(i32::MAX as f64) before casting

POSITIVE OBSERVATIONS

  • Good test coverage in e2e_audio_transcriptions.rs
  • Proper validation in AudioTranscriptionRequest::validate()
  • Consistent error handling patterns
  • OpenAPI documentation is complete

SUMMARY

5 critical issues must be fixed:

  1. Use CompletionService instead of direct provider pool call
  2. Timeout protection (automatically fixed by #1)
  3. Remove privacy-violating logging
  4. Handle usage recording errors gracefully
  5. Add overflow protection to duration calculation

Recommendation: Do not merge until these issues are resolved.

The architecture is sound and follows the hexagonal pattern well, but the route handler implementation bypasses critical safety mechanisms in the service layer.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new audio transcriptions endpoint, enabling users to transcribe audio files using Whisper models. While the feature is well-structured and follows existing patterns, a functional bug has been identified where the /audio/transcriptions endpoint incorrectly enforces a 2MB file size limit instead of the documented 25MB. Additionally, the AudioTranscriptionRequest model has insufficient validation for timestamp_granularities and filename fields, which could lead to unexpected behavior or potential security risks in downstream processing.

@nickpismenkov
Copy link
Contributor Author

@claude review

@claude
Copy link

claude bot commented Jan 27, 2026

Code Review - Audio Transcriptions Endpoint (Follow-up)

CONTEXT: Previous Review Discussion
I've reviewed the existing feedback from the earlier review. The PR shows that several critical issues were identified previously. Let me verify which issues have been addressed and which remain.


CRITICAL ISSUES (Must Fix Before Merge)

⚠️ 1. VERIFIED: Proper Service Layer Usage

Previous concern raised: Route handler bypassing CompletionService layer

Status: FIXED - Reviewing the code at completions.rs:508-517, the handler correctly calls:

app_state.completion_service.audio_transcription(
    organization_id,
    model_id,
    &model_name,
    params,
    body_hash.hash.clone(),
).await

This properly uses the service layer which includes concurrent request limiting and timeout protection. ✅


⚠️ 2. CRITICAL: Integer Overflow Risk in Duration Calculation

Location: crates/api/src/routes/completions.rs:526-531

Issue: The current code:

let duration_seconds = response
    .duration
    .unwrap_or(0.0)
    .max(0.0)
    .min(i32::MAX as f64)
    .ceil() as i32;

Status: FIXED ✅ - The bounds checking with .max(0.0).min(i32::MAX as f64) properly prevents:

  • Negative values
  • Overflow when casting to i32

⚠️ 3. CRITICAL: Privacy Violation - Logging Error Details

Location: crates/api/src/routes/completions.rs:586

Problem:

services::completions::ports::CompletionError::ProviderError(_) => {
    // Don't log error details - may contain customer data
    tracing::error\!("Audio transcription provider error");

Status: FIXED ✅ - Code correctly does NOT log error details, following CLAUDE.md privacy guidelines.


⚠️ 4. CRITICAL: Filename Path Traversal Validation Has Gaps

Location: crates/api/src/models.rs:169-177

Issue: The validation checks for path traversal:

if self.filename.contains("..")
    || self.filename.contains('/')
    || self.filename.contains('\\')
{
    return Err(
        "Filename cannot contain path traversal characters (.., /, \\)".to_string(),
    );
}

Problem: This check is incomplete. It blocks obvious path traversal but misses:

  • URL-encoded variants: %2e%2e, %2f, %5c
  • Unicode variants: various Unicode slash characters
  • Null bytes: \0 could truncate paths in some systems

Impact: Potential security risk if the filename is used in file operations without proper sanitization downstream.

Recommendation: Extract just the base filename using std::path::Path::file_name():

// After checking filename is not empty
let safe_filename = std::path::Path::new(&self.filename)
    .file_name()
    .and_then(|n| n.to_str())
    .ok_or_else(|| "Invalid filename".to_string())?
    .to_string();

This automatically strips any path components.


⚠️ 5. CRITICAL: Usage Tracking Happens Asynchronously After Response

Location: crates/api/src/routes/completions.rs:534-571

Issue: Usage tracking is fire-and-forget:

tokio::spawn(async move {
    // Usage tracking...
    if let Err(e) = usage_service.record_usage(usage_request).await {
        tracing::error\!(/* ... */);
    }
});

(StatusCode::OK, ResponseJson(response)).into_response()

Problem: If usage recording fails:

  1. Customer gets successful response and transcription
  2. Organization is NOT billed for the request
  3. No way to recover or retry the billing

This differs from the previous review concern: The earlier review said this is correct and matches image_generations pattern. However, this creates a billing inconsistency risk:

  • Lost Revenue: Failed usage recording = free transcriptions
  • No Audit Trail: Silent failures only appear in logs
  • Compliance Risk: Inaccurate usage data for billing/reporting

Recommendation: Consider one of these approaches:

  1. Synchronous tracking (like chat completions) - record BEFORE returning response
  2. Persistent retry queue - failed usage events go to a dead-letter queue for retry
  3. Billing reconciliation job - periodic job that reconciles provider calls vs. billed usage

For a production TEE environment with financial implications, I recommend option 1 (synchronous) despite the small latency cost.


⚠️ 6. Code Quality: Duplicate Content-Type Detection Logic

Location:

  • crates/inference_providers/src/external/openai_compatible.rs:1272-1285
  • crates/inference_providers/src/vllm/mod.rs:1626-1639

Issue: The detect_audio_content_type function is duplicated in two places with identical implementation.

Impact:

  • Code duplication violates DRY principle
  • If supported formats change, must update both locations
  • Risk of divergence

Recommendation: Extract to a shared helper function in crates/inference_providers/src/models.rs or a utils module.


Summary

Critical Issues Requiring Fixes: 2

  1. Filename validation - Add proper path sanitization (Security)
  2. Usage tracking - Consider synchronous tracking or retry mechanism (Business Logic/Billing)

Code Quality Issues: 1
3. 🟡 Duplicate content-type detection - Extract to shared helper (Maintainability)

Fixed from Previous Review: 3

  • ✅ Service layer usage
  • ✅ Integer overflow protection
  • ✅ Privacy-compliant logging

Recommendation

⚠️ DO NOT MERGE until critical issues #1 and #2 are resolved.

The implementation is generally sound and follows project patterns well, but the filename validation gap presents a security risk, and the async usage tracking creates potential billing inconsistency.

Once these are addressed, the PR will be in good shape. The comprehensive testing and adherence to privacy guidelines are commendable.

@nickpismenkov
Copy link
Contributor Author

@claude review

@claude
Copy link

claude bot commented Jan 27, 2026

Code Review - Audio Transcriptions Endpoint (Third Review)

CONTEXT: Reviewing PR #392 after two previous reviews. Checking if critical issues have been addressed.


REVIEW SUMMARY

After analyzing the current PR state against previous feedback, I can confirm that ALL critical issues from previous reviews have been properly addressed. The implementation is now production-ready.


✅ VERIFIED FIXES FROM PREVIOUS REVIEWS

1. ✅ Service Layer Usage (Previously: Issue #1)

Location: crates/api/src/routes/completions.rs:508-517

Status: FIXED - The handler now correctly calls:

app_state.completion_service.audio_transcription(...)

This properly includes concurrent request limiting (64 per org/model) and timeout protection (120s). Great!


2. ✅ Integer Overflow Protection (Previously: Issue #5)

Location: crates/api/src/routes/completions.rs:526-531

Status: FIXED - Duration calculation now includes proper bounds checking:

let duration_seconds = response.duration
    .unwrap_or(0.0)
    .max(0.0)
    .min(i32::MAX as f64)
    .ceil() as i32;

This prevents both negative values and overflow. Well done!


3. ✅ Privacy-Compliant Logging (Previously: Issue #3)

Location: crates/api/src/routes/completions.rs:586-596

Status: FIXED - Error handling properly avoids logging customer data:

CompletionError::ProviderError(_) => {
    // Don't log error details - may contain customer data
    tracing::error\!("Audio transcription provider error");

Excellent adherence to CLAUDE.md privacy guidelines!


4. ✅ Filename Path Traversal Security (Previously: Issue #4)

Location: crates/api/src/models.rs:159-176

Status: FIXED - Validation now uses proper path sanitization:

let safe_filename = Path::new(&self.filename)
    .file_name()
    .and_then(|n| n.to_str())
    .ok_or_else(|| "Invalid filename...")?;

if safe_filename \!= self.filename {
    return Err("Filename cannot contain path components...");
}

This properly:

  • Extracts only the base filename (strips all path components)
  • Detects and blocks traversal attempts (including ../, absolute paths)
  • Validates UTF-8 encoding
  • Checks for null bytes
  • Enforces 255 char limit

Security: This is significantly more robust than the previous string-based checks and handles encoded variants correctly.


5. ✅ Synchronous Usage Tracking (Previously: Issue #4 from second review)

Location: crates/api/src/routes/completions.rs:534-571

Status: FIXED - Usage is now recorded synchronously:

// Record usage synchronously - fail the request if usage recording fails
if let Err(e) = app_state.usage_service.record_usage(usage_request).await {
    tracing::error\!(...);
    return (...INTERNAL_SERVER_ERROR...).into_response();
}

Impact: This prevents the billing inconsistency problem where customers could get successful responses without being billed. The comment explicitly calls this out as critical for financial accuracy. Perfect!


🟢 NO REMAINING CRITICAL ISSUES

I've carefully reviewed the entire implementation across all modified files and found zero critical issues. The code is production-ready.


💡 MINOR OBSERVATIONS (Non-blocking)

1. Code Quality: Content-Type Detection Helper

Locations:

  • crates/inference_providers/src/external/openai_compatible.rs:1234
  • crates/inference_providers/src/vllm/mod.rs:1597

The detect_audio_content_type function is called but defined in models.rs as a public utility function. Upon closer inspection, this is actually not duplicated - both locations call the shared crate::models::detect_audio_content_type helper. This is correct! No change needed.

2. Test Coverage: Large File Validation

Location: crates/api/tests/e2e_audio_transcriptions.rs:816-826

The test test_audio_transcription_file_too_large is intentionally left as a placeholder due to test framework limitations with 26+ MB uploads. The validation logic itself is sound (checked at line 150-157 in models.rs). Consider adding a unit test for the validation function directly if possible, but this is not blocking.


✅ APPROVAL

Status: ✅ APPROVED FOR MERGE

This PR demonstrates:

  • ✅ Excellent adherence to CLAUDE.md privacy guidelines
  • ✅ Proper security validation (path traversal, overflow protection)
  • ✅ Correct usage of hexagonal architecture patterns
  • ✅ Comprehensive E2E test coverage
  • ✅ Synchronous usage tracking for billing accuracy
  • ✅ Proper concurrent request limiting via service layer
  • ✅ Complete OpenAPI documentation

The implementation is production-ready for a Trusted Execution Environment (TEE) with multi-tenant isolation and financial accuracy requirements.

Great work addressing all previous feedback! 🎉

@ilblackdragon
Copy link
Member

Hey @nickpismenkov do you want to take over #377 - it has a full spectrum of audio APIs?

@nickpismenkov
Copy link
Contributor Author

Hey @nickpismenkov do you want to take over #377 - it has a full spectrum of audio APIs?

Hey, yeah. Taking it over!

nickpismenkov and others added 2 commits February 6, 2026 16:11
…ption and rerank/score features

This merge successfully combines two feature branches:
- Audio transcription endpoints (/v1/audio/transcriptions)
- Rerank and score endpoints (/v1/rerank, /v1/score)

Key changes:
- CompletionServiceTrait now includes audio_transcription, try_rerank, and try_score methods
- All endpoints properly integrated with concurrent request limiting and usage tracking
- InferenceType enum updated with AudioTranscription variant
- Both features support proper billing and metrics recording
- All routes properly registered and documented in OpenAPI

Both features work independently and can be used together without conflicts.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Copy link
Contributor

@henrypark133 henrypark133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Will you be adding the usage tracking as part of a follow-up?
  2. Also, I think OAI enforces specific file types support. Do you know which ones our model can support and also add/enforce this check?

@nickpismenkov
Copy link
Contributor Author

  1. Will you be adding the usage tracking as part of a follow-up?
  2. Also, I think OAI enforces specific file types support. Do you know which ones our model can support and also add/enforce this check?
  1. Yes
  2. Audio (Whisper/TTS): .mp3, .mp4, .mpeg, .mpga, .m4a, .wav, .webm - I think OAI supports pretty much every type, so I think there's no need for the check. Anyway I think we can now merge this PR, and add something as follow up

@nickpismenkov nickpismenkov merged commit d1d4f10 into main Feb 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants