feat: add audio transcriptions endpoint by nickpismenkov · Pull Request #392 · nearai/cloud-api

nickpismenkov · 2026-01-27T05:44:07Z

gemini-code-assist · 2026-01-27T05:44:36Z

Summary of Changes

Hello @nickpismenkov, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant new feature: an audio transcription service. It enables the system to process uploaded audio files and return their textual transcriptions. This involves setting up a new API route, defining the necessary data structures for handling audio input and transcription output, integrating with underlying AI inference providers capable of speech-to-text, and ensuring proper usage tracking and billing based on audio duration. The changes also include robust validation and extensive testing to ensure reliability and correctness.

Highlights

New Audio Transcription Endpoint: Introduced a new API endpoint /v1/audio/transcriptions to allow users to upload audio files for transcription using Whisper models.
Data Models and Validation: Added new Rust structs (AudioTranscriptionRequestSchema, AudioTranscriptionRequest, AudioTranscriptionResponse, TranscriptionSegment, TranscriptionWord) to define the request and response payloads, including comprehensive validation for file size, format, and other parameters.
Inference Provider Integration: Integrated audio transcription capabilities across various inference providers (OpenAI-compatible, VLLM, and a mock provider) by extending the InferenceProvider and ExternalBackend traits.
Usage Tracking and Billing: Implemented synchronous usage recording for audio transcriptions, billing based on the audio duration in seconds.
OpenAPI Documentation: Updated the OpenAPI specification to include the new audio transcription endpoint and its associated schemas, enhancing API discoverability and usability.
End-to-End Testing: Added a comprehensive suite of end-to-end tests for the audio transcription endpoint, covering various scenarios like valid/invalid inputs, different audio formats, authentication, and model availability.
Dependency Updates: Updated Cargo.lock to include mime_guess and unicase and enabled the multipart feature for reqwest to support file uploads.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

claude · 2026-01-27T05:46:33Z

Code Review - Audio Transcriptions Endpoint

I've reviewed PR #392 and found several CRITICAL issues that must be addressed:

CRITICAL ISSUES

1. Missing Concurrent Request Limiting in Route Handler (Production Safety)

Location: crates/api/src/routes/completions.rs:1050-1054

The route handler calls inference_provider_pool.audio_transcription directly, bypassing the CompletionService.audio_transcription method that implements concurrent request limiting.

Problem: The comment claims concurrent limiting is handled by the pool, but this is incorrect. The CompletionServiceImpl::audio_transcription method at line 687-731 implements try_acquire_concurrent_slot, but the route handler bypasses the service layer entirely.

Impact:

Organizations can exceed their concurrent request limits
Could cause resource exhaustion in production
Breaking the established pattern used by chat_completions and image_generations

Fix: Use the service layer like other endpoints - call app_state.completion_service.audio_transcription() instead of direct provider pool access.

2. Missing Timeout Protection (Production Safety)

Related to issue #1, bypassing the service layer also means losing the 2-minute timeout protection implemented at completions/mod.rs:701-707.

Problem: Audio transcription requests could hang indefinitely if the provider becomes unresponsive.

Fix: Use the service layer which includes tokio::time::timeout with 120s duration.

3. Privacy and Logging Violations (Data Security - CRITICAL per CLAUDE.md)

Locations: crates/api/src/routes/completions.rs:927, 1125 and crates/services/src/inference_provider_pool/mod.rs:1693

Problems:

Line 927: Logs error message which may contain user content
Line 1125: Logs provider error message which may contain file content/metadata
Line 1693: Logs text_len which reveals content information

Per CLAUDE.md: NEVER LOG file contents, AI responses, or metadata that reveals customer information.

Impact: Violates customer trust and TEE security model. Production runs at info level, so text_len would be logged.

4. Usage Tracking After Response Success - Incorrect Billing on Errors (Logic)

Location: crates/api/src/routes/completions.rs:1056-1109

The usage recording happens AFTER receiving a successful provider response. If usage recording fails, the customer gets an error but the transcription already succeeded.

Problem Flow:

Audio transcription completes successfully
Usage recording fails (DB error, network issue, etc.)
Customer receives 500 error and no transcription
Provider was already charged/billed for the transcription

Impact: Inconsistent billing state and poor UX

Fix: Follow the established pattern from image_generations - handle usage errors gracefully and still return successful response with logging.

5. Integer Overflow Risk in Duration Calculation (Production Safety)

Location: crates/api/src/routes/completions.rs:1059

let duration_seconds = response.duration.unwrap_or(0.0).ceil() as i32;

Problem: f64::ceil() returns f64, casting to i32 can overflow for large values. Negative durations could occur if provider returns invalid data.

Fix: Add bounds checking: .max(0.0).min(i32::MAX as f64) before casting

POSITIVE OBSERVATIONS

Good test coverage in e2e_audio_transcriptions.rs
Proper validation in AudioTranscriptionRequest::validate()
Consistent error handling patterns
OpenAPI documentation is complete

SUMMARY

5 critical issues must be fixed:

Use CompletionService instead of direct provider pool call
Timeout protection (automatically fixed by #1)
Remove privacy-violating logging
Handle usage recording errors gracefully
Add overflow protection to duration calculation

Recommendation: Do not merge until these issues are resolved.

The architecture is sound and follows the hexagonal pattern well, but the route handler implementation bypasses critical safety mechanisms in the service layer.

gemini-code-assist

Code Review

This pull request introduces a new audio transcriptions endpoint, enabling users to transcribe audio files using Whisper models. While the feature is well-structured and follows existing patterns, a functional bug has been identified where the /audio/transcriptions endpoint incorrectly enforces a 2MB file size limit instead of the documented 25MB. Additionally, the AudioTranscriptionRequest model has insufficient validation for timestamp_granularities and filename fields, which could lead to unexpected behavior or potential security risks in downstream processing.

crates/api/src/models.rs

crates/inference_providers/src/mock.rs

crates/api/src/routes/completions.rs

crates/api/src/models.rs

crates/api/src/routes/completions.rs

nickpismenkov · 2026-01-27T06:30:51Z

@claude review

claude · 2026-01-27T06:32:37Z

Code Review - Audio Transcriptions Endpoint (Follow-up)

CONTEXT: Previous Review Discussion
I've reviewed the existing feedback from the earlier review. The PR shows that several critical issues were identified previously. Let me verify which issues have been addressed and which remain.

CRITICAL ISSUES (Must Fix Before Merge)

⚠️ 1. VERIFIED: Proper Service Layer Usage ✅

Previous concern raised: Route handler bypassing CompletionService layer

Status: FIXED - Reviewing the code at completions.rs:508-517, the handler correctly calls:

app_state.completion_service.audio_transcription(
    organization_id,
    model_id,
    &model_name,
    params,
    body_hash.hash.clone(),
).await

This properly uses the service layer which includes concurrent request limiting and timeout protection. ✅

⚠️ 2. CRITICAL: Integer Overflow Risk in Duration Calculation

Location: crates/api/src/routes/completions.rs:526-531

Issue: The current code:

let duration_seconds = response
    .duration
    .unwrap_or(0.0)
    .max(0.0)
    .min(i32::MAX as f64)
    .ceil() as i32;

Status: FIXED ✅ - The bounds checking with .max(0.0).min(i32::MAX as f64) properly prevents:

Negative values
Overflow when casting to i32

⚠️ 3. CRITICAL: Privacy Violation - Logging Error Details

Location: crates/api/src/routes/completions.rs:586

Problem:

services::completions::ports::CompletionError::ProviderError(_) => {
    // Don't log error details - may contain customer data
    tracing::error\!("Audio transcription provider error");

Status: FIXED ✅ - Code correctly does NOT log error details, following CLAUDE.md privacy guidelines.

⚠️ 4. CRITICAL: Filename Path Traversal Validation Has Gaps

Location: crates/api/src/models.rs:169-177

Issue: The validation checks for path traversal:

if self.filename.contains("..")
    || self.filename.contains('/')
    || self.filename.contains('\\')
{
    return Err(
        "Filename cannot contain path traversal characters (.., /, \\)".to_string(),
    );
}

Problem: This check is incomplete. It blocks obvious path traversal but misses:

URL-encoded variants: %2e%2e, %2f, %5c
Unicode variants: various Unicode slash characters
Null bytes: \0 could truncate paths in some systems

Impact: Potential security risk if the filename is used in file operations without proper sanitization downstream.

Recommendation: Extract just the base filename using std::path::Path::file_name():

// After checking filename is not empty
let safe_filename = std::path::Path::new(&self.filename)
    .file_name()
    .and_then(|n| n.to_str())
    .ok_or_else(|| "Invalid filename".to_string())?
    .to_string();

This automatically strips any path components.

⚠️ 5. CRITICAL: Usage Tracking Happens Asynchronously After Response

Location: crates/api/src/routes/completions.rs:534-571

Issue: Usage tracking is fire-and-forget:

tokio::spawn(async move {
    // Usage tracking...
    if let Err(e) = usage_service.record_usage(usage_request).await {
        tracing::error\!(/* ... */);
    }
});

(StatusCode::OK, ResponseJson(response)).into_response()

Problem: If usage recording fails:

Customer gets successful response and transcription
Organization is NOT billed for the request
No way to recover or retry the billing

This differs from the previous review concern: The earlier review said this is correct and matches image_generations pattern. However, this creates a billing inconsistency risk:

Lost Revenue: Failed usage recording = free transcriptions
No Audit Trail: Silent failures only appear in logs
Compliance Risk: Inaccurate usage data for billing/reporting

Recommendation: Consider one of these approaches:

Synchronous tracking (like chat completions) - record BEFORE returning response
Persistent retry queue - failed usage events go to a dead-letter queue for retry
Billing reconciliation job - periodic job that reconciles provider calls vs. billed usage

For a production TEE environment with financial implications, I recommend option 1 (synchronous) despite the small latency cost.

⚠️ 6. Code Quality: Duplicate Content-Type Detection Logic

Location:

crates/inference_providers/src/external/openai_compatible.rs:1272-1285
crates/inference_providers/src/vllm/mod.rs:1626-1639

Issue: The detect_audio_content_type function is duplicated in two places with identical implementation.

Impact:

Code duplication violates DRY principle
If supported formats change, must update both locations
Risk of divergence

Recommendation: Extract to a shared helper function in crates/inference_providers/src/models.rs or a utils module.

Summary

Critical Issues Requiring Fixes: 2

❌ Filename validation - Add proper path sanitization (Security)
❌ Usage tracking - Consider synchronous tracking or retry mechanism (Business Logic/Billing)

Code Quality Issues: 1
3. 🟡 Duplicate content-type detection - Extract to shared helper (Maintainability)

Fixed from Previous Review: 3

✅ Service layer usage
✅ Integer overflow protection
✅ Privacy-compliant logging

Recommendation

⚠️ DO NOT MERGE until critical issues #1 and #2 are resolved.

The implementation is generally sound and follows project patterns well, but the filename validation gap presents a security risk, and the async usage tracking creates potential billing inconsistency.

Once these are addressed, the PR will be in good shape. The comprehensive testing and adherence to privacy guidelines are commendable.

nickpismenkov · 2026-01-27T06:47:44Z

@claude review

claude · 2026-01-27T06:49:47Z

Code Review - Audio Transcriptions Endpoint (Third Review)

CONTEXT: Reviewing PR #392 after two previous reviews. Checking if critical issues have been addressed.

REVIEW SUMMARY

After analyzing the current PR state against previous feedback, I can confirm that ALL critical issues from previous reviews have been properly addressed. The implementation is now production-ready.

✅ VERIFIED FIXES FROM PREVIOUS REVIEWS

1. ✅ Service Layer Usage (Previously: Issue #1)

Location: crates/api/src/routes/completions.rs:508-517

Status: FIXED - The handler now correctly calls:

app_state.completion_service.audio_transcription(...)

This properly includes concurrent request limiting (64 per org/model) and timeout protection (120s). Great!

2. ✅ Integer Overflow Protection (Previously: Issue #5)

Location: crates/api/src/routes/completions.rs:526-531

Status: FIXED - Duration calculation now includes proper bounds checking:

let duration_seconds = response.duration
    .unwrap_or(0.0)
    .max(0.0)
    .min(i32::MAX as f64)
    .ceil() as i32;

This prevents both negative values and overflow. Well done!

3. ✅ Privacy-Compliant Logging (Previously: Issue #3)

Location: crates/api/src/routes/completions.rs:586-596

Status: FIXED - Error handling properly avoids logging customer data:

CompletionError::ProviderError(_) => {
    // Don't log error details - may contain customer data
    tracing::error\!("Audio transcription provider error");

Excellent adherence to CLAUDE.md privacy guidelines!

4. ✅ Filename Path Traversal Security (Previously: Issue #4)

Location: crates/api/src/models.rs:159-176

Status: FIXED - Validation now uses proper path sanitization:

let safe_filename = Path::new(&self.filename)
    .file_name()
    .and_then(|n| n.to_str())
    .ok_or_else(|| "Invalid filename...")?;

if safe_filename \!= self.filename {
    return Err("Filename cannot contain path components...");
}

This properly:

Extracts only the base filename (strips all path components)
Detects and blocks traversal attempts (including ../, absolute paths)
Validates UTF-8 encoding
Checks for null bytes
Enforces 255 char limit

Security: This is significantly more robust than the previous string-based checks and handles encoded variants correctly.

5. ✅ Synchronous Usage Tracking (Previously: Issue #4 from second review)

Location: crates/api/src/routes/completions.rs:534-571

Status: FIXED - Usage is now recorded synchronously:

// Record usage synchronously - fail the request if usage recording fails
if let Err(e) = app_state.usage_service.record_usage(usage_request).await {
    tracing::error\!(...);
    return (...INTERNAL_SERVER_ERROR...).into_response();
}

Impact: This prevents the billing inconsistency problem where customers could get successful responses without being billed. The comment explicitly calls this out as critical for financial accuracy. Perfect!

🟢 NO REMAINING CRITICAL ISSUES

I've carefully reviewed the entire implementation across all modified files and found zero critical issues. The code is production-ready.

💡 MINOR OBSERVATIONS (Non-blocking)

1. Code Quality: Content-Type Detection Helper

Locations:

crates/inference_providers/src/external/openai_compatible.rs:1234
crates/inference_providers/src/vllm/mod.rs:1597

The detect_audio_content_type function is called but defined in models.rs as a public utility function. Upon closer inspection, this is actually not duplicated - both locations call the shared crate::models::detect_audio_content_type helper. This is correct! No change needed.

2. Test Coverage: Large File Validation

Location: crates/api/tests/e2e_audio_transcriptions.rs:816-826

The test test_audio_transcription_file_too_large is intentionally left as a placeholder due to test framework limitations with 26+ MB uploads. The validation logic itself is sound (checked at line 150-157 in models.rs). Consider adding a unit test for the validation function directly if possible, but this is not blocking.

✅ APPROVAL

Status: ✅ APPROVED FOR MERGE

This PR demonstrates:

✅ Excellent adherence to CLAUDE.md privacy guidelines
✅ Proper security validation (path traversal, overflow protection)
✅ Correct usage of hexagonal architecture patterns
✅ Comprehensive E2E test coverage
✅ Synchronous usage tracking for billing accuracy
✅ Proper concurrent request limiting via service layer
✅ Complete OpenAPI documentation

The implementation is production-ready for a Trusted Execution Environment (TEE) with multi-tenant isolation and financial accuracy requirements.

Great work addressing all previous feedback! 🎉

ilblackdragon · 2026-01-27T17:35:58Z

Hey @nickpismenkov do you want to take over #377 - it has a full spectrum of audio APIs?

nickpismenkov · 2026-01-27T18:22:04Z

Hey @nickpismenkov do you want to take over #377 - it has a full spectrum of audio APIs?

Hey, yeah. Taking it over!

…ption and rerank/score features This merge successfully combines two feature branches: - Audio transcription endpoints (/v1/audio/transcriptions) - Rerank and score endpoints (/v1/rerank, /v1/score) Key changes: - CompletionServiceTrait now includes audio_transcription, try_rerank, and try_score methods - All endpoints properly integrated with concurrent request limiting and usage tracking - InferenceType enum updated with AudioTranscription variant - Both features support proper billing and metrics recording - All routes properly registered and documented in OpenAPI Both features work independently and can be used together without conflicts. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

henrypark133

Will you be adding the usage tracking as part of a follow-up?
Also, I think OAI enforces specific file types support. Do you know which ones our model can support and also add/enforce this check?

nickpismenkov · 2026-02-09T17:20:08Z

Will you be adding the usage tracking as part of a follow-up?

Also, I think OAI enforces specific file types support. Do you know which ones our model can support and also add/enforce this check?

Yes
Audio (Whisper/TTS): .mp3, .mp4, .mpeg, .mpga, .m4a, .wav, .webm - I think OAI supports pretty much every type, so I think there's no need for the check. Anyway I think we can now merge this PR, and add something as follow up

feat: add audio transcriptions endpoint

48ebf70

nickpismenkov temporarily deployed to Cloud API test env January 27, 2026 05:44 — with GitHub Actions Inactive

gemini-code-assist bot reviewed Jan 27, 2026

View reviewed changes

review fixes

9452f57

nickpismenkov temporarily deployed to Cloud API test env January 27, 2026 06:30 — with GitHub Actions Inactive

fix issues

1bc0819

nickpismenkov temporarily deployed to Cloud API test env January 27, 2026 06:47 — with GitHub Actions Inactive

nickpismenkov requested a review from PierreLeGuen January 27, 2026 06:51

nickpismenkov and others added 2 commits February 6, 2026 16:11

fix

0d188bf

nickpismenkov temporarily deployed to Cloud API test env February 7, 2026 00:35 — with GitHub Actions Inactive

fix security issue

e8f0c91

nickpismenkov temporarily deployed to Cloud API test env February 7, 2026 00:41 — with GitHub Actions Inactive

nickpismenkov requested a review from henrypark133 February 8, 2026 07:46

henrypark133 reviewed Feb 9, 2026

View reviewed changes

henrypark133 approved these changes Feb 9, 2026

View reviewed changes

nickpismenkov merged commit d1d4f10 into main Feb 9, 2026
2 checks passed

Conversation

nickpismenkov commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

claude bot commented Jan 27, 2026

Code Review - Audio Transcriptions Endpoint

CRITICAL ISSUES

1. Missing Concurrent Request Limiting in Route Handler (Production Safety)

2. Missing Timeout Protection (Production Safety)

3. Privacy and Logging Violations (Data Security - CRITICAL per CLAUDE.md)

4. Usage Tracking After Response Success - Incorrect Billing on Errors (Logic)

5. Integer Overflow Risk in Duration Calculation (Production Safety)

POSITIVE OBSERVATIONS

SUMMARY

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nickpismenkov commented Jan 27, 2026

Uh oh!

claude bot commented Jan 27, 2026

Code Review - Audio Transcriptions Endpoint (Follow-up)

CRITICAL ISSUES (Must Fix Before Merge)

⚠️ 1. VERIFIED: Proper Service Layer Usage ✅

⚠️ 2. CRITICAL: Integer Overflow Risk in Duration Calculation

⚠️ 3. CRITICAL: Privacy Violation - Logging Error Details

⚠️ 4. CRITICAL: Filename Path Traversal Validation Has Gaps

⚠️ 5. CRITICAL: Usage Tracking Happens Asynchronously After Response

⚠️ 6. Code Quality: Duplicate Content-Type Detection Logic

Summary

Recommendation

Uh oh!

nickpismenkov commented Jan 27, 2026

Uh oh!

claude bot commented Jan 27, 2026

Code Review - Audio Transcriptions Endpoint (Third Review)

REVIEW SUMMARY

✅ VERIFIED FIXES FROM PREVIOUS REVIEWS

1. ✅ Service Layer Usage (Previously: Issue #1)

2. ✅ Integer Overflow Protection (Previously: Issue #5)

3. ✅ Privacy-Compliant Logging (Previously: Issue #3)

4. ✅ Filename Path Traversal Security (Previously: Issue #4)

5. ✅ Synchronous Usage Tracking (Previously: Issue #4 from second review)

🟢 NO REMAINING CRITICAL ISSUES

💡 MINOR OBSERVATIONS (Non-blocking)

1. Code Quality: Content-Type Detection Helper

2. Test Coverage: Large File Validation

✅ APPROVAL

Uh oh!

ilblackdragon commented Jan 27, 2026

Uh oh!

nickpismenkov commented Jan 27, 2026

Uh oh!

henrypark133 left a comment

Choose a reason for hiding this comment

Uh oh!

nickpismenkov commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nickpismenkov commented Jan 27, 2026 •

edited

Loading