Release 6.12.0

ondewo-jenkins released this 13 Jun 07:12

51e36aa

Release ONDEWO NLU API 6.12.0

New Features

[OND221-2774] llm_evaluation.proto: Release gates as first-class entities. Adds full CRUD for LlmEvaluationReleaseGate (configurable thresholds, safety rules and criteria weights), a long-running LlmEvaluationRunReleaseGate RPC, and a persisted verdict history made up of LlmEvaluationReleaseGateRun, LlmEvaluationReleaseGateVerdict and per-check LlmEvaluationReleaseGateCheck records.
[OND221-2774] llm_evaluation.proto: Scorecards and project settings. Adds CRUD for LlmEvaluationScorecard (weighted, multi-criteria score roll-ups) and a per-(project, language_code) LlmEvaluationProjectSettings singleton that holds LlmEvaluationJudgeConfig (which judge CcaiService to use, a verbose-reasoning toggle and per-evaluator overrides).
[OND221-2774] llm_evaluation.proto: Evaluator registry. The new LlmEvaluationListEvaluators RPC returns the available evaluators as LlmEvaluationEvaluatorSpec entries, each describing its category, required example fields, multi-turn support, default threshold, whether it needs a judge, and its parameter specs.
[OND221-2774] llm_evaluation.proto: Multi-turn, conversation-flow evaluation. A run can now cover a whole conversation: the new LlmEvaluationTurnResult captures expected vs. actual output per turn plus a per-turn telemetry join key, and is attached to LlmEvaluationEvaluatorRun (which also gains repetition_index and actual_output). RunLlmEvaluationExperimentRequest gains repetitions, llm_evaluation_experiment_kind and evaluator_configs, datasets gain a LlmEvaluationDatasetType, and a new LlmEvaluationExperimentKind enum is added.
[OND221-2774] llm_evaluation.proto: Build datasets from real traffic and simulate new traffic. LlmEvaluationCreateExamplesFromSession turns recorded sessions into golden transcripts, and the long-running LlmEvaluationSimulateConversations generates persona-driven user simulations and adversarial red-teaming conversations.
[OND221-2774] llm_evaluation.proto: Scheduling and reports. Adds CRUD for LlmEvaluationSchedule (recurring experiment or release-gate runs, by cron or interval) and for immutable LlmEvaluationReport artifacts (Create / Get / List / Delete, with the report stored as payload bytes).
[OND221-2774] agent.proto: New GetSessionsStatisticsTimeSeries RPC for time-bucketed LLM telemetry. Each LlmTelemetryTimeSeriesBucket carries a full LlmTelemetryReport, bucketing is performed server-side, and the request accepts the same llm_* filters as the other statistics RPCs.
[OND221-2774] llm_evaluation.proto: A/B experiments on the LlmEvaluations service. LlmEvaluationAbExperiment (its variants, traffic config and lifecycle status) and each LlmEvaluationAbVariant (per-variant CcaiService / model / prompt overrides, traffic weight and a control flag) get full CRUD (Create / Get / List with LlmEvaluationAbExperimentFilter, Update via update_mask, and Delete). Start validates that the variant traffic weights sum correctly and sets the experiment to RUNNING; Stop ends it. The stateless GetAbExperimentResults returns per-variant LlmEvaluationAbVariantResult roll-ups. Adds the LlmEvaluationAbExperimentStatus enum.
[OND221-2774] session.proto: Native (regex-based, no-LLM) safety monitoring on live traffic. LlmTelemetry gains an LlmSafetyAssessment (field 61) with flagged categories, PII / prompt-injection / jailbreak flags, an overall safety score and individual LlmSafetyFinding records. LlmTelemetryReport gains LlmSafetyStats (field 20) with per-category counts and rates (LlmSafetyCategoryStat) and a mean score. agent.proto adds the report types AGENT_LLM_SAFETY (25) and SESSION_LLM_SAFETY (41).
[OND221-2774] llm_evaluation.proto: A/B testing for RAGFlow variants plus manual rollout. A new RagVariantConfig (chat-assistant LLM CcaiService, top_k, similarity_threshold, vector_similarity_weight and an optional rerank CcaiService) is added as rag_variant_config (field 9) on LlmEvaluationAbVariant, so RAG-project variants can override the chat-assistant LLM and retrieval parameters instead of using ccai_service_names. Rollout stays manual (there is no auto-rollout): the read-only, computed LlmEvaluationAbRolloutRecommendation compares the winning variant against the control on an LlmEvaluationAbOptimizeMetric (one of PASS_RATE, ERROR_RATE, MEAN_LATENCY, CRITERION_SCORE or SAFETY_SCORE) and reports p_value, effect_size, is_significant, sessions_per_variant, needs_more_data and a reason; it is fetched via the stateless LlmEvaluationGetAbRolloutRecommendation. LlmEvaluationApplyAbRollout then promotes the operator-chosen variant as the project's default classifier, stops the experiment, and writes an idempotent LlmEvaluationAbRolloutDecision audit record that can be read back via LlmEvaluationGetAbRolloutDecision and LlmEvaluationListAbRolloutDecisions (with LlmEvaluationAbRolloutDecisionFilter). LlmEvaluationAbExperiment gains llm_evaluation_ab_rollout_decision_name (field 15) linking the applied decision. Adds the LlmEvaluationAbOptimizeMetric enum.
[OND221-2774] llm_evaluation.proto: Continuous (online, in-production) evaluation with a human annotation queue. LlmEvaluationOnlineConfig (a reference-free evaluator set, sample_rate, fail_threshold, settle_seconds, require_telemetry, an optional LlmEvaluationOnlineSessionFilter and observability counters) gets full CRUD (Create / Get / List with LlmEvaluationOnlineConfigFilter, Update via update_mask, and Delete). Workers write read-only LlmEvaluationOnlineResult records (one per scored session step, each with an embedded LlmEvaluationFeedback list, a passed flag and an aggregate_score), available via Get / List (with LlmEvaluationOnlineResultFilter). Every failing step becomes an LlmEvaluationAnnotationQueueItem (Get / List with LlmEvaluationAnnotationQueueItemFilter, and Update via update_mask for status, assignee and reason transitions); LlmEvaluationPromoteAnnotationQueueItem reuses CreateExamplesFromSession to turn an item into dataset examples, flipping its status to PROMOTED and returning the created examples. Adds the LlmEvaluationAnnotationStatus enum.

Improvements

[OND221-2774] llm_evaluation.proto: New LlmEvaluationUpdateFeedback RPC for correcting existing feedback records via an update_mask.
[OND221-2774] session.proto: Typed retrieval metadata on LlmTelemetry (field 62). The new LlmRetrievalMetadata and LlmRetrievedChunk (document_id, chunk_id, score, text, source_uri, rank) are populated alongside the existing unstructured outputs.retrieved_chunks Struct, which is kept for backward compatibility.

Bug Fixes

None in this release.

Assets 2