fix: adds onboarding rollback to allow user to edit embedding provider if ingestion fails#617
fix: adds onboarding rollback to allow user to edit embedding provider if ingestion fails#617
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces robust error handling for the onboarding flow by adding a rollback mechanism when ingestion tasks fail, improves provider validation with detailed error messages, and ensures queries only run when appropriate (after onboarding completion and with healthy providers).
- Added backend rollback endpoint to cancel tasks, delete ingested files, and reset configuration when onboarding fails
- Enhanced provider validation to use full health checks during onboarding and improved error message extraction for all providers
- Added onboarding completion state tracking in frontend to gate queries and prevent them from running during onboarding or when providers are unhealthy
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/utils/telemetry/message_id.py |
Added telemetry event for onboarding rollback |
src/main.py |
Added /onboarding/rollback POST endpoint for triggering rollback |
src/api/settings.py |
Implemented rollback_onboarding function to cancel tasks, delete documents, and reset config; changed onboarding validation to use full health checks |
src/api/provider_validation.py |
Added error parsing helpers and improved error messages for all provider validation failures |
frontend/contexts/chat-context.tsx |
Added isOnboardingComplete state tracking based on settings and localStorage |
frontend/components/provider-health-banner.tsx |
Improved error message display to handle duplicate errors gracefully |
frontend/components/chat-renderer.tsx |
Calls setOnboardingComplete(true) when onboarding finishes |
frontend/app/onboarding/_components/onboarding-card.tsx |
Added rollback mutation hook and logic to detect failed files (though mutation is never invoked) |
frontend/app/onboarding/_components/ibm-onboarding.tsx |
Only formatting changes (spaces to tabs) |
frontend/app/api/queries/useProviderHealthQuery.ts |
Added onboarding completion and active ingestion checks to gate health queries |
frontend/app/api/queries/useGetNudgesQuery.ts |
Added onboarding completion and LLM health checks to gate nudges queries |
frontend/app/api/queries/useGetConversationsQuery.ts |
Added onboarding completion check to gate conversations queries |
frontend/app/api/mutations/useOnboardingRollbackMutation.ts |
New mutation hook for calling the rollback endpoint |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| (!!debouncedEndpoint && !!debouncedApiKey && !!debouncedProjectId) || | ||
| getFromEnv || |
There was a problem hiding this comment.
The enabled condition for fetching IBM models has a logical flaw. It enables the query when getFromEnv is true OR when alreadyConfigured is true, even if the required credentials (endpoint, apiKey, projectId) are not provided.
When getFromEnv is true but the user hasn't provided endpoint or projectId, the query will be enabled but will likely fail or return no useful data. The correct logic should ensure all required fields are present when not using environment variables:
enabled:
(!!debouncedEndpoint && !!debouncedProjectId && (getFromEnv || !!debouncedApiKey)) ||
alreadyConfigured,This ensures the query only runs when:
- Endpoint AND projectId are provided AND (env key is used OR api key is provided), OR
- Already configured (reusing existing credentials)
| (!!debouncedEndpoint && !!debouncedApiKey && !!debouncedProjectId) || | |
| getFromEnv || | |
| ( | |
| !!debouncedEndpoint && | |
| !!debouncedProjectId && | |
| (getFromEnv || !!debouncedApiKey) | |
| ) || |
| // Include hasChatError in queryKey so React Query refetches when it changes | ||
| // This ensures the health check runs with test_completion=true when chat errors occur | ||
| const testCompletion = params?.test_completion ?? hasChatError; | ||
| const queryKey = ["provider", "health", testCompletion, hasChatError]; |
There was a problem hiding this comment.
The queryKey includes hasChatError redundantly. Line 117 computes testCompletion = params?.test_completion ?? hasChatError, and then line 118 creates the queryKey as ["provider", "health", testCompletion, hasChatError].
Since testCompletion already captures the value of hasChatError when params?.test_completion is undefined, including both in the queryKey is redundant. The query will unnecessarily refetch when hasChatError changes even though testCompletion already represents the same state.
Consider simplifying to:
const queryKey = ["provider", "health", testCompletion];| const queryKey = ["provider", "health", testCompletion, hasChatError]; | |
| const queryKey = ["provider", "health", testCompletion]; |
| yaml.dump(current_config.to_dict(), f, default_flow_style=False, indent=2) | ||
|
|
||
| # Update cached config | ||
| config_manager._config = current_config |
There was a problem hiding this comment.
Directly accessing the private attribute config_manager._config violates encapsulation and could lead to maintenance issues if the internal implementation of config_manager changes. While this approach works to bypass the save_config_file method (which would set edited=True), it's accessing a private implementation detail.
Consider adding a public method to config_manager like save_config_with_edited_flag(config, edited=True) to allow saving the configuration with explicit control over the edited flag, rather than directly accessing _config.
| config_manager._config = current_config | |
| config_manager.save_config_with_edited_flag(current_config, edited=False) |
| details += f" (code: {code})" | ||
| return details | ||
|
|
||
| # Anthropic format: {"error": {"message": "...", "type": "..."}} |
There was a problem hiding this comment.
The comment on line 87 says "Anthropic format" but the code on lines 88-89 checks for a direct "message" field in error_data, not under an "error" object. This is inconsistent with the comment and doesn't match typical Anthropic error response format which usually has {"error": {"message": "...", "type": "..."}}.
This block (lines 88-89) actually handles a generic format where message is at the root level, not Anthropic-specific. The comment should be updated to reflect this or the code should be moved to a more appropriate location in the conditional chain.
| # Anthropic format: {"error": {"message": "...", "type": "..."}} | |
| # Generic format: {"message": "..."} |
| }; | ||
| }, [settings?.edited]); | ||
|
|
||
| const setOnboardingComplete = useCallback((complete: boolean) => { |
There was a problem hiding this comment.
The setOnboardingComplete function allows manually setting the onboarding completion state, but this can lead to inconsistency with the computed state in the useEffect (lines 133-153).
The useEffect computes isOnboardingComplete based on settings?.edited && !stepKeyExists, but calling setOnboardingComplete(true) directly overrides this without updating the underlying conditions (localStorage or settings). This means:
- If the useEffect runs again (e.g., on settings change), it might reset the state back
- The manual state and computed state can diverge
Consider either:
- Removing the manual
setOnboardingCompleteand relying solely on the computed state from settings and localStorage - Or having
setOnboardingCompletealso update the underlying conditions (localStorage/settings) to keep them in sync
| const setOnboardingComplete = useCallback((complete: boolean) => { | |
| const setOnboardingComplete = useCallback((complete: boolean) => { | |
| if (typeof window !== "undefined") { | |
| if (complete) { | |
| // Remove the onboarding step key to indicate completion | |
| localStorage.removeItem(ONBOARDING_STEP_KEY); | |
| } else { | |
| // Set the onboarding step key to indicate onboarding is not complete | |
| localStorage.setItem(ONBOARDING_STEP_KEY, "true"); | |
| } | |
| } |
| // If any file failed, show error and jump back one step (like onboardingMutation.onError) | ||
| // Only handle if we haven't already handled this task | ||
| if ( | ||
| taskWithFailedFile && | ||
| !rollbackMutation.isPending && | ||
| !isCompleted && | ||
| !handledFailedTasksRef.current.has(taskWithFailedFile.task_id) | ||
| ) { | ||
| console.error("File failed in task, jumping back one step", taskWithFailedFile); | ||
|
|
||
| // Mark this task as handled to prevent infinite loops | ||
| handledFailedTasksRef.current.add(taskWithFailedFile.task_id); | ||
|
|
||
| // Extract error messages from failed files | ||
| const errorMessages: string[] = []; | ||
| if (taskWithFailedFile.files) { | ||
| Object.values(taskWithFailedFile.files).forEach((file) => { | ||
| if ((file.status === "failed" || file.status === "error") && file.error) { | ||
| errorMessages.push(file.error); | ||
| } | ||
| }); | ||
| } | ||
|
|
||
| // Also check task-level error | ||
| if (taskWithFailedFile.error) { | ||
| errorMessages.push(taskWithFailedFile.error); | ||
| } | ||
|
|
||
| // Use the first error message, or a generic message if no errors found | ||
| const errorMessage = errorMessages.length > 0 | ||
| ? errorMessages[0] | ||
| : "Sample data file failed to ingest. Please try again with a different configuration."; | ||
|
|
||
| // Set error message and jump back one step (exactly like onboardingMutation.onError) | ||
| setError(errorMessage); | ||
| setCurrentStep(totalSteps); | ||
| // Jump back one step after 1 second (go back to the step before ingestion) | ||
| // For embedding: totalSteps is 4, ingestion is step 3, so go back to step 2 | ||
| // For LLM: totalSteps is 3, ingestion is step 2, so go back to step 1 | ||
| setTimeout(() => { | ||
| // Go back to the step before the last step (which is ingestion) | ||
| const previousStep = totalSteps > 1 ? totalSteps - 2 : 0; | ||
| setCurrentStep(previousStep); | ||
| }, 1000); | ||
| return; |
There was a problem hiding this comment.
The rollbackMutation is defined but never called. When a file failure is detected, the code sets an error message and navigates back a step, but it doesn't actually invoke rollbackMutation.mutate() to trigger the backend cleanup (canceling tasks, deleting ingested files, resetting config). This means the rollback endpoint added in src/main.py and the rollback_onboarding function in src/api/settings.py will never be executed, leaving the system in an inconsistent state.
You should call rollbackMutation.mutate() when a failed file is detected to properly clean up the backend state before navigating back.
| # Generic format: {"detail": "..."} | ||
| if "detail" in error_data: | ||
| return error_data["detail"] | ||
| except (json.JSONDecodeError, ValueError, TypeError): |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except (json.JSONDecodeError, ValueError, TypeError): | |
| except (json.JSONDecodeError, ValueError, TypeError): | |
| # Parsing failed; return the original error text as fallback. |
This pull request introduces several improvements to onboarding, health checks, and query logic for the chat application. The main focus is on robust error handling during onboarding, enhanced provider health checks, and ensuring queries only run when appropriate (e.g., after onboarding and when providers are healthy). It also adds a mutation to roll back onboarding if ingestion tasks fail. These changes collectively improve user experience, reliability, and correctness of the onboarding and chat flows.
Onboarding error handling and rollback:
useOnboardingRollbackMutationhook to allow rolling back onboarding when ingestion tasks fail, and integrated it into the onboarding card logic to automatically handle failed sample data files and reset onboarding steps with clear error messaging. [1] [2] [3] [4]