-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Break apart session processor and the running of each session into se… #6382
Break apart session processor and the running of each session into se… #6382
Conversation
@brandonrising Do you mind taking a look at the multi-GPU support PR #5997 to estimate how difficult it would be to integrate that into what is done here? That PR has done a small amount of refactoring to break the big scary loop into smaller pieces. |
invokeai/app/services/session_processor/session_processor_default.py
Outdated
Show resolved
Hide resolved
4fac432
to
17a3a04
Compare
Notes from my changes:
I resolved a longstanding issue where pydantic validation errors for nodes were handled as processor-level errors, instead of node-level errors. @brandonrising had almost fixed it in the first iteration of this PR, but errors were being set on the previous node, when they should be on the current node. To resolve this, some error handling was added to the This makes our error reporting more accurate. It also is a better UX - the error messages are very clear. |
Here's how I smoke tested:
|
- Add `OnNodeError` and `OnNonFatalProcessorError` callbacks - Move all session/node callbacks to `SessionRunner` - this ensures we dump perf stats before resetting them and generally makes sense to me - Remove `complete` event from `SessionRunner`, it's essentially the same as `OnAfterRunSession` - Remove extraneous `next_invocation` block, which would treat a processor error as a node error - Simplify loops - Add some callbacks for testing, to be removed before merge
- Use protocol to define callbacks, this allows them to have kwargs - Shuffle the profiler around a bit - Move `thread_limit` and `polling_interval` to `__init__`; `start` is called programmatically and will never get these args in practice
We were not handling node preparation errors as node errors before. Here's the explanation, copied from a comment that is no longer required: --- TODO(psyche): Sessions only support errors on nodes, not on the session itself. When an error occurs outside node execution, it bubbles up to the processor where it is treated as a queue item error. Nodes are pydantic models. When we prepare a node in `session.next()`, we set its inputs. This can cause a pydantic validation error. For example, consider a resize image node which has a constraint on its `width` input field - it must be greater than zero. During preparation, if the width is set to zero, pydantic will raise a validation error. When this happens, it breaks the flow before `invocation` is set. We can't set an error on the invocation because we didn't get far enough to get it - we don't know its id. Hence, we just set it as a queue item error. --- This change wraps the node preparation step with exception handling. A new `NodeInputError` exception is raised when there is a validation error. This error has the node (in the state it was in just prior to the error) and an identifier of the input that failed. This allows us to mark the node that failed preparation as errored, correctly making such errors _node_ errors and not _processor_ errors. It's much easier to diagnose these situations. The error messages look like this: > Node b5ac87c6-0678-4b8c-96b9-d215aee12175 has invalid incoming input for height Some of the exception handling logic is cleaned up.
e194fc4
to
7652fbc
Compare
…_traceback` to `session_queue` table
- Add handling for new error columns `error_type`, `error_message`, `error_traceback`. - Update queue item model to include the new data. The `error_traceback` field has an alias of `error` for backwards compatibility. - Add `fail_queue_item` method. This was previously handled by `cancel_queue_item`. Splitting this functionality makes failing a queue item a bit more explicit. We also don't need to handle multiple optional error args. -
There's a race condition where a canceled session may emit a progress event or two after it's been canceled, and the progress image isn't cleared out. To resolve this, the system slice tracks canceled session ids. When a progress event comes in, we check the cancellations and skip setting the progress if canceled.
…etting reported I had set the cancel event at some point during troubleshooting an unrelated issue. It seemed logical that it should be set there, and didn't seem to break anything. However, this is not correct. The cancel event should not be set in response to a queue status change event. Doing so can cause a race condition when nodes are executed very quickly. It's possible that a previously-executed session's queue item status change event is handled after the next session starts executing. The cancel event is set and the session runner sees it aborting the session run early. In hindsight, it doesn't make sense to set the cancel event here either. It should be set in response to user action, e.g. the user cancelled the session or cleared the queue (which implicitly cancels the current session). These events actually trigger the queue item status changed event, so if we set the cancel event here, we'd be setting it twice per cancellation.
Show error toasts on queue item error events instead of invocation error events. This allows errors that occurred outside node execution to be surfaced to the user. The error description component is updated to show the new error message if available. Commercial handling is retained, but local now uses the same component to display the error message itself.
I've updated a few things in this PR:
Working on this has underscored the need for a way to test the processor (and new session runner). I'm not familiar enough with mocking to have a clear idea of how we can do this, but I think it's feasible - especially now that things are modularized. |
…parate classes
Summary
Breaks up the currently scary to update session processor into separate components and provides callback function passthroughs which allow you to define tasks be run before/after sessions, as well as before/after each node.
QA Instructions
Run invocations of various types and batch sizes. Compare output images of same inputs against those generated on current main. Validate logging/stats are formatted correctly.
Intentionally run malformed graphs and nodes which will error out to ensure error reporting is still correct.
Merge Plan
I will merge once the PR has been thoroughly reviewed, tested, and approved on multiple OS/environments.
Checklist