Feat(Model Manager): Add improved download manager with pause/resume partial download.#8864
Conversation
|
@DustyShoe Does this handle the case where the user pauses and either:
|
|
@JPPhoto Ofc. you had to find worst case scenario...
|
|
@DustyShoe We can't predict what users or their systems will do, so coding defensively and being resilient (to a point) is always good. |
|
@JPPhoto Have to admit, that was a good point actually. Went back and added explicit toast message if temp file was removed and user tries to restart download. Also there was a bug in status bar updating. It did never reset to 0 in that case. |
There was a problem hiding this comment.
Pull request overview
This PR adds pause/resume functionality for model downloads with persistence across backend restarts. The implementation includes proper handling of servers that refuse byte-range requests, sequential multi-file downloads, and comprehensive UI controls for managing download state.
Changes:
- Adds pause/resume API endpoints and UI controls for model downloads
- Implements persistent install state using marker files to survive backend restarts
- Changes multi-file downloads to run sequentially (one file at a time) instead of in parallel
- Adds restart functionality for failed or non-resumable downloads with per-file granularity
- Updates progress calculation to aggregate bytes across all files in multi-file installs
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| invokeai/frontend/web/src/services/api/schema.ts | Added pause/resume API types and download metadata fields for resume support |
| invokeai/frontend/web/src/services/api/endpoints/models.ts | Added pause, resume, restart failed, and restart file mutations |
| invokeai/frontend/web/src/features/modelManagerV2/subpanels/AddModelPanel/ModelInstallQueue/ModelInstallQueueItem.tsx | Added UI controls for pause/resume and restart, plus progress aggregation logic |
| invokeai/frontend/web/src/features/modelManagerV2/subpanels/AddModelPanel/ModelInstallQueue/ModelInstallQueueBadge.tsx | Added "paused" status badge |
| invokeai/frontend/web/public/locales/en.json | Added translations for pause/resume/restart messages |
| invokeai/frontend/web/openapi.json | Regenerated schema with new endpoints and types (includes unrelated changes) |
| invokeai/app/services/model_install/model_install_default.py | Implemented marker file persistence, pause/resume/restart methods, and incomplete install restoration |
| invokeai/app/services/model_install/model_install_common.py | Added PAUSED status and paused property to ModelInstallJob |
| invokeai/app/services/model_install/model_install_base.py | Added abstract methods for pause/resume/restart operations |
| invokeai/app/services/events/events_common.py | Added DownloadPausedEvent |
| invokeai/app/services/events/events_base.py | Added emit_download_paused method |
| invokeai/app/services/download/download_default.py | Implemented resume logic with byte-range requests and sequential multi-file downloads |
| invokeai/app/services/download/download_base.py | Added PAUSED status and pause-related fields to DownloadJob |
| invokeai/app/api/routers/model_manager.py | Added pause/resume/restart API endpoints |
Comments suppressed due to low confidence (6)
invokeai/frontend/web/src/features/modelManagerV2/subpanels/AddModelPanel/ModelInstallQueue/ModelInstallQueueItem.tsx:229
- The code checks
installJob.status === 'completed'in the progressValue calculation, but the TypeScript type system should enforce the correct status values. However, for consistency with the rest of the codebase, verify that the status type from the API matches the expected InstallStatus enum values in all locations.
const progressValue = useMemo(() => {
if (installJob.status === 'completed' || installJob.status === 'error' || installJob.status === 'cancelled') {
return 100;
}
const parts = installJob.download_parts;
if (parts && parts.length > 0) {
const totalBytesFromParts = parts.reduce((sum, part) => sum + (part.total_bytes ?? 0), 0);
const currentBytesFromParts = parts.reduce((sum, part) => sum + (part.bytes ?? 0), 0);
const totalBytes = Math.max(totalBytesFromParts, installJob.total_bytes ?? 0);
const currentBytes = Math.max(currentBytesFromParts, installJob.bytes ?? 0);
if (totalBytes > 0) {
return (currentBytes / totalBytes) * 100;
}
return 0;
}
if (!isNil(installJob.bytes) && !isNil(installJob.total_bytes) && installJob.total_bytes > 0) {
return (installJob.bytes / installJob.total_bytes) * 100;
}
return null;
}, [installJob.bytes, installJob.download_parts, installJob.status, installJob.total_bytes]);
invokeai/frontend/web/src/features/modelManagerV2/subpanels/AddModelPanel/ModelInstallQueue/ModelInstallQueueItem.tsx:83
- The error handlers access
error.data.detailwithout checking iferror.dataexists first. If the error object doesn't have adataproperty, this will throw an uncaught exception. Add a null check:error?.data?.detailor provide a fallback error message. This pattern is repeated in all error handlers (pause, resume, restart failed, restart file).
.catch((error) => {
if (error) {
toast({
id: 'MODEL_INSTALL_PAUSE_FAILED',
title: `${error.data.detail} `,
status: 'error',
});
}
});
invokeai/frontend/web/src/features/modelManagerV2/subpanels/AddModelPanel/ModelInstallQueue/ModelInstallQueueItem.tsx:165
- Same issue with error handling - accessing
error.data.detailwithout null checks. This occurs in the resume, restart failed, and restart file handlers as well.
.catch((error) => {
if (error) {
toast({
id: 'MODEL_INSTALL_RESUME_FAILED',
title: `${error.data.detail} `,
status: 'error',
});
}
});
}, [hasRestartedFromScratch, installJob, resumeModelInstall]);
const handleRestartFailed = useCallback(() => {
restartFailedModelInstall(installJob.id)
.unwrap()
.then(() => {
toast({
id: 'MODEL_INSTALL_RESTART_FAILED',
title: t('toast.modelDownloadRestartFailed'),
status: 'success',
});
})
.catch((error) => {
if (error) {
toast({
id: 'MODEL_INSTALL_RESTART_FAILED_ERROR',
title: `${error.data.detail} `,
status: 'error',
});
}
});
}, [installJob.id, restartFailedModelInstall]);
const handleRestartFile = useCallback(
(fileSource: string) => {
restartModelInstallFile({ id: installJob.id, file_source: fileSource })
.unwrap()
.then(() => {
toast({
id: 'MODEL_INSTALL_RESTART_FILE',
title: t('toast.modelDownloadRestartFile'),
status: 'success',
});
})
.catch((error) => {
if (error) {
toast({
id: 'MODEL_INSTALL_RESTART_FILE_ERROR',
title: `${error.data.detail} `,
status: 'error',
});
}
invokeai/app/services/model_install/model_install_default.py:562
- The pause_job, resume_job, restart_failed, and restart_file methods access and modify shared state (job status, multifile_job) without holding self._lock. This could lead to race conditions if these methods are called concurrently with download callbacks or other operations. Consider adding lock protection similar to what's used in cancel_job and other methods that modify job state.
def pause_job(self, job: ModelInstallJob) -> None:
"""Pause the indicated job, preserving partial downloads."""
if job.in_terminal_state:
return
job.status = InstallStatus.PAUSED
self._logger.warning(f"Pausing {job.source}")
if dj := job._multifile_job:
for part in dj.download_parts:
self._download_queue.pause_job(part)
self._write_install_marker(job, status=InstallStatus.PAUSED)
def resume_job(self, job: ModelInstallJob) -> None:
"""Resume a previously paused job."""
if not job.paused:
return
self._logger.info(f"Resuming {job.source}")
self._resume_remote_download(job)
def restart_failed(self, job: ModelInstallJob) -> None:
"""Restart failed or non-resumable downloads for a job."""
if not isinstance(job.source, (HFModelSource, URLModelSource)):
return
if not job.download_parts:
return
if not any(part.resume_required or part.errored for part in job.download_parts):
return
sources_to_restart = {str(part.source) for part in job.download_parts if not part.complete}
if not sources_to_restart:
return
job.status = InstallStatus.WAITING
remote_files, metadata = self._remote_files_from_source(job.source)
remote_files = [rf for rf in remote_files if str(rf.url) in sources_to_restart]
subfolders = job.source.subfolders if isinstance(job.source, HFModelSource) else []
self._enqueue_remote_download(
job=job,
source=job.source,
remote_files=remote_files,
metadata=metadata,
destdir=job._install_tmpdir or job.local_path,
subfolder=job.source.subfolder if isinstance(job.source, HFModelSource) and len(subfolders) <= 1 else None,
subfolders=subfolders if len(subfolders) > 1 else None,
clear_partials=True,
)
def restart_file(self, job: ModelInstallJob, file_source: str) -> None:
"""Restart a specific file download for a job."""
if not isinstance(job.source, (HFModelSource, URLModelSource)):
return
job.status = InstallStatus.WAITING
remote_files, metadata = self._remote_files_from_source(job.source)
remote_files = [rf for rf in remote_files if str(rf.url) == file_source]
if not remote_files:
return
subfolders = job.source.subfolders if isinstance(job.source, HFModelSource) else []
self._enqueue_remote_download(
job=job,
source=job.source,
remote_files=remote_files,
metadata=metadata,
destdir=job._install_tmpdir or job.local_path,
subfolder=job.source.subfolder if isinstance(job.source, HFModelSource) and len(subfolders) <= 1 else None,
subfolders=subfolders if len(subfolders) > 1 else None,
clear_partials=True,
)
invokeai/app/services/download/download_default.py:231
- The
_submit_next_mfd_partmethod accesses and modifiesself._mfd_pending[job.id]andself._mfd_active[job.id]without lock protection. Since this method is called from multiple callbacks (_mfd_complete) which run in worker threads, there's potential for race conditions. Consider adding lock protection around the manipulation of these shared data structures.
def _submit_next_mfd_part(self, job: MultiFileDownloadJob) -> None:
pending = self._mfd_pending.get(job.id, [])
if not pending:
return
if self._mfd_active.get(job.id) is not None:
return
download_job = pending.pop(0)
self._mfd_active[job.id] = download_job
self.submit_download_job(
download_job,
on_start=self._mfd_started,
on_progress=self._mfd_progress,
on_complete=self._mfd_complete,
on_cancelled=self._mfd_cancelled,
on_error=self._mfd_error,
)
invokeai/frontend/web/openapi.json:175
- This PR includes unrelated changes to openapi.json that appear to be from other features (orphaned models detection/deletion API endpoints, FLUX model loader changes, DyPE preset modifications). These changes are not mentioned in the PR description and may have been inadvertently included from a schema regeneration. Consider whether these should be in a separate PR or if the PR description should be updated to reflect all changes.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lstein
left a comment
There was a problem hiding this comment.
Tested and it works well!
Suggestions:
- Would it be possible to add "pause all/resume all" and "cancel all" buttons to the install queue title bar, just to the left of "Prune"? Particularly after resuming from a crash, it would be great to be able to resume all the partial downloads with one click.
- I found that if I killed and restarted the backend while a file download was occurring, the backend would put the downloads into a "pause" state, but the frontend didn't update to show the new status. I had to pause and then resume each file, or else refresh the whole page. Could it be possible for the frontend to update its download queue display after a backend restart, or even automatically restart the download going?
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Given that this is an existing pattern, I treated the download state the same way. That said, I’ll see whether the download state can be updated after a backend restart without introducing major changes. |
|
Looking good. I'll do just a little more testing tomorrow before approving. |
lstein
left a comment
There was a problem hiding this comment.
Working as advertised. Great enhancement!



Summary
This PR adds few cool things:
Parts of this code were written with assistance, so I’d appreciate any fixes or improvements.
QA Instructions
It might be better to run this with debug logging enabled.
Merge Plan
Checklist
What's Newcopy (if doing a release after this PR)