Skip to content

Comments

Feat(Model Manager): Add improved download manager with pause/resume partial download.#8864

Merged
lstein merged 18 commits intoinvoke-ai:mainfrom
DustyShoe:Feat(Backend)/improved-download-manager
Feb 24, 2026
Merged

Feat(Model Manager): Add improved download manager with pause/resume partial download.#8864
lstein merged 18 commits intoinvoke-ai:mainfrom
DustyShoe:Feat(Backend)/improved-download-manager

Conversation

@DustyShoe
Copy link
Collaborator

@DustyShoe DustyShoe commented Feb 8, 2026

Summary

This PR adds few cool things:

  • Adds pause/resume for model downloads with proper “restart required” handling when servers refuse Range.
  • Persists install state so downloads can resume across restarts using the same temp folder.
  • Multi‑file installs now run sequentially per job (one file at a time) for more stable resume behavior.
  • Adds restart actions (job‑level and file‑level) plus clearer inline status in the UI.
  • Progress bar now aggregates job‑level bytes so it reflects total download progress during multi‑file installs.

Parts of this code were written with assistance, so I’d appreciate any fixes or improvements.

QA Instructions

It might be better to run this with debug logging enabled.

  1. Start a multi‑file model install (e.g., Tongyi-MAI/Z-Image-Turbo).
  2. Verify progress bar increases and tooltip shows X / Y for the full job.
  3. Pause the download mid‑file, then resume.
  4. Confirm the download continues (not restarted) and bytes increase from the same point.
  5. Kill/restart backend while a download is paused; verify it resumes on start.
  6. Force a resume‑refused case (server returns 200 to Range) and confirm:
  • file shows “Restart required”
  • job status becomes Paused
  • “Restart file” works and only that file restarts
  1. Cancel a job and re‑install the same model; ensure a new temp folder is created and no old partial blocks it.

Merge Plan

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

@github-actions github-actions bot added api python PRs that change python files services PRs that change app services frontend PRs that change frontend files labels Feb 8, 2026
@JPPhoto
Copy link
Collaborator

JPPhoto commented Feb 9, 2026

@DustyShoe Does this handle the case where the user pauses and either:

  1. Quits Invoke and deletes the temporary file?
  2. Keeps Invoke running and deletes the temporary file?

@DustyShoe
Copy link
Collaborator Author

@JPPhoto Ofc. you had to find worst case scenario...

  1. Why user even do that?
  2. User pauses, quits Invoke, deletes temp file
    On resume, we look for the .downloading file. If it’s gone, resume_from=0 and we start a fresh download (no resume).
  3. User pauses, keeps Invoke running, deletes temp file
    The file is already closed on pause, so deleting it is fine. When the user resumes, same as above: no .downloading file, so it restarts from scratch.

@JPPhoto
Copy link
Collaborator

JPPhoto commented Feb 9, 2026

@DustyShoe We can't predict what users or their systems will do, so coding defensively and being resilient (to a point) is always good.

@DustyShoe
Copy link
Collaborator Author

@JPPhoto Have to admit, that was a good point actually. Went back and added explicit toast message if temp file was removed and user tries to restart download. Also there was a bug in status bar updating. It did never reset to 0 in that case.

@DustyShoe DustyShoe changed the title Feat(backend): Add improved download manager with pause/resume partial download. Feat(Model Manager): Add improved download manager with pause/resume partial download. Feb 12, 2026
@lstein lstein self-assigned this Feb 16, 2026
@lstein lstein added the v6.13.x label Feb 20, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds pause/resume functionality for model downloads with persistence across backend restarts. The implementation includes proper handling of servers that refuse byte-range requests, sequential multi-file downloads, and comprehensive UI controls for managing download state.

Changes:

  • Adds pause/resume API endpoints and UI controls for model downloads
  • Implements persistent install state using marker files to survive backend restarts
  • Changes multi-file downloads to run sequentially (one file at a time) instead of in parallel
  • Adds restart functionality for failed or non-resumable downloads with per-file granularity
  • Updates progress calculation to aggregate bytes across all files in multi-file installs

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
invokeai/frontend/web/src/services/api/schema.ts Added pause/resume API types and download metadata fields for resume support
invokeai/frontend/web/src/services/api/endpoints/models.ts Added pause, resume, restart failed, and restart file mutations
invokeai/frontend/web/src/features/modelManagerV2/subpanels/AddModelPanel/ModelInstallQueue/ModelInstallQueueItem.tsx Added UI controls for pause/resume and restart, plus progress aggregation logic
invokeai/frontend/web/src/features/modelManagerV2/subpanels/AddModelPanel/ModelInstallQueue/ModelInstallQueueBadge.tsx Added "paused" status badge
invokeai/frontend/web/public/locales/en.json Added translations for pause/resume/restart messages
invokeai/frontend/web/openapi.json Regenerated schema with new endpoints and types (includes unrelated changes)
invokeai/app/services/model_install/model_install_default.py Implemented marker file persistence, pause/resume/restart methods, and incomplete install restoration
invokeai/app/services/model_install/model_install_common.py Added PAUSED status and paused property to ModelInstallJob
invokeai/app/services/model_install/model_install_base.py Added abstract methods for pause/resume/restart operations
invokeai/app/services/events/events_common.py Added DownloadPausedEvent
invokeai/app/services/events/events_base.py Added emit_download_paused method
invokeai/app/services/download/download_default.py Implemented resume logic with byte-range requests and sequential multi-file downloads
invokeai/app/services/download/download_base.py Added PAUSED status and pause-related fields to DownloadJob
invokeai/app/api/routers/model_manager.py Added pause/resume/restart API endpoints
Comments suppressed due to low confidence (6)

invokeai/frontend/web/src/features/modelManagerV2/subpanels/AddModelPanel/ModelInstallQueue/ModelInstallQueueItem.tsx:229

  • The code checks installJob.status === 'completed' in the progressValue calculation, but the TypeScript type system should enforce the correct status values. However, for consistency with the rest of the codebase, verify that the status type from the API matches the expected InstallStatus enum values in all locations.
  const progressValue = useMemo(() => {
    if (installJob.status === 'completed' || installJob.status === 'error' || installJob.status === 'cancelled') {
      return 100;
    }

    const parts = installJob.download_parts;
    if (parts && parts.length > 0) {
      const totalBytesFromParts = parts.reduce((sum, part) => sum + (part.total_bytes ?? 0), 0);
      const currentBytesFromParts = parts.reduce((sum, part) => sum + (part.bytes ?? 0), 0);
      const totalBytes = Math.max(totalBytesFromParts, installJob.total_bytes ?? 0);
      const currentBytes = Math.max(currentBytesFromParts, installJob.bytes ?? 0);
      if (totalBytes > 0) {
        return (currentBytes / totalBytes) * 100;
      }
      return 0;
    }

    if (!isNil(installJob.bytes) && !isNil(installJob.total_bytes) && installJob.total_bytes > 0) {
      return (installJob.bytes / installJob.total_bytes) * 100;
    }

    return null;
  }, [installJob.bytes, installJob.download_parts, installJob.status, installJob.total_bytes]);

invokeai/frontend/web/src/features/modelManagerV2/subpanels/AddModelPanel/ModelInstallQueue/ModelInstallQueueItem.tsx:83

  • The error handlers access error.data.detail without checking if error.data exists first. If the error object doesn't have a data property, this will throw an uncaught exception. Add a null check: error?.data?.detail or provide a fallback error message. This pattern is repeated in all error handlers (pause, resume, restart failed, restart file).
      .catch((error) => {
        if (error) {
          toast({
            id: 'MODEL_INSTALL_PAUSE_FAILED',
            title: `${error.data.detail} `,
            status: 'error',
          });
        }
      });

invokeai/frontend/web/src/features/modelManagerV2/subpanels/AddModelPanel/ModelInstallQueue/ModelInstallQueueItem.tsx:165

  • Same issue with error handling - accessing error.data.detail without null checks. This occurs in the resume, restart failed, and restart file handlers as well.
      .catch((error) => {
        if (error) {
          toast({
            id: 'MODEL_INSTALL_RESUME_FAILED',
            title: `${error.data.detail} `,
            status: 'error',
          });
        }
      });
  }, [hasRestartedFromScratch, installJob, resumeModelInstall]);

  const handleRestartFailed = useCallback(() => {
    restartFailedModelInstall(installJob.id)
      .unwrap()
      .then(() => {
        toast({
          id: 'MODEL_INSTALL_RESTART_FAILED',
          title: t('toast.modelDownloadRestartFailed'),
          status: 'success',
        });
      })
      .catch((error) => {
        if (error) {
          toast({
            id: 'MODEL_INSTALL_RESTART_FAILED_ERROR',
            title: `${error.data.detail} `,
            status: 'error',
          });
        }
      });
  }, [installJob.id, restartFailedModelInstall]);

  const handleRestartFile = useCallback(
    (fileSource: string) => {
      restartModelInstallFile({ id: installJob.id, file_source: fileSource })
        .unwrap()
        .then(() => {
          toast({
            id: 'MODEL_INSTALL_RESTART_FILE',
            title: t('toast.modelDownloadRestartFile'),
            status: 'success',
          });
        })
        .catch((error) => {
          if (error) {
            toast({
              id: 'MODEL_INSTALL_RESTART_FILE_ERROR',
              title: `${error.data.detail} `,
              status: 'error',
            });
          }

invokeai/app/services/model_install/model_install_default.py:562

  • The pause_job, resume_job, restart_failed, and restart_file methods access and modify shared state (job status, multifile_job) without holding self._lock. This could lead to race conditions if these methods are called concurrently with download callbacks or other operations. Consider adding lock protection similar to what's used in cancel_job and other methods that modify job state.
    def pause_job(self, job: ModelInstallJob) -> None:
        """Pause the indicated job, preserving partial downloads."""
        if job.in_terminal_state:
            return
        job.status = InstallStatus.PAUSED
        self._logger.warning(f"Pausing {job.source}")
        if dj := job._multifile_job:
            for part in dj.download_parts:
                self._download_queue.pause_job(part)
        self._write_install_marker(job, status=InstallStatus.PAUSED)

    def resume_job(self, job: ModelInstallJob) -> None:
        """Resume a previously paused job."""
        if not job.paused:
            return
        self._logger.info(f"Resuming {job.source}")
        self._resume_remote_download(job)

    def restart_failed(self, job: ModelInstallJob) -> None:
        """Restart failed or non-resumable downloads for a job."""
        if not isinstance(job.source, (HFModelSource, URLModelSource)):
            return
        if not job.download_parts:
            return
        if not any(part.resume_required or part.errored for part in job.download_parts):
            return
        sources_to_restart = {str(part.source) for part in job.download_parts if not part.complete}
        if not sources_to_restart:
            return
        job.status = InstallStatus.WAITING
        remote_files, metadata = self._remote_files_from_source(job.source)
        remote_files = [rf for rf in remote_files if str(rf.url) in sources_to_restart]
        subfolders = job.source.subfolders if isinstance(job.source, HFModelSource) else []
        self._enqueue_remote_download(
            job=job,
            source=job.source,
            remote_files=remote_files,
            metadata=metadata,
            destdir=job._install_tmpdir or job.local_path,
            subfolder=job.source.subfolder if isinstance(job.source, HFModelSource) and len(subfolders) <= 1 else None,
            subfolders=subfolders if len(subfolders) > 1 else None,
            clear_partials=True,
        )

    def restart_file(self, job: ModelInstallJob, file_source: str) -> None:
        """Restart a specific file download for a job."""
        if not isinstance(job.source, (HFModelSource, URLModelSource)):
            return
        job.status = InstallStatus.WAITING
        remote_files, metadata = self._remote_files_from_source(job.source)
        remote_files = [rf for rf in remote_files if str(rf.url) == file_source]
        if not remote_files:
            return
        subfolders = job.source.subfolders if isinstance(job.source, HFModelSource) else []
        self._enqueue_remote_download(
            job=job,
            source=job.source,
            remote_files=remote_files,
            metadata=metadata,
            destdir=job._install_tmpdir or job.local_path,
            subfolder=job.source.subfolder if isinstance(job.source, HFModelSource) and len(subfolders) <= 1 else None,
            subfolders=subfolders if len(subfolders) > 1 else None,
            clear_partials=True,
        )

invokeai/app/services/download/download_default.py:231

  • The _submit_next_mfd_part method accesses and modifies self._mfd_pending[job.id] and self._mfd_active[job.id] without lock protection. Since this method is called from multiple callbacks (_mfd_complete) which run in worker threads, there's potential for race conditions. Consider adding lock protection around the manipulation of these shared data structures.
    def _submit_next_mfd_part(self, job: MultiFileDownloadJob) -> None:
        pending = self._mfd_pending.get(job.id, [])
        if not pending:
            return
        if self._mfd_active.get(job.id) is not None:
            return
        download_job = pending.pop(0)
        self._mfd_active[job.id] = download_job
        self.submit_download_job(
            download_job,
            on_start=self._mfd_started,
            on_progress=self._mfd_progress,
            on_complete=self._mfd_complete,
            on_cancelled=self._mfd_cancelled,
            on_error=self._mfd_error,
        )

invokeai/frontend/web/openapi.json:175

  • This PR includes unrelated changes to openapi.json that appear to be from other features (orphaned models detection/deletion API endpoints, FLUX model loader changes, DyPE preset modifications). These changes are not mentioned in the PR description and may have been inadvertently included from a schema regeneration. Consider whether these should be in a separate PR or if the PR description should be updated to reflect all changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@lstein lstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and it works well!

Suggestions:

  1. Would it be possible to add "pause all/resume all" and "cancel all" buttons to the install queue title bar, just to the left of "Prune"? Particularly after resuming from a crash, it would be great to be able to resume all the partial downloads with one click.
  2. I found that if I killed and restarted the backend while a file download was occurring, the backend would put the downloads into a "pause" state, but the frontend didn't update to show the new status. I had to pause and then resume each file, or else refresh the whole page. Could it be possible for the frontend to update its download queue display after a backend restart, or even automatically restart the download going?

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@DustyShoe
Copy link
Collaborator Author

@lstein

  1. Thanks for the suggestion — that makes sense. I’ll add those buttons.

  2. This behavior already exists in other parts of the frontend. For example, if there were staged generations on the canvas and the backend is restarted, the frontend does not automatically refresh its state. The generation may complete, but the final image is not displayed until the page is refreshed. That’s also why using the launcher is recommended — it closes the frontend when the backend stops.

Given that this is an existing pattern, I treated the download state the same way. That said, I’ll see whether the download state can be updated after a backend restart without introducing major changes.

@DustyShoe
Copy link
Collaborator Author

@lstein

  1. I think the cleanest UX would be to make “Pause All / Resume All” a single toggle button that switches based on the current queue state.
image image
  1. I’ve done some additional testing and found that the behavior occurs only when the backend is stopped via Ctrl-C.

In that case, the backend explicitly pauses active download jobs during shutdown. However, the frontend does not receive the final status update because the socket connection is already closed at that moment.

If the backend is terminated via window close (X) or Task Manager, the jobs are not explicitly paused on shutdown. After restart, the backend restores and resumes the in-progress downloads automatically, which is why they continue without user interaction.

I propose to keep the current behavior and add an explicit re-sync on reconnect when jobs were paused during a graceful shutdown. This way, the UI will correctly reflect the paused state, and the user can resume them using the “Resume All” action.

Additionally, I added a “Backend disconnected” indicator to the downloader title bar.
It is shown when the backend crashes or is closed, and disappears once the connection is restored.

image

@lstein
Copy link
Collaborator

lstein commented Feb 23, 2026

Looking good. I'll do just a little more testing tomorrow before approving.

Copy link
Collaborator

@lstein lstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working as advertised. Great enhancement!

@lstein lstein enabled auto-merge (squash) February 24, 2026 02:26
@lstein lstein merged commit b9f9015 into invoke-ai:main Feb 24, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api frontend PRs that change frontend files python PRs that change python files services PRs that change app services v6.13.x

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants