Skip to content

V1.2: Adds SUBTITLE_FORMAT / OVERWRITE_SUBTITLES#50

Open
bridgemill-ch wants to merge 14 commits into
johnpc:mainfrom
bridgemill-ch:v1.2
Open

V1.2: Adds SUBTITLE_FORMAT / OVERWRITE_SUBTITLES#50
bridgemill-ch wants to merge 14 commits into
johnpc:mainfrom
bridgemill-ch:v1.2

Conversation

@bridgemill-ch
Copy link
Copy Markdown
Contributor

Summary

Fixes the scan not starting with large subtitle libraries (OOM and
event loop blocking). Adds SUBTITLE_FORMAT / OVERWRITE_SUBTITLES
option to preserve original filenames so media players detect the
correct language.

Changes

  • Fixed OOM by paginating all API endpoints and switching emitFileUpdate
    to direct DB queries instead of full table scans
  • Fixed event loop blocking by removing redundant findMatchingVideoFile
    calls and O(n²) DB queries during initial scan
  • Added OVERWRITE_SUBTITLES=true mode: syncs subtitle in-place,
    writes a # synced: header comment to track already-processed
    files instead of relying on filename suffixes
  • Fixed engine output filter to check all known engines, not just
    enabled ones
  • Replaced DB-based tracking with file header tracking (self-contained,
    survives DB resets)
  • Fixed CRLF line endings in entrypoint.sh

Usage

Add to compose environment:

- SUBTITLE_FORMAT=engine-lang (or overwrite)

david-steg added 14 commits May 10, 2026 21:40
Removed redundant findMatchingVideoFile call from the run:files_found
handler and eliminated O(n^2) DB queries during initial file scan.

With thousands of subtitle files, the synchronous forEach loop in the
event handler blocked the event loop by calling findMatchingVideoFile
(disk I/O) and emitFileUpdate (full table scan per file) for every
file before batch processing could begin.

Video path matching now only happens in processFile when the file is
actually processed, and video_path is stored in the DB at that point.
The script had Windows CRLF line endings, causing the Linux kernel to
interpret the shebang as #!/bin/bash\r (with trailing carriage return),
resulting in 'exec /entrypoint.sh: no such file or directory' at container
startup. Added .gitattributes to enforce LF line endings for shell scripts.
When enabled, the first successful engine result is copied over the
original subtitle file and the engine output is cleaned up. This
preserves the original filename (e.g. movie.de.srt) so media players
correctly detect the language instead of showing 'ffsubsync'.

Adds a processed_files table to the database to track which files
have already been overwritten, preventing re-processing on subsequent
scans.
Adds SUBTITLE_FORMAT env var with three modes:
- standard (default): file.de.ffsubsync.srt
- engine-lang: file.ffsubsync.de.srt (preserves language tag for players)
- overwrite: replaces original file in-place

Also extracts output path logic into shared getOutputPath helper
so generators, scanner, and overwrite logic use consistent paths.
- Added getFileResult(runId, filePath) direct query to database
- Changed emitFileUpdate to query single file instead of loading all
- Removed file list from WebSocket initial state (was the OOM trigger)
- Added limit to GET /api/status file results
- Added getFileResultsPaginated to database (LIMIT/OFFSET query)
- Fixed GET /api/runs/:id to load only latest 500 files
- Fixed GET /api/status to use paginated query
- Added GET /api/runs/:id/files paginated endpoint
- Fixed emitFileUpdate uses direct query instead of full table scan
- Removed file list from WebSocket initial state
isEngineOutput was only checking currently enabled engines, so
old .autosubsync. files were picked up when only alass was enabled.
When SUBTITLE_FORMAT is changed (e.g. from standard to engine-lang),
existing engine output files (.ffsubsync.srt, etc.) are renamed to
match the new naming convention before the scan begins. This prevents
re-syncing already-processed subtitles and keeps filenames consistent.
When SUBTITLE_FORMAT=overwrite, synced subtitles are now marked with
a '# synced:' comment at the top of the file instead of relying on
engine suffixes in the filename or a database table. This preserves
the original filename (e.g. movie.de.srt) so media players correctly
detect the language. The scanner reads the file header to skip
already-synced files.

Removed the engine-lang format (was confusing, not the right approach).
Both OVERWRITE_SUBTITLES=true and SUBTITLE_FORMAT=overwrite enable
overwrite mode with file-header tracking.
isSyncedSrt was reading the entire SRT file for each of 197k files,
causing massive blocking I/O. Now reads only first 100 bytes.
Moved isSyncedSrt check out of the file scan into processFile so
the scan stays fast (only directory traversal + filename filtering).
The header check is now async using fs/promises so concurrent batches
don't block the event loop.
…e mode

In overwrite mode, old engine output files from previous standard-format
runs were triggering the skip check inside generators. Disabled the
existsSync check in generators when OVERWRITE_SUBTITLES=true so the
header is the only source of truth for already-synced status.
Copy link
Copy Markdown
Owner

@johnpc johnpc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution — the OOM fixes and overwrite mode are solid work. A few things need addressing before merge:

  1. Revert the CI workflow changes. The .github/workflows/docker-publish.yml modifications (pushing to bridgemill/subsyncarr, triggering on your branch instead of tags) are specific to your fork and can't be merged into this repo. Please revert that file to match main.

  2. SRT comment header concern. The # synced: marker written into SRT files — have you verified that common players (Jellyfin, Plex, VLC, mpv) don't render that line as subtitle text? SRT doesn't officially support comments, so some parsers may display it. If any do, consider using a zero-duration cue at timestamp 00:00:00,000 --> 00:00:00,000 instead, which is universally ignored.

The actual feature code (pagination, event loop fix, isEngineOutput checking all engines, lazy header check, overwrite mode) is well-structured and I'd be happy to merge once the workflow file is reverted.

johnpc

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants