Skip to content

Add parallel file conversion support with ThreadPoolExecutor#9

Merged
samuelduchesne merged 2 commits intomainfrom
claude/improve-convert-speed-H0A1O
Feb 25, 2026
Merged

Add parallel file conversion support with ThreadPoolExecutor#9
samuelduchesne merged 2 commits intomainfrom
claude/improve-convert-speed-H0A1O

Conversation

@samuelduchesne
Copy link
Copy Markdown
Contributor

Summary

This PR adds support for parallel file conversion in the EnergyPlus documentation build process. File conversions are now executed concurrently using a ThreadPoolExecutor when multiple workers are specified, improving build performance for large documentation sets.

Key Changes

  • Parallel conversion infrastructure: Introduced _convert_files() function that uses ThreadPoolExecutor to run file conversions in parallel when max_workers > 1. The Pandoc subprocess calls are I/O-bound, making threads well-suited for this workload.

  • Task collection refactoring: Extracted _collect_tasks() function to separate task building from execution, enabling better separation of concerns and cleaner parallel execution logic.

  • Refactored convert_doc_set(): Split into two phases:

    • Phase 1: Parallel file conversions (when applicable)
    • Phase 2: Sequential result logging and TOC generation (must happen after files are written)
  • CLI enhancements:

    • Added --max-workers argument to scripts/convert.py with default value of CPU count
    • Added --file-workers argument to scripts/convert_all.py to control per-version parallelism
    • Updated Makefile to support MAX_WORKERS environment variable
  • API updates: Added max_workers keyword-only parameter to convert_doc_set() and convert_version() functions with sensible defaults.

Implementation Details

  • The parallel execution only activates when max_workers > 1 AND there are multiple tasks, otherwise falls back to sequential processing
  • TOC generation and result logging remain sequential to ensure proper file ordering and consistency
  • Error handling and logging use file_result.source instead of tex_path for consistency across parallel and sequential paths
  • The convert_all.py script can now control both version-level parallelism (multiple versions) and file-level parallelism (files within a version)

https://claude.ai/code/session_01XAaaYCg6rvfpWNsmWRxZen

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 67950de69d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/convert_all.py Outdated
Comment on lines +118 to +121
"--file-workers",
type=int,
default=os.cpu_count() or 4,
help="Parallel file conversions per version (default: CPU count)",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reduce default file-worker fanout in convert_all

--file-workers now defaults to os.cpu_count(), but convert_all already runs up to --max-workers processes (also defaulting to CPU count), so the default invocation can fan out to roughly cpu_count^2 concurrent file conversions and Pandoc subprocesses; on typical CI/dev machines this can cause heavy oversubscription, memory pressure, and build failures/timeouts instead of speeding up conversion. A safer default for per-version file workers (for example 1, or a value derived from max_workers) avoids breaking the default workflow.

Useful? React with 👍 / 👎.

Parallelize Pandoc file conversions within a single version using
ThreadPoolExecutor, significantly speeding up the build. Each
convert_tex_file() call is independent (label_index is read-only,
output paths are unique), so files can safely be converted in parallel.

- scripts/convert.py: Add --max-workers flag (default: CPU count) that
  controls thread pool size for file conversions within convert_doc_set()
- scripts/convert_all.py: Add --file-workers flag (separate from the
  existing --max-workers for version-level process parallelism)
- Makefile: Add MAX_WORKERS variable to the convert target

https://claude.ai/code/session_01XAaaYCg6rvfpWNsmWRxZen
When convert_all runs with default settings, --max-workers (CPU count)
processes each spawned --file-workers (CPU count) threads, resulting in
cpu_count^2 concurrent Pandoc subprocesses. This causes memory pressure
and build failures on typical CI/dev machines.

Default --file-workers to 1 since the outer process pool already
saturates the machine. Users can still opt into inner parallelism
explicitly via --file-workers.

https://claude.ai/code/session_01XAaaYCg6rvfpWNsmWRxZen
@samuelduchesne samuelduchesne force-pushed the claude/improve-convert-speed-H0A1O branch from bb6e1d2 to 81c9637 Compare February 25, 2026 18:24
@samuelduchesne samuelduchesne merged commit b70244f into main Feb 25, 2026
3 of 4 checks passed
@samuelduchesne samuelduchesne deleted the claude/improve-convert-speed-H0A1O branch February 25, 2026 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants