-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Bug Description
When running ilab data generate --pipeline simple, the process fails with the following error:
failed to generate data with exception: type object 'StandardPdfPipeline' has no attribute 'download_models_hf'
Environment
InstructLab version: 0.26.1
instructlab-sdg version: 0.8.3
Docling version: 2.61.1 (required >= 2.28.4)
Platform: macOS (also affects other platforms)
Root Cause
The StandardPdfPipeline.download_models_hf() method has been deprecated and removed from StandardPdfPipeline in newer versions of Docling (>= 2.28.4). The method now only exists in LegacyStandardPdfPipeline with a deprecation warning.
According to Docling's deprecation notice, the recommended approach is to use:
docling.utils.model_downloader.download_models() (programmatic)
docling-tools models download (CLI)
Affected Code
File: chunkers.py (line ~142)
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
if self.docling_model_path is None:
logger.info("Docling models not found on disk, downloading models...")
self.docling_model_path = StandardPdfPipeline.download_models_hf()
Proposed Fix
Option 1 (Recommended): Use the new Docling API
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.utils.model_downloader import download_models
if self.docling_model_path is None:
logger.info("Docling models not found on disk, downloading models...")
self.docling_model_path = download_models(output_dir=None, force=False, progress=False)
Option 2 (Temporary): Use LegacyStandardPdfPipeline
from docling.pipeline.legacy_standard_pdf_pipeline import LegacyStandardPdfPipeline
if self.docling_model_path is None:
logger.info("Docling models not found on disk, downloading models...")
self.docling_model_path = LegacyStandardPdfPipeline.download_models_hf()
Steps to Reproduce
Install InstructLab with Docling >= 2.28.4
Set up a taxonomy with knowledge documents
Run ilab data generate --pipeline simple
Observe the error
Impact
This bug prevents users from generating synthetic training data, blocking the entire InstructLab workflow for knowledge-based model fine-tuning.
Workaround
Manually patch the file at: chunkers.py with the proposed fix above.