Skip to content

Fix AutoImageProcessor.from_pretrained failing on URL input#44892

Closed
he-yufeng wants to merge 2 commits intohuggingface:mainfrom
he-yufeng:fix-image-processor-url-loading
Closed

Fix AutoImageProcessor.from_pretrained failing on URL input#44892
he-yufeng wants to merge 2 commits intohuggingface:mainfrom
he-yufeng:fix-image-processor-url-loading

Conversation

@he-yufeng
Copy link
Copy Markdown
Contributor

Fixes #44821

The elif is_remote_url(...) / download_url(...) branch in get_image_processor_dict was accidentally removed during the image processor refactor in #43514. This caused AutoImageProcessor.from_pretrained(url) to break with an OSError about invalid repo id format.

The old is_remote_url and download_url utilities were intentionally removed in v5, so this restores URL support with a small local download_url helper using httpx (already a project dependency), paired with a straightforward startswith("http") check — consistent with how URLs are detected elsewhere in the codebase (e.g. video_utils.py, audio_utils.py).

Added a regression test that mocks the download and verifies the processor loads correctly from a URL.

@zucchini-nlp

The elif branch for URL detection (is_remote_url + download_url) was
accidentally removed in huggingface#43514 during the image processor refactor.
This restores URL support with a local download_url helper using httpx,
since the old utils.hub.download_url was intentionally dropped in v5.

Fixes huggingface#44821
Comment on lines +292 to +297
elif pretrained_model_name_or_path.startswith("http://") or pretrained_model_name_or_path.startswith(
"https://"
):
image_processor_file = pretrained_model_name_or_path
resolved_image_processor_file = download_url(pretrained_model_name_or_path)
resolved_processor_file = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are utilities already available which we can import and use. Can you instead revert how it was prev and also check video processing and just processing files (it was copied from image and might have same issue)

elif is_remote_url(pretrained_model_name_or_path):
image_processor_file = pretrained_model_name_or_path
resolved_image_processor_file = download_url(pretrained_model_name_or_path)
else:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for digging into that. Makes sense that it was removed on purpose — no point bringing back deprecated stuff.

Comment on lines +70 to +86
def test_image_processor_from_pretrained_url(self):
# Regression test: loading from a URL should work (see #44821)
config_data = {
"image_processor_type": "ViTImageProcessor",
"size": {"height": 224, "width": 224},
"do_resize": True,
}
# Write a fake config to a temp file, then have download_url return that path
with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
json.dump(config_data, f)
tmp_path = f.name

with mock.patch("transformers.image_processing_base.download_url", return_value=tmp_path):
processor = AutoImageProcessor.from_pretrained("https://example.com/preprocessor_config.json")

self.assertIsInstance(processor, ViTImageProcessor)
self.assertEqual(processor.size, {"height": 224, "width": 224})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for a test! We could do without mock if we take a random checkpoint from the hub, e.g. https://huggingface.co/google/vit-base-patch16-224-in21k/blob/main/preprocessor_config.json

AutoImageProcessor.from_pretrained("https://huggingface.co/google/vit-base-patch16-224-in21k/blob/main/preprocessor_config.json") and compare it will result in the same internal dict as if we loaded by repo id

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the real checkpoint approach is cleaner. Already switched to that in the latest push before seeing the closure.

Addressed review feedback:
- Reverted inline URL detection to use is_remote_url + download_url from
  utils/hub.py (restored these helpers that were dropped in the refactor)
- Applied the same URL handling fix to processing_utils.py
- Replaced mock-based test with a real HF checkpoint comparison
@he-yufeng
Copy link
Copy Markdown
Contributor Author

Updated — switched to is_remote_url() / download_url() utilities as suggested. Also checked video_processing_base.py and processing_utils.py (both had the same gap from the refactor).

Test rewritten to use google/vit-base-patch16-224-in21k directly, no mock needed.

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44892&sha=599cdd

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I just tracked down why download_url was deletde and that seems to be on purpose. It was deprectaed and we don't want to load from a URL anymore

warnings.warn(
f"Using `from_pretrained` with the url of a file (here {url}) is deprecated and won't be possible anymore in"
" v5 of Transformers. You should host your file on the Hub (hf.co) instead and use the repository ID. Note"
" that this is not compatible with the caching system (your file will be downloaded at each execution) or"
" multiple processes (each process will download the file in a different temporary file).",
FutureWarning,
)

Let's close this PR, sorry again 🥲

@he-yufeng
Copy link
Copy Markdown
Contributor Author

No worries at all — thanks for tracking down the history on download_url. Totally makes sense that URL loading was deprecated intentionally (caching issues, temp file per call, etc.). I'll close this on my end too. Appreciate the thorough review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to load AutoImageProcessor from URL

2 participants