Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support remote images #52

Open
bertsky opened this issue Sep 28, 2022 · 3 comments · May be fixed by #59
Open

Support remote images #52

bertsky opened this issue Sep 28, 2022 · 3 comments · May be fixed by #59

Comments

@bertsky
Copy link
Contributor

bertsky commented Sep 28, 2022

We frequently have the use-case where some (or even all) the file references have not been downloaded yet.

But these URL references for images make OcrdBrowser stumble:

today at 22:59:06Traceback (most recent call last):
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/ui/window.py", line 92, in _open
today at 22:59:06    self.page_list.set_document(self.document)
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/ui/page_browser.py", line 39, in set_document
today at 22:59:06    self.model = PageListStore(self.document)
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/ui/page_store.py", line 57, in __init__
today at 22:59:06    file_lookup = document.get_image_paths(self.file_group)
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/model/document.py", line 275, in get_image_paths
today at 22:59:06    image_paths[page_id] = self.path(images[0])
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/model/document.py", line 169, in path
today at 22:59:06    return self.directory.joinpath(other.local_filename)
today at 22:59:06  File "/usr/local/lib/python3.7/pathlib.py", line 922, in joinpath
today at 22:59:06    return self._make_child(args)
today at 22:59:06  File "/usr/local/lib/python3.7/pathlib.py", line 704, in _make_child
today at 22:59:06    drv, root, parts = self._parse_args(args)
today at 22:59:06  File "/usr/local/lib/python3.7/pathlib.py", line 658, in _parse_args
today at 22:59:06    a = os.fspath(a)
today at 22:59:06TypeError: expected str, bytes or os.PathLike object, not NoneType

That's because in …

if isinstance(other, OcrdFile):
return self.directory.joinpath(other.local_filename)

… we do not differentiate between an OcrdFile's .local_filename (which may be empty) and its .url. The latter could still be downloaded into the document.directory under some name and returned here.

Or perhaps one could somehow make this downloading a lazy operation only to be triggered when actually needed.

@bertsky
Copy link
Contributor Author

bertsky commented Feb 2, 2023

we do not differentiate between an OcrdFile's .local_filename (which may be empty) and its .url. The latter could still be downloaded into the document.directory under some name and returned here.

Or perhaps one could somehow make this downloading a lazy operation only to be triggered when actually needed.

BTW, that's also how most OCR-D processors handle this. They rely on Workspace.download_file, which for non-local files will automatically download from the URL and store in the workspace (without actually changing the METS but with a reproducible local path, so subsequent attempts will use the local copy).

hnesk added a commit that referenced this issue Feb 21, 2023
hnesk added a commit that referenced this issue Feb 21, 2023
@bertsky bertsky linked a pull request Feb 27, 2023 that will close this issue
hnesk added a commit that referenced this issue Feb 27, 2023
@hnesk
Copy link
Owner

hnesk commented Mar 2, 2023

See support_remote_images branch for progress

@kba
Copy link
Contributor

kba commented Mar 13, 2024

One additional feature wish: A graceful way to handle failing downloads, e.g. showing just a placeholder image instead of crashing outright. This does happen in our collection for files in the PRESENTATION fileGrp which references files by file:// URL that are not actually usable outside the network:

<mets:fileGrp USE="PRESENTATION">                                                                                                           
  <mets:file ID="FILE_0001_PRESENTATION" MIMETYPE="image/tiff">                                                                             
    <mets:FLocat xmlns:xlink="http://www.w3.org/1999/xlink" LOCTYPE="URL" xlink:href="file:///goobi/tiff001/sbb/PPN680203753/00000001.tif"/>
  </mets:file>                                                                                                                              

I know that we should fix that on our side but that is not trivial to do and we're probably not the only ones (mis)using mets:FLocat like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants