Skip to content

Conversation

sisp
Copy link
Collaborator

@sisp sisp commented Jan 29, 2024

I've refactored the LFSClient class some more to directly use fsspec's HTTPFileSystem implementation instead of the one from dvc-http, as most of dvc-http's implementation is irrelevant to making HTTP requests independent of DVC. I've retained the retry client, which dvc-http also uses, because that's generally consistent with the official Git LFS client's implementation. In fact, the official Git LFS client also supports retries for Batch API requests, which are disabled by dvc-http's ReadOnlyRetryClient class. To otherwise keep the retry behavior as before, I've copied the retry settings from dvc-http including the hardcoded number of parallel jobs inherited from dvc_objects.fs.base.FileSystem. In a follow-up PR, I believe we should make this value configurable and get the jobs value from DVC's used FileSystem instance in dvc.scm.lfs_prefetch like this:

 def lfs_prefetch(fs: "FileSystem", paths: List[str]):
     from scmrepo.git.lfs import fetch as _lfs_fetch

     from dvc.fs.dvc import DVCFileSystem
     from dvc.fs.git import GitFileSystem

     if isinstance(fs, DVCFileSystem) and isinstance(fs.repo.fs, GitFileSystem):
         git_fs = fs.repo.fs
         scm = fs.repo.scm
         assert isinstance(scm, Git)
     else:
         return

     try:
         if "filter=lfs" not in git_fs.open(".gitattributes").read():
             return
     except OSError:
         return
     with TqdmGit(desc="Checking for Git-LFS objects") as pbar:
         _lfs_fetch(
             scm,
             [git_fs.rev],
             include=[(path if path.startswith("/") else f"/{path}") for path in paths],
             progress=pbar.update_git,
+            jobs=fs.jobs,
         )

@sisp
Copy link
Collaborator Author

sisp commented Jan 29, 2024

The cyclic import seems to be an upstream problem in dvc-objects:

$ python -c 'from dvc_objects.executors import batch_coros'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".venv/lib/python3.11/site-packages/dvc_objects/executors.py", line 21, in <module>
    from .fs.callbacks import Callback
  File ".venv/lib/python3.11/site-packages/dvc_objects/fs/__init__.py", line 5, in <module>
    from . import generic  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/dvc_objects/fs/generic.py", line 11, in <module>
    from dvc_objects.executors import ThreadPoolExecutor, batch_coros
ImportError: cannot import name 'ThreadPoolExecutor' from partially initialized module 'dvc_objects.executors' (most likely due to a circular import) (.venv/lib/python3.11/site-packages/dvc_objects/executors.py)

I'm looking into it.

@sisp
Copy link
Collaborator Author

sisp commented Jan 29, 2024

It seems three re-exports in dvc_objects/fs/__init__.py are causing the cycle:

from . import generic  # noqa: F401
from .local import LocalFileSystem, localfs  # noqa: F401
from .memory import MemoryFileSystem  # noqa: F401L7

But removing them would break compatibility. 🤔

@sisp
Copy link
Collaborator Author

sisp commented Jan 29, 2024

We could use a slightly ugly workaround to enforce a different import order in scmrepo/git/lfs/client.py:

 import aiohttp
 from aiohttp_retry import ExponentialRetry, RetryClient
-from dvc_objects.executors import batch_coros
 from dvc_objects.fs import localfs
 from dvc_objects.fs.callbacks import DEFAULT_CALLBACK
 from dvc_objects.fs.utils import as_atomic
@@ -14,6 +13,11 @@ from fsspec.asyn import sync_wrapper
 from fsspec.implementations.http import HTTPFileSystem
 from funcy import cached_property
 
+if True:
+    # NOTE: This is a workaround to avoid an import cycle. Importing from
+    # `dvc_objects.executors` must come after importing from `dvc_objects.fs`.
+    from dvc_objects.executors import batch_coros
+
 from scmrepo.git.credentials import Credential, CredentialNotFoundError
 
 from .exceptions import LFSError

Or we could do a lazy import of batch_coros in LFSClient._download().

WDYT, @pmrowla?

@efiop efiop requested a review from pmrowla January 29, 2024 09:49
@pmrowla
Copy link
Contributor

pmrowla commented Feb 6, 2024

Or we could do a lazy import of batch_coros in LFSClient._download().

It would be better to drop the dependency on dvc-objects entirely and use the fsspec version of batch_coros. The dvc-objects implementation is more optimized, but it's not a big issue in this use case, and it's been noted elsewhere that scmrepo shouldn't really import any of the DVC libraries since it introduces the circular dependencies.

@pmrowla pmrowla force-pushed the refactor/lfs-fsspec-httpfilesystem branch from df0561d to 68b778f Compare February 6, 2024 06:11
@codecov-commenter
Copy link

codecov-commenter commented Feb 6, 2024

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (04ca001) 78.41% compared to head (68b778f) 78.42%.

Files Patch % Lines
src/scmrepo/git/lfs/client.py 56.25% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #315      +/-   ##
==========================================
+ Coverage   78.41%   78.42%   +0.01%     
==========================================
  Files          39       39              
  Lines        4970     4969       -1     
  Branches      893      892       -1     
==========================================
  Hits         3897     3897              
+ Misses        919      918       -1     
  Partials      154      154              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pmrowla
Copy link
Contributor

pmrowla commented Feb 6, 2024

In a follow-up PR, I believe we should make this value configurable and get the jobs value from DVC's used FileSystem instance in dvc.scm.lfs_prefetch like this:

scmrepo can probably just get rid of the DVC jobs/_JOBS settings entirely and just forward the batch_size kwarg into the fsspec operations (and fallback to fsspec's default batch size which is also based on cpu count), but as you noted this can be done in a follow up PR

@pmrowla pmrowla merged commit 641ee43 into iterative:main Feb 6, 2024
@sisp sisp deleted the refactor/lfs-fsspec-httpfilesystem branch February 6, 2024 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants