-
Notifications
You must be signed in to change notification settings - Fork 14
lfs: use HTTPFileSystem
from fsspec
instead of dvc-http
#315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lfs: use HTTPFileSystem
from fsspec
instead of dvc-http
#315
Conversation
The cyclic import seems to be an upstream problem in $ python -c 'from dvc_objects.executors import batch_coros'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".venv/lib/python3.11/site-packages/dvc_objects/executors.py", line 21, in <module>
from .fs.callbacks import Callback
File ".venv/lib/python3.11/site-packages/dvc_objects/fs/__init__.py", line 5, in <module>
from . import generic # noqa: F401
^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.11/site-packages/dvc_objects/fs/generic.py", line 11, in <module>
from dvc_objects.executors import ThreadPoolExecutor, batch_coros
ImportError: cannot import name 'ThreadPoolExecutor' from partially initialized module 'dvc_objects.executors' (most likely due to a circular import) (.venv/lib/python3.11/site-packages/dvc_objects/executors.py) I'm looking into it. |
It seems three re-exports in from . import generic # noqa: F401
from .local import LocalFileSystem, localfs # noqa: F401
from .memory import MemoryFileSystem # noqa: F401L7 But removing them would break compatibility. 🤔 |
We could use a slightly ugly workaround to enforce a different import order in import aiohttp
from aiohttp_retry import ExponentialRetry, RetryClient
-from dvc_objects.executors import batch_coros
from dvc_objects.fs import localfs
from dvc_objects.fs.callbacks import DEFAULT_CALLBACK
from dvc_objects.fs.utils import as_atomic
@@ -14,6 +13,11 @@ from fsspec.asyn import sync_wrapper
from fsspec.implementations.http import HTTPFileSystem
from funcy import cached_property
+if True:
+ # NOTE: This is a workaround to avoid an import cycle. Importing from
+ # `dvc_objects.executors` must come after importing from `dvc_objects.fs`.
+ from dvc_objects.executors import batch_coros
+
from scmrepo.git.credentials import Credential, CredentialNotFoundError
from .exceptions import LFSError Or we could do a lazy import of WDYT, @pmrowla? |
It would be better to drop the dependency on dvc-objects entirely and use the fsspec version of |
df0561d
to
68b778f
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #315 +/- ##
==========================================
+ Coverage 78.41% 78.42% +0.01%
==========================================
Files 39 39
Lines 4970 4969 -1
Branches 893 892 -1
==========================================
Hits 3897 3897
+ Misses 919 918 -1
Partials 154 154 ☔ View full report in Codecov by Sentry. |
scmrepo can probably just get rid of the DVC |
I've refactored the
LFSClient
class some more to directly usefsspec
'sHTTPFileSystem
implementation instead of the one fromdvc-http
, as most ofdvc-http
's implementation is irrelevant to making HTTP requests independent of DVC. I've retained the retry client, whichdvc-http
also uses, because that's generally consistent with the official Git LFS client's implementation. In fact, the official Git LFS client also supports retries for Batch API requests, which are disabled bydvc-http
'sReadOnlyRetryClient
class. To otherwise keep the retry behavior as before, I've copied the retry settings fromdvc-http
including the hardcoded number of parallel jobs inherited fromdvc_objects.fs.base.FileSystem
. In a follow-up PR, I believe we should make this value configurable and get thejobs
value from DVC's usedFileSystem
instance indvc.scm.lfs_prefetch
like this:def lfs_prefetch(fs: "FileSystem", paths: List[str]): from scmrepo.git.lfs import fetch as _lfs_fetch from dvc.fs.dvc import DVCFileSystem from dvc.fs.git import GitFileSystem if isinstance(fs, DVCFileSystem) and isinstance(fs.repo.fs, GitFileSystem): git_fs = fs.repo.fs scm = fs.repo.scm assert isinstance(scm, Git) else: return try: if "filter=lfs" not in git_fs.open(".gitattributes").read(): return except OSError: return with TqdmGit(desc="Checking for Git-LFS objects") as pbar: _lfs_fetch( scm, [git_fs.rev], include=[(path if path.startswith("/") else f"/{path}") for path in paths], progress=pbar.update_git, + jobs=fs.jobs, )