New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
♻️ Refactor to use UPath everywhere #1102
Conversation
lamindb/_file.py
Outdated
) -> Tuple[Storage, bool]: | ||
if not skip_existence_check: | ||
try: # check if file exists | ||
if not filepath.exists(): | ||
raise FileNotFoundError(filepath) | ||
except PermissionError: | ||
pass | ||
if not isinstance(filepath, UPath): | ||
filepath = filepath.resolve() | ||
filepath = filepath.resolve() # works for UPath also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed for a CloudPath object as by definition, the cloudpath is fully specified including the bucket.
Hence: The check for a local path makes sense. Only if it's a local path, we'd like to resolve it. Calling it on CloudPath, too, causes an API request, I fear, and hence unnecessarily reduces performance.
I've made several experiments ingesting high numbers of files and found that reducing the number of API requests is the only factor that determines performance if files aren't big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, will remove it.
size = filepath_stat["size"] # type: ignore | ||
else: | ||
raise e | ||
if not isinstance(filepath, LocalPathClasses): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
|
Follow whatever you think is best, Sergei! |
But yeah, we don't want a "total mess" anymore. 😅 |
@@ -387,7 +386,7 @@ def log_storage_hint( | |||
if check_path_in_storage: | |||
display_root = storage.root # type: ignore | |||
# check whether path is local | |||
if not storage.root.startswith(("s3://", "gs://")): # type: ignore | |||
if fsspec.utils.get_protocol(storage.root) == "file": # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use the isinstance(filepath, LocalPathClasses)
check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here storage
is a model with root
being a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right!
To make this more readable (fsspec.utils.get_protocol(storage.root) == "file"
isn't understandable at all), could we introduce a helper function?
def is_local_path(path: [str, UPath, Path]) -> bool:
if isinstance(path, str):
return fsspec.utils.get_protocol(storage.root) == "file"
else:
return isinstance(path, LocalPathClasses)
This could be imported like so:
from lamindb.dev.upath import is_local_path, is_cloud_path
And then be used anywhere.
(Not sure whether the _path
suffix is needed.)
488aee9
to
dd455a1
Compare
Codecov Report
@@ Coverage Diff @@
## main #1102 +/- ##
==========================================
+ Coverage 93.50% 93.74% +0.24%
==========================================
Files 41 41
Lines 3462 3455 -7
==========================================
+ Hits 3237 3239 +2
+ Misses 225 216 -9
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
7b1bee4
to
0d4790a
Compare
corresponds to laminlabs/lamindb-setup#490
in progress