Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ Refactor to use UPath everywhere #1102

Merged
merged 11 commits into from Aug 29, 2023
Merged

♻️ Refactor to use UPath everywhere #1102

merged 11 commits into from Aug 29, 2023

Conversation

Koncopd
Copy link
Member

@Koncopd Koncopd commented Aug 27, 2023

corresponds to laminlabs/lamindb-setup#490
in progress

lamindb/_file.py Outdated
) -> Tuple[Storage, bool]:
if not skip_existence_check:
try: # check if file exists
if not filepath.exists():
raise FileNotFoundError(filepath)
except PermissionError:
pass
if not isinstance(filepath, UPath):
filepath = filepath.resolve()
filepath = filepath.resolve() # works for UPath also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed for a CloudPath object as by definition, the cloudpath is fully specified including the bucket.

Hence: The check for a local path makes sense. Only if it's a local path, we'd like to resolve it. Calling it on CloudPath, too, causes an API request, I fear, and hence unnecessarily reduces performance.

I've made several experiments ingesting high numbers of files and found that reducing the number of API requests is the only factor that determines performance if files aren't big.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, will remove it.

size = filepath_stat["size"] # type: ignore
else:
raise e
if not isinstance(filepath, LocalPathClasses):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@Koncopd
Copy link
Member Author

Koncopd commented Aug 28, 2023

_file.py is still a total mess, maybe need to enforce UPath everywhere instead of mix due to usage of Path.cwd() etc.

@falexwolf
Copy link
Member

Follow whatever you think is best, Sergei!

@falexwolf
Copy link
Member

But yeah, we don't want a "total mess" anymore. 😅

@@ -387,7 +386,7 @@ def log_storage_hint(
if check_path_in_storage:
display_root = storage.root # type: ignore
# check whether path is local
if not storage.root.startswith(("s3://", "gs://")): # type: ignore
if fsspec.utils.get_protocol(storage.root) == "file": # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use the isinstance(filepath, LocalPathClasses) check here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here storage is a model with root being a string.

Copy link
Member

@falexwolf falexwolf Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right!

To make this more readable (fsspec.utils.get_protocol(storage.root) == "file" isn't understandable at all), could we introduce a helper function?

def is_local_path(path: [str, UPath, Path]) -> bool:
    if isinstance(path, str):
         return fsspec.utils.get_protocol(storage.root) == "file"
    else:
         return isinstance(path, LocalPathClasses)

This could be imported like so:

from lamindb.dev.upath import is_local_path, is_cloud_path

And then be used anywhere.

(Not sure whether the _path suffix is needed.)

@Koncopd Koncopd force-pushed the upath_refactor branch 2 times, most recently from 488aee9 to dd455a1 Compare August 29, 2023 15:43
@codecov
Copy link

codecov bot commented Aug 29, 2023

Codecov Report

Merging #1102 (0d4790a) into main (0e010eb) will increase coverage by 0.24%.
Report is 3 commits behind head on main.
The diff coverage is 96.96%.

@@            Coverage Diff             @@
##             main    #1102      +/-   ##
==========================================
+ Coverage   93.50%   93.74%   +0.24%     
==========================================
  Files          41       41              
  Lines        3462     3455       -7     
==========================================
+ Hits         3237     3239       +2     
+ Misses        225      216       -9     
Files Changed Coverage Δ
lamindb/dev/storage/file.py 92.85% <85.71%> (ø)
lamindb/_file.py 93.24% <100.00%> (+1.42%) ⬆️
lamindb/_save.py 80.95% <100.00%> (+1.11%) ⬆️
lamindb/dev/storage/__init__.py 100.00% <100.00%> (ø)
lamindb/dev/storage/object.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions
Copy link

github-actions bot commented Aug 29, 2023

@github-actions github-actions bot temporarily deployed to pull request August 29, 2023 16:17 Inactive
@github-actions github-actions bot temporarily deployed to pull request August 29, 2023 17:43 Inactive
@Koncopd Koncopd merged commit 6b06748 into main Aug 29, 2023
12 checks passed
@Koncopd Koncopd deleted the upath_refactor branch August 29, 2023 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants