Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fsspec support #190

Merged
merged 6 commits into from
Jun 1, 2022
Merged

Add fsspec support #190

merged 6 commits into from
Jun 1, 2022

Conversation

ap--
Copy link
Contributor

@ap-- ap-- commented May 31, 2022

Hi @mdshw5,

I saw that there already was an attempt at implementing fsspec support and that the PR is basically stale.

So here's my attempt at implementing fsspec support:
This closes #165 and replaces #168

I explicitly tried to keep the required changes to a minimum, to simplify review.
This follows @manzt recommendations and accepts fsspec.core.OpenFile objects.
This can still be done cleaner by refactoring more of Faidx.__init__

Tests all pass and I added a single fsspec test. Let me know what the best tests would be for covering the base functionality of this library to test fsspec support.

Cheers,
Andreas 😃

@mdshw5
Copy link
Owner

mdshw5 commented May 31, 2022

Hey @ap-- thanks for the thoughtful contribution! I do have a few small questions listed in the review above. This does seem like a straightforward implementation and I appreciate your consideration of the existing code structure. @manzt can you comment on whether this PR addresses your requirements?

@ap--
Copy link
Contributor Author

ap-- commented May 31, 2022

Hi @mdshw5,

I currently can't see your review comments. Did you submit it?

Cheers,
Andreas

Copy link
Owner

@mdshw5 mdshw5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ap-- I had a few comments here. Looks really good, and I’ll be glad to merge after I study the changes a bit more with a clear mind.

tests/test_fsspec.py Outdated Show resolved Hide resolved
pyfaidx/__init__.py Outdated Show resolved Hide resolved
"Cannot read FASTA file %s" % filename)

self.indexname = filename + '.fai'
self.indexname = self.filename + '.fai'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this work when the filename is derived from a remote URI? What does the filename for the local index file become?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
Currently this stores the index file on the remote.
So: providing fsspec.open("s3://mybucket/somefile.fasta") will make self.filename = "mybucket/somefile.fasta" and self._fs will be an s3fs filesystem instance. self.indexname = mybucket/somefile.fasta.fai.

But this requires the remote to be writable. (I basically stuck to the current behavior that stores the index next to the file.) In case the remote is readonly, it would be nice if the user could specify a local path for the index file.

I could make this configurable through a kwarg, with the default being current behavior.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current behavior is fine, and as I haven't used fsspec I did not understand that the remote file-like object would be created. As long as there is as reasonable exception generated when the remote is read-only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a test for this on read-only filesystem with a fasta file with no index file.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the remote is readonly, it would be nice if the user could specify a local path for the index file.

I ran into this when experimenting with an implementation a little while ago. It's hard to know what the "correct" behavior should be. I guess one option would be to have an index_filename kwarg in the constructor, which could similarly be a fsspec.OpenFile or str and would specify explicitly where to look for (or create) an index.

So the default would be to read and write index adjacent to filename in whatever filesystem is provided. If filename is readonly, then we could raise an exception, and the user could fix either by 1.) configuring the filesystem with write permissions or 2.) specifying a local path (e.g., ./local.fasta.fai). My only concern is that it could get challenging to handle multiple file objects.

@mdshw5
Copy link
Owner

mdshw5 commented Jun 1, 2022

Hi @mdshw5,

I currently can't see your review comments. Did you submit it?

Cheers, Andreas

Ah yeah - I forgot to submit 😅

@mdshw5 mdshw5 merged commit 3faada8 into mdshw5:master Jun 1, 2022
@mdshw5
Copy link
Owner

mdshw5 commented Jun 1, 2022

@ap-- I've merged these changes and plan to release a new v0.6.5 package asap. One thing I forgot to request: can you please provide an example in the Readme.rst demonstrating usage of fsspec with this package?

Copy link

@manzt manzt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks great and like it would cover 95% of my use cases.

I think it would be useful to have the ability to write an index locally if the fsspec.filesystem is not writeable and one would need to (re)build an index. E.g,.

import fsspec
from pyfaidx import Fasta

fileobj = fsspec.open('http://example.com/data.fasta')
fasta = Fasta(fileobj, index='./data.fasta.fai')

However, there are likely some edge cases that further need to be considered and I don't think its worth blocking this PR.

"Cannot read FASTA file %s" % filename)

self.indexname = filename + '.fai'
self.indexname = self.filename + '.fai'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the remote is readonly, it would be nice if the user could specify a local path for the index file.

I ran into this when experimenting with an implementation a little while ago. It's hard to know what the "correct" behavior should be. I guess one option would be to have an index_filename kwarg in the constructor, which could similarly be a fsspec.OpenFile or str and would specify explicitly where to look for (or create) an index.

So the default would be to read and write index adjacent to filename in whatever filesystem is provided. If filename is readonly, then we could raise an exception, and the user could fix either by 1.) configuring the filesystem with write permissions or 2.) specifying a local path (e.g., ./local.fasta.fai). My only concern is that it could get challenging to handle multiple file objects.

@manzt
Copy link

manzt commented Jun 1, 2022

Oh oops, didn't see this got merged while I looked it over. Nice work, and thanks!

@mdshw5
Copy link
Owner

mdshw5 commented Jun 1, 2022

@manzt I do like the idea of adding an index_loc kwarg to both Faidx and Fasta. By default it would be set to None, and this can signify that the index should be located adjacent to the FASTA file URI. Otherwise the value will be a file-like object. This would solve other use cases, like shared read-only FASTA files on university HPC systems, or even passing a temporary file or in-memory file like object.

@ap-- ap-- deleted the fsspec-support branch June 1, 2022 19:02
@ap--
Copy link
Contributor Author

ap-- commented Jun 1, 2022

Thanks for merging ❤️
Should I open another PR with the README changes, and is there still interest in adding more fsspec tests?
Or should I wait for #192 to be implemented?

Cheers,
Andreas

@mdshw5
Copy link
Owner

mdshw5 commented Jun 1, 2022

Don't hold your breath on #192 😄. It's more complicated than I thought. If you want to take a stab at it be my guest. Otherwise, some documentation for this PR in the README will suffice, and then I'll tag a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

usage on google cloud storage and more general file handling
3 participants