-
Notifications
You must be signed in to change notification settings - Fork 258
[datasets][annotationdb] specify cloud storage platform (gcp or aws) #9605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rm ('gcp' or 'aws'). Cleaned up ValueError messages a bit. Added type hints.
…p' or 'aws'). In `DB` constructor, added line to filter and include only annotation datasets in config (everything showing up with unified JSON now). Other minor refactoring.
…dated text in annotationdb.js that appears in "Hail Generated Code" box on annotation db ui page. In load_dataset() set default parameters region='us' and cloud='gcp', moved line in DB constructor.
danking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, I love the type annotations. I've just one nit on custom configs.
hail/python/hail/experimental/db.py
Outdated
| assert cloud in doc['url'], doc['url'] | ||
| url = doc['url'][cloud] | ||
| else: | ||
| url = doc['url'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to require custom configs to have the same format as the publicly supported one. AFAIK, nobody is using this functionality. Moreover, this is in the experimental module, so we have no obligation to maintain compatibility.
Do you have an argument in favor of different formats for the custom vs non-custom? I suppose a custom config is probably used by one lab which will exist in one cloud. I can see its a bit more annoying, but I prefer the consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really any argument for different formats, and the consistency would make things nicer. With that in mind, I think I'll also modify Dataset.from_name_and_json(), as it currently doesn't check for region with a custom config. That way the expected formatting for a custom config will be consistent with whats in datasets.json.
This allows the user to specify the cloud platform ('gcp' or 'aws') they are using when accessing datasets via the datasets API and annotation DB. A user running hail on AWS would read from the s3 bucket, and a user running on GCP would read from the gs bucket (can also still read locally from gs bucket with cloud storage connector installed). Not intended for cross-platform use like running a dataproc cluster and trying to access the s3 bucket, or trying to access the gs bucket on an EMR cluster. Will assume user on AWS has their configuration set with their credentials.
Did not have permissions to set up a cluster or EC2 instance on AWS to test, but was able to access all the datasets without issue on a dataproc cluster when using s3a:// prefixes and providing my AWS credentials in the spark config. So these changes should (hopefully) work fine on an EMR cluster with the s3:// client. Everything worked as expected on GCP.
Overview of changes:
load_dataset()function and methods indb.py.datasets.json:load_dataset()function:cloudparameter, set default values toregion='us'andcloud='gcp.DBclass:cloudparameter to constructor, set default values toregion='us'andcloud='gcp.datasets.jsoncurrently end up in the_DB__by_namedictionary, even if not annotation datasets. Added line 279 indb.pyto fix this and filter out datasets that are not annotation datasets (datasets missing "annotation_db" key).Datasetclass:cloudandcustom_configparameters toDataset.from_name_and_json()to pass toDatasetVersion.from_json()to grab correct urls for platform.Dataset.from_name_and_json(). Will require users passing a custom config to be of same format as what is indatasets.jsonfor now. So each key: value pair in a user provided config should be as below:DatasetVersionclass:reference_genomeattribute, now that the version and reference genome are two separate fields indatasets.json.DatasetVersion.from_json()to handle cloud parameter to grab correct version url when using checked-in config file.