Skip to content

Conversation

@pwc2
Copy link
Contributor

@pwc2 pwc2 commented Oct 15, 2020

This allows the user to specify the cloud platform ('gcp' or 'aws') they are using when accessing datasets via the datasets API and annotation DB. A user running hail on AWS would read from the s3 bucket, and a user running on GCP would read from the gs bucket (can also still read locally from gs bucket with cloud storage connector installed). Not intended for cross-platform use like running a dataproc cluster and trying to access the s3 bucket, or trying to access the gs bucket on an EMR cluster. Will assume user on AWS has their configuration set with their credentials.

Did not have permissions to set up a cluster or EC2 instance on AWS to test, but was able to access all the datasets without issue on a dataproc cluster when using s3a:// prefixes and providing my AWS credentials in the spark config. So these changes should (hopefully) work fine on an EMR cluster with the s3:// client. Everything worked as expected on GCP.

Overview of changes:

  • Added missing type hints to load_dataset() function and methods in db.py.
  • In datasets.json:
    • Added AWS urls so now for each version of a dataset in dataset["versions"], the url entry looks like:
"url": {
  "aws": {
    "us": "s3://hail-datasets-us-east-1/..."
  },
  "gcp": {
    "eu": "gs://hail-datasets-eu/...",
    "us": "gs://hail-datasets-us/...
  }
  • In load_dataset() function:
    • Added cloud parameter, set default values to region='us' and cloud='gcp.
  • In DB class:
    • Added cloud parameter to constructor, set default values to region='us' and cloud='gcp.
    • All datasets in datasets.json currently end up in the _DB__by_name dictionary, even if not annotation datasets. Added line 279 in db.py to fix this and filter out datasets that are not annotation datasets (datasets missing "annotation_db" key).
  • In Dataset class:
    • Added cloud and custom_config parameters to Dataset.from_name_and_json() to pass to DatasetVersion.from_json() to grab correct urls for platform.
    • Refactored Dataset.from_name_and_json(). Will require users passing a custom config to be of same format as what is in datasets.json for now. So each key: value pair in a user provided config should be as below:
"dataset_name": {
"annotation_db": {"key_properties": ["..."]},
"description": "...",
"url": "...",
"versions": [
  {
    "reference_genome": "...",
    "url": {"aws": {"eu": "...", "us": "..."},
            "gcp": {"eu": "...", "us": "..."}},
    "version": "..."
  }]}
  • In DatasetVersion class:
    • Added reference_genome attribute, now that the version and reference genome are two separate fields in datasets.json.
    • Modified DatasetVersion.from_json() to handle cloud parameter to grab correct version url when using checked-in config file.

pwc2 added 5 commits October 15, 2020 16:18
…rm ('gcp' or 'aws'). Cleaned up ValueError messages a bit. Added type hints.
…p' or 'aws'). In `DB` constructor, added line to filter and include only annotation datasets in config (everything showing up with unified JSON now). Other minor refactoring.
…dated text in annotationdb.js that appears in "Hail Generated Code" box on annotation db ui page. In load_dataset() set default parameters region='us' and cloud='gcp', moved line in DB constructor.
@pwc2 pwc2 marked this pull request as ready for review October 16, 2020 14:34
Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, I love the type annotations. I've just one nit on custom configs.

assert cloud in doc['url'], doc['url']
url = doc['url'][cloud]
else:
url = doc['url']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to require custom configs to have the same format as the publicly supported one. AFAIK, nobody is using this functionality. Moreover, this is in the experimental module, so we have no obligation to maintain compatibility.

Do you have an argument in favor of different formats for the custom vs non-custom? I suppose a custom config is probably used by one lab which will exist in one cloud. I can see its a bit more annoying, but I prefer the consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really any argument for different formats, and the consistency would make things nicer. With that in mind, I think I'll also modify Dataset.from_name_and_json(), as it currently doesn't check for region with a custom config. That way the expected formatting for a custom config will be consistent with whats in datasets.json.

@pwc2 pwc2 dismissed danking’s stale review October 16, 2020 17:13

fixed custom configs

@pwc2 pwc2 requested a review from danking October 16, 2020 17:14
@danking danking merged commit f662e44 into hail-is:main Oct 16, 2020
@pwc2 pwc2 deleted the adb-aws branch October 16, 2020 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants