Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow parameters in urlpath with zarr entries #505

Open
alimanfoo opened this issue Jun 4, 2020 · 1 comment
Open

Allow parameters in urlpath with zarr entries #505

alimanfoo opened this issue Jun 4, 2020 · 1 comment

Comments

@alimanfoo
Copy link
Contributor

Some drivers like the CSV source support parameters which can then be interpolated into the urlpath. This doesn't seem to work with the zarr_cat driver. E.g., I have a catalog like this, containing an entry snp_genotypes that uses a parameter, and for comparison an entry snp_genotypes_test that doesn't use parameters:

metadata:
  version: 1
sources:

  snp_genotypes:
    description: 'Test zarr data source, using parameters'
    driver: zarr_cat
    parameters:
      sample_set:
        description: 'Sample set.'
        type: str
    args:
      urlpath: 'gcs://vo_agam_release/v3/snp_genotypes/all/{{sample_set}}/'
      consolidated: true

  snp_genotypes_test:
    description: 'Test zarr data source, not using parameters.'
    driver: zarr_cat
    args:
      urlpath: 'gcs://vo_agam_release/v3/snp_genotypes/all/AG1000G-AO/'
      consolidated: true

I can access the snp_genotypes_test entry just fine, e.g.:

In [1]: import intake                                                         

In [2]: intake.__version__                                                    
Out[2]: '0.6.0'

In [3]: cat = intake.open_catalog('test.yml')                                 

In [4]: cat.snp_genotypes_test                                                
Out[4]: <Intake catalog: snp_genotypes_test>

In [5]: cat.snp_genotypes_test.to_zarr()                                      
Out[5]: <zarr.hierarchy.Group '/' read-only>

However, when I try to access the snp_genotypes entry I get errors, which I suspect are due to the fact that the sample_set parameter is not getting applied to the urlpath:

In [6]: cat.snp_genotypes(sample_set='AG1000G-AO')                            
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/fsspec/mapping.py in __getitem__(self, key, default)
     74         try:
---> 75             result = self.fs.cat(k)
     76         except (FileNotFoundError, IsADirectoryError, NotADirectoryError):

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/gcsfs/core.py in cat(self, path)
    784         u2 = self.url(path)
--> 785         r = self._call("GET", u2)
    786         r.raise_for_status()

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
    486                 )
--> 487                 validate_response(r, path)
    488                 break

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/gcsfs/core.py in validate_response(r, path)
    121         if r.status_code == 404:
--> 122             raise FileNotFoundError
    123         elif r.status_code == 403:

FileNotFoundError: 

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/intake/catalog/base.py in __getattr__(self, item)
    338             try:
--> 339                 return self[item]  # triggers reload_on_change
    340             except KeyError:

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/intake/catalog/base.py in __getitem__(self, key)
    386             e._pmode = self.pmode
--> 387             return e()
    388         if isinstance(key, str) and '.' in key:

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/intake/catalog/entry.py in __call__(self, persist, **kwargs)
     76         persist = persist or self._pmode
---> 77         s = self.get(**kwargs)
     78         if persist != 'never' and s.has_been_persisted:

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/intake/catalog/local.py in get(self, **user_parameters)
    280         plugin, open_args = self._create_open_args(user_parameters)
--> 281         data_source = plugin(**open_args)
    282         data_source.catalog_object = self._catalog

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/intake/catalog/zarr.py in __init__(self, urlpath, storage_options, component, metadata, consolidated)
     37         self._grp = None
---> 38         super().__init__(metadata=metadata)
     39 

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/intake/catalog/base.py in __init__(self, name, description, metadata, auth, ttl, getenv, getshell, persist_mode, storage_options, *args)
    112         self._entries = self._make_entries_container()
--> 113         self.force_reload()
    114 

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/intake/catalog/base.py in force_reload(self)
    169         """Imperative reload data now"""
--> 170         self._load()
    171         self.updated = time.time()

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/intake/catalog/zarr.py in _load(self)
     64                     # use consolidated metadata
---> 65                     root = zarr.open_consolidated(store=store, mode='r')
     66                 else:

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/zarr/convenience.py in open_consolidated(store, metadata_key, mode, **kwargs)
   1178     # setup metadata sotre
-> 1179     meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
   1180 

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/zarr/storage.py in __init__(self, store, metadata_key)
   2500         # retrieve consolidated metadata
-> 2501         meta = json_loads(store[metadata_key])
   2502 

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/fsspec/mapping.py in __getitem__(self, key, default)
     78                 return default
---> 79             raise KeyError(key)
     80         return result

KeyError: '.zmetadata'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-6-0a68a2150771> in <module>
----> 1 cat.snp_genotypes(sample_set='AG1000G-AO')

~/malariagen/binder/conda/envs/alimanfoo-6a9e1cc/lib/python3.7/site-packages/intake/catalog/base.py in __getattr__(self, item)
    339                 return self[item]  # triggers reload_on_change
    340             except KeyError:
--> 341                 raise AttributeError(item)
    342         raise AttributeError(item)
    343 

AttributeError: snp_genotypes

Is there something I need to do in the ZarrGroupCatalog class to make sure parameters are applied?

@martindurant
Copy link
Member

@tacaswell @danielballan - this is a real regression due to entries #490

The initial access snp_genotypes immediately instantiates the source, and since the parameter has no default, you get an empty string in place of the template. When you then call this source, you get a new version of the source with the given templating.

So a workaround is to have a default value for parameter sample_set which resolves to a real target. A better solution is not to have the driver class actually access the target until the user accesses them, i.e., have the catalogue members be lazily loaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants