Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMIP(6) exclude_dirs #62

Closed
aaronspring opened this issue Apr 27, 2019 · 11 comments
Closed

CMIP(6) exclude_dirs #62

aaronspring opened this issue Apr 27, 2019 · 11 comments
Projects

Comments

@aaronspring
Copy link
Contributor

Great job for the CMIP class!

Is it possible to add exclude_dirs for CMIP catalogs? The base directory contains other folders
which I don't have access too and therefore catalog building fails.

m300524@mlogin108:/work/kd0956/CMIP6$ ls
CMIP/         dc_repli/     dc_test_data/ test_data/
ls test_data/CMIP6/
ls: cannot open directory test_data/CMIP6/: Permission denied
m300524@mlogin108:/work/kd0956/CMIP6$ ls CMIP/CNRM-CERFACS/CNRM-CM6-1/1pctCO2/r99i1p1f2/Omon/tos/gn/v20180424/tos_Omon_CNRM-CM6-1_1pctCO2_r99i1p1f2_gn_185001-185001.nc
@andersy005
Copy link
Member

@aaronspring, thank you for pointing this out. This is something missing from the CMIP implementation. The exclude_dirs parameter is hard-coded in some of the methods. I agree with you, we should let the user have control over this. I will patch this issue this weekend.

@andersy005 andersy005 added this to To do in Backlog via automation Apr 27, 2019
@matt-long
Copy link
Contributor

@andersy005, are there further opportunities for consolidation/code reuse across CMIP and CESM?

@andersy005
Copy link
Member

@matt-long

  1. For building catalogs, while CESM doesn't care about directory structure, CMIP relies on it. This makes their consolidation a daunting task. However, during MPIGE and GMET Add implementation for The Gridded Meteorological Ensemble Tool (GMET) data holdings  #61 implementations, I found myself copying some code with slight changes, and I believe that some code refactoring would provide code reuse across collections that don't rely on the directory structure. Let me know if you've got some ideas on this.

  2. For using the built catalogs, today, CESM and CMIP are using the exact same code. Therefore, code reuse is covered for this case.

@matt-long
Copy link
Contributor

I think we should consider generalizing path_pattern and file_pattern as general methods and point the CMIP, CESM and MPI collection_type's to these, respectively.

@andersy005
Copy link
Member

Sounds good. If you are not busy this coming week, let's discuss more about what these changes will look like

@andersy005
Copy link
Member

@aaronspring,

With #63

You will be able to add exclude_dirs key to your YAML file

name: cmip5_test_collection
collection_type: cmip5
data_sources:
  root_dir:
    name: GLADE
    loc_type: posix
    direct_access: True
    urlpath: ./tests/sample_data/cmip/cmip5
    exclude_dirs: ['files', 'latest']

@aaronspring
Copy link
Contributor Author

aaronspring commented Apr 28, 2019

Re-installed.
Still doesnt work. File structure in post above.

YAML file: (I think this should work)

name: cmip6_collection_mistral
collection_type: cmip6
data_sources:
  root_dir:
    name: MISTRAL
    loc_type: posix
    direct_access: True
    urlpath: /work/kd0956/CMIP6
    exclude_dirs: ['test_data/CMIP6','dc_repli/*','dc_test_data/*']
col = intake.open_esm_metadatastore(collection_input_definition=collection_input_definition, overwrite_existing=True)
Getting list of directories

---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
<ipython-input-15-554feb33ee30> in <module>
----> 1 col = intake.open_esm_metadatastore(collection_input_definition=collection_input_definition, overwrite_existing=True)

/work/mh0727/m300524/anaconda3/envs/xr/lib/python3.7/site-packages/intake_esm-2019.4.26.1.post15-py3.7.egg/intake_esm/core.py in __init__(self, collection_input_definition, collection_name, overwrite_existing, metadata)
     64                 collection_input_definition
     65             )
---> 66             self.build_collection(overwrite_existing)
     67 
     68         else:

/work/mh0727/m300524/anaconda3/envs/xr/lib/python3.7/site-packages/intake_esm-2019.4.26.1.post15-py3.7.egg/intake_esm/core.py in build_collection(self, overwrite_existing)
     97             cc = ESMMetadataStoreCatalog.collection_types[ctype]
     98             cc = cc(self.input_collection)
---> 99             cc.build()
    100             self.get_built_collections()
    101         self.open_collection(name)

/work/mh0727/m300524/anaconda3/envs/xr/lib/python3.7/site-packages/intake_esm-2019.4.26.1.post15-py3.7.egg/intake_esm/cmip.py in build(self)
    145         Reference: CMIP6 DRS: http://goo.gl/v1drZl
    146         """
--> 147         self.build_cmip(depth=9)
    148 
    149     def _get_entry(self, directory):

/work/mh0727/m300524/anaconda3/envs/xr/lib/python3.7/site-packages/intake_esm-2019.4.26.1.post15-py3.7.egg/intake_esm/common.py in build_cmip(self, depth)
    102         if not os.path.exists(self.root_dir):
    103             raise NotADirectoryError(f'{os.path.abspath(self.root_dir)} does not exist')
--> 104         dirs = self.get_directories(root_dir=self.root_dir, depth=depth)
    105         dfs = [self._parse_directory(directory, self.columns) for directory in dirs]
    106         df = dd.from_delayed(dfs).compute()

/work/mh0727/m300524/anaconda3/envs/xr/lib/python3.7/site-packages/intake_esm-2019.4.26.1.post15-py3.7.egg/intake_esm/common.py in get_directories(self, root_dir, depth)
     80 
     81         print('Getting list of directories')
---> 82         y = [x[0] for x in self.walk(root_dir, depth)]
     83         diff = depth - 1
     84         base = len(root_dir.split('/'))

/work/mh0727/m300524/anaconda3/envs/xr/lib/python3.7/site-packages/intake_esm-2019.4.26.1.post15-py3.7.egg/intake_esm/common.py in <listcomp>(.0)
     80 
     81         print('Getting list of directories')
---> 82         y = [x[0] for x in self.walk(root_dir, depth)]
     83         diff = depth - 1
     84         base = len(root_dir.split('/'))

/work/mh0727/m300524/anaconda3/envs/xr/lib/python3.7/site-packages/intake_esm-2019.4.26.1.post15-py3.7.egg/intake_esm/common.py in walk(self, top, maxdepth)
     69         if maxdepth > 1:
     70             for path in dirs:
---> 71                 for x in self.walk(path, maxdepth - 1):
     72                     yield x
     73 

/work/mh0727/m300524/anaconda3/envs/xr/lib/python3.7/site-packages/intake_esm-2019.4.26.1.post15-py3.7.egg/intake_esm/common.py in walk(self, top, maxdepth)
     69         if maxdepth > 1:
     70             for path in dirs:
---> 71                 for x in self.walk(path, maxdepth - 1):
     72                     yield x
     73 

/work/mh0727/m300524/anaconda3/envs/xr/lib/python3.7/site-packages/intake_esm-2019.4.26.1.post15-py3.7.egg/intake_esm/common.py in walk(self, top, maxdepth)
     64 
     65         dirs, nondirs = [], []
---> 66         for entry in os.scandir(top):
     67             (dirs if entry.is_dir() else nondirs).append(entry.path)
     68         yield top, dirs, nondirs

PermissionError: [Errno 13] Permission denied: '/work/kd0956/CMIP6/test_data/CMIP6'

The error comes form os.scandir.

for i in os.scandir('/work/kd0956/CMIP6'):
    print(i)
<DirEntry 'dc_test_data'>
<DirEntry 'dc_repli'>
<DirEntry 'test_data'>
<DirEntry 'CMIP'>

for i in os.scandir('/work/kd0956/CMIP6/test_data'):
    print(i)
<DirEntry 'CMIP6'>

for i in os.scandir('/work/kd0956/CMIP6/test_data/CMIP6'):
    print(i)
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
<ipython-input-18-9f49578bfd74> in <module>
----> 1 for i in os.scandir('/work/kd0956/CMIP6/test_data/CMIP6'):
      2     print(i)

PermissionError: [Errno 13] Permission denied: '/work/kd0956/CMIP6/test_data/CMIP6'

@aaronspring
Copy link
Contributor Author

Looked into the code and dont see the problem.

exclude=['test_data','dc_repli/*','dc_test_data/*']
for root, dirs, files in os.walk('/work/kd0956/CMIP6'):
    dirs[:] = [d for d in dirs if d not in exclude]
    print(dirs)
['dc_test_data', 'dc_repli', 'CMIP']
['cmip5', 'CMIP6']
['output1']
['MOHC', 'IPSL']
['HadGEM2-ES']

@kmpaul
Copy link
Collaborator

kmpaul commented Apr 28, 2019 via email

@andersy005
Copy link
Member

@aaronspring,

When you get time, can you confirm that #64 works?

pip install https://github.com/andersy005/intake-esm@cmip-exclude-dirs

andersy005 added a commit that referenced this issue May 1, 2019
@andersy005
Copy link
Member

Closing this as it's been fixed in #64

Backlog automation moved this from To do to Done May 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Backlog
  
Done
Development

No branches or pull requests

4 participants