Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a catalog from a list of file names and then using the gui to select a source from that catalog. #774

Open
tedhabermann opened this issue Nov 25, 2023 · 10 comments

Comments

@tedhabermann
Copy link

I have read the documentation many times but am still missing something simple.
I am trying to create a catalog from a directory with a bunch of data files.

For each file I create a dictionary (cat_d) that looks like:

args:
csv_kwargs:
dtype:
Agency Portal URL: object
Datasets: object
Grant ID: object
Issue: object
ORCID: object
urlpath: ~/CHORUS/data/USAID-2023-09-18-AllReport.csv
description: CHORUS USAID All Report
driver: csv
name: USAID-2023-09-18-All
parameters:
agency:
default: USAID
description: agancy acronym
type: &id001 !!python/name:builtins.str ''
dataType:
default: all
description: CHORUS data type
type: *id001
timestamp:
default: '2023-09-18'
description: YYYY-MM-DD
type: *id001

and I try to create a LocalCatalogEntry from this like:
localCatEntry = LocalCatalogEntry(**cat_d) or
localCatEntry = LocalCatalogEntry(
name = cat_d['name'],
description = cat_d['description'],
parameters = cat_d['parameters'],
driver = cat_d['driver'],
args = cat_d['args']
)

and the localCatEntry has what appears to me to be some extraneous stuff, like the empty args key, that i don't understand.
!!python/object:intake.catalog.local.LocalCatalogEntry
args: []
cls: intake.catalog.local.LocalCatalogEntry
kwargs:
name: USAID-2023-09-18-All
description: CHORUS USAID All Report
parameters:
agency:
default: USAID
description: agancy acronym
type: &id001 !!python/name:builtins.str ''
dataType:
default: all
description: CHORUS data type
type: *id001
timestamp:
default: '2023-09-18'
description: YYYY-MM-DD
type: *id001
driver: csv
args:
csv_kwargs:
dtype:
Agency Portal URL: object
Datasets: object
Grant ID: object
Issue: object
ORCID: object
urlpath: /Users/tedhabermann/Documents/MetadataGameChanger/ProjectsAndPlans/INFORMATE/CHORUS/data/USAID-2023-09-18-AllReport.csv

undeterred, I append this localCatEntry to a dictionary where sourceName = 'USAID-2023-09-18-All':
catalog_d.update({sourceName : localCatEntry})

then make a catalog:
mycat = Catalog.from_dict(catalog_d)

now I try to add this catalog to the gui:
intake.gui.add(mycat)
and get:

TypeError Traceback (most recent call last)
Cell In[33], line 1
----> 1 intake.gui.add(mycat)

File ~/anaconda3/lib/python3.11/site-packages/intake/interface/gui.py:65, in GUI.add(self, *args, **kwargs)
63 def add(self, *args, **kwargs):
64 """Add to list of cats"""
---> 65 return self.cat.select.add(*args, **kwargs)

File ~/anaconda3/lib/python3.11/site-packages/intake/interface/base.py:222, in BaseSelector.add(self, items)
220 self.widget.options.update(options)
221 self.widget.param.trigger("options")
--> 222 self.widget.value = list(options.values())[:1]

File ~/anaconda3/lib/python3.11/site-packages/param/parameterized.py:367, in instance_descriptor.._f(self, obj, val)
365 instance_param = getattr(obj, '_instance__params', {}).get(self.name)
366 if instance_param is not None and self is not instance_param:
--> 367 instance_param.set(obj, val)
368 return
369 return f(self, obj, val)

File ~/anaconda3/lib/python3.11/site-packages/param/parameterized.py:369, in instance_descriptor.._f(self, obj, val)
367 instance_param.set(obj, val)
368 return
--> 369 return f(self, obj, val)

File ~/anaconda3/lib/python3.11/site-packages/param/parameterized.py:1252, in Parameter.set(self, obj, val)
1250 # Copy watchers here since they may be modified inplace during iteration
1251 for watcher in sorted(watchers, key=lambda w: w.precedence):
-> 1252 obj.param._call_watcher(watcher, event)
1253 if not obj.param._BATCH_WATCH:
1254 obj.param._batch_call_watchers()

File ~/anaconda3/lib/python3.11/site-packages/param/parameterized.py:2043, in Parameters.call_watcher(self, watcher, event)
2041 event = self_.update_event_type(watcher, event, self.self_or_cls.param.TRIGGER)
2042 with batch_call_watchers(self.self_or_cls, enable=watcher.queued, run=False):
-> 2043 self
._execute_watcher(watcher, (event,))

File ~/anaconda3/lib/python3.11/site-packages/param/parameterized.py:2025, in Parameters._execute_watcher(self, watcher, events)
2023 async_executor(partial(watcher.fn, *args, **kwargs))
2024 else:
-> 2025 watcher.fn(*args, **kwargs)

File ~/anaconda3/lib/python3.11/site-packages/intake/interface/catalog/select.py:86, in CatSelector.callback(self, event)
85 def callback(self, event):
---> 86 self.expand_nested(event.new)
87 if self.done_callback:
88 self.done_callback(event.new)

File ~/anaconda3/lib/python3.11/site-packages/intake/interface/catalog/select.py:113, in CatSelector.expand_nested(self, cats)
111 name = next(k for k, v in old if v == cat)
112 index = next(i for i, (k, v) in enumerate(old) if v == cat)
--> 113 if right in name:
114 prefix = f"{name.split(right)[0]}{down} {right}"
115 else:

TypeError: argument of type 'NoneType' is not iterable

running intake.gui seems to know this catalog is there as it displays None in the list of catalogs, but the source is not there.

i also tried something like:
newgui = intake.interface.gui.GUI
and
newgui.add(mycat)
but got

AttributeError Traceback (most recent call last)
Cell In[46], line 1
----> 1 newgui.add(mycat)

File ~/anaconda3/lib/python3.11/site-packages/intake/interface/gui.py:65, in GUI.add(self, *args, **kwargs)
63 def add(self, *args, **kwargs):
64 """Add to list of cats"""
---> 65 return self.cat.select.add(*args, **kwargs)

AttributeError: 'NoneType' object has no attribute 'select'

@rsignell
Copy link

rsignell commented Nov 26, 2023

@tedhabermann , I suggest taking a look at this Project Pythia Intake notebook.

It's nearly exactly what you want to do: You can just change it to read CSV files instead of Zarr datasets!

@tedhabermann
Copy link
Author

tedhabermann commented Nov 26, 2023 via email

@tedhabermann
Copy link
Author

Rich - you are right, this is a very interesting and helpful tutorial. I learned alot about ways to use catalogs. Unfortunately, there are some differences that are important. This tutorial uses the csv data source to inform users about options while I was trying to read the file names and add them as sources to the very cool intake GUI. This tutorial seems to be aimed at users that are rather fluent in python, which is fine, but I am adding at users with less interest in writing code to select the datasets they are interested in...

Also I really liked the integration of different types of data sources into the catalog. I will definitely use that capability once I figure out how to create a catalog that works and add it to the GUI!

@rsignell
Copy link

@tedhabermann Here's an example of creating a catalog from several CSV files -- it was trickier than I thought!
https://nbviewer.org/gist/rsignell/2b9a8f7476666f469c0ebddadbef708c

Also, if you want a nice user interface for folks, you might want to consider https://lumen.holoviz.org/ instead of the intake gui.

@tedhabermann
Copy link
Author

Rich,

Thanks again. I want to make sure I understand this.

  1. Open each csv file with open_csv to create a local source and give it a name
  2. Create a dictionary (catalog) - the documentation suggests Catalog.from_dict to do this...?
  3. Write it to a file as yaml
  4. Open the file to create cat
  5. Add sources to cat - the documentation suggests LocalCatalogEntry to do this...?
  6. Write cat to a file
  7. Open the file you just wrote

This seems like quite a bit of disc access... and a little kludgey...
Is it because the functions from_dict and LocalCatalogEntry which are in the documentation don't work?

@rsignell
Copy link

rsignell commented Nov 28, 2023

@tedhabermann , yes, I see your point. I think the add approach would be nice if you have a variety of different datasets, but if they are all CSVs, would be nice to just generate using LocalCatalogEntry as you suggest:

Based on this information in the Intake documentation, I created this notebook which uses this pattern:

from intake.catalog.local import LocalCatalogEntry 
from intake.catalog import Catalog

cat = Catalog()
csv_list = [ ['states1', 'states_1.csv'], 
             ['states2', 'states_2.csv'] ]
cat._entries = {name: LocalCatalogEntry(name, description='',                                     
                                        driver='intake.source.csv.CSVSource',
                                        args={"urlpath": url})  for name, url in csv_list}
cat.save('catalog.yml')

For a more complex LocalCatalogEntry, check out this StackOverflow answer from @martindurant!

@rsignell
Copy link

rsignell commented Nov 29, 2023

@tedhabermann and finally, what you probably were asking for in the beginning. 🙂

My colleague pointed out that using _entries = is kind of inelegant.

So here we create a dict of sources using LocalCatalogEntry and then create the catalog using Catalog.from_dict():

import intake
from intake.catalog.local import LocalCatalogEntry 
from intake.catalog import Catalog
from pathlib import Path

source_list = ['states_1.csv', 'states_2.csv']

intake_sources={}

for source in source_list:
    name = Path(source).stem
    intake_sources[name] = LocalCatalogEntry(
        name=name,
        description=f'CSV file {name}',
        driver='intake.source.csv.CSVSource',
        args={
            'urlpath': source
             },
        metadata={
                'agency': 'blah',
                'another tag': 'blah'
            }
    )

cat = Catalog.from_dict(
    intake_sources,
    name="CSV Files",
    description="CSV Files from Intake Examples", 
    )
cat.save('catalog.yml') 

Here is the Full Notebook!

@tedhabermann
Copy link
Author

Rich - yes, I think this is it... Thanks so much for your patience and persistence!

@tedhabermann
Copy link
Author

Rich - the problem seems to be how more complex catalog structures get passed into the LocalCatalogEntry function. I was trying to pass a list of arguments like userParameter_l = [ UserParameter(x) for x in cat_d['parameters'] ] but that did not work even when I tried to follow the single parameter example that Martin provided on stackOverflow, so I gave up on parameters.

Now I am trying to pass an argument dictionary that has a dtype dictionary in it like (urlpath is added to this dictionary as I loop the file names):
dataTypes = {
'All': {
'description':'CHORUS USAID All Report',
'driver': 'csv',
'args' : {
'urlpath' : None,
'csv_kwargs' : {
'dtype': {
'Datasets': 'object',
'Issue': 'object',
'ORCID': 'object',
'Agency Portal URL': 'object',
'Grant ID': 'object'
}
}
}
},

In the catalog this becomes something like this (below) which looks like it has a definition of &id001 and later references to *id001. Unfortunately, the urlpath is not the same in the sources that reference *id001. The urlpath should be unique for each source so this reference approach will not work.
sources:
NSF-2023-04-27-All:
args: &id001
csv_kwargs:
dtype:
Agency Portal URL: object
Datasets: object
Grant ID: object
Issue: object
ORCID: object
urlpath: /Users/metadatagamechanger/MetadataGameChanger/ProjectsAndPlans/INFORMATE/CHORUS/data/USGS-2023-10-04-AllReport.csv
description: CHORUS USAID All Report
driver: csv
name: NSF-2023-04-27-All
parameters: {}
NSF-2023-04-28-All:
args: *id001
description: CHORUS USAID All Report
driver: csv
name: NSF-2023-04-28-All
parameters: {}
NSF-2023-08-23-All:
args: *id001
description: CHORUS USAID All Report
driver: csv
name: NSF-2023-08-23-All
parameters: {}

I am trying to create these LocatCatalogEntrys as
localCatEntry = LocalCatalogEntry( # create localCategoryEntry
name = cat_d['name'],
description = cat_d['description'],
#parameters = userParameter_l,
driver = cat_d['driver'],
args = cat_d['args']
)

@rsignell
Copy link

rsignell commented Dec 1, 2023

@tedhabermann it's hard read this without code and syntax highlighting. Can you please spend a few minutes editing your questions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants