Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to retrieve example datasets #924

Open
kandersolar opened this issue Mar 2, 2020 · 6 comments
Open

Add function to retrieve example datasets #924

kandersolar opened this issue Mar 2, 2020 · 6 comments
Milestone

Comments

@kandersolar
Copy link
Member

It would be nice to have a clean way for gallery examples to get access to the test data files. See #860 (comment)

For a function like load_dataset('greensboro-tmy'):

Pros:

  • No need to monkey around with filepaths, especially ones that aren't really meant to be public
  • Associating files with keys means we can move and rename test data files without it being a breaking change
  • Simplifies example code

Cons:

  • The data files are in several formats (csv, json, nc, h5 etc), so this function would either have to know the appropriate reading method for each file (complicated) or just return a file handle and let the user parse the contents (less useful).
@cwhanse
Copy link
Member

cwhanse commented Mar 2, 2020

+1 to having the function. I think a useful scope is to return the full path to the file. Reading file content and providing output in various formats seems like a large bite to chew.

@kandersolar
Copy link
Member Author

Any ideas for what the function should be named if it returns the file path? I'm having trouble coming up with a verb that doesn't seem misleading. Maybe just dataset()?

df = pd.read_csv(dataset('greensboro-tmy'))

with open(dataset('greensboro-tmy')) as f:

@mikofski
Copy link
Member

mikofski commented Mar 2, 2020

Con: The data files are in several formats (csv, json, nc, h5 etc)

To me this a "Pro" for having a dedicated data reader, to make it easier for users, if it's a PITA for us, it will be a blocker for them I think

@wholmgren
Copy link
Member

I would be happy if this

# get full path to the data directory
DATA_DIR = pathlib.Path(pvlib.__file__).parent / 'data'
# get TMY3 data with rain
greensboro, _ = read_tmy3(DATA_DIR / '723170TYA.CSV', coerce_year=1990)

looked like

# get full path to the data file
file_path = dataset('greensboro-tmy')

# parse TMY3 data
greensboro, _ = read_tmy3(file_path, coerce_year=1990)

I don't think the broader scope is feasible. To be clear, this is just something for the tests/examples - not for anything else.

metpy has a get_test_data function with the same idea, but a different implementation because it uses a caching back end that I think we should avoid.

example: https://unidata.github.io/MetPy/latest/examples/XArray_Projections.html#sphx-glr-examples-xarray-projections-py

@CameronTStark CameronTStark added this to the 0.7.3 milestone Mar 6, 2020
@kandersolar
Copy link
Member Author

I just stumbled across some functions in pkg_resources that might be relevant to this issue: https://setuptools.readthedocs.io/en/latest/pkg_resources.html#resource-extraction

@wholmgren wholmgren modified the milestones: 0.7.3, 0.8.0 Jul 17, 2020
@wholmgren wholmgren modified the milestones: 0.8.0, Someday Aug 28, 2020
@echedey-ls
Copy link
Contributor

I just stumbled upon this issue looking for ideas for GSoC.
@kanderso-nrel , just a quick note on that might-be-relevant package:
From my experience both with PVLIB and SciencePlots, sticking to __path__[0] (which is already in use in PVLIB) is more than enough, and I believe (or want to) it is exempt from errors. At least, in SciencePlots we haven't recorded any issues regarding that after the breaking changes in the v2.0.0 release.
About pkg_resources, it's use is deprecated in favor of other packages. See this attention directive at pkg_resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants