Change local_path to save_path in get_bsrn #1282

AdamRJensen · 2021-08-15T19:16:02Z

Pull request is nearly complete and ready for detailed review.
Maintainer: Appropriate GitHub Labels and Milestone are assigned to the Pull Request and linked Issue.

This pull request changes the input argument local_path to save_path in get_bsrn as this is a more descriptive name.

I am making this pull request now as this needs to be merged before 0.9 as otherwise it will be a breaking change.

AdamRJensen · 2021-08-15T19:17:25Z

@kanderso-nrel Wanna add labels? :)

kandersolar · 2021-08-15T19:59:25Z

@AdamRJensen as long as we're thinking about clarity... what do you think about editing the parameter description to make it explicit that save_path must be a directory (not a file)?

wholmgren · 2021-08-17T17:13:32Z

I don't recall any other iotools functions that will write data to disk. Am I forgetting something? While I'm not opposed to throwing it into the wild and seeing what the response is, I'm leery of requests to add this feature to every fetch function and there is something to be said for API consistency. Why not let users get the raw files themselves? We're still giving them a useful parser.

AdamRJensen · 2021-08-17T19:21:45Z

It's true that you haven't seen it in the other functions, but I think that is a mistake!

Particularly for research projects (which need to be reproducible), I find it essential to have the raw files available locally. The main reason for this is that many data suppliers are not always consistent (e.g., BSRN updates files when mistakes are found and CAMS computes irradiance on the fly using the most recent inputs). Sure users could save the returned dataframe, but there's no straight method for packing the associated metadata.

The alternative is for users to get the files themselves, but then all the super nice retrieval functions are obsolete. Personally, this would reduce the value of the iotools by ~40% and add all the hassle the retrieval functions are trying to remove.

If it's because you're worried about requests to add it, then I'd be happy to implement this feature for all the retrieval functions or at least the ones it makes sense for.

I'm very interested in hearing other opinions on this? Do pvlib users not save data, and if you do, how do you do it?

wholmgren · 2021-08-17T22:17:39Z

I'd be ok with functions like download_bsrn that do just that one thing: download a file. I dislike the idea of saving files as a side effect of fetching the data and putting it into python data structures.

What other libraries can we look to for patterns? I found only one example of siphon supporting download to a file:

https://github.com/Unidata/siphon/blob/feb7f60260d157c1c282469f755c3998ef6a6e0f/siphon/catalog.py#L598-L611

but then all the super nice retrieval functions are obsolete

I typically value the parsers more than the fetchers, so maybe we're coming in with different perspectives. The BSRN parser that you wrote is a great example of this - the file is horrible to parse but not difficult to get.

In the interest of not holding up a release, I see two options:

Merge this change (it's certainly an improvement). Comes with the risk of introducing a deprecation in the next release cycle.
Revise this PR to remove this feature from this function. Eliminates risk of introducing a feature that may deprecated, but doesn't allow for @AdamRJensen (or others) to use this feature.

Then we take up discussion elsewhere for the next release cycle.

mikofski · 2021-08-18T16:08:57Z

I like the pvlib pattern of functions only doing one thing, so my preference is for a separate download_bsrn() I agree some files like ECMWF files are so cumbersome to retrieve that a separate step to save the files for repeatability is very useful, but for other files it's an easy step for users to save by themselves

kandersolar · 2021-08-18T19:00:10Z

I don't recall any other iotools functions that will write data to disk.

They're not iotools functions yet, but #1264 and #1274 (WIP ERA5 and MERRA2 PRs) also have this local_path option, so BSRN would have some company if we decide to go forward with this.

I see the point about functions doing one thing. But I'm also not enthusiastic about adding download to our already somewhat cluttered get/read/parse API.

Comes with the risk of introducing a deprecation in the next release cycle.

If you think that chance is nontrivial, and the way forward isn't clear, I'd lean towards removing the feature for now until the way forward becomes clear. Deprecations are a pain for us and for users.

AdamRJensen · 2021-08-18T20:39:59Z

I know that other solar related libraries save local data, e.g., SolarData and irradpy, so it seems others are also considering this useful.

While it is easy to manually download a single file from one of the services, this is not the case in many use cases e.g., simulating PV systems using irradiance data from multiple sources or getting all BSRN files for the past 5 years for Europe. Besides ECMWF files as @mikofski mentioned, ERA5 and MERRA2 files can also not be downloaded from a user interface but requires API usage.

Taking get_bsrn as an example, the option of saving local data adds two lines of documentation and three lines of code. As previously mentioned, having a download function as an alternative would add a bunch of functions to the already cluttered iotools, and instead of the 5 lines currently used to support the feature, it would require hundreds of lines of duplicated code or defining a private fetch function that both get_ and download_ could use - sounds like a hassle to me.

A thing that I don't like about the option of just saving the data and metadata objects is that our parser functions carry out an 'irreversible' act i.e., you cannot recreate the data files (also metadata that isn't parsed is lost). Having the raw data files locally makes it reversible in a way, if that makes sense.

wholmgren · 2021-08-19T16:29:44Z

If you think that chance is nontrivial, and the way forward isn't clear, I'd lean towards removing the feature for now until the way forward becomes clear. Deprecations are a pain for us and for users.

I do think the chance is nontrivial. I'm less concerned about deprecation because it feels like we might need a larger reorganization of the iotools package to support this feature across more modules. And it feels like we're getting to the point where a separate repository/package may be preferable. So I'm still ok with keeping the save_path as an experiment or removing the feature. I don't think anyone's mind is going to be changed with further discussion right now.

mikofski · 2021-08-19T17:17:33Z

having a download function as an alternative would add a bunch of functions to the already cluttered iotools, and instead of the 5 lines currently used to support the feature, it would require hundreds of lines of duplicated code or defining a private fetch function that both get_ and download_ could use - sounds like a hassle to me.

I don't think this is necessarily so. For example, imagine how should a user reuse a file that they had downloaded? Would they rerun the retrieval function, downloading another copy? Can't we move the parsing code to a separate function? This isolates each step into its own function to be grouped as needed. EG:

download_bsrn(lat, lon, ...) - only retrieves the file and saves it to file, perhaps optionally, and returns the file buffer? Note: it doesn't parse the contents.
read_bsrn(file) - read a saved file and call the parsing function to output a dataframe.
parse_bsrn(fbuf) - parse the contents of the file buffer to a data frame
get_bsrn(lat, lon, ...) - calls download and parsing functions to output a dataframe. Maybe this has optional localpath arg as suggested?

wholmgren · 2021-08-20T16:36:23Z

@mikofski just to be clear, are you also thinking of a _get_raw_bsrn(lat, lon, ...) that would be used by both download_bsrn and get_bsrn?

def _get_raw_bsrn(lat, lon, ...):
    # get the data
    return data


def download_bsrn(filepath_or_buffer, lat, lon, ...):
    data = _get_raw_bsrn(lat, lon, ...)
    with open(filepath_or_buffer, 'w') as f:
        f.write(data)


def get_bsrn(lat, lon, ...):
    data = _get_raw_bsrn(lat, lon, ...)
    parsed_data, metadata = parse_bsrn(data)  # might need to be buffered
    return parsed_data, metadata


def read_bsrn(filepath):
    with open(filepath) as f:
        parsed_data, metadata = parse_bsrn(fbuf)
        return parsed_data, metadata


def parse_bsrn(fbuf):
    # do the parsing
    return parsed_data, metadata

That could work.

Here are some attempts to simplify the iotools API by putting slightly more responsibility on the user...

Alternative 1:

def get_raw_bsrn(lat, lon, ...):
    # get the data 
    # plain text easier for most users when API returns plain text 
    # but bad fit for consistency with APIs that return binary data
    return buffered_data  


def read_bsrn(filepath_or_buffer):
    # https://docs.python.org/3/library/contextlib.html#contextlib.nullcontext
    if isinstance(filepath_or_buffer, (str, Path)):
        cm = open(filepath_or_buffer)
    else:
        cm = nullcontext
    with cm as buffered_data:
        parsed_data, metadata = parse_bsrn(buffered_data)
        return parsed_data, metadata


def parse_bsrn(buffered_data):
    # do the parsing
    return parsed_data, metadata


# my code
buffered_data = get_raw_bsrn(lat, lon, ...)
parsed_data, metadata = parse_bsrn(buffered_data)

# adam's code
buffered_data = get_raw_bsrn(lat, lon, ...)
with open('myfile', 'w') as f:
    # or copyfileobj https://stackoverflow.com/a/3253819/2802993
    f.write(buffered_data.read())
parsed_data, metadata = read_bsrn(myfile)

Alternative 2

def get_raw_bsrn(lat, lon, ...):
    # get the data 
    # plain text easier for most users when API returns csv 
    # but bad fit for consistency with APIs that return binary data
    return data.read()


def read_bsrn(filepath_or_buffer):
    # https://docs.python.org/3/library/contextlib.html#contextlib.nullcontext
    if isinstance(filepath_or_buffer, (str, Path)):
        cm = open(filepath_or_buffer)
    else:
        cm = nullcontext
    with cm as buffered_data:
        parsed_data, metadata = parse_bsrn(buffered_data)
        return parsed_data, metadata


def parse_bsrn(data):
    buffered_data = StringIO(data)
    # do the parsing
    return parsed_data, metadata


# my code
data = get_raw_bsrn(lat, lon, ...)
parsed_data, metadata = parse_bsrn(data)

# adam's code
data = get_raw_bsrn(lat, lon, ...)
with open('myfile', 'w') as f:
    f.write(data)
parsed_data, metadata = read_bsrn(myfile)

Reiterating the comment in the get_raw_bsrn function: Alternative 2 might look better until you consider consistency implications for PRs like #1264 #1274.

To support direct calls to APIs in read_bsrn and friends, we could consider copying or directly using some of pandas.io.common. It's not documented in the reference guide so it's unclear if it's considered public and thus safe for us to rely on. Obviously pandas has put a ton of work into making read_csv and friends as seamless as possible. Is it reasonable to expect the same for pvlib.iotools? I'd say no.

wholmgren

Given further discussion in #1214, I recommend that we move ahead with this PR (rather than removing the feature) despite the relatively high risk of deprecation.

Is the save file feature actually tested? I don't see anything asserting that a file was written. Perhaps write to a temporary directory and assert that the file is there. Not sure if it's needed to assert the contents are as expected. I guess you could point the parse function at the (buffered) local file to ensure that it parses.

wholmgren · 2021-08-24T16:29:24Z

pvlib/tests/iotools/test_bsrn.py

@@ -85,7 +85,7 @@ def test_get_bsrn(expected_index, bsrn_credentials):
        station='tam',
        username=username,
        password=password,
-        local_path='')
+        save_path='')


What happens when save_path=''? An empty string is not None (as the function logic tests for). Is this supposed to save a file?

The idea is that save_path can either be a directory or file path (filename) depending on the nature of the function.

Since get_bsrn retrieves data from a number of actual files (and not a database), the original files are retrieves and saved to the specified directory path with their original filename.

save_path: str or path-like, optional If specified, a directory path of where to save files.

get_bsrn differs quite a bit from the other functions, as the other functions generally retrieve one request from an API (which doesn't have an associated file name) and then save_path would be a filename.

So in the get_bsrn case I assume the empty string represents the current working directory. I can definitely write a test that checks that (better yet modify the current test to check it).

I guess it's ok here since the filename includes the station name and we're not matching existing API nor committing to retaining this API.

The idea is that save_path can either be a directory or file path (filename) depending on the nature of the function.

I doubt this will be clear to most users.

wholmgren

Test looks good, thanks.

pvlib/iotools/bsrn.py

adriesse · 2021-08-27T09:11:53Z

+1 for keeping functions mostly single-purpose

wholmgren

@kanderso-nrel feel free to merge if you're happy with this

pvlib/tests/iotools/test_bsrn.py

AdamRJensen · 2022-09-05T02:39:19Z

Just stumbled upon SunPy's Fido module which bears a strong resemblance to pvlib's iotools. The Fido functions have a keyword path that allows for files to be written to the disk, e.g.:

# Once you have located your files via a Fido.search you can download them via Fido.fetch:
downloaded_files = Fido.fetch(results, path='/ThisIs/MyPath/to/Data/{file}')

Some explanation from SunPy:

This downloads the query results into the directory /ThisIs/MyPath/to/Data, naming each downloaded file with the filename {file} obtained from the client. You can also use other properties of the returned query to define the path where the data is saved.

It's not completely parallel, but I figured I'd mention it here in regards to the previous debate:

I'd be ok with functions like download_bsrn that do just that one thing: download a file. I dislike the idea of saving files as a side effect of fetching the data and putting it into python data structures.

What other libraries can we look to for patterns? I found only one example of siphon supporting download to a file:

https://github.com/Unidata/siphon/blob/feb7f60260d157c1c282469f755c3998ef6a6e0f/siphon/catalog.py#L598-L611

Change local_path to save_path

8fc48df

Update test_bsrn.py

dc768e1

kandersolar added the api label Aug 15, 2021

kandersolar added this to the 0.9.0 milestone Aug 15, 2021

AdamRJensen added 2 commits August 15, 2021 22:08

Update save_path description

3cddc10

Fix spelling mistake

58b272c

kandersolar approved these changes Aug 16, 2021

View reviewed changes

wholmgren mentioned this pull request Aug 20, 2021

pvlib v0.9 release planning #1214

Closed

24 tasks

wholmgren reviewed Aug 24, 2021

View reviewed changes

Test that file downloaded with save_path is read correctly

b661dcf

wholmgren reviewed Aug 25, 2021

View reviewed changes

pvlib/iotools/bsrn.py Outdated Show resolved Hide resolved

Update save_path description

cebfaf9

wholmgren approved these changes Aug 28, 2021

View reviewed changes

kandersolar reviewed Aug 30, 2021

View reviewed changes

pvlib/tests/iotools/test_bsrn.py Outdated Show resolved Hide resolved

Save testfile in temporary directory

b083acd

kandersolar merged commit 2dff28a into pvlib:master Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change local_path to save_path in get_bsrn #1282

Change local_path to save_path in get_bsrn #1282

AdamRJensen commented Aug 15, 2021 •

edited by kandersolar

AdamRJensen commented Aug 15, 2021

kandersolar commented Aug 15, 2021

wholmgren commented Aug 17, 2021

AdamRJensen commented Aug 17, 2021

wholmgren commented Aug 17, 2021

mikofski commented Aug 18, 2021

kandersolar commented Aug 18, 2021

AdamRJensen commented Aug 18, 2021

wholmgren commented Aug 19, 2021

mikofski commented Aug 19, 2021

wholmgren commented Aug 20, 2021

wholmgren left a comment

wholmgren Aug 24, 2021

AdamRJensen Aug 25, 2021 •

edited

wholmgren Aug 25, 2021

wholmgren left a comment

adriesse commented Aug 27, 2021

wholmgren left a comment

AdamRJensen commented Sep 5, 2022

Change local_path to save_path in get_bsrn #1282

Change local_path to save_path in get_bsrn #1282

Conversation

AdamRJensen commented Aug 15, 2021 • edited by kandersolar

AdamRJensen commented Aug 15, 2021

kandersolar commented Aug 15, 2021

wholmgren commented Aug 17, 2021

AdamRJensen commented Aug 17, 2021

wholmgren commented Aug 17, 2021

mikofski commented Aug 18, 2021

kandersolar commented Aug 18, 2021

AdamRJensen commented Aug 18, 2021

wholmgren commented Aug 19, 2021

mikofski commented Aug 19, 2021

wholmgren commented Aug 20, 2021

wholmgren left a comment

Choose a reason for hiding this comment

wholmgren Aug 24, 2021

Choose a reason for hiding this comment

AdamRJensen Aug 25, 2021 • edited

Choose a reason for hiding this comment

wholmgren Aug 25, 2021

Choose a reason for hiding this comment

wholmgren left a comment

Choose a reason for hiding this comment

adriesse commented Aug 27, 2021

wholmgren left a comment

Choose a reason for hiding this comment

AdamRJensen commented Sep 5, 2022

AdamRJensen commented Aug 15, 2021 •

edited by kandersolar

AdamRJensen Aug 25, 2021 •

edited