Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change local_path to save_path in get_bsrn #1282

Merged
merged 7 commits into from Sep 1, 2021

Conversation

AdamRJensen
Copy link
Member

@AdamRJensen AdamRJensen commented Aug 15, 2021

  • Pull request is nearly complete and ready for detailed review.
  • Maintainer: Appropriate GitHub Labels and Milestone are assigned to the Pull Request and linked Issue.

This pull request changes the input argument local_path to save_path in get_bsrn as this is a more descriptive name.

I am making this pull request now as this needs to be merged before 0.9 as otherwise it will be a breaking change.

@AdamRJensen
Copy link
Member Author

@kanderso-nrel Wanna add labels? :)

@kandersolar kandersolar added this to the 0.9.0 milestone Aug 15, 2021
@kandersolar
Copy link
Member

@AdamRJensen as long as we're thinking about clarity... what do you think about editing the parameter description to make it explicit that save_path must be a directory (not a file)?

@wholmgren
Copy link
Member

I don't recall any other iotools functions that will write data to disk. Am I forgetting something? While I'm not opposed to throwing it into the wild and seeing what the response is, I'm leery of requests to add this feature to every fetch function and there is something to be said for API consistency. Why not let users get the raw files themselves? We're still giving them a useful parser.

@AdamRJensen
Copy link
Member Author

It's true that you haven't seen it in the other functions, but I think that is a mistake!

Particularly for research projects (which need to be reproducible), I find it essential to have the raw files available locally. The main reason for this is that many data suppliers are not always consistent (e.g., BSRN updates files when mistakes are found and CAMS computes irradiance on the fly using the most recent inputs). Sure users could save the returned dataframe, but there's no straight method for packing the associated metadata.

The alternative is for users to get the files themselves, but then all the super nice retrieval functions are obsolete. Personally, this would reduce the value of the iotools by ~40% and add all the hassle the retrieval functions are trying to remove.

If it's because you're worried about requests to add it, then I'd be happy to implement this feature for all the retrieval functions or at least the ones it makes sense for.

I'm very interested in hearing other opinions on this? Do pvlib users not save data, and if you do, how do you do it?

@wholmgren
Copy link
Member

I'd be ok with functions like download_bsrn that do just that one thing: download a file. I dislike the idea of saving files as a side effect of fetching the data and putting it into python data structures.

What other libraries can we look to for patterns? I found only one example of siphon supporting download to a file:

https://github.com/Unidata/siphon/blob/feb7f60260d157c1c282469f755c3998ef6a6e0f/siphon/catalog.py#L598-L611

but then all the super nice retrieval functions are obsolete

I typically value the parsers more than the fetchers, so maybe we're coming in with different perspectives. The BSRN parser that you wrote is a great example of this - the file is horrible to parse but not difficult to get.

In the interest of not holding up a release, I see two options:

  1. Merge this change (it's certainly an improvement). Comes with the risk of introducing a deprecation in the next release cycle.
  2. Revise this PR to remove this feature from this function. Eliminates risk of introducing a feature that may deprecated, but doesn't allow for @AdamRJensen (or others) to use this feature.

Then we take up discussion elsewhere for the next release cycle.

@mikofski
Copy link
Member

I like the pvlib pattern of functions only doing one thing, so my preference is for a separate download_bsrn() I agree some files like ECMWF files are so cumbersome to retrieve that a separate step to save the files for repeatability is very useful, but for other files it's an easy step for users to save by themselves

@kandersolar
Copy link
Member

I don't recall any other iotools functions that will write data to disk.

They're not iotools functions yet, but #1264 and #1274 (WIP ERA5 and MERRA2 PRs) also have this local_path option, so BSRN would have some company if we decide to go forward with this.

I see the point about functions doing one thing. But I'm also not enthusiastic about adding download to our already somewhat cluttered get/read/parse API.

Comes with the risk of introducing a deprecation in the next release cycle.

If you think that chance is nontrivial, and the way forward isn't clear, I'd lean towards removing the feature for now until the way forward becomes clear. Deprecations are a pain for us and for users.

@AdamRJensen
Copy link
Member Author

I know that other solar related libraries save local data, e.g., SolarData and irradpy, so it seems others are also considering this useful.

While it is easy to manually download a single file from one of the services, this is not the case in many use cases e.g., simulating PV systems using irradiance data from multiple sources or getting all BSRN files for the past 5 years for Europe. Besides ECMWF files as @mikofski mentioned, ERA5 and MERRA2 files can also not be downloaded from a user interface but requires API usage.

Taking get_bsrn as an example, the option of saving local data adds two lines of documentation and three lines of code. As previously mentioned, having a download function as an alternative would add a bunch of functions to the already cluttered iotools, and instead of the 5 lines currently used to support the feature, it would require hundreds of lines of duplicated code or defining a private fetch function that both get_ and download_ could use - sounds like a hassle to me.

A thing that I don't like about the option of just saving the data and metadata objects is that our parser functions carry out an 'irreversible' act i.e., you cannot recreate the data files (also metadata that isn't parsed is lost). Having the raw data files locally makes it reversible in a way, if that makes sense.

@wholmgren
Copy link
Member

If you think that chance is nontrivial, and the way forward isn't clear, I'd lean towards removing the feature for now until the way forward becomes clear. Deprecations are a pain for us and for users.

I do think the chance is nontrivial. I'm less concerned about deprecation because it feels like we might need a larger reorganization of the iotools package to support this feature across more modules. And it feels like we're getting to the point where a separate repository/package may be preferable. So I'm still ok with keeping the save_path as an experiment or removing the feature. I don't think anyone's mind is going to be changed with further discussion right now.

@mikofski
Copy link
Member

having a download function as an alternative would add a bunch of functions to the already cluttered iotools, and instead of the 5 lines currently used to support the feature, it would require hundreds of lines of duplicated code or defining a private fetch function that both get_ and download_ could use - sounds like a hassle to me.

I don't think this is necessarily so. For example, imagine how should a user reuse a file that they had downloaded? Would they rerun the retrieval function, downloading another copy? Can't we move the parsing code to a separate function? This isolates each step into its own function to be grouped as needed. EG:

  1. download_bsrn(lat, lon, ...) - only retrieves the file and saves it to file, perhaps optionally, and returns the file buffer? Note: it doesn't parse the contents.
  2. read_bsrn(file) - read a saved file and call the parsing function to output a dataframe.
  3. parse_bsrn(fbuf) - parse the contents of the file buffer to a data frame
  4. get_bsrn(lat, lon, ...) - calls download and parsing functions to output a dataframe. Maybe this has optional localpath arg as suggested?

@wholmgren
Copy link
Member

@mikofski just to be clear, are you also thinking of a _get_raw_bsrn(lat, lon, ...) that would be used by both download_bsrn and get_bsrn?

def _get_raw_bsrn(lat, lon, ...):
    # get the data
    return data


def download_bsrn(filepath_or_buffer, lat, lon, ...):
    data = _get_raw_bsrn(lat, lon, ...)
    with open(filepath_or_buffer, 'w') as f:
        f.write(data)


def get_bsrn(lat, lon, ...):
    data = _get_raw_bsrn(lat, lon, ...)
    parsed_data, metadata = parse_bsrn(data)  # might need to be buffered
    return parsed_data, metadata


def read_bsrn(filepath):
    with open(filepath) as f:
        parsed_data, metadata = parse_bsrn(fbuf)
        return parsed_data, metadata


def parse_bsrn(fbuf):
    # do the parsing
    return parsed_data, metadata

That could work.

Here are some attempts to simplify the iotools API by putting slightly more responsibility on the user...

Alternative 1:

def get_raw_bsrn(lat, lon, ...):
    # get the data 
    # plain text easier for most users when API returns plain text 
    # but bad fit for consistency with APIs that return binary data
    return buffered_data  


def read_bsrn(filepath_or_buffer):
    # https://docs.python.org/3/library/contextlib.html#contextlib.nullcontext
    if isinstance(filepath_or_buffer, (str, Path)):
        cm = open(filepath_or_buffer)
    else:
        cm = nullcontext
    with cm as buffered_data:
        parsed_data, metadata = parse_bsrn(buffered_data)
        return parsed_data, metadata


def parse_bsrn(buffered_data):
    # do the parsing
    return parsed_data, metadata


# my code
buffered_data = get_raw_bsrn(lat, lon, ...)
parsed_data, metadata = parse_bsrn(buffered_data)

# adam's code
buffered_data = get_raw_bsrn(lat, lon, ...)
with open('myfile', 'w') as f:
    # or copyfileobj https://stackoverflow.com/a/3253819/2802993
    f.write(buffered_data.read())
parsed_data, metadata = read_bsrn(myfile)

Alternative 2

def get_raw_bsrn(lat, lon, ...):
    # get the data 
    # plain text easier for most users when API returns csv 
    # but bad fit for consistency with APIs that return binary data
    return data.read()


def read_bsrn(filepath_or_buffer):
    # https://docs.python.org/3/library/contextlib.html#contextlib.nullcontext
    if isinstance(filepath_or_buffer, (str, Path)):
        cm = open(filepath_or_buffer)
    else:
        cm = nullcontext
    with cm as buffered_data:
        parsed_data, metadata = parse_bsrn(buffered_data)
        return parsed_data, metadata


def parse_bsrn(data):
    buffered_data = StringIO(data)
    # do the parsing
    return parsed_data, metadata


# my code
data = get_raw_bsrn(lat, lon, ...)
parsed_data, metadata = parse_bsrn(data)

# adam's code
data = get_raw_bsrn(lat, lon, ...)
with open('myfile', 'w') as f:
    f.write(data)
parsed_data, metadata = read_bsrn(myfile)

Reiterating the comment in the get_raw_bsrn function: Alternative 2 might look better until you consider consistency implications for PRs like #1264 #1274.

To support direct calls to APIs in read_bsrn and friends, we could consider copying or directly using some of pandas.io.common. It's not documented in the reference guide so it's unclear if it's considered public and thus safe for us to rely on. Obviously pandas has put a ton of work into making read_csv and friends as seamless as possible. Is it reasonable to expect the same for pvlib.iotools? I'd say no.

@wholmgren wholmgren mentioned this pull request Aug 20, 2021
24 tasks
Copy link
Member

@wholmgren wholmgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given further discussion in #1214, I recommend that we move ahead with this PR (rather than removing the feature) despite the relatively high risk of deprecation.

Is the save file feature actually tested? I don't see anything asserting that a file was written. Perhaps write to a temporary directory and assert that the file is there. Not sure if it's needed to assert the contents are as expected. I guess you could point the parse function at the (buffered) local file to ensure that it parses.

@@ -85,7 +85,7 @@ def test_get_bsrn(expected_index, bsrn_credentials):
station='tam',
username=username,
password=password,
local_path='')
save_path='')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when save_path=''? An empty string is not None (as the function logic tests for). Is this supposed to save a file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that save_path can either be a directory or file path (filename) depending on the nature of the function.

Since get_bsrn retrieves data from a number of actual files (and not a database), the original files are retrieves and saved to the specified directory path with their original filename.

save_path: str or path-like, optional
    If specified, a directory path of where to save files.

get_bsrn differs quite a bit from the other functions, as the other functions generally retrieve one request from an API (which doesn't have an associated file name) and then save_path would be a filename.

So in the get_bsrn case I assume the empty string represents the current working directory. I can definitely write a test that checks that (better yet modify the current test to check it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's ok here since the filename includes the station name and we're not matching existing API nor committing to retaining this API.

The idea is that save_path can either be a directory or file path (filename) depending on the nature of the function.

I doubt this will be clear to most users.

Copy link
Member

@wholmgren wholmgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test looks good, thanks.

pvlib/iotools/bsrn.py Outdated Show resolved Hide resolved
@adriesse
Copy link
Member

+1 for keeping functions mostly single-purpose

Copy link
Member

@wholmgren wholmgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kanderso-nrel feel free to merge if you're happy with this

@kandersolar kandersolar merged commit 2dff28a into pvlib:master Sep 1, 2021
@AdamRJensen
Copy link
Member Author

Just stumbled upon SunPy's Fido module which bears a strong resemblance to pvlib's iotools. The Fido functions have a keyword path that allows for files to be written to the disk, e.g.:

# Once you have located your files via a Fido.search you can download them via Fido.fetch:
downloaded_files = Fido.fetch(results, path='/ThisIs/MyPath/to/Data/{file}')

Some explanation from SunPy:

This downloads the query results into the directory /ThisIs/MyPath/to/Data, naming each downloaded file with the filename {file} obtained from the client. You can also use other properties of the returned query to define the path where the data is saved.

It's not completely parallel, but I figured I'd mention it here in regards to the previous debate:

I'd be ok with functions like download_bsrn that do just that one thing: download a file. I dislike the idea of saving files as a side effect of fetching the data and putting it into python data structures.

What other libraries can we look to for patterns? I found only one example of siphon supporting download to a file:

https://github.com/Unidata/siphon/blob/feb7f60260d157c1c282469f755c3998ef6a6e0f/siphon/catalog.py#L598-L611

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants