Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add read_bsrn function #1145

Merged
merged 46 commits into from
Feb 11, 2021
Merged

Add read_bsrn function #1145

merged 46 commits into from
Feb 11, 2021

Conversation

AdamRJensen
Copy link
Member

@AdamRJensen AdamRJensen commented Jan 24, 2021

  • Closes Add BSRN format reader to iotools #1015
  • I am familiar with the contributing guidelines
  • Tests added
  • Updates entries to docs/sphinx/source/api.rst for API changes.
  • Adds description and name entries in the appropriate "what's new" file in docs/sphinx/source/whatsnew for all changes. Includes link to the GitHub Issue with :issue:`num` or this Pull Request with :pull:`num`. Includes contributor name and/or GitHub username (link with :ghuser:`user`).
  • New code is fully documented. Includes numpydoc compliant docstrings, examples, and comments where necessary.
  • Pull request is nearly complete and ready for detailed review.
  • Maintainer: Appropriate GitHub Labels and Milestone are assigned to the Pull Request and linked Issue.

@AdamRJensen
Copy link
Member Author

@wholmgren I've changed it such that it now commits to pvlib:master. Hope it's correct.

@kandersolar
Copy link
Member

kandersolar commented Jan 24, 2021

Hi @AdamRJensen, could you update this PR to resolve the code style issues raised by Stickler? It's easier to review PRs once stickler is satisfied. The style issues are listed here: https://github.com/pvlib/pvlib-python/pull/1145/files

Edit: also please add a new entry here for your function: https://github.com/pvlib/pvlib-python/blob/master/pvlib/iotools/__init__.py

@AdamRJensen AdamRJensen changed the title Add bsrn_read function Add read_bsrn function Jan 25, 2021
@AdamRJensen AdamRJensen marked this pull request as ready for review January 25, 2021 20:15
@wholmgren
Copy link
Member

Thanks @AdamRJensen! We will need some tests. Do you want me to push a test to your branch or do you want to take it on yourself? It seemed like you might have been disinclined to add a test in your first PR, but I am happy to help since this will be a big help to me.

@AdamRJensen
Copy link
Member Author

Hi @kanderso-nrel, thanks for the heads up!

I wasn't sure if PVlib is very strict on style issues, e.g. line length<79 characters? I have corrected all instances of this, but in some places, it makes the code a little harder to follow / less compact.

@AdamRJensen
Copy link
Member Author

@wholmgren I am definitely up for attempting to make tests. Can you point me in a direction to get started?

@wholmgren
Copy link
Member

Check out pvlib/tests/iotools/test_surfrad.py and pvlib/tests/iotools/test_surfrad.py pvlib/tests/iotools/test_solrad.py.

Add a test file to pvlib/data. Here's an example for the SURFRAD reader. You'll find similar for the other iotools.

Start by making your test as simple as possible. A single data file, no remote io capabilities. Probably ok to only assert the index values and the columns for now.

@wholmgren
Copy link
Member

wholmgren commented Jan 25, 2021

I wasn't sure if PVlib is very strict on style issues, e.g. line length<79 characters? I have corrected all instances of this, but in some places, it makes the code a little harder to follow / less compact.

If you think it makes the code less readable then it's acceptable to add # noqa: E501 to that line. Note pep8 also says two spaces before the #.

@kandersolar
Copy link
Member

If the test file is a proper BSRN data file, we should make sure we're staying within their conditions of data release.

@AdamRJensen
Copy link
Member Author

If the test file is a proper BSRN data file, we should make sure we're staying within their conditions of data release.

@kanderso-nrel and @wholmgren any opinion on if I should create a pseudo-file or ask Amelie from BSRN if we can upload a single file for test purposes?

pvlib/iotools/bsrn.py Outdated Show resolved Hide resolved
@cwhanse
Copy link
Member

cwhanse commented Jan 25, 2021

@AdamRJensen I think we need to get explicit OK from the WRMC, because pvlib-python isn't a traditional "publication", but (in my view) qualifies as a "project for wide release." What matters, though, is WRMC's view of pvlib-python.

The way I read the Data Release Guidelines, assuming WRMC concurrence pvlib-python can include a file of BSRN data provided:

  • the file is clearly identified as originated from the WRMC (use the file name for this purpose if there isn't a free-form header)
  • users are provided with information on how to obtain the original data (function docstring does this, but we could also add a short paragraph to the LICENSE file about using the BSRN file).

@wholmgren
Copy link
Member

wholmgren commented Jan 25, 2021

My reading is that BSRN data downloaded directly from NASA websites is subject to NASA's Data and Information Policy. So I don't see any issue with including a test file from either of the GIM or LaRC sites. I'm not opposed to asking - and that may well be more polite in any case - just pointing out that there may be other options.

@wholmgren wholmgren added this to the 0.9.0 milestone Jan 26, 2021
@AdamRJensen AdamRJensen reopened this Feb 9, 2021
Air temperature was listed as air_temperature in the docstring instead of temp_air.
pvlib/iotools/bsrn.py Outdated Show resolved Hide resolved
pvlib/tests/iotools/test_bsrn.py Outdated Show resolved Hide resolved
Comment on lines 102 to 116
if line.startswith('*'): # Find start of all logical records
line_no_dict[line[2:6]] = num # key is 4 digit LR number

# Determine start and end line of logical record LR0100 to be parsed
start_row = line_no_dict['0100'] + 1 # Start line number
# If LR0100 is the last logical record, then read rest of file
if start_row-1 == max(line_no_dict.values()):
end_row = num # then parse rest of the file
else: # otherwise parse until the beginning of the next logical record
end_row = min([i for i in line_no_dict.values() if i > start_row])
nrows = end_row-start_row

# Read file as a fixed width file (fwf)
data = pd.read_fwf(filename, skiprows=start_row, nrows=nrows, header=None,
colspecs=COL_SPECS, na_values=[-999.0, -99.9])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to avoid reading the file twice by accumulating the first read of just the '0100' section into an IO buffer, stopping the read at the end of the section, and then passing that buffer into pd.read_fwf.

I think this might result in meaningful performance improvements when reading many files. Or maybe we need to profile the code before we put the effort into it.

In any case, I'm ok saving this idea for future work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wholmgren The initial opening of the file and looping through all of the lines takes ~414 ms, whereas the pd.read_fwf takes 1.15 s. Tested in a Jupyter Notebook using the %%timeit command.

Other parts of the code take on the order of µs (except setting the datetime index 3 ms).

So there are potential savings, however, after searching for a while I still have no idea of how to write to a buffer line by line. I'm also very fine with saving this idea for future work though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting in the effort to profile the code! The potential speed up seems modest so let's save it for future work.

@kandersolar
Copy link
Member

kandersolar commented Feb 10, 2021

A clue for some of the test failures: pandas-dev/pandas#37909

I get failures locally with b5ed6ee and pandas==1.2.2, but adding compression='infer' to the pd.read_fwf call gets them passing.

Also this line should be assert actual.index.equals(expected_index):

assert actual.index.equals()

@wholmgren
Copy link
Member

@AdamRJensen please review changes on my bsrnfixes branch:

https://github.com/wholmgren/pvlib-python/blob/bsrnfixes/pvlib/iotools/bsrn.py
https://github.com/wholmgren/pvlib-python/blob/bsrnfixes/pvlib/tests/iotools/test_bsrn.py

The function didn't quite produce the expected results when running it against this NASA LRC file that only contains the 0100 measurements: https://cove.larc.nasa.gov/BSRN/LRC49/2021/lrc0121.dat

In the process of trying to fix it, I changed the line number convention to make it make more sense to me. Fine with me if you want to keep the original convention so long as it's fixed to work correctly for files with only 0100.

I also refactored the test to use both files. I didn't actually add the NASA LRC file to git due to the data license considerations above, so you'll need to download it and put it in the right place if you want that test to run correctly. Alternatively, we could make a second test file from the PAY file that only contains the 0100 record.

If you can, try adding my fork as another git remote, fetching my branch, and comparing against yours. Normally I would make a pull request to your fork, but for some reason github won't give me the option to select it as the base repository.

Finally, it appears the bsrn.py file uses windows line breaks. Can you change that to unix line breaks?

@AdamRJensen
Copy link
Member Author

@kanderso-nrel Thank you very much for solving the issue regarding the compression. Cleared a lot of issues! There are still 3 issues left... any ideas on these?

@wholmgren I prefer keeping it consistent with Python starting the line numbers at zero, hence I have added start=2 to the enumerate function. Also, I attempted to change the file to Unix line ending.. but I have no way of checking if it is correct now.

@kandersolar
Copy link
Member

Older pandas versions had inconsistent behavior for offset strings... there's a relevant pandas issue or PR about this but I couldn't quickly find it. Anyway I'd try T as the unit for the Timedelta because it means the same as minute in pandas lingo but might be supported by our minimum pandas version. If python36-min doesn't like that either then I'll try to find that pandas page I'm remembering and figure out what will work across all versions.

For the codecov failure, it's because not 100% of your lines of code get run by the test suite: https://github.com/pvlib/pvlib-python/pull/1145/files#annotation_969052144. If @wholmgren is okay with leaving the non-gzip branch untested then 96 can be left alone. I haven't read the code in-depth enough to know whether it would be easy to engineer a test that runs 112.

All logical records after LR0100 have been removed to reduce space (be below 25 MB), but also to test the functionality of files with few logical records.
@wholmgren
Copy link
Member

the bsrn.py file in this zip should have the right line breaks: bsrn.py.zip

Copy link
Member

@wholmgren wholmgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can fix line breaks in a follow up PR.

@kanderso-nrel any other comments?

Copy link
Member

@kandersolar kandersolar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't run any code myself, but docs LGTM with a minor typo.

pvlib/iotools/bsrn.py Outdated Show resolved Hide resolved
Co-authored-by: Kevin Anderson <57452607+kanderso-nrel@users.noreply.github.com>
@AdamRJensen
Copy link
Member Author

I'm unsure if there's anything else I need to do, but if that's the case please let me know. Thank you all for your patience and help!

@wholmgren
Copy link
Member

Thanks @AdamRJensen for the new feature and for your patience during the review!

I expect it'll be a month or so before we release 0.9.0, but I'll make another pre-release by the end of next week.

@wholmgren wholmgren merged commit 59a8b6c into pvlib:master Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add BSRN format reader to iotools
4 participants