-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add read_bsrn function #1145
Add read_bsrn function #1145
Conversation
Related to issue #1015.
Simplified how the start and end line of the data is determined. Improved documentation, e.g. moved constants outside of function.
Add bsrn file and read_bsrn function
@wholmgren I've changed it such that it now commits to pvlib:master. Hope it's correct. |
Hi @AdamRJensen, could you update this PR to resolve the code style issues raised by Edit: also please add a new entry here for your function: https://github.com/pvlib/pvlib-python/blob/master/pvlib/iotools/__init__.py |
Thanks @AdamRJensen! We will need some tests. Do you want me to push a test to your branch or do you want to take it on yourself? It seemed like you might have been disinclined to add a test in your first PR, but I am happy to help since this will be a big help to me. |
Hi @kanderso-nrel, thanks for the heads up! I wasn't sure if PVlib is very strict on style issues, e.g. line length<79 characters? I have corrected all instances of this, but in some places, it makes the code a little harder to follow / less compact. |
@wholmgren I am definitely up for attempting to make tests. Can you point me in a direction to get started? |
Check out Add a test file to Start by making your test as simple as possible. A single data file, no remote io capabilities. Probably ok to only assert the index values and the columns for now. |
If you think it makes the code less readable then it's acceptable to add |
If the test file is a proper BSRN data file, we should make sure we're staying within their conditions of data release. |
@kanderso-nrel and @wholmgren any opinion on if I should create a pseudo-file or ask Amelie from BSRN if we can upload a single file for test purposes? |
@AdamRJensen I think we need to get explicit OK from the WRMC, because pvlib-python isn't a traditional "publication", but (in my view) qualifies as a "project for wide release." What matters, though, is WRMC's view of pvlib-python. The way I read the Data Release Guidelines, assuming WRMC concurrence pvlib-python can include a file of BSRN data provided:
|
My reading is that BSRN data downloaded directly from NASA websites is subject to NASA's Data and Information Policy. So I don't see any issue with including a test file from either of the GIM or LaRC sites. I'm not opposed to asking - and that may well be more polite in any case - just pointing out that there may be other options. |
Air temperature was listed as air_temperature in the docstring instead of temp_air.
pvlib/iotools/bsrn.py
Outdated
if line.startswith('*'): # Find start of all logical records | ||
line_no_dict[line[2:6]] = num # key is 4 digit LR number | ||
|
||
# Determine start and end line of logical record LR0100 to be parsed | ||
start_row = line_no_dict['0100'] + 1 # Start line number | ||
# If LR0100 is the last logical record, then read rest of file | ||
if start_row-1 == max(line_no_dict.values()): | ||
end_row = num # then parse rest of the file | ||
else: # otherwise parse until the beginning of the next logical record | ||
end_row = min([i for i in line_no_dict.values() if i > start_row]) | ||
nrows = end_row-start_row | ||
|
||
# Read file as a fixed width file (fwf) | ||
data = pd.read_fwf(filename, skiprows=start_row, nrows=nrows, header=None, | ||
colspecs=COL_SPECS, na_values=[-999.0, -99.9]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to avoid reading the file twice by accumulating the first read of just the '0100' section into an IO buffer, stopping the read at the end of the section, and then passing that buffer into pd.read_fwf
.
I think this might result in meaningful performance improvements when reading many files. Or maybe we need to profile the code before we put the effort into it.
In any case, I'm ok saving this idea for future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wholmgren The initial opening of the file and looping through all of the lines takes ~414 ms, whereas the pd.read_fwf takes 1.15 s. Tested in a Jupyter Notebook using the %%timeit command.
Other parts of the code take on the order of µs (except setting the datetime index 3 ms).
So there are potential savings, however, after searching for a while I still have no idea of how to write to a buffer line by line. I'm also very fine with saving this idea for future work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting in the effort to profile the code! The potential speed up seems modest so let's save it for future work.
Co-authored-by: Will Holmgren <william.holmgren@gmail.com>
Co-authored-by: Will Holmgren <william.holmgren@gmail.com>
A clue for some of the test failures: pandas-dev/pandas#37909 I get failures locally with b5ed6ee and pandas==1.2.2, but adding Also this line should be
|
@AdamRJensen please review changes on my https://github.com/wholmgren/pvlib-python/blob/bsrnfixes/pvlib/iotools/bsrn.py The function didn't quite produce the expected results when running it against this NASA LRC file that only contains the 0100 measurements: https://cove.larc.nasa.gov/BSRN/LRC49/2021/lrc0121.dat In the process of trying to fix it, I changed the line number convention to make it make more sense to me. Fine with me if you want to keep the original convention so long as it's fixed to work correctly for files with only 0100. I also refactored the test to use both files. I didn't actually add the NASA LRC file to git due to the data license considerations above, so you'll need to download it and put it in the right place if you want that test to run correctly. Alternatively, we could make a second test file from the PAY file that only contains the 0100 record. If you can, try adding my fork as another git remote, fetching my branch, and comparing against yours. Normally I would make a pull request to your fork, but for some reason github won't give me the option to select it as the base repository. Finally, it appears the |
@kanderso-nrel Thank you very much for solving the issue regarding the compression. Cleared a lot of issues! There are still 3 issues left... any ideas on these? @wholmgren I prefer keeping it consistent with Python starting the line numbers at zero, hence I have added start=2 to the enumerate function. Also, I attempted to change the file to Unix line ending.. but I have no way of checking if it is correct now. |
Older pandas versions had inconsistent behavior for offset strings... there's a relevant pandas issue or PR about this but I couldn't quickly find it. Anyway I'd try For the codecov failure, it's because not 100% of your lines of code get run by the test suite: https://github.com/pvlib/pvlib-python/pull/1145/files#annotation_969052144. If @wholmgren is okay with leaving the non-gzip branch untested then 96 can be left alone. I haven't read the code in-depth enough to know whether it would be easy to engineer a test that runs 112. |
All logical records after LR0100 have been removed to reduce space (be below 25 MB), but also to test the functionality of files with few logical records.
the bsrn.py file in this zip should have the right line breaks: bsrn.py.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can fix line breaks in a follow up PR.
@kanderso-nrel any other comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't run any code myself, but docs LGTM with a minor typo.
Co-authored-by: Kevin Anderson <57452607+kanderso-nrel@users.noreply.github.com>
I'm unsure if there's anything else I need to do, but if that's the case please let me know. Thank you all for your patience and help! |
Thanks @AdamRJensen for the new feature and for your patience during the review! I expect it'll be a month or so before we release 0.9.0, but I'll make another pre-release by the end of next week. |
docs/sphinx/source/api.rst
for API changes.docs/sphinx/source/whatsnew
for all changes. Includes link to the GitHub Issue with:issue:`num`
or this Pull Request with:pull:`num`
. Includes contributor name and/or GitHub username (link with:ghuser:`user`
).