Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing of non-finite values #3942

Merged
merged 20 commits into from Mar 16, 2021
Merged

Conversation

mjmckp
Copy link
Contributor

@mjmckp mjmckp commented Feb 12, 2021

Proposed solution for #3941. Current implementation silently returns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match. No attempt to optimise string allocations in this implementation since it is usually rarely invoked.

matthew-peacock and others added 16 commits November 2, 2018 13:07
…datasets.

Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.
…data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)
…urns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match. No attempt to optimise string allocations in this implementation since it is usually rarely invoked.
@StrikerRUS
Copy link
Collaborator

@AlbertoEAF Could you please help to review?

@StrikerRUS StrikerRUS added the fix label Feb 12, 2021
Copy link
Contributor

@AlbertoEAF AlbertoEAF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

  1. C++98 does not define aspect of inf/nan values in text, but C99 does, and so, unlike stod, apparently stringstreams might not work in all platforms (specially MSVC) to parse inf, and nan values: https://www.boost.org/doc/libs/1_70_0/libs/math/doc/html/math_toolkit/fp_facets/facets_intro.html

  2. The old implementation used stod (which is not locale-safe for our purposes): https://en.cppreference.com/w/cpp/string/basic_string/stof. See the old version of of this very function here: 792c930#diff-ddbdb79f73cc159561b86b759b0542fec75ca9f4869a7eac5790e8f9ebaf7ef4R382

  3. We shouldn't stop parsing numbers when fast_double_parser::parse_number fails. There might be some numerical values that besides inf, or nan which are not parsed either. Hence we should keep the stringstream parsing for numerical values on the slow path instead of giving a fatal error.

Comment on lines -1087 to -1090
std::stringstream ss;
Common::C_stringstream(ss);
ss << str;
ss >> tmp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this code on the else branch instead of raising a fatal error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that code silently fails, which is completely unacceptable.

In reality we are only parsing strings generated by LightGBM itself when the model was written out to file, so we should ensure there are robust round-trip tests which include models that contain inf and nan values. There are no other possible non-finite values defined for IEEE 754 floating point numbers, and if at some point in the future the standard changed, this would be quickly picked up by the round-trip tests and easily addressed.

I am shocked that such tests don't already exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant doing number parsing after the nan and inf checks. Your logic makes sense regarding IEEE-754, although I wouldn't recommend dropping that parsing at the end without adding said tests first, not 100.0% sure fast_double_parser parses all numbers.

tmp = std::numeric_limits<double>::infinity();
else if (strlower == std::string("-inf"))
tmp = -std::numeric_limits<double>::infinity();
else if (strlower == std::string("nan"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Missing -nan handling.

Probably best to halve the string comparisons by parsing first the "-" sign.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks, I'll add -nan. I don't think it's worth obfuscating this rarely executed branch with optimisations until profiling shows it is a bottleneck.

ss << str;
ss >> tmp;
std::string strlower(str);
std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast<char>(::tolower(c)); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great clean code @mjmckp ;)

Instead of allocating a string, and since you already have a lambda, what about defining a case-insensitive comparison lambda and use std::equal to check the "inf" and "nan" values below?

Although this is the rare branch there might be longer strings than inf or nan which might be parsed here and might slow down our parsing without need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this hardly seems worth it, this branch is rarely invoked, meanwhile a colossal amount of strings are being allocated in splitting and parsing the input file, so these few extra allocations are a drop in the ocean.

I think our time is better spent adding robust round-trip tests to ensure major bugs like this don't occur again...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our time is better spent adding robust round-trip tests to ensure major bugs like this don't occur again...

Agreed. Will you add such tests?
I actually run such tests but on an external lgbm provider and didn't have nan nor inf on my model, but would prefer to see them in lgbm's CI so any breakage is detected immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could assist in adding the tests, any idea where these should go and how best to implement them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjmckp #3555 has been just merged. We will really appreciate new GTest-compatible tests in this or in a follow-up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, would you mind pointing me towards a similar kind of test that I can use as a starting point please?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we don't have any tests yet. This is something that we should concentrate on in the near future. For now, I think you can take a look at tests from @AlbertoEAF in #3997.
https://github.com/microsoft/LightGBM/pull/3997/files#diff-c363eba6eda99d9e560f8341a1fc8fe02e885d2256db2482e1c543430a25666d

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjmckp I've merged this PR with the aim to not delay the upcoming release. Please feel free to add tests in a new PR. We'll be very grateful! And thanks a lot for the bug fix!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS Thanks a lot, I'll get up to speed on how the new tests work and add some tests for this in a new PR soon.

Remove trailing whitespace to pass linting tests

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@AlbertoEAF AlbertoEAF mentioned this pull request Mar 6, 2021
@StrikerRUS
Copy link
Collaborator

@shiyu1994 @btrotta @guolinke Could you please review this to include in the upcoming release?

Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StrikerRUS StrikerRUS merged commit 4e47c93 into microsoft:master Mar 16, 2021
StrikerRUS added a commit that referenced this pull request Mar 25, 2021
* [docs]Add alt text on images

* Update docs/GPU-Windows.rst

Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Update docs/GPU-Windows.rst

Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Apply suggestions from code review

Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Apply suggestions from code review

Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Merge main branch commit updates (#1)

* [docs] Add alt text to image in Parameters-Tuning.rst (#4035)

* [docs] Add alt text to image in Parameters-Tuning.rst

Add alt text to Leaf-wise growth image, as part of #4028

* Update docs/Parameters-Tuning.rst

Co-authored-by: James Lamb <jaylamb20@gmail.com>

Co-authored-by: James Lamb <jaylamb20@gmail.com>

* [ci] [R-package] upgrade to R 4.0.4 in CI (#4042)

* [docs] update description of deterministic parameter (#4027)

* update description of deterministic parameter to require using with force_row_wise or force_col_wise

* Update include/LightGBM/config.h

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* update docs

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* [dask] Include support for init_score (#3950)

* include support for init_score

* use dataframe from init_score and test difference with and without init_score in local model

* revert refactoring

* initial docs. test between distributed models with and without init_score

* remove ranker from tests

* test value for root node and change docs

* comma

* re-include parametrize

* fix incorrect merge

* use single init_score and the booster_ attribute

* use np.float64 instead of float

* [ci] ignore untitle Jupyter notebooks in .gitignore (#4047)

* [ci] prevent getting incompatible dask and distributed versions (#4054)

* [ci] prevent getting incompatible dask and distributed versions

* Update .ci/test.sh

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* empty commit

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* [ci] fix R CMD CHECK note about example timings (fixes #4049) (#4055)

* [ci] fix R CMD CHECK note about example timings (fixes #4049)

* Apply suggestions from code review

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* empty commit

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* [ci] add CMake + R 3.6 test back (fixes #3469) (#4053)

* [ci] add CMake + R 3.6 test back (fixes #3469)

* Apply suggestions from code review

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Update .ci/test_r_package_windows.ps1

* -Wait and remove rtools40

* empty commit

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* [dask] include multiclass-classification task in tests (#4048)

* include multiclass-classification task and task_to_model_factory dicts

* define centers coordinates. flatten init_scores within each partition for multiclass-classification

* include issue comment and fix linting error

* Update index.rst (#4029)

Add alt text to logo image

Co-authored-by: James Lamb <jaylamb20@gmail.com>

* [dask] raise more informative error for duplicates in 'machines' (fixes #4057) (#4059)

* [dask] raise more informative error for duplicates in 'machines'

* uncomment

* avoid test failure

* Revert "avoid test failure"

This reverts commit 9442bdf.

* [dask] add tutorial documentation (fixes #3814, fixes #3838) (#4030)

* [dask] add tutorial documentation (fixes #3814, fixes #3838)

* add notes on saving the model

* quick start examples

* add examples

* fix timeouts in examples

* remove notebook

* fill out prediction section

* table of contents

* add line back

* linting

* isort

* Apply suggestions from code review

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* Apply suggestions from code review

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* move examples under python-guide

* remove unused pickle import

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* set 'pending' commit status for R Solaris optional workflow (#4061)

* [docs] add Yu Shi to repo maintainers (#4060)

* Update FAQ.rst

* Update CODEOWNERS

* set is_linear_ to false when it is absent from the model file (fix #3778) (#4056)

* Add CMake option to enable sanitizers and build gtest (#3555)

* Add CMake option to enable sanitizer

* Set up gtest

* Address reviewer's feedback

* Address reviewer's feedback

* Update CMakeLists.txt

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* added type hint (#4070)

* [ci] run Dask examples on CI (#4064)

* Update Parallel-Learning-Guide.rst

* Update test.sh

* fix path

* address review comments

* [python-package] add type hints on Booster.set_network() (#4068)

* [python-package] add type hints on Booster.set_network()

* change behavior

* [python-package] Some mypy fixes (#3916)

* Some mypy fixes

* address James' comments

* Re-introduce pass in empty classes

* Update compat.py

Remove extra lines

* [dask] [ci] fix flaky network-setup test (#4071)

* [tests][dask] simplify code in Dask tests (#4075)

* simplify Dask tests code

* enable CI

* disable CI

* Revert "[ci] prevent getting incompatible dask and distributed versions (#4054)" (#4076)

This reverts commit 4e9c976.

* Fix parsing of non-finite values (#3942)

* Fix index out-of-range exception generated by BaggingHelper on small datasets.

Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.

* Update goss.hpp

* Update goss.hpp

* Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)

* Fix incorrect upstream merge

* Add link to LightGBM.NET

* Fix indenting to 2 spaces

* Dummy edit to trigger CI

* Dummy edit to trigger CI

* remove duplicate functions from merge

* Fix parsing of non-finite values.  Current implementation silently returns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match.  No attempt to optimise string allocations in this implementation since it is usually rarely invoked.

* Dummy commit to trigger CI

* Also handle -nan in double parsing method

* Update include/LightGBM/utils/common.h

Remove trailing whitespace to pass linting tests

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com>
Co-authored-by: Guolin Ke <guolin.ke@outlook.com>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* [dask] remove unused imports from typing (#4079)

* Range check for DCG position discount lookup (#4069)

* Add check to prevent out of index lookup in the position discount table. Add debug logging to report number of queries found in the data.

* Change debug logging location so that we can print the data file name as well.

* Revert "Change debug logging location so that we can print the data file name as well."

This reverts commit 3981b34.

* Add data file name to debug logging.

* Move log line to a place where it is output even when query IDs are read from a separate file.

* Also add the out-of-range check to rank metrics.

* Perform check after number of queries is initialized.

* Update

* [ci] upgrade R CI scripts to work on Ubuntu 20.04 (#4084)

* [ci] install additional LaTeX packages in R CI jobs

* update autoconf version

* bump upper limit on package size to 100

* [SWIG] Add streaming data support + cpp tests (#3997)

* [feature] Add ChunkedArray to SWIG

* Add ChunkedArray
* Add ChunkedArray_API_extensions.i
* Add SWIG class wrappers

* Address some review comments

* Fix linting issues

* Move test to tests/test_ChunkedArray_manually.cpp

* Add test note

* Move ChunkedArray to include/LightGBM/utils/

* Declare more explicit types of ChunkedArray in the SWIG API.

* Port ChunkedArray tests to googletest

* Please C++ linter

* Address StrikerRUS' review comments

* Update SWIG doc & disable ChunkedArray<int64_t>

* Use CHECK_EQ instead of assert

* Change include order (linting)

* Rename ChunkedArray -> chunked_array files

* Change header guards

* Address last comments from StrikerRUS

* store all CMake files in one place (#4087)

* v3.2.0 release (#3872)

* Update VERSION.txt

* update appveyor.yml and configure

* fix Appveyor builds

Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: StrikerRUS <nekit94-12@hotmail.com>

* [ci] Bump version for development (#4094)

* Update .appveyor.yml

* Update cran-comments.md

* Update VERSION.txt

* update configure

Co-authored-by: James Lamb <jaylamb20@gmail.com>

* [ci] fix flaky Azure Pipelines jobs (#4095)

* Update test.sh

* Update setup.sh

* Update .vsts-ci.yml

* Update test.sh

* Update setup.sh

* Update .vsts-ci.yml

* Update setup.sh

* Update setup.sh

Co-authored-by: Subham Agrawal <34346812+subhamagrawal7@users.noreply.github.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: jmoralez <jmoralz92@gmail.com>
Co-authored-by: marcelonieva7 <72712805+marcelonieva7@users.noreply.github.com>
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
Co-authored-by: Deddy Jobson <dedjob@hotmail.com>
Co-authored-by: Alberto Ferreira <AlbertoEAF@users.noreply.github.com>
Co-authored-by: mjmckp <mjmckp@users.noreply.github.com>
Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com>
Co-authored-by: Guolin Ke <guolin.ke@outlook.com>
Co-authored-by: ashok-ponnuswami-msft <57648631+ashok-ponnuswami-msft@users.noreply.github.com>
Co-authored-by: StrikerRUS <nekit94-12@hotmail.com>

* Apply suggestions from code review

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: Subham Agrawal <34346812+subhamagrawal7@users.noreply.github.com>
Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: jmoralez <jmoralz92@gmail.com>
Co-authored-by: marcelonieva7 <72712805+marcelonieva7@users.noreply.github.com>
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
Co-authored-by: Deddy Jobson <dedjob@hotmail.com>
Co-authored-by: Alberto Ferreira <AlbertoEAF@users.noreply.github.com>
Co-authored-by: mjmckp <mjmckp@users.noreply.github.com>
Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com>
Co-authored-by: Guolin Ke <guolin.ke@outlook.com>
Co-authored-by: ashok-ponnuswami-msft <57648631+ashok-ponnuswami-msft@users.noreply.github.com>
Co-authored-by: StrikerRUS <nekit94-12@hotmail.com>
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants