Skip to content

Commit

Permalink
Fix model locale issue and improve model R/W performance. (#3405)
Browse files Browse the repository at this point in the history
* Fix LightGBM models locale sensitivity and improve R/W performance.

When Java is used, the default C++ locale is broken. This is true for
Java providers that use the C API or even Python models that require JEP.

This patch solves that issue making the model reads/writes insensitive
to such settings.
To achieve it, within the model read/write codebase:
 - C++ streams are imbued with the classic locale
 - Calls to functions that are dependent on the locale are replaced
 - The default locale is not changed!

This approach means:
 - The user's locale is never tampered with, avoiding issues such as
    #2979 with the previous
    approach #2891
 - Datasets can still be read according the user's locale
 - The model file has a single format independent of locale

Changes:
 - Add CommonC namespace which provides faster locale-independent versions of Common's methods
 - Model code makes conversions through CommonC
 - Cleanup unused Common methods
 - Performance improvements. Use fast libraries for locale-agnostic conversion:
   - value->string: https://github.com/fmtlib/fmt
   - string->double: https://github.com/lemire/fast_double_parser (10x
      faster double parsing according to their benchmark)

Bugfixes:
 - #2500
 - #2890
 - ninia/jep#205 (as it is related to LGBM as well)

* Align CommonC namespace

* Add new external_libs/ to python setup

* Try fast_double_parser fix #1

Testing commit e09e5aad828bcb16bea7ed0ed8322e019112fdbe

If it works it should fix more LGBM builds

* CMake: Attempt to link fmt without explicit PUBLIC tag

* Exclude external_libs from linting

* Add exernal_libs to MANIFEST.in

* Set dynamic linking option for fmt.

* linting issues

* Try to fix lint includes

* Try to pass fPIC with static fmt lib

* Try CMake P_I_C option with fmt library

* [R-package] Add CMake support for R and CRAN

* Cleanup CMakeLists

* Try fmt hack to remove stdout

* Switch to header-only mode

* Add PRIVATE argument to target_link_libraries

* use fmt in header-only mode

* Remove CMakeLists comment

* Change OpenMP to PUBLIC linking in Mac

* Update fmt submodule to 7.1.2

* Use fmt in header-only-mode

* Remove fmt from CMakeLists.txt

* Upgrade fast_double_parser to v0.2.0

* Revert "Add PRIVATE argument to target_link_libraries"

This reverts commit 3dd45dd.

* Address James Lamb's comments

* Update R-package/.Rbuildignore

Co-authored-by: James Lamb <jaylamb20@gmail.com>

* Upgrade to fast_double_parser v0.3.0 - Solaris support

* Use legacy code only in Solaris

* Fix lint issues

* Fix comment

* Address StrikerRUS's comments (solaris ifdef).

* Change header guards

Co-authored-by: James Lamb <jaylamb20@gmail.com>
  • Loading branch information
AlbertoEAF and jameslamb committed Dec 8, 2020
1 parent 218446a commit 792c930
Show file tree
Hide file tree
Showing 15 changed files with 534 additions and 197 deletions.
4 changes: 2 additions & 2 deletions .ci/test.sh
Expand Up @@ -52,8 +52,8 @@ if [[ $TRAVIS == "true" ]] && [[ $TASK == "lint" ]]; then
"r-lintr>=2.0"
pip install --user cpplint
echo "Linting Python code"
pycodestyle --ignore=E501,W503 --exclude=./compute,./.nuget . || exit -1
pydocstyle --convention=numpy --add-ignore=D105 --match-dir="^(?!^compute|test|example).*" --match="(?!^test_|setup).*\.py" . || exit -1
pycodestyle --ignore=E501,W503 --exclude=./compute,./.nuget,./external_libs . || exit -1
pydocstyle --convention=numpy --add-ignore=D105 --match-dir="^(?!^compute|external_libs|test|example).*" --match="(?!^test_|setup).*\.py" . || exit -1
echo "Linting R code"
Rscript ${BUILD_DIRECTORY}/.ci/lint_r_code.R ${BUILD_DIRECTORY} || exit -1
echo "Linting C++ code"
Expand Down
6 changes: 6 additions & 0 deletions .gitmodules
@@ -1,3 +1,9 @@
[submodule "include/boost/compute"]
path = compute
url = https://github.com/boostorg/compute
[submodule "external_libs/fmt"]
path = external_libs/fmt
url = https://github.com/fmtlib/fmt.git
[submodule "external_libs/fast_double_parser"]
path = external_libs/fast_double_parser
url = https://github.com/lemire/fast_double_parser.git
23 changes: 23 additions & 0 deletions R-package/.Rbuildignore
@@ -1,10 +1,14 @@
\.appveyor\.yml
AUTOCONF_UBUNTU_VERSION
^autom4te.cache/.*$
^.*\.bin
^build_r.R$
\.clang-format
^cran-comments\.md$
^docs$
^.*\.dll
\.drone\.yml
\.git
\.gitkeep$
^.*\.history
^Makefile$
Expand All @@ -24,3 +28,22 @@ AUTOCONF_UBUNTU_VERSION
^src/compute/.gitignore$
^src/compute/CONTRIBUTING.md$
^src/compute/README.md$
src/external_libs/fast_double_parser/benchmarks
src/external_libs/fast_double_parser/Makefile
src/external_libs/fast_double_parser/.*\.md
src/external_libs/fast_double_parser/tests
src/external_libs/fast_double_parser/.*\.yaml
src/external_libs/fast_double_parser/.*\.yml
src/external_libs/fmt/.*\.md
src/external_libs/fmt/doc
src/external_libs/fmt/support/Android\.mk
src/external_libs/fmt/support/.*\.gradle
src/external_libs/fmt/support/.*\.pro
src/external_libs/fmt/support/.*\.py
src/external_libs/fmt/support/rtd
src/external_libs/fmt/support/.*sublime-syntax
src/external_libs/fmt/support/Vagrantfile
src/external_libs/fmt/support/.*\.xml
src/external_libs/fmt/support/.*\.yml
src/external_libs/fmt/test
\.travis\.yml
3 changes: 3 additions & 0 deletions R-package/DESCRIPTION
Expand Up @@ -20,6 +20,9 @@ Authors@R: c(
person("Jay", "Loden", role = c("cph")),
person("Dave", "Daeschler", role = c("cph")),
person("Giampaolo", "Rodola", role = c("cph")),
person("Alberto", "Ferreira", role = c("ctb")),
person("Daniel", "Lemire", role = c("ctb")),
person("Victor", "Zverovich", role = c("cph")),
person("IBM Corporation", role = c("ctb"))
)
Description: Tree based algorithms can be improved by introducing boosting frameworks.
Expand Down
1 change: 1 addition & 0 deletions R-package/README.md
Expand Up @@ -241,6 +241,7 @@ For more information on this approach, see ["Writing R Extensions"](https://cran
From the root of the repository, run the following.

```shell
git submodule update --init --recursive
sh build-cran-package.sh
```

Expand Down
19 changes: 19 additions & 0 deletions build-cran-package.sh
Expand Up @@ -28,6 +28,15 @@ cp -R R-package/* ${TEMP_R_DIR}
cp -R include ${TEMP_R_DIR}/src/
cp -R src/* ${TEMP_R_DIR}/src/

cp \
external_libs/fast_double_parser/include/fast_double_parser.h \
${TEMP_R_DIR}/src/include/LightGBM

mkdir -p ${TEMP_R_DIR}/src/include/LightGBM/fmt
cp \
external_libs/fmt/include/fmt/*.h \
${TEMP_R_DIR}/src/include/LightGBM/fmt/

cd ${TEMP_R_DIR}

# Remove files not needed for CRAN
Expand Down Expand Up @@ -67,6 +76,16 @@ cd ${TEMP_R_DIR}
done
find . -name '*.h.bak' -o -name '*.hpp.bak' -o -name '*.cpp.bak' -exec rm {} \;

sed \
-i.bak \
-e 's/\.\..*fmt\/format\.h/LightGBM\/fmt\/format\.h/' \
src/include/LightGBM/utils/common.h

sed \
-i.bak \
-e 's/\.\..*fast_double_parser\.h/LightGBM\/fast_double_parser\.h/' \
src/include/LightGBM/utils/common.h

# When building an R package with 'configure', it seems
# you're guaranteed to get a shared library called
# <packagename>.so/dll. The package source code expects
Expand Down
11 changes: 11 additions & 0 deletions build_r.R
Expand Up @@ -173,6 +173,17 @@ result <- file.remove(
)
.handle_result(result)

#------------#
# submodules #
#------------#
result <- file.copy(
from = "external_libs/"
, to = sprintf("%s/", TEMP_SOURCE_DIR)
, recursive = TRUE
, overwrite = TRUE
)
.handle_result(result)

# copy files into the place CMake expects
for (src_file in c("lightgbm_R.cpp", "lightgbm_R.h", "R_object_helper.h")) {
result <- file.copy(
Expand Down
1 change: 1 addition & 0 deletions external_libs/fast_double_parser
Submodule fast_double_parser added at ace606
1 change: 1 addition & 0 deletions external_libs/fmt
Submodule fmt added at cc09f1

0 comments on commit 792c930

Please sign in to comment.