Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] make package installable with CRAN toolchain (fixes #2960) #3188

Merged
merged 50 commits into from Jul 29, 2020

Conversation

jameslamb
Copy link
Collaborator

This pull request contains a proposal for the next step to get the LightGBM R package onto CRAN: building without CMake.

See conversation in #629 for some background.

Essentially, CRAN very particular about how source packages with C++ code are built. It enforces a lot of checks to ensure portability, and will reject packages that require any of the following:

  • non-portable flags
  • non-standard / non-open-source build tools

The R package does not currently comply with CRAN's preferred build toolchain. This PR fixes that 😀

Overview

As of this PR, LightGBM's R package gains a CRAN-compliant installation toolchain using autoconf. From "Writing R Extensions"

If your package needs some system-dependent configuration before installation you can include an executable (Bourne25) shell script configure in your package which (if present) is executed by R CMD INSTALL before any other action is performed. This can be a script created by the Autoconf mechanism, but may also be a script written by yourself...the full power of Autoconf is available for your extension package (including variable substitution, searching for libraries, etc.).

The details of how this is used are explained in the proposed changes to R-package/README.md added to this PR.

Notes for Reviewers

Thanks in advance for your time and thorough reviews!

.ci/test_r_package.sh Outdated Show resolved Hide resolved
R-package/configure.ac Outdated Show resolved Hide resolved
build-cran-package.sh Outdated Show resolved Hide resolved
R-package/configure.ac Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator Author

Some CI jobs are still failing but I think this is close enough that it's ready for the review process to start.

Ok, I think this was just because of the issues fixed in #3193 , I think this is ready for review!

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb Wow, impressive! We are so close to CRAN!

Just had a chance to give a first look at this PR. Please see some initial comments below:

.ci/test_r_package.sh Outdated Show resolved Hide resolved
.ci/test_r_package_windows.ps1 Outdated Show resolved Hide resolved
.ci/test_r_package_windows.ps1 Show resolved Hide resolved
.github/workflows/main.yml Show resolved Hide resolved
.github/workflows/main.yml Show resolved Hide resolved
R-package/src/Makevars.in Show resolved Hide resolved
R-package/src/Makevars.win.in Outdated Show resolved Hide resolved
build-cran-package.sh Outdated Show resolved Hide resolved
build-cran-package.sh Outdated Show resolved Hide resolved
build-cran-package.sh Outdated Show resolved Hide resolved
@StrikerRUS
Copy link
Collaborator

From macOS logs:

checking whether MM_PREFETCH works... no
checking whether MM_MALLOC works... no
checking whether OpenMP will work in a package... no

I think we should check for yes like we do for determining right compiler and toolchain at Windows.

# Checking that we actually got the expected compiler. The R package has some logic
# to fail back to MinGW if MSVC fails, but for CI builds we need to check that the correct
# compiler was used.
$checks = Select-String -Path "${INSTALL_LOG_FILE_NAME}" -Pattern "Check for working CXX compiler.*$env:COMPILER"
if ($checks.Matches.length -eq 0) {
Write-Output "The wrong compiler was used. Check the build logs."
Check-Output $False
}
# Checking that we got the right toolchain for MinGW. If using MinGW, both
# MinGW and MSYS toolchains are supported
if ($env:COMPILER -eq "MINGW") {
$checks = Select-String -Path "${INSTALL_LOG_FILE_NAME}" -Pattern "Trying to build with.*$env:TOOLCHAIN"
if ($checks.Matches.length -eq 0) {
Write-Output "The wrong toolchain was used. Check the build logs."
Check-Output $False
}
}

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jul 3, 2020

I think it makes sense to port compiler version checks from our CMake configuration

LightGBM/CMakeLists.txt

Lines 24 to 42 in 4f8c32d

if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
if(CMAKE_CXX_COMPILER_VERSION VERSION_LESS "4.8.2")
message(FATAL_ERROR "Insufficient gcc version")
endif()
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
if(CMAKE_CXX_COMPILER_VERSION VERSION_LESS "3.8")
message(FATAL_ERROR "Insufficient Clang version")
endif()
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "AppleClang")
if(CMAKE_CXX_COMPILER_VERSION VERSION_LESS "8.1.0")
message(FATAL_ERROR "Insufficient AppleClang version")
endif()
cmake_minimum_required(VERSION 3.16)
elseif(MSVC)
if(MSVC_VERSION LESS 1900)
message(FATAL_ERROR "The compiler ${CMAKE_CXX_COMPILER} doesn't support required C++11 features. Please use a newer MSVC.")
endif()
cmake_minimum_required(VERSION 3.8)
endif()

Some parts can be borrowed from
https://github.com/RcppCore/RcppArmadillo/blob/db3ae40795b80e0df1320f8e28447adebfa17ff2/configure.ac#L79-L153

UPD: something better: https://github.com/duckmayr/gpirt/blob/6948bb0d482a23e32494b913ab911fa6d91c80da/configure.ac#L35-L42

@guolinke
Copy link
Collaborator

guolinke commented Aug 7, 2020

yeah, i think it is okay.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Aug 8, 2020

Ok, I just uploaded a binary for the R package, built against R 4.0 on Windows! It can be installed like this

url <- "https://github.com/microsoft/LightGBM/releases/download/v3.0.0rc1/lightgbm-3.0.0-1-r40.zip"
download.file(
    url = url
    , destfile = "lightgbm.zip"
)
install.packages(
    pkgs = "lightgbm.zip"
    , type = "binary"
    , repos = NULL
)

I'll upload Mac and Linux shortly. I also will open up a PR with docs on how to make these and how to download them.

Why the code above doesn't use remotes

I could not get remotes or install.packages() to work directly against this URL. I built the package exactly the way that is recommended in "Writing R Extensions".

sh build-cran-package.sh
R CMD INSTALL --build lightgbm_3.0.0-1.tar.gz

When I try this:

url <- "https://github.com/microsoft/LightGBM/releases/download/v3.0.0rc1/lightgbm-3.0.0-1-r40-windows.zip"
remotes::install_url(
    url = url
    , type = "binary"
    , build = FALSE
)

I get this error

Downloading package from url: https://github.com/microsoft/LightGBM/releases/download/v3.0.0rc1/lightgbm-3.0.0-1-r40-windows.zip
Installing package into ‘C:/Users/James/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
* installing *binary* package 'lightgbm' ...
cp: unknown option -- )
Try '/usr/bin/cp --help' for more information.
ERROR: installing binary package failed
* removing 'C:/Users/James/Documents/R/win-library/4.0/lightgbm'
Error: Failed to install 'unknown package' from URL:
  (converted from warning) installation of package ‘C:/Users/James/AppData/Local/Temp/RtmpYrgJKr/remotes433027d16e0d/lightgbm’ had non-zero exit status

When I try this

url <- "https://github.com/microsoft/LightGBM/releases/download/v3.0.0rc1/lightgbm-3.0.0-1-r40-windows.zip"
install.packages(
    pkgs = url
    , type = "binary"
    , repos = NULL
)

I get this error

Installing package into ‘C:/Users/James/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
trying URL 'https://github.com/microsoft/LightGBM/releases/download/v3.0.0rc1/lightgbm-3.0.0-1-r40.zip'
Content type 'application/octet-stream' length 1675774 bytes (1.6 MB)
downloaded 1.6 MB

Warning in install.packages :
  cannot open compressed file 'lightgbm-3.0.0-1-r40/DESCRIPTION', probable reason 'No such file or directory'
Error in install.packages : cannot open the connection

@jameslamb
Copy link
Collaborator Author

@jaredlander binaries for the R package are now available, thanks for the nudge! This is great timing, because we just put up a major release candidate this week.

You can change the url code below to install for whatever operating system you're on.

LGB_RELEASE <- "https://github.com/microsoft/LightGBM/releases/download/v3.0.0rc1/"
pkg_urls <- c(
    "linux" = file.path(LGB_RELEASE, "lightgbm_3.0.0-1-r40-linux.tgz")
    "mac" = file.path(LGB_RELEASE, "lightgbm_3.0.0-1-r40-macos.tgz")
    "windows" = file.path(LGB_RELEASE, "lightgbm-3.0.0-1-r40-windows.zip")
)

download.file(
    url = pkg_urls["mac"]
    , destfile = "lightgbm.zip"
)
install.packages(
    pkgs = "lightgbm.zip"
    , type = "binary"
    , repos = NULL
)

@jaredlander
Copy link

Thanks for pulling this off. Some issues with the instructions though.

First, is a simple typo.

pkg_urls <- c(
    "linux" = file.path(LGB_RELEASE, "lightgbm_3.0.0-1-r40-linux.tgz")
    "mac" = file.path(LGB_RELEASE, "lightgbm_3.0.0-1-r40-macos.tgz")
    "windows" = file.path(LGB_RELEASE, "lightgbm-3.0.0-1-r40-windows.zip")
)

This needs commas (,), that's no big deal.

pkg_urls <- c(
    "linux" = file.path(LGB_RELEASE, "lightgbm_3.0.0-1-r40-linux.tgz"),
    "mac" = file.path(LGB_RELEASE, "lightgbm_3.0.0-1-r40-macos.tgz"),
    "windows" = file.path(LGB_RELEASE, "lightgbm-3.0.0-1-r40-windows.zip")
)

While I installed successfully on Windows, I am having no such luck on Linux.

When running

install.packages(
    pkgs = "lightgbm.zip"
    , type = "binary"
    , repos = NULL
)

I get this reasonable error message

Error in (function (pkgs, lib, repos = getOption("repos"), contriburl = contrib.url(repos,  : 
  type 'binary' is not supported on this platform

So I change "binary" to "source" and it installs, but when I load the package with library(lightgbm) I get this error.

Error: package or namespace load failed forlightgbmin dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/jared/consulting/talks/renv/library/R-4.0/x86_64-pc-linux-gnu/lightgbm/libs/lightgbm.so':
  /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by /home/jared/consulting/talks/renv/library/R-4.0/x86_64-pc-linux-gnu/lightgbm/libs/lightgbm.so)

I am running Ubuntu 18.04, and from what I've read, 2.27 is the highest you can get on 18.04, though I could be mistaken about this. Interestingly, when searching about this, I came across an Ubuntu help page relating specifically to R. Apparently there are ways to bump the version of glibc on Ubuntu 18.04, but with it being such a core library I am hesitant to make any changes to my machine.

@jaredlander
Copy link

More bad news. Testing it on windows (due to the Linux issues) in RStudio.

library(lightgbm)

data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)

mod_light <- lightgbm(data=credit_light, nrounds=100, obj='binary')

The code above came from the documentation. It hangs my session.

image

However, it runs when using R from the terminal, in this case Git Bash. Both are running R 4.0.2.

Here's my sessionInfo():

 sessioninfo::session_info()
- Session info ------------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/New_York            
 date     2020-08-09                  

- Packages ----------------------------------------------------------------------------------------
 package      * version    date       lib source        
 assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.0.2)
 backports      1.1.8      2020-06-17 [1] CRAN (R 4.0.2)
 BBmisc         1.11       2017-03-10 [1] CRAN (R 4.0.2)
 checkmate      2.0.0      2020-02-06 [1] CRAN (R 4.0.2)
 class          7.3-17     2020-04-26 [1] CRAN (R 4.0.2)
 cli            2.0.2      2020-02-28 [1] CRAN (R 4.0.2)
 clipr          0.7.0      2019-07-23 [1] CRAN (R 4.0.2)
 codetools      0.2-16     2018-12-24 [1] CRAN (R 4.0.2)
 colorspace     1.4-1      2019-03-18 [1] CRAN (R 4.0.2)
 crayon         1.3.4      2017-09-16 [1] CRAN (R 4.0.2)
 data.table     1.12.8     2019-12-09 [1] CRAN (R 4.0.0)
 DBI            1.1.0      2019-12-15 [1] CRAN (R 4.0.2)
 desc           1.2.0      2018-05-01 [1] CRAN (R 4.0.2)
 details        0.2.1      2020-01-12 [1] CRAN (R 4.0.2)
 digest         0.6.25     2020-02-23 [1] CRAN (R 4.0.2)
 doParallel     1.0.15     2019-08-02 [1] CRAN (R 4.0.2)
 dplyr          1.0.0      2020-05-29 [1] CRAN (R 4.0.2)
 DT             0.14       2020-06-24 [1] CRAN (R 4.0.2)
 dygraphs       1.1.1.6    2018-07-11 [1] CRAN (R 4.0.2)
 ellipsis       0.3.1      2020-05-15 [1] CRAN (R 4.0.2)
 evaluate       0.14       2019-05-28 [1] CRAN (R 4.0.2)
 fansi          0.4.1      2020-01-08 [1] CRAN (R 4.0.2)
 fastmatch      1.1-0      2017-01-28 [1] CRAN (R 4.0.0)
 FNN            1.1.3      2019-02-15 [1] CRAN (R 4.0.2)
 foreach        1.5.0      2020-03-30 [1] CRAN (R 4.0.2)
 generics       0.0.2      2018-11-29 [1] CRAN (R 4.0.2)
 ggplot2        3.3.2      2020-06-19 [1] CRAN (R 4.0.2)
 ggthemes       4.2.0      2019-05-13 [1] CRAN (R 4.0.2)
 glue           1.4.1      2020-05-13 [1] CRAN (R 4.0.2)
 gower          0.2.2      2020-06-23 [1] CRAN (R 4.0.2)
 gtable         0.3.0      2019-03-25 [1] CRAN (R 4.0.2)
 here           0.1        2017-05-28 [1] CRAN (R 4.0.2)
 hms            0.5.3      2020-01-08 [1] CRAN (R 4.0.2)
 htmltools      0.5.0      2020-06-16 [1] CRAN (R 4.0.2)
 htmlwidgets    1.5.1      2019-10-08 [1] CRAN (R 4.0.2)
 httr           1.4.1      2019-08-05 [1] CRAN (R 4.0.0)
 ipred          0.9-9      2019-04-28 [1] CRAN (R 4.0.2)
 iterators      1.0.12     2019-07-26 [1] CRAN (R 4.0.2)
 knitr          1.29       2020-06-23 [1] CRAN (R 4.0.2)
 lattice        0.20-41    2020-04-02 [1] CRAN (R 4.0.2)
 lava           1.6.7      2020-03-05 [1] CRAN (R 4.0.2)
 lifecycle      0.2.0      2020-03-06 [1] CRAN (R 4.0.2)
 lubridate      1.7.9      2020-06-08 [1] CRAN (R 4.0.2)
 magrittr       1.5        2014-11-22 [1] CRAN (R 4.0.2)
 MASS           7.3-51.6   2020-04-26 [1] CRAN (R 4.0.2)
 Matrix         1.2-18     2019-11-27 [1] CRAN (R 4.0.2)
 mlr            2.17.1     2020-03-24 [1] CRAN (R 4.0.2)
 munsell        0.5.0      2018-06-12 [1] CRAN (R 4.0.2)
 nnet           7.3-14     2020-04-26 [1] CRAN (R 4.0.2)
 parallelMap    1.5.0      2020-03-26 [1] CRAN (R 4.0.2)
 ParamHelpers   1.14       2020-03-24 [1] CRAN (R 4.0.2)
 pillar         1.4.6      2020-07-10 [1] CRAN (R 4.0.2)
 pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.0.2)
 plyr           1.8.6      2020-03-03 [1] CRAN (R 4.0.2)
 png            0.1-7      2013-12-03 [1] CRAN (R 4.0.0)
 pROC           1.16.2     2020-03-19 [1] CRAN (R 4.0.2)
 prodlim        2019.11.13 2019-11-17 [1] CRAN (R 4.0.2)
 purrr          0.3.4      2020-04-17 [1] CRAN (R 4.0.2)
 R6             2.4.1      2019-11-12 [1] CRAN (R 4.0.2)
 RANN           2.6.1      2019-01-08 [1] CRAN (R 4.0.2)
 Rcpp           1.0.5      2020-07-06 [1] CRAN (R 4.0.2)
 readr          1.3.1      2018-12-21 [1] CRAN (R 4.0.2)
 recipes        0.1.13     2020-06-23 [1] CRAN (R 4.0.2)
 rlang          0.4.7      2020-07-09 [1] CRAN (R 4.0.2)
 rmarkdown      2.3        2020-06-18 [1] CRAN (R 4.0.2)
 ROSE           0.0-3      2014-07-15 [1] CRAN (R 4.0.2)
 rpart          4.1-15     2019-04-12 [1] CRAN (R 4.0.2)
 rprojroot      1.3-2      2018-01-03 [1] CRAN (R 4.0.2)
 rstudioapi     0.11       2020-02-07 [1] CRAN (R 4.0.2)
 scales         1.1.1      2020-05-11 [1] CRAN (R 4.0.2)
 sessioninfo    1.1.1      2018-11-05 [1] CRAN (R 4.0.2)
 stringi        1.4.6      2020-02-17 [1] CRAN (R 4.0.0)
 stringr        1.4.0      2019-02-10 [1] CRAN (R 4.0.2)
 survival       3.1-12     2020-04-10 [1] CRAN (R 4.0.2)
 themis         0.1.1      2020-05-17 [1] CRAN (R 4.0.2)
 tibble         3.0.3      2020-07-10 [1] CRAN (R 4.0.2)
 tidyselect     1.1.0      2020-05-11 [1] CRAN (R 4.0.2)
 timeDate       3043.102   2018-02-21 [1] CRAN (R 4.0.0)
 unbalanced     2.0        2015-06-26 [1] CRAN (R 4.0.2)
 vctrs          0.3.2      2020-07-15 [1] CRAN (R 4.0.2)
 withr          2.2.0      2020-04-20 [1] CRAN (R 4.0.2)
 xfun           0.15       2020-06-21 [1] CRAN (R 4.0.2)
 xml2           1.3.2      2020-04-23 [1] CRAN (R 4.0.2)
 yaml           2.2.1      2020-02-01 [1] CRAN (R 4.0.0)
 yardstick      0.0.7      2020-07-13 [1] CRAN (R 4.0.2)
 zoo            1.8-8      2020-05-02 [1] CRAN (R 4.0.2)

[1] C:/Users/jared/Documents/R/R-4.0.2/library

@jameslamb
Copy link
Collaborator Author

Hi @jaredlander sorry, this is the first time I've ever built binaries myself (since you never do this when submitting to CRAN), so I don't know the gotchas. I just followed the instructions in "Writing R Extensions" as closely as I could.

I can't comment on how our library might interact with renv...we don't use that tool or test against it.

I also can't comment on the library breaking in RStudio but working from a Git Bash for Windows shell, other than to say that your effective PATH is almost certainly different in RStudio than it is in that shell, and maybe conflicting versions of some library are being linked in.

I did just update the names of the artifacts....two that had _ in their names have been changed too -. A result of doing this for the first time, and manually.

Is there are a reason you're opposed to installing from source? Until recently I understand that LightGBM's R package had a reputation for being difficult to configure and install, but we've done a loot of work to make source installation smoother.

I just put up a source distribution on our release...exactly the package we would submit to CRAN. The only issue I know it has (that makes it not-quite-CRAN-able yet) is for 32-bit Windows (#3187 ), but I'm guessing that won't be a problem for your or most others.

You can install it like this:

lightgbm_source <- "https://github.com/microsoft/LightGBM/releases/download/v3.0.0rc1/lightgbm-3.0.0-1-cran.tar.gz"
remotes::install_url(lightgbm_source)

Unlike the binaries (which I just made for the first time and which we do not test), this source package is rigorously tested.

@jaredlander
Copy link

jaredlander commented Aug 10, 2020

For the binaries, I'm guessing you built on the wrong machine? Did you use devtools::build(binary=TRUE)? And did you have GitHub actions build it in the target OS?

Now that cmake isn't needed I'll try to install from source on Ubuntu, though I'm still worried about glibc.

My main issue about installing from source is that if I find it complicated then it'll most likely be even harder for other users, most of whom don't have necessarily have a lot of - terminal experience. And I like to show people tools that they can turn the key and run, so they can focus on data, not installation issues.

Could you tell me more about what LightGBM does for paths? What is it looking for?

Also, now that cmake is removed, perhaps it can be installed by remotes::install_github("...")?

@jameslamb
Copy link
Collaborator Author

For the binaries, I'm guessing you built on the wrong machine?

I built the Windows on a Windows machine, Mac on a Mac machine, and Linux in a docker container running the rocker/verse:4.02 image.

Did you use devtools::build(binary=TRUE)?

I did not use devtools. I built these binaries exactly as documented in the "Building Binary Packages" section of "Writing R Extensions".

Since this is the first time I've ever done this, I just opened a PR yesterday to document the process. You can see what was done there: #3285.

And did you have GitHub actions build it in the target OS?

No, as I mentioned in #3188 (comment), I created them manually. We've opened an issue to track the work to automate building these artifacts: #3283 .

My main issue about installing from source is that if I find it complicated then it'll most likely be even harder for other users, most of whom don't have necessarily have a lot of - terminal experience. And I like to show people tools that they can turn the key and run, so they can focus on data, not installation issues.

Makes sense! That remotes::install_url() example I just shared should be almost identical to the experience using install.packages(), and doesn't require anything outside of the usual CRAN toolchain (like CMake).

Could you tell me more about what LightGBM does for paths? What is it looking for?

I actually have no idea if that's an issue, sorry. My main role here is as an R maintainer and I don't have a full grasp of which things we link to dynamically vs. statically. Since this is the first time we've distributed a binary of the CRAN package, we also don't have any experience with users reporting issues on it...you're probably the first person other than me to try to use those artifacts 😬

Also, now that cmake is removed, perhaps it can be installed by remotes::install_github("...")?

remotes::install_github() will not work with this project and we don't recommend trying it. It's not accurate to say "cmake is removed". This PR we're commenting allows building a CRAN-friendly version of the R package that does not require CMake, but installing with CMake is still supported. https://github.com/microsoft/LightGBM/blob/master/R-package/README.md#install

Because we support these different installation paths, the code in R-package/ in this repo can't just be installed with R CMD INSTALL.

@jameslamb
Copy link
Collaborator Author

and sorry to change the names on you again @jaredlander , but I just had to change the name of that source distribution. I forgot to add a -r to differentiate it from the other non-R artifacts there.

lightgbm_source <- "https://github.com/microsoft/LightGBM/releases/download/v3.0.0rc1/lightgbm-3.0.0-1-r-cran.tar.gz"
remotes::install_url(lightgbm_source)

@jaredlander
Copy link

Good news! From your latest comment, installing from "https://github.com/microsoft/LightGBM/releases/download/v3.0.0rc1/lightgbm-3.0.0-1-r-cran.tar.gz" using remotes::install_url() worked on Ubuntu and the model ran in RStudio.

No such luck with Windows though. This is what happens

* installing *source* package 'lightgbm' ...
** using staged installation
checking whether MM_PREFETCH works...yes
checking whether MM_MALLOC works...yes
** libs
Error: (converted from warning) this package has a non-empty 'configure.win' file,
so building only the main architecture
* removing 'C:/Users/jared/Documents/R/R-4.0.2/library/lightgbm'
* restoring previous 'C:/Users/jared/Documents/R/R-4.0.2/library/lightgbm'
Error: Failed to install 'unknown package' from URL:
  (converted from warning) installation of packageC:/Users/jared/AppData/Local/Temp/Rtmp8yJ5px/file86e0277e32fe/lightgbm_3.0.0-1.tar.gzhad non-zero exit status

Normally I write all my talks on Windows, but for this talk I've been using both Windows and Linux for the extra horsepower my server provides. So I should be able to work this into the content. Would be excellent to tell people they can recreate in both OSes.
Installing the binary works, but the package can only be used from the terminal, not RStudio, and that will be a non-starter for a lot of people.

I did not use devtools. I built these binaries exactly as documented in the "Building Binary Packages" section of "Writing R Extensions".

I stopped using the command line for building packages as {devtools} made it so much less error prone, for me at least.

My main role here is as an R maintainer and I don't have a full grasp of which things we link to dynamically vs. statically.

The DESCRIPTION file doesn't have any LinkingTo listings, I wonder if that's an issue at all.

@jameslamb
Copy link
Collaborator Author

Thanks for trying it! I'm pretty confused by both of those results...I regularly develop LightGBM in RStudio without issue, and I do not have any special customizations in my local environment. I don't have a ~/.R/Makevars or any uses of .Rprofile or .Renviron either 😕

We also test that source package on Windows for R 3.6 and 4.0, and it's passing R CMD check there.

Error: (converted from warning) this package has a non-empty 'configure.win' file,
so building only the main architecture

ugh it's confusing that this is showing up as an Error:, when it seemed like a warning on win-builder. I'm beginning to think the Biarch flag in DESCRIPTION should just never be used.

You've already been very patient with us, so feel free to say "I don't have time for this" at any point. But if you do have the time, could you try passing INSTALL_opts = "--no-multiarch" to install_url() on Windows? Maybe that will get around that error.

The DESCRIPTION file doesn't have any LinkingTo listings, I wonder if that's an issue at all.

In theory it shouldn't need it. That field is specific for linking to other R packages that are used to distribute headers of libraries. I don't think we should need a LinkingTo for any of things we link to (OpenMP on Mac and Linux, Iphlpapi and ws2_32 on Windows), but maybe I'm wrong about that.

@jaredlander
Copy link

You've already been very patient with us, so feel free to say "I don't have time for this" at any point.

Gonna give it another try on Windows. Then I might have to punt to the next time I give the talk, since the conference is this week and I'm the host!

But if you do have the time, could you try passing INSTALL_opts = "--no-multiarch" to install_url() on Windows?

The folks behind {catboost} used this, so hopefully that's the trick. That was originally included for the talk but I had to cut it for time.

Since I did get it working on Linux, could you point me to where I can find out about computing metrics? I can't seem to get model1$best_iter or `model1$best_score to return anything meaningful. Also looking for a way to extract variable importance or any sort of visualization.

@jaredlander
Copy link

I am now seeing lgb.plot.interpretation() and lgb.importance(0, investigating those.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Aug 10, 2020

sure!

So for metrics, you'll probably want to pass a validation set. You can see these tests as an example:

test_that("when early stopping is not activated, best_iter and best_score come from valids and not training data", {

data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
train <- agaricus.train
test <- agaricus.test

metrics <- list("binary_error", "auc", "binary_logloss")
bst <- lightgbm(
    data = train$data
    , label = train$label
    , num_leaves = 4L
    , learning_rate = 1.0
    , nrounds = 10L
    , objective = "binary"
    , metric = metrics
)

image

If you train with verbose = 1, you'll get metrics printed after every iteration. You can pass multiple metrics in one training run, and multiple validation sets (if you'd like).

You can also run bst$eval_train() (where bst is the result of lightgbm() or lgb.train()) to evaluate the model on the training set.

See bst$record_evals for evaluations of all metrics, at each iteration, for the training data + all validation sets.

lgb.importance() is the right place to look for feature importance, you can see the example from ?lightgbm::lgb.importance.

For the last year all of the energy going into our R package has been focused on getting to CRAN. It's taking a lot of work. I wish I could point you to beautiful vignettes (#1944) or tell you we have really compelling visualizations (#1222), but we're just not there yet.

@jaredlander
Copy link

jaredlander commented Aug 10, 2020

This is what I have so far:

library(lightgbm)
library(dplyr)
library(recipes)

data(bank)

bank_char <- bank %>% select(-y) %>% 
    purrr::map_lgl(~is.character(.x)) %>% 
    which()

bank_rec <- bank %>% 
    mutate(across(where(is.character), ~as.integer(factor(.x))) - 1) %>% 
    recipe(formula=y ~ .) %>% 
    step_mutate(y_class=factor(y)) %>% 
    themis::step_upsample(y_class) %>% 
    step_rm(y_class) %>% 
    prep()

bank_x <- bank_rec %>% juice(all_predictors(), composition='dgCMatrix')
bank_y <- bank_rec %>% juice(all_outcomes(), composition='dgCMatrix')

bank_train <- lgb.Dataset(data=bank_x, label=bank_y, categorical_feature=bank_char)
mod2 <- lightgbm(data=bank_train, nrounds=100, obj='binary', metric=list('AUC'))

mod2 %>% lgb.importance() %>% lgb.plot.importance()

For the last year all of the energy going into our R package has been focused on getting to CRAN. It's taking a lot of work. I wish I could point you to beautiful vignettes (#1944) or tell you we have really compelling visualizations (#1222), but we're just not there yet.

I can see how much work you're putting in from all the back and forth in the issues. Thanks for doing that. With {xgboost} so ingrained, more descriptive help files will definitely be a plus in your endeavour to catch up.

The way categorical features are handled is interesting. Took some digging to figure out what to do. Have you seen how {catboost} handles categorical data? They let you pass data.frames which makes it a bit easier.

I'm a big fan of {parsnip} and while they only put their attention to packages on CRAN, once you're up there you may want to talk to that team about getting {lightgbm} integrated. You can see from tidymodels/parsnip#117 that they really put a lot of thought into R-like interfaces. They offer the {hardhat} package to assist with that.

Back to getting {lightgbm} into my talk...this interface is going to take a lot of explaining compared to the other interfaces. The problem is the talk is already butting up against the time limit. So I'm going to try to fit this in, but I'm not sure I can. You worked so hard to get in ready ni time so I really want to squeeze it in somehow.

@jameslamb
Copy link
Collaborator Author

thanks for all the background! Yes, we still have a long way to go, including in messaging the way LightGBM works compared to those other libraries.

If it doesn't get into the talk, no worries at all. We appreciate the time and attention you've given to LightGBM already!

@jaredlander
Copy link

Thanks for all your work. The slides are at https://jaredlander.com/content/2020/08/TallestTree.html and the video will be at rstats.ai in a few weeks.

@jameslamb
Copy link
Collaborator Author

Thanks! Should we interpret this slide to mean "it's expected that calling plot() on a fitted model with no other arguments does something meaningful"?

image

I noticed you had a similar slide for {ranger} and didn't attempt that for {xgboost} (I don't know if XGBoost supports that)

@jaredlander
Copy link

Exactly, I would like I'd calling plot() did something, but I understand that it doesn't. For any model where plot() works out of the box, I used that. If plot() didn't work, but an alternative existed, like rpart.plot() for {rpart} and xgb.plot.multitree() for {xgboost}, I used that. If I couldn't find an alternative, I showed plot() causing an error.

@jameslamb jameslamb mentioned this pull request Aug 20, 2020
@jameslamb
Copy link
Collaborator Author

got it, thanks! I linked that comment into #1222

by the way...we got past what I think is the last hurdle blocking us from CRAN (#3307 ). As soon as CRAN maintainers come back from vacation, we'll submit again and I think we have a good chance of being accepted

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants