[R-package] Accept data frames as inputs #4323

jameslamb · 2021-05-26T04:06:28Z

Summary

It should be possible to use R data frames directly as inputs in {lightgbm}, without converting them to matrices.

Motivation

The data.frame is a very common data structure in R, and many statistics and machine learning projects accept data frames as inputs, including but not limited to {caret}, {randomForest}, and the packages listed in #4295 (comment).

Allowing the use of data.frame objects would reduce friction for R users working with LightGBM.

Description

I think this work can be broken down into the following components

accept data.frame inputs for data argument to lgb.Dataset()
accept data.frame inputs for data argument to lightgbm()
accept data.frame inputs for data argument to lgb.cv()
automatically encode categorical-looking column types (factor, character) in a LightGBM-friendly way
accept data.frame inputs for predict.lgb.Booster() / Booster$predict() / Predictor$predict()
- once "automatically encode categorical-looking column types" is implemented, this should also including converting categorical-looking columns with the same rules
allow passing a column name for init_score in lgb.Dataset() if data is a data.frame
allow passing a column name for label in lgb.Dataset() if data is a data.frame
allow passing a column name for weight in lgb.Dataset() if data is a data.frame

References

See #4207 for a prior proposal and #4207 (review) for reference on this issue.

This feature precedes #4295.

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-05-26T04:08:25Z

I've added this feature request to #2302, per this project's standard approach for managing its backlog.

If you are interested in contributing to this feature or have additional information to add, please leave a comment and the issue can be re-opened.

jameslamb · 2021-08-22T01:20:03Z

Want to add that when this feature is picked up, the R package should support categorical features for data frames using the same interface as the Python package:

setting categorical_feature = "auto" in Dataset (the default) means "automatically detect categorical features"
if categorical_feature in Dataset is a vector of feature indices or names, use that list of features for categorical_feature instead

LightGBM/python-package/lightgbm/basic.py

Line 1127 in 8a90ea3

feature_name='auto', categorical_feature='auto', params=None,

LightGBM/python-package/lightgbm/basic.py

Lines 534 to 540 in 8a90ea3

    
           if categorical_feature is not None: 
        
               if feature_name is None: 
        
                   feature_name = list(data.columns) 
        
               if categorical_feature == 'auto':  # use cat cols from DataFrame 
        
                   categorical_feature = cat_cols_not_ordered 
        
               else:  # use cat cols specified by user 
        
                   categorical_feature = list(categorical_feature)

LightGBM/python-package/lightgbm/basic.py

Lines 555 to 556 in 8a90ea3

    
           if categorical_feature == 'auto': 
        
               categorical_feature = None

mayer79 · 2022-02-14T13:16:45Z

The tricky thing here is that we would need to store factor levels in the Booster object in order to allow for safe out-of-sample predictions. This could be a named list like xlevs = list(gender = c("m", "f"), age = c("young", "old")).

jameslamb · 2023-08-18T02:11:21Z

Sorry, this was locked accidentally. Just unlocked it. We'd still love help with this feature!

jameslamb added feature request r-package labels May 26, 2021

jameslamb mentioned this issue May 26, 2021

Feature Requests & Voting Hub #2302

Open

jameslamb closed this as completed May 26, 2021

jameslamb mentioned this issue Mar 9, 2022

[c-api][python-package][R-package] expose feature num bin #5048

Merged

jameslamb mentioned this issue Apr 14, 2022

[RFC] 4.0.0 Release #5153

Closed

60 tasks

jameslamb mentioned this issue Apr 28, 2022

[R-package] allow use of categorical_features in Dataset when raw data does not have column names (fixes #4374) #5184

Merged

jameslamb mentioned this issue Jul 25, 2022

test that lightgbm objective is set correctly when label is a factor tidymodels/bonsai#43

Merged

jmoralez mentioned this issue Aug 16, 2022

[python-package][R-package] load parameters from model file (fixes #2613) #5424

Merged

This comment was marked as off-topic.

Sign in to view

github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023

microsoft unlocked this conversation Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] Accept data frames as inputs #4323

[R-package] Accept data frames as inputs #4323

jameslamb commented May 26, 2021

jameslamb commented May 26, 2021

jameslamb commented Aug 22, 2021

mayer79 commented Feb 14, 2022 •

edited

This comment was marked as off-topic.

jameslamb commented Aug 18, 2023

[R-package] Accept data frames as inputs #4323

[R-package] Accept data frames as inputs #4323

Comments

jameslamb commented May 26, 2021

Summary

Motivation

Description

References

jameslamb commented May 26, 2021

jameslamb commented Aug 22, 2021

mayer79 commented Feb 14, 2022 • edited

This comment was marked as off-topic.

jameslamb commented Aug 18, 2023

mayer79 commented Feb 14, 2022 •

edited