Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Accept data frames as inputs #4323

Closed
8 tasks
Tracked by #5153
jameslamb opened this issue May 26, 2021 · 5 comments
Closed
8 tasks
Tracked by #5153

[R-package] Accept data frames as inputs #4323

jameslamb opened this issue May 26, 2021 · 5 comments

Comments

@jameslamb
Copy link
Collaborator

Summary

It should be possible to use R data frames directly as inputs in {lightgbm}, without converting them to matrices.

Motivation

The data.frame is a very common data structure in R, and many statistics and machine learning projects accept data frames as inputs, including but not limited to {caret}, {randomForest}, and the packages listed in #4295 (comment).

Allowing the use of data.frame objects would reduce friction for R users working with LightGBM.

Description

I think this work can be broken down into the following components

  • accept data.frame inputs for data argument to lgb.Dataset()
  • accept data.frame inputs for data argument to lightgbm()
  • accept data.frame inputs for data argument to lgb.cv()
  • automatically encode categorical-looking column types (factor, character) in a LightGBM-friendly way
  • accept data.frame inputs for predict.lgb.Booster() / Booster$predict() / Predictor$predict()
    • once "automatically encode categorical-looking column types" is implemented, this should also including converting categorical-looking columns with the same rules
  • allow passing a column name for init_score in lgb.Dataset() if data is a data.frame
  • allow passing a column name for label in lgb.Dataset() if data is a data.frame
  • allow passing a column name for weight in lgb.Dataset() if data is a data.frame

References

See #4207 for a prior proposal and #4207 (review) for reference on this issue.

This feature precedes #4295.

@jameslamb
Copy link
Collaborator Author

I've added this feature request to #2302, per this project's standard approach for managing its backlog.

If you are interested in contributing to this feature or have additional information to add, please leave a comment and the issue can be re-opened.

@jameslamb
Copy link
Collaborator Author

Want to add that when this feature is picked up, the R package should support categorical features for data frames using the same interface as the Python package:

  • setting categorical_feature = "auto" in Dataset (the default) means "automatically detect categorical features"
  • if categorical_feature in Dataset is a vector of feature indices or names, use that list of features for categorical_feature instead

feature_name='auto', categorical_feature='auto', params=None,

if categorical_feature is not None:
if feature_name is None:
feature_name = list(data.columns)
if categorical_feature == 'auto': # use cat cols from DataFrame
categorical_feature = cat_cols_not_ordered
else: # use cat cols specified by user
categorical_feature = list(categorical_feature)

if categorical_feature == 'auto':
categorical_feature = None

@mayer79
Copy link
Contributor

mayer79 commented Feb 14, 2022

The tricky thing here is that we would need to store factor levels in the Booster object in order to allow for safe out-of-sample predictions. This could be a named list like xlevs = list(gender = c("m", "f"), age = c("young", "old")).

@github-actions

This comment was marked as off-topic.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023
@jameslamb
Copy link
Collaborator Author

Sorry, this was locked accidentally. Just unlocked it. We'd still love help with this feature!

@microsoft microsoft unlocked this conversation Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants