Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore_column does not work when loading a dataset from a binary file #4657

Closed
karunrao97 opened this issue Oct 8, 2021 · 5 comments
Closed
Assignees
Labels

Comments

@karunrao97
Copy link

karunrao97 commented Oct 8, 2021

Description

I need to load select columns of a LightGBM dataset from a binary file. I am trying to use the ignore_column parameter for this, but this results in the following (where c1 is the first column name I'm trying to ignore):

LightGBMError: Could not find ignore column c1 in data file

Reproducible example

import lightgbm as lgb
import numpy as np

n_samples = 100
feature_names = [f'c{i}' for i in range(10)]
ds_path = 'ds.bin'

ds = lgb.Dataset(np.random.random((n_samples, len(feature_names))), label=np.random.random(n_samples), feature_name=feature_names)
ds.construct()
ds.save_binary(ds_path)

ds = lgb.Dataset(ds_path)
ds.construct()  # this works fine

ds = lgb.Dataset(ds_path, params={'ignore_column': f'name:{",".join(feature_names[1:])}'})
ds.construct()  # this throws the above-mentioned exception

Environment info

LightGBM version or commit hash: 5b7a6f3

Command(s) you used to install LightGBM: I built it from source

Additional Comments

I also tried the following:

ds = lgb.Dataset(ds_path, params={'ignore_column': ",".join([str(i) for i in range(1, len(feature_names))])})
ds.construct()

This doesn't crash, but it just loads all the features, and when training a model on this dataset, all the features are used, so it doesn't seem like ignore_column had any effect in this case.

@jameslamb jameslamb added the bug label Oct 8, 2021
@jameslamb
Copy link
Collaborator

Thanks very much for the excellent issue write-up with example code! We'll take a look at this soon.


By the way, I've slightly changed the formatting of your issue to make it a bit easier to read. If you're new to GitHub and / or not familiar with markdown, you might find this guide helpful: https://docs.github.com/en/github/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#quoting-code.

@jmoralez
Copy link
Collaborator

jmoralez commented Oct 9, 2021

This happens with the CLI as well, both the error when passing name:c1 and using them for training when passing integers like 0,1,2.

@karunrao97
Copy link
Author

Thank you for the follow-up, and thanks also for the formatting edits on the original post.

@hzy46
Copy link
Contributor

hzy46 commented Oct 28, 2021

The parameter ignore_column is by design to be effectiveless when we load from binary file. Similar parameters include two_round, header, label_column, weight_column, group_column . See #4701 #4724 .

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants