Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] added get_data() method to Dataset class #1870

Merged
merged 9 commits into from Dec 20, 2018
Merged

[python] added get_data() method to Dataset class #1870

merged 9 commits into from Dec 20, 2018

Conversation

StrikerRUS
Copy link
Collaborator

Fixed #1690.
Fixed #1763.


Returns
-------
data : string, numpy array, pandas DataFrame, scipy.sparse, list of numpy arrays or None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about string and list of numpy arrays? How to "slice" them?..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cannot slide them, maybe we should throw a warning or error here.

@StrikerRUS
Copy link
Collaborator Author

Simplified example from #1763

import numpy as np
import pandas as pd
import lightgbm as lgb

full_data = pd.DataFrame({'x1': np.random.rand(100),
                          'x2': 5 + np.random.rand(100),
                          'target': np.random.randint(0, 2, 100)})

data_train = lgb.Dataset(full_data.drop(['target'], axis=1), full_data.target, free_raw_data=False)

def fpreproc_rebalance(dtrain, dtest, params,):
    train_data = dtrain.get_data()
    test_data = dtest.get_data()
    train_data.loc[train_data['x1'] > 0.5, ['x1']] = 0.5
    test_data.loc[test_data['x1'] > 0.5, ['x1']] = 0.5
    fdtrain = lgb.Dataset(train_data, dtrain.label, free_raw_data=False)
    fdtest = lgb.Dataset(test_data, dtest.label, free_raw_data=False)
    return fdtrain, fdtest, params

results = lgb.cv({}, data_train, fpreproc=fpreproc_rebalance)

fails with

LightGBMError: Cannot add validation data, since it has different bin mappers with training data

categorical_feature=self.categorical_feature, params=params)
ret = Dataset(self.data, reference=self, feature_name=self.feature_name,
categorical_feature=self.categorical_feature, params=params,
free_raw_data=self.free_raw_data)
Copy link
Collaborator

@guolinke guolinke Dec 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a better logic still pass None for data here?
Then in construct, use reference.data to get_data ?
Then, the logic of construct is the same as before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked

@StrikerRUS
Copy link
Collaborator Author

Now this snippet should be modified with construct() like the following:

...
def fpreproc_rebalance(dtrain, dtest, params,):
    train_data = dtrain.construct().get_data()
    test_data = dtest.construct().get_data()
    ...

@@ -974,6 +975,8 @@ def construct(self):
ctypes.c_int(used_indices.shape[0]),
c_str(params_str),
ctypes.byref(self.handle)))
self.data = self.reference.data
self.get_data()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we don't need the get_data here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we need a consistent state of the object after calling construct().

Or do you think that it's unneeded and we should allow users to access data only via get_data() (remove mentions about data field)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I see. As only subset will call the slicing, I think the overhead cost is acceptable.
Change the definition of data may cause many dependency problems.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's expected that only subsetting branch of construct calls get_data(). Also, get_data() itself checks used_indices:
https://github.com/Microsoft/LightGBM/blob/1ff6c6d6bca05de2c13490264ba3b6bf7f993d4a/python-package/lightgbm/basic.py#L1392

@StrikerRUS
Copy link
Collaborator Author

@guolinke please help with this.

@guolinke
Copy link
Collaborator

@StrikerRUS I think that will not fail by the latest code

@StrikerRUS
Copy link
Collaborator Author

@guolinke Unfortunately, it still fails (data branch with dll file from latest master):

import numpy as np
import pandas as pd
import lightgbm as lgb

full_data = pd.DataFrame({'x1': np.random.rand(100),
                          'x2': 5 + np.random.rand(100),
                          'target': np.random.randint(0, 2, 100)})

data_train = lgb.Dataset(full_data.drop(['target'], axis=1), full_data.target, free_raw_data=False)

def fpreproc_rebalance(dtrain, dtest, params,):
    train_data = dtrain.construct().get_data()
    test_data = dtest.construct().get_data()
    train_data.loc[train_data['x1'] > 0.5, ['x1']] = 0.5
    test_data.loc[test_data['x1'] > 0.5, ['x1']] = 0.5
    fdtrain = lgb.Dataset(train_data, dtrain.label, free_raw_data=False)
    fdtest = lgb.Dataset(test_data, dtest.label, free_raw_data=False)
    return fdtrain, fdtest, params

results = lgb.cv({}, data_train, fpreproc=fpreproc_rebalance)
LightGBMError: Cannot add validation data, since it has different bin mappers with training data

@guolinke
Copy link
Collaborator

@StrikerRUS you need to set the reference for valid data:

    fdtrain = lgb.Dataset(train_data, dtrain.label, free_raw_data=False)
    fdtest = lgb.Dataset(test_data, dtest.label, free_raw_data=False, reference=fdtrain )

@StrikerRUS
Copy link
Collaborator Author

@guolinke Yeah, thanks! Completely forgot about this. Now everything is OK.

Maybe add this snippet as a regression test?

@guolinke
Copy link
Collaborator

@StrikerRUS yeah, sure!

@StrikerRUS StrikerRUS changed the title [WIP][python] added get_data() method to Dataset class [python] added get_data() method to Dataset class Dec 14, 2018
@StrikerRUS
Copy link
Collaborator Author

@guolinke Test has been added. Please check.

@guolinke guolinke merged commit 2323cb3 into master Dec 20, 2018
@StrikerRUS StrikerRUS deleted the data branch December 20, 2018 11:15
@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants