[python] added get_data() method to Dataset class #1870

StrikerRUS · 2018-11-24T00:56:17Z

Fixed #1690.
Fixed #1763.

StrikerRUS · 2018-11-24T00:57:35Z

python-package/lightgbm/basic.py

+
+        Returns
+        -------
+        data : string, numpy array, pandas DataFrame, scipy.sparse, list of numpy arrays or None


What about string and list of numpy arrays? How to "slice" them?..

I think we cannot slide them, maybe we should throw a warning or error here.

StrikerRUS · 2018-11-24T00:59:48Z

Simplified example from #1763

import numpy as np
import pandas as pd
import lightgbm as lgb

full_data = pd.DataFrame({'x1': np.random.rand(100),
                          'x2': 5 + np.random.rand(100),
                          'target': np.random.randint(0, 2, 100)})

data_train = lgb.Dataset(full_data.drop(['target'], axis=1), full_data.target, free_raw_data=False)

def fpreproc_rebalance(dtrain, dtest, params,):
    train_data = dtrain.get_data()
    test_data = dtest.get_data()
    train_data.loc[train_data['x1'] > 0.5, ['x1']] = 0.5
    test_data.loc[test_data['x1'] > 0.5, ['x1']] = 0.5
    fdtrain = lgb.Dataset(train_data, dtrain.label, free_raw_data=False)
    fdtest = lgb.Dataset(test_data, dtest.label, free_raw_data=False)
    return fdtrain, fdtest, params

results = lgb.cv({}, data_train, fpreproc=fpreproc_rebalance)

fails with

LightGBMError: Cannot add validation data, since it has different bin mappers with training data

guolinke · 2018-12-10T02:41:47Z

python-package/lightgbm/basic.py

-                      categorical_feature=self.categorical_feature, params=params)
+        ret = Dataset(self.data, reference=self, feature_name=self.feature_name,
+                      categorical_feature=self.categorical_feature, params=params,
+                      free_raw_data=self.free_raw_data)


maybe a better logic still pass None for data here?
Then in construct, use reference.data to get_data ?
Then, the logic of construct is the same as before.

StrikerRUS · 2018-12-10T13:30:51Z

Now this snippet should be modified with construct() like the following:

...
def fpreproc_rebalance(dtrain, dtest, params,):
    train_data = dtrain.construct().get_data()
    test_data = dtest.construct().get_data()
    ...

guolinke · 2018-12-11T07:55:17Z

python-package/lightgbm/basic.py

@@ -974,6 +975,8 @@ def construct(self):
                        ctypes.c_int(used_indices.shape[0]),
                        c_str(params_str),
                        ctypes.byref(self.handle)))
+                    self.data = self.reference.data
+                    self.get_data()


maybe we don't need the get_data here ?

I thought that we need a consistent state of the object after calling construct().

Or do you think that it's unneeded and we should allow users to access data only via get_data() (remove mentions about data field)?

okay, I see. As only subset will call the slicing, I think the overhead cost is acceptable.
Change the definition of data may cause many dependency problems.

Yeah, it's expected that only subsetting branch of construct calls get_data(). Also, get_data() itself checks used_indices:
https://github.com/Microsoft/LightGBM/blob/1ff6c6d6bca05de2c13490264ba3b6bf7f993d4a/python-package/lightgbm/basic.py#L1392

StrikerRUS · 2018-12-12T00:41:45Z

@guolinke please help with this.

guolinke · 2018-12-12T03:26:45Z

@StrikerRUS I think that will not fail by the latest code

StrikerRUS · 2018-12-12T11:31:04Z

@guolinke Unfortunately, it still fails (data branch with dll file from latest master):

import numpy as np
import pandas as pd
import lightgbm as lgb

full_data = pd.DataFrame({'x1': np.random.rand(100),
                          'x2': 5 + np.random.rand(100),
                          'target': np.random.randint(0, 2, 100)})

data_train = lgb.Dataset(full_data.drop(['target'], axis=1), full_data.target, free_raw_data=False)

def fpreproc_rebalance(dtrain, dtest, params,):
    train_data = dtrain.construct().get_data()
    test_data = dtest.construct().get_data()
    train_data.loc[train_data['x1'] > 0.5, ['x1']] = 0.5
    test_data.loc[test_data['x1'] > 0.5, ['x1']] = 0.5
    fdtrain = lgb.Dataset(train_data, dtrain.label, free_raw_data=False)
    fdtest = lgb.Dataset(test_data, dtest.label, free_raw_data=False)
    return fdtrain, fdtest, params

results = lgb.cv({}, data_train, fpreproc=fpreproc_rebalance)

LightGBMError: Cannot add validation data, since it has different bin mappers with training data

guolinke · 2018-12-12T13:04:06Z

@StrikerRUS you need to set the reference for valid data:

    fdtrain = lgb.Dataset(train_data, dtrain.label, free_raw_data=False)
    fdtest = lgb.Dataset(test_data, dtest.label, free_raw_data=False, reference=fdtrain )

StrikerRUS · 2018-12-13T00:26:05Z

@guolinke Yeah, thanks! Completely forgot about this. Now everything is OK.

Maybe add this snippet as a regression test?

guolinke · 2018-12-13T01:29:15Z

@StrikerRUS yeah, sure!

StrikerRUS · 2018-12-14T20:42:45Z

@guolinke Test has been added. Please check.

StrikerRUS added 4 commits November 19, 2018 16:44

added get_data method

462cf16

Merge branch 'master' into data

6fa5b09

Merge branch 'master' into data

0520429

hotfix

daf85e5

StrikerRUS added the help wanted label Nov 24, 2018

StrikerRUS commented Nov 24, 2018

View reviewed changes

guolinke reviewed Dec 10, 2018

View reviewed changes

StrikerRUS added 3 commits December 10, 2018 15:17

fixed conflicts

d8aa16e

added warning for other data types

c36b89f

reworked according to review comments

1ff6c6d

guolinke reviewed Dec 11, 2018

View reviewed changes

guolinke approved these changes Dec 12, 2018

View reviewed changes

minor addition to FAQ

c15c8e4

StrikerRUS removed the help wanted label Dec 14, 2018

added test

4bed1ae

StrikerRUS force-pushed the data branch from c99aa40 to 4bed1ae Compare December 14, 2018 20:30

StrikerRUS changed the title ~~[WIP][python] added get_data() method to Dataset class~~ [python] added get_data() method to Dataset class Dec 14, 2018

guolinke approved these changes Dec 15, 2018

View reviewed changes

guolinke merged commit 2323cb3 into master Dec 20, 2018

StrikerRUS deleted the data branch December 20, 2018 11:15

StrikerRUS mentioned this pull request Jul 7, 2019

fix init_model with subset #2252

Merged

lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] added get_data() method to Dataset class #1870

[python] added get_data() method to Dataset class #1870

StrikerRUS commented Nov 24, 2018

StrikerRUS Nov 24, 2018 •

edited

guolinke Dec 10, 2018

StrikerRUS commented Nov 24, 2018

guolinke Dec 10, 2018 •

edited

StrikerRUS Dec 10, 2018

StrikerRUS commented Dec 10, 2018

guolinke Dec 11, 2018

StrikerRUS Dec 11, 2018

guolinke Dec 11, 2018

StrikerRUS Dec 11, 2018

StrikerRUS commented Dec 12, 2018

guolinke commented Dec 12, 2018

StrikerRUS commented Dec 12, 2018

guolinke commented Dec 12, 2018

StrikerRUS commented Dec 13, 2018

guolinke commented Dec 13, 2018

StrikerRUS commented Dec 14, 2018

[python] added get_data() method to Dataset class #1870

[python] added get_data() method to Dataset class #1870

Conversation

StrikerRUS commented Nov 24, 2018

StrikerRUS Nov 24, 2018 • edited

Choose a reason for hiding this comment

guolinke Dec 10, 2018

Choose a reason for hiding this comment

StrikerRUS commented Nov 24, 2018

guolinke Dec 10, 2018 • edited

Choose a reason for hiding this comment

StrikerRUS Dec 10, 2018

Choose a reason for hiding this comment

StrikerRUS commented Dec 10, 2018

guolinke Dec 11, 2018

Choose a reason for hiding this comment

StrikerRUS Dec 11, 2018

Choose a reason for hiding this comment

guolinke Dec 11, 2018

Choose a reason for hiding this comment

StrikerRUS Dec 11, 2018

Choose a reason for hiding this comment

StrikerRUS commented Dec 12, 2018

guolinke commented Dec 12, 2018

StrikerRUS commented Dec 12, 2018

guolinke commented Dec 12, 2018

StrikerRUS commented Dec 13, 2018

guolinke commented Dec 13, 2018

StrikerRUS commented Dec 14, 2018

StrikerRUS Nov 24, 2018 •

edited

guolinke Dec 10, 2018 •

edited