Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] added get_data() method to Dataset class #1870

Merged
merged 9 commits into from
Dec 20, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/FAQ.rst
Original file line number Diff line number Diff line change
Expand Up @@ -211,15 +211,15 @@ Python-package
If you set ``free_raw_data=True`` (default), the raw data (with Python data struct) will be freed.
So, if you want to:

- get label (or weight/init\_score/group) before constructing a dataset, it's same as get ``self.label``
- get label (or weight/init\_score/group/data) before constructing a dataset, it's same as get ``self.label``;

- set label (or weight/init\_score/group) before constructing a dataset, it's same as ``self.label=some_label_array``
- set label (or weight/init\_score/group) before constructing a dataset, it's same as ``self.label=some_label_array``;

- get num\_data (or num\_feature) before constructing a dataset, you can get data with ``self.data``.
Then, if your data is ``numpy.ndarray``, use some code like ``self.data.shape``
Then, if your data is ``numpy.ndarray``, use some code like ``self.data.shape``. But do not do this after subsetting the Dataset, because you'll get always ``None``;

- set predictor (or reference/categorical feature) after constructing a dataset,
you should set ``free_raw_data=False`` or init a Dataset object with the same raw data
you should set ``free_raw_data=False`` or init a Dataset object with the same raw data.

--------------

Expand Down
27 changes: 26 additions & 1 deletion python-package/lightgbm/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -687,6 +687,7 @@ def __init__(self, data, label=None, reference=None,
self.params = copy.deepcopy(params)
self.free_raw_data = free_raw_data
self.used_indices = None
self.need_slice = True
self._predictor = None
self.pandas_categorical = None
self.params_back_up = None
Expand Down Expand Up @@ -974,6 +975,8 @@ def construct(self):
ctypes.c_int(used_indices.shape[0]),
c_str(params_str),
ctypes.byref(self.handle)))
self.data = self.reference.data
self.get_data()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we don't need the get_data here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we need a consistent state of the object after calling construct().

Or do you think that it's unneeded and we should allow users to access data only via get_data() (remove mentions about data field)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I see. As only subset will call the slicing, I think the overhead cost is acceptable.
Change the definition of data may cause many dependency problems.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's expected that only subsetting branch of construct calls get_data(). Also, get_data() itself checks used_indices:
https://github.com/Microsoft/LightGBM/blob/1ff6c6d6bca05de2c13490264ba3b6bf7f993d4a/python-package/lightgbm/basic.py#L1392

if self.group is not None:
self.set_group(self.group)
if self.get_label() is None:
Expand Down Expand Up @@ -1041,7 +1044,8 @@ def subset(self, used_indices, params=None):
if params is None:
params = self.params
ret = Dataset(None, reference=self, feature_name=self.feature_name,
categorical_feature=self.categorical_feature, params=params)
categorical_feature=self.categorical_feature, params=params,
free_raw_data=self.free_raw_data)
Copy link
Collaborator

@guolinke guolinke Dec 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a better logic still pass None for data here?
Then in construct, use reference.data to get_data ?
Then, the logic of construct is the same as before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked

ret._predictor = self._predictor
ret.pandas_categorical = self.pandas_categorical
ret.used_indices = used_indices
Expand Down Expand Up @@ -1375,6 +1379,27 @@ def get_init_score(self):
self.init_score = self.get_field('init_score')
return self.init_score

def get_data(self):
"""Get the raw data of the Dataset.

Returns
-------
data : string, numpy array, pandas DataFrame, scipy.sparse, list of numpy arrays or None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about string and list of numpy arrays? How to "slice" them?..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cannot slide them, maybe we should throw a warning or error here.

Raw data used in the Dataset construction.
"""
if self.handle is None:
raise Exception("Cannot get data before construct Dataset")
if self.data is not None and self.used_indices is not None and self.need_slice:
if isinstance(self.data, np.ndarray) or scipy.sparse.issparse(self.data):
self.data = self.data[self.used_indices, :]
elif isinstance(self.data, DataFrame):
self.data = self.data.iloc[self.used_indices].copy()
else:
warnings.warn("Cannot subset {} type of raw data.\n"
"Returning original raw data".format(type(self.data).__name__))
self.need_slice = False
return self.data

def get_group(self):
"""Get the group of the Dataset.

Expand Down
20 changes: 20 additions & 0 deletions tests/python_package_test/test_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -808,3 +808,23 @@ def test_constant_features_multiclassova(self):
}
self.test_constant_features([0.0, 1.0, 2.0, 0.0], [0.5, 0.25, 0.25], params)
self.test_constant_features([0.0, 1.0, 2.0, 1.0], [0.25, 0.5, 0.25], params)

def test_fpreproc(self):
def preprocess_data(dtrain, dtest, params):
train_data = dtrain.construct().get_data()
test_data = dtest.construct().get_data()
train_data[:, 0] += 1
test_data[:, 0] += 1
dtrain.label[-5:] = 3
dtest.label[-5:] = 3
dtrain = lgb.Dataset(train_data, dtrain.label)
dtest = lgb.Dataset(test_data, dtest.label, reference=dtrain)
params['num_class'] = 4
return dtrain, dtest, params

X, y = load_iris(True)
dataset = lgb.Dataset(X, y, free_raw_data=False)
params = {'objective': 'multiclass', 'num_class': 3, 'verbose': -1}
results = lgb.cv(params, dataset, num_boost_round=10, fpreproc=preprocess_data)
self.assertIn('multi_logloss-mean', results)
self.assertEqual(len(results['multi_logloss-mean']), 10)