New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] added get_data() method to Dataset class #1870
Conversation
|
||
Returns | ||
------- | ||
data : string, numpy array, pandas DataFrame, scipy.sparse, list of numpy arrays or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about string
and list of numpy arrays
? How to "slice" them?..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we cannot slide them, maybe we should throw a warning or error here.
Simplified example from #1763
fails with
|
categorical_feature=self.categorical_feature, params=params) | ||
ret = Dataset(self.data, reference=self, feature_name=self.feature_name, | ||
categorical_feature=self.categorical_feature, params=params, | ||
free_raw_data=self.free_raw_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a better logic still pass None
for data
here?
Then in construct
, use reference.data to get_data
?
Then, the logic of construct is the same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked
Now this snippet should be modified with
|
@@ -974,6 +975,8 @@ def construct(self): | |||
ctypes.c_int(used_indices.shape[0]), | |||
c_str(params_str), | |||
ctypes.byref(self.handle))) | |||
self.data = self.reference.data | |||
self.get_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we don't need the get_data
here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that we need a consistent state of the object after calling construct()
.
Or do you think that it's unneeded and we should allow users to access data only via get_data()
(remove mentions about data
field)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I see. As only subset will call the slicing, I think the overhead cost is acceptable.
Change the definition of data
may cause many dependency problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's expected that only subsetting branch of construct
calls get_data()
. Also, get_data()
itself checks used_indices
:
https://github.com/Microsoft/LightGBM/blob/1ff6c6d6bca05de2c13490264ba3b6bf7f993d4a/python-package/lightgbm/basic.py#L1392
@StrikerRUS I think that will not fail by the latest code |
@guolinke Unfortunately, it still fails (
|
@StrikerRUS you need to set the reference for valid data:
|
@guolinke Yeah, thanks! Completely forgot about this. Now everything is OK. Maybe add this snippet as a regression test? |
@StrikerRUS yeah, sure! |
@guolinke Test has been added. Please check. |
Fixed #1690.
Fixed #1763.