-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] added get_data() method to Dataset class #1870
Changes from all commits
462cf16
6fa5b09
0520429
daf85e5
d8aa16e
c36b89f
1ff6c6d
c15c8e4
4bed1ae
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -687,6 +687,7 @@ def __init__(self, data, label=None, reference=None, | |
self.params = copy.deepcopy(params) | ||
self.free_raw_data = free_raw_data | ||
self.used_indices = None | ||
self.need_slice = True | ||
self._predictor = None | ||
self.pandas_categorical = None | ||
self.params_back_up = None | ||
|
@@ -974,6 +975,8 @@ def construct(self): | |
ctypes.c_int(used_indices.shape[0]), | ||
c_str(params_str), | ||
ctypes.byref(self.handle))) | ||
self.data = self.reference.data | ||
self.get_data() | ||
if self.group is not None: | ||
self.set_group(self.group) | ||
if self.get_label() is None: | ||
|
@@ -1041,7 +1044,8 @@ def subset(self, used_indices, params=None): | |
if params is None: | ||
params = self.params | ||
ret = Dataset(None, reference=self, feature_name=self.feature_name, | ||
categorical_feature=self.categorical_feature, params=params) | ||
categorical_feature=self.categorical_feature, params=params, | ||
free_raw_data=self.free_raw_data) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe a better logic still pass There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reworked |
||
ret._predictor = self._predictor | ||
ret.pandas_categorical = self.pandas_categorical | ||
ret.used_indices = used_indices | ||
|
@@ -1375,6 +1379,27 @@ def get_init_score(self): | |
self.init_score = self.get_field('init_score') | ||
return self.init_score | ||
|
||
def get_data(self): | ||
"""Get the raw data of the Dataset. | ||
|
||
Returns | ||
------- | ||
data : string, numpy array, pandas DataFrame, scipy.sparse, list of numpy arrays or None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we cannot slide them, maybe we should throw a warning or error here. |
||
Raw data used in the Dataset construction. | ||
""" | ||
if self.handle is None: | ||
raise Exception("Cannot get data before construct Dataset") | ||
if self.data is not None and self.used_indices is not None and self.need_slice: | ||
if isinstance(self.data, np.ndarray) or scipy.sparse.issparse(self.data): | ||
self.data = self.data[self.used_indices, :] | ||
elif isinstance(self.data, DataFrame): | ||
self.data = self.data.iloc[self.used_indices].copy() | ||
else: | ||
warnings.warn("Cannot subset {} type of raw data.\n" | ||
"Returning original raw data".format(type(self.data).__name__)) | ||
self.need_slice = False | ||
return self.data | ||
|
||
def get_group(self): | ||
"""Get the group of the Dataset. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we don't need the
get_data
here ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that we need a consistent state of the object after calling
construct()
.Or do you think that it's unneeded and we should allow users to access data only via
get_data()
(remove mentions aboutdata
field)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I see. As only subset will call the slicing, I think the overhead cost is acceptable.
Change the definition of
data
may cause many dependency problems.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's expected that only subsetting branch of
construct
callsget_data()
. Also,get_data()
itself checksused_indices
:https://github.com/Microsoft/LightGBM/blob/1ff6c6d6bca05de2c13490264ba3b6bf7f993d4a/python-package/lightgbm/basic.py#L1392