-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] fix access to Dataset metadata in scikit-learn custom metrics and objectives #6108
Conversation
@@ -368,31 +368,31 @@ def _data_to_2d_numpy( | |||
"It should be list of lists, numpy 2-D array or pandas DataFrame") | |||
|
|||
|
|||
def _cfloat32_array_to_numpy(cptr: "ctypes._Pointer", length: int) -> np.ndarray: | |||
def _cfloat32_array_to_numpy(*, cptr: "ctypes._Pointer", length: int) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures that this function is only called with keyword arguments.
That syntax has been available since Python 3.0: https://peps.python.org/pep-3102/
Doing that eliminates the possibility of accidentally-passed-arguments-in-the-wrong-order types of bugs, and (in my opinion) makes the code a bit easier to read.
@@ -2834,7 +2849,7 @@ def get_feature_name(self) -> List[str]: | |||
ptr_string_buffers)) | |||
return [string_buffers[i].value.decode('utf-8') for i in range(num_feature)] | |||
|
|||
def get_label(self) -> Optional[np.ndarray]: | |||
def get_label(self) -> Optional[_LGBM_LabelType]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These issues in the Dataset.get_{metadata}()
methods were the direct cause of the mypy
errors mentioned in the description.
Here's an example.
import lightgbm as lgb
import numpy as np
X = np.array([[1.0, 2.0], [3.0, 4.0]])
dtrain = lgb.Dataset(
data=np.array([[1.0, 2.0], [3.0, 4.0]]),
label=[1, 2],
params={
"min_data_in_bin": 1,
"min_data_in_leaf": 1,
},
)
# 'label' was passed in as a list, get_label() returns that list
type(dtrain.get_label())
# <class 'list'>
# after construction, this is pulled from the C++ side, and is a numpy array
dtrain.construct()
type(dtrain.get_label())
# <class 'numpy.ndarray'>
python-package/lightgbm/sklearn.py
Outdated
@@ -151,14 +151,18 @@ def __call__(self, preds: np.ndarray, dataset: Dataset) -> Tuple[np.ndarray, np. | |||
The value of the second order derivative (Hessian) of the loss | |||
with respect to the elements of preds for each sample point. | |||
""" | |||
labels = dataset.get_label() | |||
labels = dataset.get_field("label") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After correcting the type hints in Dataset
methods, mypy
rightly started complaining about these parts of the sklearn
interface, where custom objective functions and metric functions are called.
With many instances like this:
sklearn.py:157: error: Argument 1 has incompatible type "Union[List[float], List[int], ndarray[Any, Any], Any, None]"; expected "Optional[ndarray[Any, Any]]" [arg-type]
That's because we say in documentation and type hints that sklearn
custom objective functions and eval metric functions should only expect to be passed numpy
arrays or None
.
LightGBM/python-package/lightgbm/sklearn.py
Lines 35 to 51 in 7c9a985
_LGBM_ScikitCustomObjectiveFunction = Union[ | |
# f(labels, preds) | |
Callable[ | |
[Optional[np.ndarray], np.ndarray], | |
Tuple[np.ndarray, np.ndarray] | |
], | |
# f(labels, preds, weights) | |
Callable[ | |
[Optional[np.ndarray], np.ndarray, Optional[np.ndarray]], | |
Tuple[np.ndarray, np.ndarray] | |
], | |
# f(labels, preds, weights, group) | |
Callable[ | |
[Optional[np.ndarray], np.ndarray, Optional[np.ndarray], Optional[np.ndarray]], | |
Tuple[np.ndarray, np.ndarray] | |
], | |
] |
LightGBM/python-package/lightgbm/sklearn.py
Lines 101 to 115 in 7c9a985
Expects a callable with following signatures: | |
``func(y_true, y_pred)``, | |
``func(y_true, y_pred, weight)`` | |
or ``func(y_true, y_pred, weight, group)`` | |
and returns (grad, hess): | |
y_true : numpy 1-D array of shape = [n_samples] | |
The target values. | |
y_pred : numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples, n_classes] (for multi-class task) | |
The predicted values. | |
Predicted values are returned before any transformation, | |
e.g. they are raw margin instead of probability of positive class for binary task. | |
weight : numpy 1-D array of shape = [n_samples] | |
The weight of samples. Weights should be non-negative. | |
group : numpy 1-D array |
This PR proposes fixing that by switching from get_label()
to get_field("label")
(and similar for all the other metadata stuff). Custom metrics and eval functions in the sklearn
interface should only ever be called on already-constructed Dataset
objects, so that should be safe. It should also reduce the risk of soem mistake accidentally leading to something other than numpy arrays being passed into sklearn
custom objective and metric functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you looked into the performance impact of this? This method is called on each iteration and we're currently just returning an attribute from the dataset but this would require making a call to C++ to get the field each time. I think it just returns the pointer so it may not be that expensive but it's something to consider given that we'd be making several calls to get all the fields that this method needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I didn't consider that this would be a performance issue. You're right, it could be!
I think it just returns the pointer
Nah, your instincts that this might be more expensive were good! This call to LGBM_DatasetGetField()
...
LightGBM/python-package/lightgbm/basic.py
Lines 2522 to 2527 in 63a882b
_safe_call(_LIB.LGBM_DatasetGetField( | |
self._handle, | |
_c_str(field_name), | |
ctypes.byref(tmp_out_len), | |
ctypes.byref(ret), | |
ctypes.byref(out_type))) |
... is just populating a pointer ...
Line 1697 in 63a882b
if (dataset->GetFloatField(field_name, out_len, reinterpret_cast<const float**>(out_ptr))) { |
Line 956 in 63a882b
*out_ptr = metadata_.label(); |
...but then that get materialized into a new numpy
array (e.g. a full copy is made in memory):
LightGBM/python-package/lightgbm/basic.py
Lines 2532 to 2537 in 63a882b
if out_type.value == _C_API_DTYPE_INT32: | |
arr = _cint32_array_to_numpy(ctypes.cast(ret, ctypes.POINTER(ctypes.c_int32)), tmp_out_len.value) | |
elif out_type.value == _C_API_DTYPE_FLOAT32: | |
arr = _cfloat32_array_to_numpy(ctypes.cast(ret, ctypes.POINTER(ctypes.c_float)), tmp_out_len.value) | |
elif out_type.value == _C_API_DTYPE_FLOAT64: | |
arr = _cfloat64_array_to_numpy(ctypes.cast(ret, ctypes.POINTER(ctypes.c_double)), tmp_out_len.value) |
LightGBM/python-package/lightgbm/basic.py
Line 374 in 63a882b
return np.ctypeslib.as_array(cptr, shape=(length,)).copy() |
So I think you're right, calling get_field()
unambiguously might not be a good way to do this. That's just wasted effort relative to .get_label()
, .get_weight()
, etc., since, on a constructed Dataset
, .label
, .weight
, etc. should already contain numpy
arrays regardless of what format the raw data was passed in.
Thanks to these calls in Dataset._lazy_init()
:
LightGBM/python-package/lightgbm/basic.py
Lines 1958 to 1967 in 63a882b
if label is not None: | |
self.set_label(label) | |
if self.get_label() is None: | |
raise ValueError("Label should not be None") | |
if weight is not None: | |
self.set_weight(weight) | |
if group is not None: | |
self.set_group(group) | |
if position is not None: | |
self.set_position(position) |
So given all that....let me think about whether there's a better way to resolve these errors from mypy
. I don't think they're just things we should ignore... I do feel we should add some stronger guarantees here that by the time custom objective and metric functions should only be passed an already-constructed Dataset
.
Thanks for bringing it up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright @jmoralez I think I found a better way to do this. I just pushed 017e5e5, with the following changes:
- switches back to using
get_label()
andget_weight()
instead ofget_field()
- because those will return the data cached on the
Dataset
object without reconstructing a newnumpy
array on every iteration
- because those will return the data cached on the
- keeps
get_field("group")
forgroup
- that'll still introduce a performance penalty (reconstructing an array on every iteration if the objective function / eval metric takes 4 arguments, i.e. is for learning-to-rank)
- but I think we need to do that... right now the docs say such functions should expect
group
to be anumpy
array, butlightgbm
is currently passing them alist
LightGBM/python-package/lightgbm/sklearn.py
Line 115 in d45dca7
group : numpy 1-D array
- fixes the
mypy
errors by puttingget_label()
andget_weight()
calls inside functions that useassert isinstance(x, np.ndarray)
for type narrowing* - adds docs to
Dataset.get_{group/init_score/weight/position/label}
explaining that the return type will be different for a constructed Dataset
@jmoralez whenever you have time, could you please take another look? Thanks again for bringing this up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we do for group what we do for other fields like init_score and weight where after we set the field we set the Dataset attribute to the result of get_field?
LightGBM/python-package/lightgbm/basic.py
Lines 2742 to 2743 in 8ed371c
self.set_field('init_score', init_score) | |
self.init_score = self.get_field('init_score') # original values can be modified at cpp side |
LightGBM/python-package/lightgbm/basic.py
Lines 2720 to 2721 in 8ed371c
self.set_field('weight', weight) | |
self.weight = self.get_field('weight') # original values can be modified at cpp side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could but that would be a user-facing breaking change. Look at the test cases in this PR... even after construction, Dataset.get_group()
returns whatever was passed in (most commonly a list).
Do you think it's worth the breaking change to get that consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is, it's weird having all of the arguments of a custom objective be arrays except for group. Although the main difference I think is not really the data structure but the format that they're in (group boundaries vs lengths). Given that the docstring says array and it's more consistent with the other fields I'd prefer we override it with get_field. On the other hand I haven't used ranking a lot so I'm not sure which format is more convenient for the objective function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think your reasoning makes sense and that we should change it to an array.
I'm especially since tonight I found that it'll be set to an array of boundaries the first time you call get_group()
on a Dataset
where the .group
attribute is None
, which can happen when loading from a binary file:
import lightgbm as lgb
import numpy as np
X = np.array([[1.0, 2.0], [3.0, 4.0]])
dtrain = lgb.Dataset(
X,
params={
"min_data_in_bin": 1,
"min_data_in_leaf": 1,
"verbosity": -1
},
group=[1, 1],
label=[1, 2],
)
dtrain.construct()
# get_group() returns a list of sizes
assert dtrain.get_group() == [1, 1]
# get_field() returns a numpy array of bounadries
np.testing.assert_array_equal(
dtrain.get_field("group"),
np.array([0, 1, 2])
)
# round-trip to and from binary file
dtrain.save_binary("dtrain.bin")
dtrain2 = lgb.Dataset(
data="dtrain.bin",
params={
"min_data_in_bin": 1,
"min_data_in_leaf": 1,
"verbosity": -1
}
)
# before construction, group is empty
assert dtrain2.group is None
# after construction, get_group() returns a numpy array of sizes
dtrain2.construct()
np.testing.assert_array_equal(
dtrain.get_group(),
np.array([1, 1])
)
# ... and get_field() returns a numpy array of boundaries
np.testing.assert_array_equal(
dtrain.get_field("group"),
np.array([0, 1, 2])
)
That doesn't matter specifically for the scikit-learn
interface (which I don't believe could ever encounter a binary dataset file), but it does give me more confidence that other parts of the code base aren't implicitly relying on Dataset.group
being a list.
I just did the following:
- pushed ada18f8 with these changes
- changed the label on this PR to
breaking
- changed the title of the PR
- added a section in the description explaining why this is
breaking
Thanks for talking through it with me, I know this is way down in the depths of the library and that reviewing it takes a lot of effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't merge this until you've had another chance to review @jmoralez . Take your time!
The Python 3.7 jobs are failing because For example, from the
I'll push a fix. |
…M into python/dataset-getters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just minor suggestions
Co-authored-by: José Morales <jmoralz92@gmail.com>
Thanks very much for the thorough reviews, especially @jmoralez for talking through the different options! I know it takes a lot of effort to context-switch back into something this deep into the library, I really appreciate the time you took. |
…m metrics and objectives (microsoft#6108)
Contributes to #3756.
Contributes to #3867.
Fixes the following errors from
mypy
These come from the fact that the return type of methods like
Dataset.get_label()
are different based on whether or not theDataset
has been constructed. I've left inline comments explaining the specifics, and added a unit test to ensure we're notified if future refactorings accidentally change that.Why is this labeled
breaking
?Prior to this PR, custom metric functions and objective functions for learning-to-rank tasks were passed query groups as a Python
list
of group sizes, despite the documentation saying that they'd be passed anumpy
array of group sizes.As of this PR, query groups are passed as a
numpy
array of group sizes.It is only breaking for uses of learning-to-rank using the
lightgbm.sklearn
estimators with a custom objective or metric function which takes 4 arguments.