[python-package] fix access to Dataset metadata in scikit-learn custom metrics and objectives #6108

jameslamb · 2023-09-22T04:36:10Z

Contributes to #3756.
Contributes to #3867.

Fixes the following errors from mypy

basic.py:2826: error: Incompatible return value type (got "Union[List[float], List[int], ndarray[Any, Any], Any, Any, None]", expected "Optional[ndarray[Any, Any]]")  [return-value]
basic.py:2838: error: Incompatible return value type (got "Union[List[float], List[int], ndarray[Any, Any], Any, None]", expected "Optional[ndarray[Any, Any]]")  [return-value]
basic.py:2850: error: Incompatible return value type (got "Union[List[float], List[List[float]], ndarray[Any, Any], Any, Any, None]", expected "Optional[ndarray[Any, Any]]")  [return-value]
basic.py:2901: error: Incompatible return value type (got "Union[List[float], Any, List[int], ndarray[Any, dtype[Any]], ndarray[Any, Any], None]", expected "Optional[ndarray[Any, Any]]")  [return-value]

These come from the fact that the return type of methods like Dataset.get_label() are different based on whether or not the Dataset has been constructed. I've left inline comments explaining the specifics, and added a unit test to ensure we're notified if future refactorings accidentally change that.

Why is this labeled `breaking`?

Prior to this PR, custom metric functions and objective functions for learning-to-rank tasks were passed query groups as a Python list of group sizes, despite the documentation saying that they'd be passed a numpy array of group sizes.

As of this PR, query groups are passed as a numpy array of group sizes.

It is only breaking for uses of learning-to-rank using the lightgbm.sklearn estimators with a custom objective or metric function which takes 4 arguments.

jameslamb · 2023-09-22T04:37:18Z

python-package/lightgbm/basic.py

@@ -368,31 +368,31 @@ def _data_to_2d_numpy(
                    "It should be list of lists, numpy 2-D array or pandas DataFrame")


-def _cfloat32_array_to_numpy(cptr: "ctypes._Pointer", length: int) -> np.ndarray:
+def _cfloat32_array_to_numpy(*, cptr: "ctypes._Pointer", length: int) -> np.ndarray:


This ensures that this function is only called with keyword arguments.

That syntax has been available since Python 3.0: https://peps.python.org/pep-3102/

Doing that eliminates the possibility of accidentally-passed-arguments-in-the-wrong-order types of bugs, and (in my opinion) makes the code a bit easier to read.

jameslamb · 2023-09-22T04:42:39Z

python-package/lightgbm/basic.py

@@ -2834,7 +2849,7 @@ def get_feature_name(self) -> List[str]:
                ptr_string_buffers))
        return [string_buffers[i].value.decode('utf-8') for i in range(num_feature)]

-    def get_label(self) -> Optional[np.ndarray]:
+    def get_label(self) -> Optional[_LGBM_LabelType]:


These issues in the Dataset.get_{metadata}() methods were the direct cause of the mypy errors mentioned in the description.

Here's an example.

import lightgbm as lgb import numpy as np X = np.array([[1.0, 2.0], [3.0, 4.0]]) dtrain = lgb.Dataset( data=np.array([[1.0, 2.0], [3.0, 4.0]]), label=[1, 2], params={ "min_data_in_bin": 1, "min_data_in_leaf": 1, }, ) # 'label' was passed in as a list, get_label() returns that list type(dtrain.get_label()) # <class 'list'> # after construction, this is pulled from the C++ side, and is a numpy array dtrain.construct() type(dtrain.get_label()) # <class 'numpy.ndarray'>

jameslamb · 2023-09-22T04:49:34Z

python-package/lightgbm/sklearn.py

@@ -151,14 +151,18 @@ def __call__(self, preds: np.ndarray, dataset: Dataset) -> Tuple[np.ndarray, np.
            The value of the second order derivative (Hessian) of the loss
            with respect to the elements of preds for each sample point.
        """
-        labels = dataset.get_label()
+        labels = dataset.get_field("label")


After correcting the type hints in Dataset methods, mypy rightly started complaining about these parts of the sklearn interface, where custom objective functions and metric functions are called.

With many instances like this:

sklearn.py:157: error: Argument 1 has incompatible type "Union[List[float], List[int], ndarray[Any, Any], Any, None]"; expected "Optional[ndarray[Any, Any]]" [arg-type]

That's because we say in documentation and type hints that sklearn custom objective functions and eval metric functions should only expect to be passed numpy arrays or None.

LightGBM/python-package/lightgbm/sklearn.py

Lines 35 to 51 in 7c9a985

_LGBM_ScikitCustomObjectiveFunction = Union[

# f(labels, preds)

Callable[

[Optional[np.ndarray], np.ndarray],

Tuple[np.ndarray, np.ndarray]

],

# f(labels, preds, weights)

Callable[

[Optional[np.ndarray], np.ndarray, Optional[np.ndarray]],

Tuple[np.ndarray, np.ndarray]

],

# f(labels, preds, weights, group)

Callable[

[Optional[np.ndarray], np.ndarray, Optional[np.ndarray], Optional[np.ndarray]],

Tuple[np.ndarray, np.ndarray]

],

]

LightGBM/python-package/lightgbm/sklearn.py

Lines 101 to 115 in 7c9a985

Expects a callable with following signatures:

``func(y_true, y_pred)``,

``func(y_true, y_pred, weight)``

or ``func(y_true, y_pred, weight, group)``

and returns (grad, hess):

y_true : numpy 1-D array of shape = [n_samples]

The target values.

y_pred : numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples, n_classes] (for multi-class task)

The predicted values.

Predicted values are returned before any transformation,

e.g. they are raw margin instead of probability of positive class for binary task.

weight : numpy 1-D array of shape = [n_samples]

The weight of samples. Weights should be non-negative.

group : numpy 1-D array

This PR proposes fixing that by switching from get_label() to get_field("label") (and similar for all the other metadata stuff). Custom metrics and eval functions in the sklearn interface should only ever be called on already-constructed Dataset objects, so that should be safe. It should also reduce the risk of soem mistake accidentally leading to something other than numpy arrays being passed into sklearn custom objective and metric functions.

Have you looked into the performance impact of this? This method is called on each iteration and we're currently just returning an attribute from the dataset but this would require making a call to C++ to get the field each time. I think it just returns the pointer so it may not be that expensive but it's something to consider given that we'd be making several calls to get all the fields that this method needs.

No, I didn't consider that this would be a performance issue. You're right, it could be!

I think it just returns the pointer

Nah, your instincts that this might be more expensive were good! This call to LGBM_DatasetGetField() ...

LightGBM/python-package/lightgbm/basic.py

Lines 2522 to 2527 in 63a882b

_safe_call(_LIB.LGBM_DatasetGetField(

self._handle,

_c_str(field_name),

ctypes.byref(tmp_out_len),

ctypes.byref(ret),

ctypes.byref(out_type)))

... is just populating a pointer ...

LightGBM/src/c_api.cpp

Line 1697 in 63a882b

if (dataset->GetFloatField(field_name, out_len, reinterpret_cast<const float**>(out_ptr))) {

LightGBM/src/io/dataset.cpp

Line 956 in 63a882b

*out_ptr = metadata_.label();

...but then that get materialized into a new numpy array (e.g. a full copy is made in memory):

LightGBM/python-package/lightgbm/basic.py

Lines 2532 to 2537 in 63a882b

if out_type.value == _C_API_DTYPE_INT32:

arr = _cint32_array_to_numpy(ctypes.cast(ret, ctypes.POINTER(ctypes.c_int32)), tmp_out_len.value)

elif out_type.value == _C_API_DTYPE_FLOAT32:

arr = _cfloat32_array_to_numpy(ctypes.cast(ret, ctypes.POINTER(ctypes.c_float)), tmp_out_len.value)

elif out_type.value == _C_API_DTYPE_FLOAT64:

arr = _cfloat64_array_to_numpy(ctypes.cast(ret, ctypes.POINTER(ctypes.c_double)), tmp_out_len.value)

LightGBM/python-package/lightgbm/basic.py

Line 374 in 63a882b

return np.ctypeslib.as_array(cptr, shape=(length,)).copy()

So I think you're right, calling get_field() unambiguously might not be a good way to do this. That's just wasted effort relative to .get_label(), .get_weight(), etc., since, on a constructed Dataset, .label, .weight, etc. should already contain numpy arrays regardless of what format the raw data was passed in.

Thanks to these calls in Dataset._lazy_init():

LightGBM/python-package/lightgbm/basic.py

Lines 1958 to 1967 in 63a882b

if label is not None:

self.set_label(label)

if self.get_label() is None:

raise ValueError("Label should not be None")

if weight is not None:

self.set_weight(weight)

if group is not None:

self.set_group(group)

if position is not None:

self.set_position(position)

So given all that....let me think about whether there's a better way to resolve these errors from mypy. I don't think they're just things we should ignore... I do feel we should add some stronger guarantees here that by the time custom objective and metric functions should only be passed an already-constructed Dataset.

Thanks for bringing it up!

Alright @jmoralez I think I found a better way to do this. I just pushed 017e5e5, with the following changes:

switches back to using get_label() and get_weight() instead of get_field()

because those will return the data cached on the Dataset object without reconstructing a new numpy array on every iteration

keeps get_field("group") for group

that'll still introduce a performance penalty (reconstructing an array on every iteration if the objective function / eval metric takes 4 arguments, i.e. is for learning-to-rank)

but I think we need to do that... right now the docs say such functions should expect group to be a numpy array, but lightgbm is currently passing them a list

LightGBM/python-package/lightgbm/sklearn.py

Line 115 in d45dca7

group : numpy 1-D array

fixes the mypy errors by putting get_label() and get_weight() calls inside functions that use assert isinstance(x, np.ndarray) for type narrowing*

adds docs to Dataset.get_{group/init_score/weight/position/label} explaining that the return type will be different for a constructed Dataset

@jmoralez whenever you have time, could you please take another look? Thanks again for bringing this up!

Can't we do for group what we do for other fields like init_score and weight where after we set the field we set the Dataset attribute to the result of get_field?

LightGBM/python-package/lightgbm/basic.py

Lines 2742 to 2743 in 8ed371c

self.set_field('init_score', init_score)

self.init_score = self.get_field('init_score') # original values can be modified at cpp side

LightGBM/python-package/lightgbm/basic.py

Lines 2720 to 2721 in 8ed371c

self.set_field('weight', weight)

self.weight = self.get_field('weight') # original values can be modified at cpp side

We could but that would be a user-facing breaking change. Look at the test cases in this PR... even after construction, Dataset.get_group() returns whatever was passed in (most commonly a list).

Do you think it's worth the breaking change to get that consistency?

I think it is, it's weird having all of the arguments of a custom objective be arrays except for group. Although the main difference I think is not really the data structure but the format that they're in (group boundaries vs lengths). Given that the docstring says array and it's more consistent with the other fields I'd prefer we override it with get_field. On the other hand I haven't used ranking a lot so I'm not sure which format is more convenient for the objective function.

Ok I think your reasoning makes sense and that we should change it to an array.

I'm especially since tonight I found that it'll be set to an array of boundaries the first time you call get_group() on a Dataset where the .group attribute is None, which can happen when loading from a binary file:

import lightgbm as lgb import numpy as np X = np.array([[1.0, 2.0], [3.0, 4.0]]) dtrain = lgb.Dataset( X, params={ "min_data_in_bin": 1, "min_data_in_leaf": 1, "verbosity": -1 }, group=[1, 1], label=[1, 2], ) dtrain.construct() # get_group() returns a list of sizes assert dtrain.get_group() == [1, 1] # get_field() returns a numpy array of bounadries np.testing.assert_array_equal( dtrain.get_field("group"), np.array([0, 1, 2]) ) # round-trip to and from binary file dtrain.save_binary("dtrain.bin") dtrain2 = lgb.Dataset( data="dtrain.bin", params={ "min_data_in_bin": 1, "min_data_in_leaf": 1, "verbosity": -1 } ) # before construction, group is empty assert dtrain2.group is None # after construction, get_group() returns a numpy array of sizes dtrain2.construct() np.testing.assert_array_equal( dtrain.get_group(), np.array([1, 1]) ) # ... and get_field() returns a numpy array of boundaries np.testing.assert_array_equal( dtrain.get_field("group"), np.array([0, 1, 2]) )

That doesn't matter specifically for the scikit-learn interface (which I don't believe could ever encounter a binary dataset file), but it does give me more confidence that other parts of the code base aren't implicitly relying on Dataset.group being a list.

I just did the following:

pushed ada18f8 with these changes

changed the label on this PR to breaking

changed the title of the PR

added a section in the description explaining why this is breaking

Thanks for talking through it with me, I know this is way down in the depths of the library and that reviewing it takes a lot of effort.

I won't merge this until you've had another chance to review @jmoralez . Take your time!

python-package/lightgbm/sklearn.py

jameslamb · 2023-09-22T17:54:10Z

The Python 3.7 jobs are failing because np.testing.assert_array_equal() only got the keyword argument strict (which raises an assertion error on dtype and shape differences) about a year ago: numpy/numpy#21595.

For example, from the bdist (macOS-latest, Python 3.7) job (build link)

        if getenv('TASK', '') != 'cuda':
            np.testing.assert_array_equal(
                dtrain.position,
                np.array([0.0, 1.0], dtype=np.float32),
>               strict=True
            )
E           TypeError: assert_array_equal() got an unexpected keyword argument 'strict'

I'll push a fix.

…M into python/dataset-getters

jmoralez

LGTM! Just minor suggestions

python-package/lightgbm/sklearn.py

tests/python_package_test/test_basic.py

Co-authored-by: José Morales <jmoralz92@gmail.com>

jameslamb · 2023-11-07T21:01:38Z

Thanks very much for the thorough reviews, especially @jmoralez for talking through the different options! I know it takes a lot of effort to context-switch back into something this deep into the library, I really appreciate the time you took.

…m metrics and objectives (microsoft#6108)

jameslamb added 3 commits September 21, 2023 22:56

use stricter data-getters in scikit-learn interface

c91e01a

more changes

0119e57

fix tests

48de342

jameslamb added the maintenance label Sep 22, 2023

jameslamb commented Sep 22, 2023

View reviewed changes

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

jameslamb added 2 commits September 21, 2023 23:57

fix tests on CUDA

8874f3b

fix comment

7ea687e

jameslamb changed the title ~~WIP: [python-package] fix type hints and access patterns for Dataset metadata~~ [python-package] fix type hints and access patterns for Dataset metadata Sep 22, 2023

jameslamb marked this pull request as ready for review September 22, 2023 05:13

jameslamb requested review from guolinke, shiyu1994 and jmoralez as code owners September 22, 2023 05:13

jameslamb added the awaiting review label Sep 22, 2023

jameslamb added 8 commits September 22, 2023 13:04

fix compatibility with np.testing.assert_array_equal()

7386ecd

fix

ff01f80

Merge branch 'master' into python/dataset-getters

e3b9cd2

Merge branch 'master' into python/dataset-getters

5c60a31

less expensive access pattern

017e5e5

Merge branch 'python/dataset-getters' of github.com:microsoft/LightGB…

96e346f

…M into python/dataset-getters

Merge branch 'master' into python/dataset-getters

2fd4979

Merge branch 'master' into python/dataset-getters

1d5528f

guolinke approved these changes Oct 8, 2023

View reviewed changes

jameslamb added 4 commits October 8, 2023 21:26

Merge branch 'master' into python/dataset-getters

31877b6

Merge branch 'master' into python/dataset-getters

bb94198

pass 'group' as numpy array of boundaries

ada18f8

Merge branch 'python/dataset-getters' of github.com:microsoft/LightGB…

061c94d

…M into python/dataset-getters

jameslamb changed the title ~~[python-package] fix type hints and access patterns for Dataset metadata~~ [python-package] fix type hints, access patterns for Dataset metadata in scikit-learn custom metrics and objectives Oct 27, 2023

jameslamb changed the title ~~[python-package] fix type hints, access patterns for Dataset metadata in scikit-learn custom metrics and objectives~~ [python-package] fix access to Dataset metadata in scikit-learn custom metrics and objectives Oct 27, 2023

jameslamb added breaking and removed maintenance labels Oct 27, 2023

jameslamb mentioned this pull request Nov 6, 2023

[python-package] Allow to pass Arrow array as labels #6163

Merged

Merge branch 'master' into python/dataset-getters

29422e5

jmoralez reviewed Nov 7, 2023

View reviewed changes

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved

Apply suggestions from code review

c0b507e

Co-authored-by: José Morales <jmoralz92@gmail.com>

jmoralez approved these changes Nov 7, 2023

View reviewed changes

jameslamb removed the awaiting review label Nov 7, 2023

Merge branch 'master' into python/dataset-getters

58306e7

jameslamb merged commit aeafccf into master Nov 7, 2023
41 checks passed

jameslamb deleted the python/dataset-getters branch November 7, 2023 21:01

jameslamb mentioned this pull request Nov 7, 2023

[python-package] Allow to pass Arrow array as weights #6164

Merged

david-cortes pushed a commit to david-cortes/LightGBM that referenced this pull request Nov 8, 2023

[python-package] fix access to Dataset metadata in scikit-learn custo…

bcbaaac

…m metrics and objectives (microsoft#6108)

jameslamb mentioned this pull request Nov 14, 2023

release v4.2.0 #6191

Merged

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] fix access to Dataset metadata in scikit-learn custom metrics and objectives #6108

[python-package] fix access to Dataset metadata in scikit-learn custom metrics and objectives #6108

jameslamb commented Sep 22, 2023 •

edited

Loading

jameslamb Sep 22, 2023 •

edited

Loading

jameslamb Sep 22, 2023

jameslamb Sep 22, 2023

jmoralez Oct 2, 2023

jameslamb Oct 2, 2023

jameslamb Oct 6, 2023

jmoralez Oct 10, 2023

jameslamb Oct 10, 2023

jmoralez Oct 10, 2023

jameslamb Oct 27, 2023

jameslamb Oct 27, 2023

jameslamb commented Sep 22, 2023

jmoralez left a comment

jameslamb commented Nov 7, 2023

	_LGBM_ScikitCustomObjectiveFunction = Union[
	# f(labels, preds)
	Callable[
	[Optional[np.ndarray], np.ndarray],
	Tuple[np.ndarray, np.ndarray]
	],
	# f(labels, preds, weights)
	Callable[
	[Optional[np.ndarray], np.ndarray, Optional[np.ndarray]],
	Tuple[np.ndarray, np.ndarray]
	],
	# f(labels, preds, weights, group)
	Callable[
	[Optional[np.ndarray], np.ndarray, Optional[np.ndarray], Optional[np.ndarray]],
	Tuple[np.ndarray, np.ndarray]
	],
	]

	Expects a callable with following signatures:
	``func(y_true, y_pred)``,
	``func(y_true, y_pred, weight)``
	or ``func(y_true, y_pred, weight, group)``
	and returns (grad, hess):

	y_true : numpy 1-D array of shape = [n_samples]
	The target values.
	y_pred : numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples, n_classes] (for multi-class task)
	The predicted values.
	Predicted values are returned before any transformation,
	e.g. they are raw margin instead of probability of positive class for binary task.
	weight : numpy 1-D array of shape = [n_samples]
	The weight of samples. Weights should be non-negative.
	group : numpy 1-D array

	_safe_call(_LIB.LGBM_DatasetGetField(
	self._handle,
	_c_str(field_name),
	ctypes.byref(tmp_out_len),
	ctypes.byref(ret),
	ctypes.byref(out_type)))

	if out_type.value == _C_API_DTYPE_INT32:
	arr = _cint32_array_to_numpy(ctypes.cast(ret, ctypes.POINTER(ctypes.c_int32)), tmp_out_len.value)
	elif out_type.value == _C_API_DTYPE_FLOAT32:
	arr = _cfloat32_array_to_numpy(ctypes.cast(ret, ctypes.POINTER(ctypes.c_float)), tmp_out_len.value)
	elif out_type.value == _C_API_DTYPE_FLOAT64:
	arr = _cfloat64_array_to_numpy(ctypes.cast(ret, ctypes.POINTER(ctypes.c_double)), tmp_out_len.value)

	if label is not None:
	self.set_label(label)
	if self.get_label() is None:
	raise ValueError("Label should not be None")
	if weight is not None:
	self.set_weight(weight)
	if group is not None:
	self.set_group(group)
	if position is not None:
	self.set_position(position)

	self.set_field('init_score', init_score)
	self.init_score = self.get_field('init_score') # original values can be modified at cpp side

	self.set_field('weight', weight)
	self.weight = self.get_field('weight') # original values can be modified at cpp side

[python-package] fix access to Dataset metadata in scikit-learn custom metrics and objectives #6108

[python-package] fix access to Dataset metadata in scikit-learn custom metrics and objectives #6108

Conversation

jameslamb commented Sep 22, 2023 • edited Loading

Why is this labeled breaking?

jameslamb Sep 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb commented Sep 22, 2023

jmoralez left a comment

Choose a reason for hiding this comment

jameslamb commented Nov 7, 2023

jameslamb commented Sep 22, 2023 •

edited

Loading

Why is this labeled `breaking`?

jameslamb Sep 22, 2023 •

edited

Loading