Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] 4.0.0 parameters missing from model.txt #6010

Closed
adfea9c0 opened this issue Jul 26, 2023 · 20 comments · Fixed by #6077
Closed

[Docs] 4.0.0 parameters missing from model.txt #6010

adfea9c0 opened this issue Jul 26, 2023 · 20 comments · Fixed by #6077
Labels

Comments

@adfea9c0
Copy link

adfea9c0 commented Jul 26, 2023

Description

I noticed when you save a model, some of the new parameters in the latest 4.0.0 release are not stored, e.g. the ones around quantized training. It seems they are explicitly marked as such in the document:

// [no-save]
// desc = whether to use gradient quantization when training
// desc = enabling this will discretize (quantize) the gradients and hessians into bins of ``num_grad_quant_bins``
// desc = with quantized training, most arithmetics in the training process will be integer operations
// desc = gradient quantization can accelerate training, with little accuracy drop in most cases
// desc = **Note**: can be used only with ``device_type = cpu``
// desc = *New in version 4.0.0*
bool use_quantized_grad = false;
// [no-save]
// desc = number of bins to quantization gradients and hessians
// desc = with more bins, the quantized training will be closer to full precision training
// desc = **Note**: can be used only with ``device_type = cpu``
// desc = *New in 4.0.0*
int num_grad_quant_bins = 4;
// [no-save]
// desc = whether to renew the leaf values with original gradients when quantized training
// desc = renewing is very helpful for good quantized training accuracy for ranking objectives
// desc = **Note**: can be used only with ``device_type = cpu``
// desc = *New in 4.0.0*
bool quant_train_renew_leaf = false;
// [no-save]
// desc = whether to use stochastic rounding in gradient quantization
// desc = *New in 4.0.0*
bool stochastic_rounding = true;

Is this intentional? It seems to me these parameters ought to be saved.

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

It seems to me these parameters ought to be saved.

I believe this was intentional in #5800. Those parameters don't affect serving a model to generate predictions (similar conversation: #6017 (comment)).

Could you explain specifically why you think they should be saved?

@jameslamb
Copy link
Collaborator

I thought about it some more tonight and I support adding these to model files, let's see if other maintainers agree: #6017 (comment)

But either way, please share as much detail as you can about why you think they "ought to be saved".

@jameslamb
Copy link
Collaborator

By the way, I modified the code link in your initial post to one that's tagged to a specific commit. "lines 596-622 of config.h" will point to different code in the future as that file changes, which might make it difficult to understand the intent of this issue. Doing that also has the side benefit of getting a nice inline view of the exact lines you were referring to.

If you're not familiar with that, see https://docs.github.com/en/repositories/working-with-files/using-files/getting-permanent-links-to-files#press-y-to-permalink-to-a-file-in-a-specific-commit.

@mjmckp
Copy link
Contributor

mjmckp commented Aug 24, 2023

Please also ensure data_sample_strategy is persisted to the model file.

@adfea9c0
Copy link
Author

adfea9c0 commented Aug 29, 2023

But either way, please share as much detail as you can about why you think they "ought to be saved".

Reproducibility and hyperparamger logging in machine learning in general is valuable -- if I go back and look at a model.txt I trained earlier, being able to see what hyperparameters I used to train it seems like a nice feature to have. I understand there's limitations to this, we can't put the whole dataset in the model.txt. But a few hyperparameters seem ~free to add.

those parameters don't affect serving a model to generate predictions (similar conversation: #6017 (comment)).

I think this is true for a lot of parameters that currently are printed? My model.txt is currently full of hyperparameters, many of which I didn't even use (tweedie_variance_power? machine_list_filename when I'm running single-machine training through python? bagging_seed? etc etc). I don't need to know any of these I want to do inference. To be clear, I'm not arguing for them to be removed, quite the opposite :-)

If the concern here is file size, maybe there could be a succint mode that only stores the bare minimum relevant for serving (so no irrelevant hyperparameters, feature importances, tree descriptions only contain splits/values, not counts/weights).

@jameslamb
Copy link
Collaborator

Thanks for that.

I intentionally wanted to ask in an open-ended, non-leading way to see if there were other specific use cases you were thinking of that I hadn't considered.

Seems like there are not and your motivation for this request is "seems like it'd be nice". To be clear... I don't disagree! We'll pick this up at some point, thanks for taking the time to start a thread about it here.

@mjmckp
Copy link
Contributor

mjmckp commented Aug 30, 2023 via email

@adfea9c0
Copy link
Author

Also in 4.0.0 I noticed that at some point the python code will actually try to parse parameters from the model text file, namely

def _get_loaded_param(self) -> Dict[str, Any]:
buffer_len = 1 << 20
tmp_out_len = ctypes.c_int64(0)
string_buffer = ctypes.create_string_buffer(buffer_len)
ptr_string_buffer = ctypes.c_char_p(*[ctypes.addressof(string_buffer)])
_safe_call(_LIB.LGBM_BoosterGetLoadedParam(
self._handle,
ctypes.c_int64(buffer_len),
ctypes.byref(tmp_out_len),
ptr_string_buffer))
actual_len = tmp_out_len.value
# if buffer length is not long enough, re-allocate a buffer
if actual_len > buffer_len:
string_buffer = ctypes.create_string_buffer(actual_len)
ptr_string_buffer = ctypes.c_char_p(*[ctypes.addressof(string_buffer)])
_safe_call(_LIB.LGBM_BoosterGetLoadedParam(
self._handle,
ctypes.c_int64(actual_len),
ctypes.byref(tmp_out_len),
ptr_string_buffer))
return json.loads(string_buffer.value.decode('utf-8'))

Though in practice the goal seems to be to override input parameters, see

if params:
_log_warning('Ignoring params argument, using parameters from model file.')
params = self._get_loaded_param()

So this in practice will end up loading the wrong parameters, e.g.:

import numpy as np
import lightgbm as lgb
assert lgb.__version__ == "4.0.0"

n, p = 10_000, 10
X = np.random.normal(size=(n, p))
y = np.random.normal(size=(n,))
ds = lgb.Dataset(X, label=y, free_raw_data=False)

model = lgb.train({"verbosity": -1, "use_quantized_grad": True}, train_set=ds, num_boost_round=1)
print("use_quantized_grad" in model.params)
>>> True

model.save_model("tmp.txt")

booster = lgb.Booster({"verbosity": -1, "use_quantized_grad": True}, model_file="tmp.txt")
print("use_quantized_grad" in booster.params)
>>> False

TBH I don't really know what the idea is behind this parameter override when loading a model, but either way this doesn't really seem correct?

@adfea9c0
Copy link
Author

Just to check; was the original reason for omitting these output parameters indeed to keep the output file small for inference? Would having a smaller inference-only text file format resolve this?

@jameslamb
Copy link
Collaborator

was the original reason for omitting these output parameters indeed to keep the output file small for inference?

Not necessarily. The idea of excluding parameters from the model file intentionally originally came from a desire to not include CLI-only parameters to the model file.

For example, the CLI supports parameters like these:

  • output_model = filepath, if provided tells LightGBM to write a model in text format to that path during training
  • save_binary = boolean, tells LightGBM to write the training Dataset out to LightGBM's binary format

Having such parameters in the model config by default is problematic outside of the CLI, like in the R and Python packages, because it can result in side-effects like file-writing that users didn't ask for and which happen silently.

See this thread: #2589 (comment)

(which I found from the git blame: https://github.com/microsoft/LightGBM/blame/82033064005d07ae98cd3003190d675992061a61/include/LightGBM/config.h#L9)

I suspect that other parameters were then later marked [no-save] (omitted from the model flle) for other reasons, like maybe a misunderstanding that the model file was supposed to only contain information necessary for inference + whatever it already contained. I'm not sure.

Would having a smaller inference-only text file format resolve this?

I don't support adding this at this time. The size of LightGBM text files is already dominated by the tree structure (which would be necessary for inference too), so I don't think a code path generating a model file which omits parameters and other model information just for the sake of file size is worth the added complexity.

@jameslamb
Copy link
Collaborator

TBH I don't really know what the idea is behind this parameter override when loading a model, but either way this doesn't really seem correct?

I don't understand this comment. What specifically is not "correct" about that behavior you linked from the Python package?

@adfea9c0
Copy link
Author

adfea9c0 commented Sep 5, 2023

I don't understand this comment. What specifically is not "correct" about that behavior you linked from the Python package?

See the code snippet I posted. I set use_quantized_grad=True in both boosters but it is not present in the parameters in the second booster.

@jameslamb
Copy link
Collaborator

See the code snippet I posted. I set use_quantized_grad=True in both boosters but it is not present in the parameters in the second booster.

I understand the snippet, and understand that you mean it to say that you expected parameters passed to lgb.train() to override those loaded from the model.

But the fact that that's what you expected, or that that's what you'd prefer, doesn't mean LightGBM's current approach is not "correct".

That behavior was an intentional choice. If you look in the git blame view for the code you linked to (blame link), it'll lead you to the PR that introduced it: #5424.

From there, you can find that there was actually quite a lot of conversation about this topic on the PR:


@jmoralez when you have time could you read through those two threads again and give us your opinion on whether parameters passed through params in lgb.train() should override what's lloaded from a model file?

Re-reading them and considering the conversation in this issue, I think we should have params specified in lgb.train() override any loaded from the model file if you pass in a model file.

@jmoralez
Copy link
Collaborator

jmoralez commented Sep 7, 2023

The example is constructing a booster passing both params and model_file, which I'm pretty sure issues a warning about the params being ignored. I believe the conclusion was that if the user wanted to train more iterations it'd have to be done with train and passing the saved model as init_model.

@jameslamb
Copy link
Collaborator

Oh!! I totally missed that it was a call to lgb.Booster() there, not a second call to lgb.train(). 😅

@adfea9c0
Copy link
Author

adfea9c0 commented Sep 7, 2023

Just to clarify what my first snippet is supposed to show:

  • I train and save a booster with use_quantized_grad=True (emphasis on True)
  • I create a second booster which loads the above booster and also separately inputs use_quantized_grad=True (emphasis on True)
  • The resulting second booster does not have use_quantized_grad set at all, despite me setting it to True twice.

The point is that there is something off, whether my expectations are correct or not. Either the parameters from the model_file are leading (the current situation, as you suggested) or the passed in parameters are leading (my expectation), but either way this should lead to use_quantized_grad=True in the final model, because I set it to True in both dictionaries. I accept that my expectations are evidently wrong but there is still a bug here.

[deleted section because I said something dumb]

I experimented a bit and actually the situation with lgb.train is actually already what I expected:

import numpy as np
import lightgbm as lgb
assert lgb.__version__ == "4.0.0"

np.random.seed(0)
n, p = 10_000, 10
X, y = np.random.normal(size=(n, p)), np.random.normal(size=(n,))
ds = lgb.Dataset(X, label=y, free_raw_data=False)

model = lgb.train({"verbosity": -1, "use_quantized_grad": True}, train_set=ds, num_boost_round=1)
model.save_model("tmp.txt")

model2 = lgb.train({"verbosity": -1}, train_set=ds, init_model="tmp.txt", num_boost_round=1)
print("use_quantized_grad" in model2.params)
>>> False

model3 = lgb.train({"verbosity": -1, "use_quantized_grad": False}, train_set=ds, init_model="tmp.txt", num_boost_round=1)
print("use_quantized_grad" in model3.params, model3.params["use_quantized_grad"])
>>> True, False

So it seems to me the current situation is:

  • For lgb.train, the parameters in the input are loaded, the parameters in the init_model file are ignored. (EDIT: Actually, this may be a bug in itself and only happen for use_quantized_grad and other parameters not currently stored in the model file.)
  • For lgb.Booster, the claimed behaviour is that the parameters from the model_file are loaded and override those input the booster, however as the first snippet I posted shows, this is not the case for use_quantized_grad and other parameters that are not stored in the model file.

The last sentences makes perfect sense, becaus if we don't store use_quantized_grad in the model file then we cannot know what its value is supposed to be when loading that file.

@adfea9c0
Copy link
Author

adfea9c0 commented Sep 7, 2023

Just to drive the point home. Can you please have a look at this code snippet and guess what it should output, and then try and run it:

import numpy as np
import lightgbm as lgb
assert lgb.__version__ == "4.0.0"

np.random.seed(0)
n, p = 10_000, 10
X, y = np.random.normal(size=(n, p)), np.random.normal(size=(n,))
ds = lgb.Dataset(X, label=y, free_raw_data=False)

params1 = {"verbosity": -1, "use_quantized_grad": True, "bagging_fraction": 0.5}
params2 = {"verbosity": -1, "use_quantized_grad": False, "bagging_fraction": 1.0}

model = lgb.train(params1, train_set=ds, num_boost_round=1)
model.save_model("tmp.txt")

model2 = lgb.Booster(params2, model_file="tmp.txt")

print(f"{model2.params.get('use_quantized_grad', None)}")
print(f"{model2.params.get('bagging_fraction', None)}")

@jmoralez
Copy link
Collaborator

jmoralez commented Sep 7, 2023

I train and save a booster with use_quantized_grad=True (emphasis on True)

This param isn't currently saved in the model file and thus can't be recovered, but it's being addressed in #6077.

I create a second booster which loads the above booster and also separately inputs use_quantized_grad=True (emphasis on True)

This issues the warning about the params argument being ignored, so it doesn't have any effect.

The resulting second booster does not have use_quantized_grad set at all, despite me setting it to True twice.

This is because of the two points above.

Actually, this may be a bug in itself and only happen for use_quantized_grad and other parameters not currently stored in the model file

Yes, since it's not currently saved in the model file there's no way to recover it, however you can override the parameters for continued training with the lgb.train function.

For lgb.Booster, the claimed behaviour is that the parameters from the model_file are loaded and override those input the booster

This isn't stated anywhere, those parameters are meant to be used on the Python side to get information on the trained model, such as using the learned categorical features (#5246).

@jameslamb
Copy link
Collaborator

#6077 was just merged, so as of latest master and the next LightGBM release, all of the parameters related to quantized training will be stored in the model file.

Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 13, 2023
Ten0 pushed a commit to Ten0/LightGBM that referenced this issue Jan 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
4 participants