Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new task type: "save_binary". #3651

Merged
merged 2 commits into from
Feb 3, 2021
Merged

Conversation

cyfdecyf
Copy link
Contributor

The main motivation of adding save_binary task type is to save memroy.

The python function Dataset.save_binary() has two drawbacks:

  1. Uses much more memory compared to data itself
    • triple memory usage to be more exact, but I didn't investigate this too deep
  2. Can not use the two_round parameter compared to command line verison to reduce memory usage

If you have interest to merge this PR, I'd also update the corresponding docs.

@cyfdecyf
Copy link
Contributor Author

I've looked at this failed check, which is a compilation failure caused by missing cstdint header in meta.h. Guess this is not caused by my change?

@jameslamb
Copy link
Collaborator

I've looked at this failed check, which is a compilation failure caused by missing cstdint header in meta.h. Guess this is not caused by my change?

no you can ignore that one, it's very unreliable. Sorry! I'll re-run manually

Copy link
Contributor Author

@cyfdecyf cyfdecyf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for some logging tweaks.

@@ -223,7 +224,7 @@ Dataset* DatasetLoader::LoadFromFile(const char* filename, int rank, int num_mac
ConstructBinMappersFromTextData(rank, num_machines, sample_data, parser.get(), dataset.get());
// initialize label
dataset->metadata_.Init(dataset->num_data_, weight_idx_, group_idx_);
Log::Debug("Making second pass...");
Log::Info("Making second pass...");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log is helpful to confirm the two_round parameter is effective.


auto t2 = std::chrono::high_resolution_clock::now();
Log::Info("Construct bin mappers from text data time %.2f seconds",
std::chrono::duration<double, std::milli>(t2 - t1) * 1e-3);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add more timing information for the whole process.

@@ -27,7 +27,7 @@ namespace LightGBM {

/*! \brief Types of tasks */
enum TaskType {
kTrain, kPredict, kConvertModel, KRefitTree
kTrain, kPredict, kConvertModel, KRefitTree, kSaveBinary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyfdecyf Thanks for your contribution! Could you please resolve all conflicts and document new enum option?

// [no-save]
// [doc-only]
// type = enum
// default = train
// options = train, predict, convert_model, refit
// alias = task_type
// desc = ``train``, for training, aliases: ``training``
// desc = ``predict``, for prediction, aliases: ``prediction``, ``test``
// desc = ``convert_model``, for converting model file into if-else format, see more information in `Convert Parameters <#convert-parameters>`__
// desc = ``refit``, for refitting existing models with new data, aliases: ``refit_tree``
// desc = **Note**: can be used only in CLI version; for language-specific packages you can use the correspondent functions
TaskType task = TaskType::kTrain;

After adding description of save_binary task, run this script to regenerate documentation.
https://github.com/microsoft/LightGBM/blob/master/helpers/parameter_generator.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. Conflicts are resolved and added document for the new enum option.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for your contribution!

@@ -102,6 +102,7 @@ struct Config {
// desc = ``predict``, for prediction, aliases: ``prediction``, ``test``
// desc = ``convert_model``, for converting model file into if-else format, see more information in `Convert Parameters <#convert-parameters>`__
// desc = ``refit``, for refitting existing models with new data, aliases: ``refit_tree``
// desc = ``save_binary``, load train (and validation) data then save dataset to binary file. Typical usage: ``save_binary`` first, then run multiple ``train`` tasks in parallel using the saved binary file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for Typical usage! 👍

@StrikerRUS StrikerRUS merged commit 111d0c8 into microsoft:master Feb 3, 2021
@cyfdecyf cyfdecyf deleted the save-binary branch July 12, 2021 00:08
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants