Skip to content
This repository has been archived by the owner on Jun 24, 2020. It is now read-only.

Is it necessary to stick to schema of current binary format? #3

Closed
trivialfis opened this issue Dec 12, 2018 · 11 comments
Closed

Is it necessary to stick to schema of current binary format? #3

trivialfis opened this issue Dec 12, 2018 · 11 comments

Comments

@trivialfis
Copy link
Collaborator

trivialfis commented Dec 12, 2018

Before I start working on it, please consider deprecating the current way of how we save the model. From the specified schema, JSON file is a utf-8 version of current binary format. Can we open up the possibility of introducing a schema that match the complete XGBoost itself rather than the old binary format?

For example:

The // in JSON code snippet is comment for demonstration purpose, should not show up in actual model file.

Let's take Learner class in the draft as an example:

{
  "learner_model_param" : LearnerModelParam,
  "predictor_param" : PredictorParam,
  "name_obj" : string,
  "name_gbm" : string,
  "gbm" : GBM,
  "attributes" : StringKeyValuePairCollection,
  "eval_metrics" : [ array of string ],
  "count_poisson_max_delta_step" : floating-point
}

Here the draft specifies we save predictor_param and count_posson_max_delta_step, which don't belong to Learner itself. Instead I propose we save something like this for Learner:

{
  // This belongs to learner, hence handled by learner
  "LearnerTrainParam": { LearnerTrainParam },   
  // No `LearnerModelParameter`, we don't need it since JSON can save complete model.

  "predictor" : "gpu_predictor",
  "gpu_predictor" : {
    "GPUPredictionParam": { ... }
  },

  "gbm" : "gbtree",
  "gbtree" : { GBM },

  // This can also be an array, I won't argue which one is better.
  "eval_metrics" : {
    "merror": {...},
    "mae": {...}
  }
}

Update: Actually predictor should be handled in gbm, but lets keep it here for the sake of demonstration.

For actual IO of GPUPredictionParam, we leave it to gpu_predictor. Same goes for GBM.
For how to do that, we can implement this in Learner class:

void Learner::Load(KVStore& kv_store) {
  std::string predictor_name = kv_store["predictor"].ToString();  // say "gpu_predictor" or "cpu_predictor"
  auto p_predictor = Predictor::Create(predictor_name);
  p_predictor->Load(kv_store[predictor_name]);

  // (.. other io ...)

  KVStore& eval_metrics = kv_store["eval_metrics"];
  std::vector<Metric> metrics (eval_metrics.ToMap().size());
  for (auto& m : metrics) {
    metrics.Load(eval_metrics);
  }
}

Inside Metric, let's say mae:

void Load(KVStore& kv_store) {
   KVStore self = kv_store["mae"];  // Look up itself by name.
  // load parameters from `self`
  // load other stuffs if needed.
}

Motivation

The reason I want to do it in this way are:

  1. No extra_attributes. That's a remedy for not being able to add new fields. Now we are able to do so.
  2. Modular. Each C++ class handles what it has, once there are need for special code handling changes of dumped model, Learner doesn't get bloated.
  3. Organized. This way it would be much easier for human to interpret the model since it follows a hierarchy that matches XGBoost itself, what you see from the JSON file, is what the internal of XGBoost looks like.
  4. Complete. Described in (3).
  5. Other models. Related to (2). We can save linear model easily, because it handles its own IO.
  6. Can be In-cooperated existing Configuration. Inside XGBoost, the functions like Configure, Init are just another way of loading the class itself from parameters.

The most important one is (2).

Possible objections

  1. Size of the model file.
    Most of the added fields are parameters. They are important part of the model. A clean and complete representation should worth the space.
  2. Schema less
    Previously I use split_evaluator as an example in RFC: JSON as Next-Generation Model Serialization Format dmlc/xgboost#3980 (comment) . It's possible that we @RAMitchell remove replace it with something simpler due to not being compatible with GPU. So we should still have a schema, but slightly more complicate than current schema.
@trivialfis trivialfis changed the title Is it necessary to stick to current binary format? Is it necessary to stick to schema of current binary format? Dec 12, 2018
@hcho3
Copy link
Owner

hcho3 commented Dec 12, 2018

@trivialfis

Size of the model file.

With #4, the tree nodes will be more compact, so the issue is less severe.

It's possible that we remove it completely due to not being compatible with GPU

Can we keep split_evaluator? Quite a few people use monotonic and feature interaction constraints. Incompatibility with GPU alone is not a good reason to remove a well-used feature.

@trivialfis
Copy link
Collaborator Author

@hcho3 Sorry for the ambiguity. The correct word should be "replace" it with something simpler (anything without a virtual function pointer).

@hcho3
Copy link
Owner

hcho3 commented Dec 12, 2018

But otherwise I concur with saving the complete snapshot of XGBoost. We do want to be careful to provide a way for the user to override the save configurations, so that, say, the user can load model trained with GPU and use it on a machine without GPUs.

@hcho3
Copy link
Owner

hcho3 commented Dec 12, 2018

@trivialfis Ah I see. Let's come back to it later, to find a good way to simplify split_evaluator. For the current discussion, it should be sufficient to save 1) split evaluator sequence (string) and 2) evaluator parameters as key-value pairs (both string).

@hcho3
Copy link
Owner

hcho3 commented Dec 12, 2018

@trivialfis Back to the discussion: should every class be serialized? If not, which should be? Can we make a list?

@trivialfis
Copy link
Collaborator Author

I edited the comment just before your first reply :-)

provide a way for the user to override the save configurations

No problem, at least user need to change "gpu_hist" to "hist", other things can be configured by Learner. Besides, changing parameters after loading should work out of box.

should every class be serialized?

I will start working on this once my proposal get approved by other participants.

@hcho3
Copy link
Owner

hcho3 commented Dec 12, 2018

A little tidbit:

No LearnerModelParameter, we don't need it since JSON can save complete model

I think we still need LearnerModelParam, since we need to store base_score, num_feature, and num_class.

@trivialfis
Copy link
Collaborator Author

@hcho3 Aha, totally forgot. I will review all parameters/objects when drafting the list.

@hcho3
Copy link
Owner

hcho3 commented Dec 12, 2018

@trivialfis Can you look at #4 as well, when you get a chance?

@trivialfis
Copy link
Collaborator Author

@hcho3 Sorry for the long delay. Should be able to make a PR in a few days.

@trivialfis
Copy link
Collaborator Author

Closing as draft is in #5 .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants