Is it necessary to stick to schema of current binary format? #3

trivialfis · 2018-12-12T08:36:43Z

Before I start working on it, please consider deprecating the current way of how we save the model. From the specified schema, JSON file is a utf-8 version of current binary format. Can we open up the possibility of introducing a schema that match the complete XGBoost itself rather than the old binary format?

For example:

The // in JSON code snippet is comment for demonstration purpose, should not show up in actual model file.

Let's take Learner class in the draft as an example:

{
  "learner_model_param" : LearnerModelParam,
  "predictor_param" : PredictorParam,
  "name_obj" : string,
  "name_gbm" : string,
  "gbm" : GBM,
  "attributes" : StringKeyValuePairCollection,
  "eval_metrics" : [ array of string ],
  "count_poisson_max_delta_step" : floating-point
}

Here the draft specifies we save predictor_param and count_posson_max_delta_step, which don't belong to Learner itself. Instead I propose we save something like this for Learner:

{
  // This belongs to learner, hence handled by learner
  "LearnerTrainParam": { LearnerTrainParam },   
  // No `LearnerModelParameter`, we don't need it since JSON can save complete model.

  "predictor" : "gpu_predictor",
  "gpu_predictor" : {
    "GPUPredictionParam": { ... }
  },

  "gbm" : "gbtree",
  "gbtree" : { GBM },

  // This can also be an array, I won't argue which one is better.
  "eval_metrics" : {
    "merror": {...},
    "mae": {...}
  }
}

Update: Actually predictor should be handled in gbm, but lets keep it here for the sake of demonstration.

For actual IO of GPUPredictionParam, we leave it to gpu_predictor. Same goes for GBM.
For how to do that, we can implement this in Learner class:

void Learner::Load(KVStore& kv_store) {
  std::string predictor_name = kv_store["predictor"].ToString();  // say "gpu_predictor" or "cpu_predictor"
  auto p_predictor = Predictor::Create(predictor_name);
  p_predictor->Load(kv_store[predictor_name]);

  // (.. other io ...)

  KVStore& eval_metrics = kv_store["eval_metrics"];
  std::vector<Metric> metrics (eval_metrics.ToMap().size());
  for (auto& m : metrics) {
    metrics.Load(eval_metrics);
  }
}

Inside Metric, let's say mae:

void Load(KVStore& kv_store) {
   KVStore self = kv_store["mae"];  // Look up itself by name.
  // load parameters from `self`
  // load other stuffs if needed.
}

Motivation

The reason I want to do it in this way are:

No extra_attributes. That's a remedy for not being able to add new fields. Now we are able to do so.
Modular. Each C++ class handles what it has, once there are need for special code handling changes of dumped model, Learner doesn't get bloated.
Organized. This way it would be much easier for human to interpret the model since it follows a hierarchy that matches XGBoost itself, what you see from the JSON file, is what the internal of XGBoost looks like.
Complete. Described in (3).
Other models. Related to (2). We can save linear model easily, because it handles its own IO.
Can be In-cooperated existing Configuration. Inside XGBoost, the functions like Configure, Init are just another way of loading the class itself from parameters.

The most important one is (2).

Possible objections

Size of the model file.
Most of the added fields are parameters. They are important part of the model. A clean and complete representation should worth the space.
Schema less
Previously I use split_evaluator as an example in RFC: JSON as Next-Generation Model Serialization Format dmlc/xgboost#3980 (comment) . It's possible that we @RAMitchell ~~remove~~ replace it with something simpler due to not being compatible with GPU. So we should still have a schema, but slightly more complicate than current schema.

The text was updated successfully, but these errors were encountered:

hcho3 · 2018-12-12T09:14:55Z

@trivialfis

Size of the model file.

With #4, the tree nodes will be more compact, so the issue is less severe.

It's possible that we remove it completely due to not being compatible with GPU

Can we keep split_evaluator? Quite a few people use monotonic and feature interaction constraints. Incompatibility with GPU alone is not a good reason to remove a well-used feature.

trivialfis · 2018-12-12T09:16:32Z

@hcho3 Sorry for the ambiguity. The correct word should be "replace" it with something simpler (anything without a virtual function pointer).

hcho3 · 2018-12-12T09:16:43Z

But otherwise I concur with saving the complete snapshot of XGBoost. We do want to be careful to provide a way for the user to override the save configurations, so that, say, the user can load model trained with GPU and use it on a machine without GPUs.

hcho3 · 2018-12-12T09:18:23Z

@trivialfis Ah I see. Let's come back to it later, to find a good way to simplify split_evaluator. For the current discussion, it should be sufficient to save 1) split evaluator sequence (string) and 2) evaluator parameters as key-value pairs (both string).

hcho3 · 2018-12-12T09:21:35Z

@trivialfis Back to the discussion: should every class be serialized? If not, which should be? Can we make a list?

trivialfis · 2018-12-12T09:22:42Z

I edited the comment just before your first reply :-)

provide a way for the user to override the save configurations

No problem, at least user need to change "gpu_hist" to "hist", other things can be configured by Learner. Besides, changing parameters after loading should work out of box.

should every class be serialized?

I will start working on this once my proposal get approved by other participants.

hcho3 · 2018-12-12T09:29:39Z

A little tidbit:

No LearnerModelParameter, we don't need it since JSON can save complete model

I think we still need LearnerModelParam, since we need to store base_score, num_feature, and num_class.

trivialfis · 2018-12-12T09:32:00Z

@hcho3 Aha, totally forgot. I will review all parameters/objects when drafting the list.

hcho3 · 2018-12-12T09:33:11Z

@trivialfis Can you look at #4 as well, when you get a chance?

trivialfis · 2018-12-17T09:47:24Z

@hcho3 Sorry for the long delay. Should be able to make a PR in a few days.

trivialfis · 2018-12-22T05:26:52Z

Closing as draft is in #5 .

trivialfis changed the title ~~Is it necessary to stick to current binary format?~~ Is it necessary to stick to schema of current binary format? Dec 12, 2018

hcho3 mentioned this issue Dec 12, 2018

RFC: JSON as Next-Generation Model Serialization Format dmlc/xgboost#3980

Closed

trivialfis mentioned this issue Dec 12, 2018

Address comments from @KOLANICH #4

Merged

trivialfis closed this as completed Dec 22, 2018

hcho3 mentioned this issue Mar 13, 2019

[RFC] Custom (user-defined) split evaluator dmlc/xgboost#4230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it necessary to stick to schema of current binary format? #3

Is it necessary to stick to schema of current binary format? #3

trivialfis commented Dec 12, 2018 •

edited

hcho3 commented Dec 12, 2018

trivialfis commented Dec 12, 2018

hcho3 commented Dec 12, 2018

hcho3 commented Dec 12, 2018

hcho3 commented Dec 12, 2018

trivialfis commented Dec 12, 2018

hcho3 commented Dec 12, 2018

trivialfis commented Dec 12, 2018

hcho3 commented Dec 12, 2018

trivialfis commented Dec 17, 2018

trivialfis commented Dec 22, 2018

Is it necessary to stick to schema of current binary format? #3

Is it necessary to stick to schema of current binary format? #3

Comments

trivialfis commented Dec 12, 2018 • edited

For example:

Motivation

Possible objections

hcho3 commented Dec 12, 2018

trivialfis commented Dec 12, 2018

hcho3 commented Dec 12, 2018

hcho3 commented Dec 12, 2018

hcho3 commented Dec 12, 2018

trivialfis commented Dec 12, 2018

hcho3 commented Dec 12, 2018

trivialfis commented Dec 12, 2018

hcho3 commented Dec 12, 2018

trivialfis commented Dec 17, 2018

trivialfis commented Dec 22, 2018

trivialfis commented Dec 12, 2018 •

edited