You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR;
I am parsing PMML files that were generated using r2pmml from ranger models. Unfortunately, it seems like actual record counts are not available from the ranger object, r2pmml only has access to the final probabilities. This results in syntactically correct PMML files, but there are semantical issues that make parsing the files more tedious. Would it be possible to add the actual counts to the model? It would make automated processing of ranger PMML files easier and the PMML files more explicit.
Long version:
The PMML spec requires a recordCount so r2pmml sets the recordCounts to the calculated probabilities, calling the values "relative record counts". This results in PMMLs like this:
All other PMML files I've encountered, including the samples in the PMML spec, use abolute record counts and set the total number in the Node tag for convenience (although it could be calculated by aggregating the particular recordCounts).
The value of recordCount in a Node serves as a base size for recordCount values in ScoreDistribution elements. "
I can calculate the probabilities myself, i.e. 3/10 and 7/10. There is an optional probability attribute on the ScoreDistribution that I would always prefer while parsing.
Although the "relative record counts" example above results in a technically valid PMML file (because the XSD allows floats for recordCount), it seems to be odd to have "relative record counts". These are my main reasons, without going into the details of actually processing the PMML:
It's tedious (not impossible) to parse this, because of the ambiguity of recordCount
I am convinced that there is a reason why there is a probability attribute for the ScoreDistribution (...so that record counts can be actual record counts)
To quote the spec on ScoreDistribution:
"recordCount: This attribute of ScoreDistribution is the size (in number of records) associated with the value attribute."
At least I expect actual (absolute) counts when reading this.
In the meantime I'm already in discussions with the PMML group to find out if "relative record counts" were intended. If it turns out that they were not and that they will be excluded explicitly (in the future), this would mean that there is no way to create valid ranger PMML files. A solution provided by the PMML group could be to make either probability OR recordCount mandatory, instead of always requiring a recordCount. In this case no change on the ranger side would be required and generated PMMLs would be "cleaner".
To summarize: If ranger provides the record counts, we will be able to create more explicit and "clean" PMML files (now).
The text was updated successfully, but these errors were encountered:
Currently, we throw away that information as soon as a terminal node is reached because we divide by the number of observations. I will check whether we can do this division later to keep the actual counts.
TL;DR;
I am parsing PMML files that were generated using r2pmml from ranger models. Unfortunately, it seems like actual record counts are not available from the ranger object, r2pmml only has access to the final probabilities. This results in syntactically correct PMML files, but there are semantical issues that make parsing the files more tedious. Would it be possible to add the actual counts to the model? It would make automated processing of ranger PMML files easier and the PMML files more explicit.
Long version:
The PMML spec requires a recordCount so r2pmml sets the recordCounts to the calculated probabilities, calling the values "relative record counts". This results in PMMLs like this:
All other PMML files I've encountered, including the samples in the PMML spec, use abolute record counts and set the total number in the
Node
tag for convenience (although it could be calculated by aggregating the particular recordCounts).So far, only ranger PMMLs are different.
Following the spec:
I can calculate the probabilities myself, i.e.
3/10
and7/10
. There is an optionalprobability
attribute on theScoreDistribution
that I would always prefer while parsing.Although the "relative record counts" example above results in a technically valid PMML file (because the XSD allows floats for recordCount), it seems to be odd to have "relative record counts". These are my main reasons, without going into the details of actually processing the PMML:
It's tedious (not impossible) to parse this, because of the ambiguity of recordCount
I am convinced that there is a reason why there is a
probability
attribute for theScoreDistribution
(...so that record counts can be actual record counts)To quote the spec on
ScoreDistribution
:At least I expect actual (absolute) counts when reading this.
In the meantime I'm already in discussions with the PMML group to find out if "relative record counts" were intended. If it turns out that they were not and that they will be excluded explicitly (in the future), this would mean that there is no way to create valid ranger PMML files. A solution provided by the PMML group could be to make either
probability
ORrecordCount
mandatory, instead of always requiring arecordCount
. In this case no change on the ranger side would be required and generated PMMLs would be "cleaner".To summarize: If ranger provides the record counts, we will be able to create more explicit and "clean" PMML files (now).
The text was updated successfully, but these errors were encountered: