Question on record counts in trees #437

chmielot · 2019-09-27T10:13:28Z

TL;DR;
I am parsing PMML files that were generated using r2pmml from ranger models. Unfortunately, it seems like actual record counts are not available from the ranger object, r2pmml only has access to the final probabilities. This results in syntactically correct PMML files, but there are semantical issues that make parsing the files more tedious. Would it be possible to add the actual counts to the model? It would make automated processing of ranger PMML files easier and the PMML files more explicit.

Long version:
The PMML spec requires a recordCount so r2pmml sets the recordCounts to the calculated probabilities, calling the values "relative record counts". This results in PMMLs like this:

<Node score="1">
    <ScoreDistribution value="0" recordCount="0.3"/>
    <ScoreDistribution value="1" recordCount="0.7"/>
</Node>

All other PMML files I've encountered, including the samples in the PMML spec, use abolute record counts and set the total number in the Node tag for convenience (although it could be calculated by aggregating the particular recordCounts).

<Node score="1" recordCount="10">
    <ScoreDistribution value="0" recordCount="3"/>
    <ScoreDistribution value="1" recordCount="7"/>
</Node>

So far, only ranger PMMLs are different.

Following the spec:

The value of recordCount in a Node serves as a base size for recordCount values in ScoreDistribution elements. "

I can calculate the probabilities myself, i.e. 3/10 and 7/10. There is an optional probability attribute on the ScoreDistribution that I would always prefer while parsing.

Although the "relative record counts" example above results in a technically valid PMML file (because the XSD allows floats for recordCount), it seems to be odd to have "relative record counts". These are my main reasons, without going into the details of actually processing the PMML:

It's tedious (not impossible) to parse this, because of the ambiguity of recordCount
I am convinced that there is a reason why there is a probability attribute for the ScoreDistribution (...so that record counts can be actual record counts)
To quote the spec on ScoreDistribution:

"recordCount: This attribute of ScoreDistribution is the size (in number of records) associated with the value attribute."

At least I expect actual (absolute) counts when reading this.

In the meantime I'm already in discussions with the PMML group to find out if "relative record counts" were intended. If it turns out that they were not and that they will be excluded explicitly (in the future), this would mean that there is no way to create valid ranger PMML files. A solution provided by the PMML group could be to make either probability OR recordCount mandatory, instead of always requiring a recordCount. In this case no change on the ranger side would be required and generated PMMLs would be "cleaner".

To summarize: If ranger provides the record counts, we will be able to create more explicit and "clean" PMML files (now).

The text was updated successfully, but these errors were encountered:

mnwright · 2019-10-07T06:50:25Z

Currently, we throw away that information as soon as a terminal node is reached because we divide by the number of observations. I will check whether we can do this division later to keep the actual counts.

mnwright · 2023-09-25T19:34:10Z

In #690 we have the number of observations in each node. Does that help?

mnwright closed this as completed Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on record counts in trees #437

Question on record counts in trees #437

chmielot commented Sep 27, 2019 •

edited

mnwright commented Oct 7, 2019

mnwright commented Sep 25, 2023

Question on record counts in trees #437

Question on record counts in trees #437

Comments

chmielot commented Sep 27, 2019 • edited

mnwright commented Oct 7, 2019

mnwright commented Sep 25, 2023

chmielot commented Sep 27, 2019 •

edited