Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on record counts in trees #437

Closed
chmielot opened this issue Sep 27, 2019 · 2 comments
Closed

Question on record counts in trees #437

chmielot opened this issue Sep 27, 2019 · 2 comments

Comments

@chmielot
Copy link

chmielot commented Sep 27, 2019

TL;DR;
I am parsing PMML files that were generated using r2pmml from ranger models. Unfortunately, it seems like actual record counts are not available from the ranger object, r2pmml only has access to the final probabilities. This results in syntactically correct PMML files, but there are semantical issues that make parsing the files more tedious. Would it be possible to add the actual counts to the model? It would make automated processing of ranger PMML files easier and the PMML files more explicit.

Long version:
The PMML spec requires a recordCount so r2pmml sets the recordCounts to the calculated probabilities, calling the values "relative record counts". This results in PMMLs like this:

<Node score="1">
    <ScoreDistribution value="0" recordCount="0.3"/>
    <ScoreDistribution value="1" recordCount="0.7"/>
</Node>

All other PMML files I've encountered, including the samples in the PMML spec, use abolute record counts and set the total number in the Node tag for convenience (although it could be calculated by aggregating the particular recordCounts).

<Node score="1" recordCount="10">
    <ScoreDistribution value="0" recordCount="3"/>
    <ScoreDistribution value="1" recordCount="7"/>
</Node>

So far, only ranger PMMLs are different.

Following the spec:

The value of recordCount in a Node serves as a base size for recordCount values in ScoreDistribution elements. "

I can calculate the probabilities myself, i.e. 3/10 and 7/10. There is an optional probability attribute on the ScoreDistribution that I would always prefer while parsing.

Although the "relative record counts" example above results in a technically valid PMML file (because the XSD allows floats for recordCount), it seems to be odd to have "relative record counts". These are my main reasons, without going into the details of actually processing the PMML:

  • It's tedious (not impossible) to parse this, because of the ambiguity of recordCount

  • I am convinced that there is a reason why there is a probability attribute for the ScoreDistribution (...so that record counts can be actual record counts)

  • To quote the spec on ScoreDistribution:

    "recordCount: This attribute of ScoreDistribution is the size (in number of records) associated with the value attribute."

    At least I expect actual (absolute) counts when reading this.

In the meantime I'm already in discussions with the PMML group to find out if "relative record counts" were intended. If it turns out that they were not and that they will be excluded explicitly (in the future), this would mean that there is no way to create valid ranger PMML files. A solution provided by the PMML group could be to make either probability OR recordCount mandatory, instead of always requiring a recordCount. In this case no change on the ranger side would be required and generated PMMLs would be "cleaner".

To summarize: If ranger provides the record counts, we will be able to create more explicit and "clean" PMML files (now).

@mnwright
Copy link
Member

mnwright commented Oct 7, 2019

Currently, we throw away that information as soon as a terminal node is reached because we divide by the number of observations. I will check whether we can do this division later to keep the actual counts.

@mnwright
Copy link
Member

In #690 we have the number of observations in each node. Does that help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants