Random Forest Conversions and Consumption #5

koinadn · 2016-03-02T21:35:58Z

Hello,

I'm having a few issues in testing a random forest classifier from scklearn2pmml in JPMML. I'm producing a simple PMML file from the code here:

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

import pandas
import sklearn_pandas

iris = load_iris()

iris_df = pandas.concat((pandas.DataFrame(iris.data[:, :], columns = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]), pandas.DataFrame(iris.target, columns = ["Species"])), axis = 1)

iris_mapper = sklearn_pandas.DataFrameMapper([('Sepal.Length',None),
                                              ('Sepal.Width', None), 
                                              ('Petal.Width', None),
                                              ('Petal.Width', None),
                                              ('Species',None)])

iris = iris_mapper.fit_transform(iris_df)

from sklearn.ensemble import RandomForestClassifier

iris_X = iris[:, 0:4]
iris_y = iris[:, 4]

iris_classifier = RandomForestClassifier(n_estimators=10)
iris_classifier.fit(iris_X, iris_y)

from sklearn2pmml import sklearn2pmml

sklearn2pmml(iris_classifier, iris_mapper, "randomforest.pmml")

1. I'd like to do no transformations across my data set. Leaving them all blank transforms the data type from double to float.

            <DerivedField name="x1" optype="continuous" dataType="float">
                <FieldRef field="Sepal.Length"/>

This causes an issue with JPMML throwing the following error:

org.jpmml.evaluator.TypeCheckException: Expected FLOAT, but got DOUBLE (5.3)
    at org.jpmml.evaluator.TypeUtil.toFloat(TypeUtil.java:456)
    at org.jpmml.evaluator.TypeUtil.cast(TypeUtil.java:330)
    at org.jpmml.evaluator.TypeUtil.parseOrCast(TypeUtil.java:61)
    at org.jpmml.evaluator.FieldValueUtil.create(FieldValueUtil.java:92)
    at org.jpmml.evaluator.FieldValueUtil.refine(FieldValueUtil.java:144)
    at org.jpmml.evaluator.FieldValueUtil.refine(FieldValueUtil.java:116)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:82)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:63)
    at org.jpmml.evaluator.PredicateUtil.evaluateSimplePredicate(PredicateUtil.java:95)
    at org.jpmml.evaluator.PredicateUtil.evaluate(PredicateUtil.java:54)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateNode(TreeModelEvaluator.java:171)
    at org.jpmml.evaluator.TreeModelEvaluator.handleTrue(TreeModelEvaluator.java:186)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateTree(TreeModelEvaluator.java:139)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateClassification(TreeModelEvaluator.java:111)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluate(TreeModelEvaluator.java:80)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:463)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluateClassification(MiningModelEvaluator.java:244)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:133)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:106)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:263)
    at com.ea.eadp.risk.service.pmml.impl.PMMLEvaluatorImpl.evaluate(PMMLEvaluatorImpl.java:114)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.evaluate(PMMLLoadTestServiceImpl.java:167)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:99)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:31)
    at com.ea.eadp.test.jpmml.Program.main(Program.java:22)

Is there a way to not use DataFrameMapper or would I have to manually change each of the float types back into double?

2. Changing the above issue, JPMML complains about the the model with the following error and I'm unable to evaluate it. Any ideas on what is the cause of this?

org.jpmml.evaluator.TypeCheckException: Expected org.jpmml.evaluator.HasProbability, but got org.jpmml.evaluator.ClassificationMap (ClassificationMap{type=VOTE, vote_entries=[0=0.0, 1=0.3, 2=0.7]})
    at org.jpmml.evaluator.OutputUtil.asResultFeature(OutputUtil.java:862)
    at org.jpmml.evaluator.OutputUtil.getProbability(OutputUtil.java:489)
    at org.jpmml.evaluator.OutputUtil.evaluate(OutputUtil.java:182)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:143)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:106)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:263)
    at com.ea.eadp.risk.service.pmml.impl.PMMLEvaluatorImpl.evaluate(PMMLEvaluatorImpl.java:114)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.evaluate(PMMLLoadTestServiceImpl.java:167)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:99)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:31)
    at com.ea.eadp.test.jpmml.Program.main(Program.java:22)

Thank you!

The text was updated successfully, but these errors were encountered:

vruusmann · 2016-03-02T23:03:10Z

What version of JPMML are you using?

First, the conversion from double to float was not available in older versions, because it loses numeric precision. However, it has been enabled since September 2015 (eg. see jpmml/jpmml-evaluator@2079ebc).

Please note that SkLearn uses 32-bit floating-point values for representing tree split conditions. Therefore, it is absolutely necessary to use the same datatype in PMML also, because otherwise some splits may be evaluated incorrectly.

Second, you're working with a classification-type problem, where the final (ie. the top-level MiningModel element) class probability distribution is calculated by applying the average aggregation function over all member (ie. nested TreeModel elements) class probability distributions. JPMML uses interface org.jpmml.evaluator.HasProbability to expose that information to interested parties.

Both of your problems can be solved by upgrading to the latest JPMML-Evaluator library version, which is 1.2.11 at the moment. Also, the upgrade should give you a considerable performance boost.

koinadn · 2016-03-03T17:35:34Z

Thank you for the quick response.

That was a good call. I was using JPMML 1.1.17. However, it seems a new error has occurred after upgrading to 1.2.11 as soon as it hits the first field name:

org.jpmml.evaluator.MissingFieldException: x3
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:150)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:64)
    at org.jpmml.evaluator.PredicateUtil.evaluateSimplePredicate(PredicateUtil.java:106)
    at org.jpmml.evaluator.PredicateUtil.evaluatePredicate(PredicateUtil.java:63)
    at org.jpmml.evaluator.PredicateUtil.evaluate(PredicateUtil.java:51)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateNode(TreeModelEvaluator.java:188)
    at org.jpmml.evaluator.TreeModelEvaluator.handleTrue(TreeModelEvaluator.java:205)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateTree(TreeModelEvaluator.java:146)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateClassification(TreeModelEvaluator.java:118)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluate(TreeModelEvaluator.java:87)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:349)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluateClassification(MiningModelEvaluator.java:176)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:143)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:115)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:110)
    at com.ea.eadp.risk.service.pmml.impl.PMMLEvaluatorImpl.evaluate(PMMLEvaluatorImpl.java:111)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.evaluate(PMMLLoadTestServiceImpl.java:167)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:99)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:31)
    at com.ea.eadp.test.jpmml.Program.main(Program.java:22)

However, the PMML (as generated from the code above) does include the names as derived fields:


    <DataDictionary>
        <DataField name="Species" optype="categorical" dataType="string">
            <Value value="0"/>
            <Value value="1"/>
            <Value value="2"/>
        </DataField>
        <DataField name="Sepal.Length" optype="continuous" dataType="double"/>
        <DataField name="Sepal.Width" optype="continuous" dataType="double"/>
        <DataField name="Petal.Length" optype="continuous" dataType="double"/>
        <DataField name="Petal.Width" optype="continuous" dataType="double"/>
    </DataDictionary>
    <MiningModel functionName="classification">
        <MiningSchema>
            <MiningField name="Species" usageType="target"/>
            <MiningField name="Sepal.Length"/>
            <MiningField name="Sepal.Width"/>
            <MiningField name="Petal.Length"/>
            <MiningField name="Petal.Width"/>
        </MiningSchema>
        <Output>
            <OutputField name="probability_0" feature="probability" value="0"/>
            <OutputField name="probability_1" feature="probability" value="1"/>
            <OutputField name="probability_2" feature="probability" value="2"/>
        </Output>
        <LocalTransformations>
            <DerivedField name="x1" optype="continuous" dataType="float">
                <FieldRef field="Sepal.Length"/>
            </DerivedField>
            <DerivedField name="x2" optype="continuous" dataType="float">
                <FieldRef field="Sepal.Width"/>
            </DerivedField>
            <DerivedField name="x3" optype="continuous" dataType="float">
                <FieldRef field="Petal.Length"/>
            </DerivedField>
            <DerivedField name="x4" optype="continuous" dataType="float">
                <FieldRef field="Petal.Width"/>
            </DerivedField>
        </LocalTransformations>
        <Segmentation multipleModelMethod="average">
            <Segment id="1">
                <True/>
                <TreeModel functionName="classification" splitCharacteristic="binarySplit">
                    <MiningSchema>
                        <MiningField name="Sepal.Width"/>
                        <MiningField name="Petal.Length"/>
                        <MiningField name="Petal.Width"/>
                    </MiningSchema>
                    <Node id="1">
                        <True/>
                        <Node id="2" score="0" recordCount="54.0">
                            <SimplePredicate field="x3" operator="lessOrEqual" value="2.5999999046325684"/>

Any idea on the cause of this?

Thanks!

vruusmann · 2016-03-03T17:45:20Z

This is a legitimate bug now. The derived field x3 depends on the user-supplied field Petal.Length, but the latter is not "imported" by the MiningSchema element of the TreeModel element.

Probably, this happens because your DataFrameMapper object does not specify any transformations for input fields.

koinadn · 2016-03-03T22:59:59Z

Thanks again for the quick repsonse.

I tested the same code with the original PCA transformation in the example:

iris_mapper = sklearn_pandas.DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], PCA(n_components = 3)),
    ("Species", None)
])

And the PMML was consumed and evaluated properly so that does seem to be the cause of the error.

I do believe there are some cases where no transformations would be used so that would be nice to have.

Thank you.

vruusmann · 2016-03-07T21:55:17Z

The conversion produces an invalid PMML document, because your DataFrameMapper object contains a duplicate mapping for the Petal.Width field (and no mapping for the Petal.Length field). If this typo is corrected, then the mapping to None transform works as intended.

I've updated the JPMML-SkLearn library to do extra sanity checking along those lines: jpmml/jpmml-sklearn@7d0578a

tbayrak · 2017-05-23T12:39:44Z

Hi,

I've tried the same code above to create pmml file but got the following error;
TypeError: The pipeline object is not an instance of PMMLPipeline

any suggestions? Thanks

vruusmann closed this as completed Mar 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random Forest Conversions and Consumption #5

Random Forest Conversions and Consumption #5

koinadn commented Mar 2, 2016

vruusmann commented Mar 2, 2016

koinadn commented Mar 3, 2016

vruusmann commented Mar 3, 2016

koinadn commented Mar 3, 2016

vruusmann commented Mar 7, 2016

tbayrak commented May 23, 2017

Random Forest Conversions and Consumption #5

Random Forest Conversions and Consumption #5

Comments

koinadn commented Mar 2, 2016

vruusmann commented Mar 2, 2016

koinadn commented Mar 3, 2016

vruusmann commented Mar 3, 2016

koinadn commented Mar 3, 2016

vruusmann commented Mar 7, 2016

tbayrak commented May 23, 2017