Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Forest Conversions and Consumption #5

Closed
koinadn opened this issue Mar 2, 2016 · 6 comments
Closed

Random Forest Conversions and Consumption #5

koinadn opened this issue Mar 2, 2016 · 6 comments

Comments

@koinadn
Copy link

koinadn commented Mar 2, 2016

Hello,

I'm having a few issues in testing a random forest classifier from scklearn2pmml in JPMML. I'm producing a simple PMML file from the code here:

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

import pandas
import sklearn_pandas

iris = load_iris()

iris_df = pandas.concat((pandas.DataFrame(iris.data[:, :], columns = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]), pandas.DataFrame(iris.target, columns = ["Species"])), axis = 1)

iris_mapper = sklearn_pandas.DataFrameMapper([('Sepal.Length',None),
                                              ('Sepal.Width', None), 
                                              ('Petal.Width', None),
                                              ('Petal.Width', None),
                                              ('Species',None)])

iris = iris_mapper.fit_transform(iris_df)

from sklearn.ensemble import RandomForestClassifier

iris_X = iris[:, 0:4]
iris_y = iris[:, 4]

iris_classifier = RandomForestClassifier(n_estimators=10)
iris_classifier.fit(iris_X, iris_y)

from sklearn2pmml import sklearn2pmml

sklearn2pmml(iris_classifier, iris_mapper, "randomforest.pmml")

1. I'd like to do no transformations across my data set. Leaving them all blank transforms the data type from double to float.

            <DerivedField name="x1" optype="continuous" dataType="float">
                <FieldRef field="Sepal.Length"/>

This causes an issue with JPMML throwing the following error:

org.jpmml.evaluator.TypeCheckException: Expected FLOAT, but got DOUBLE (5.3)
    at org.jpmml.evaluator.TypeUtil.toFloat(TypeUtil.java:456)
    at org.jpmml.evaluator.TypeUtil.cast(TypeUtil.java:330)
    at org.jpmml.evaluator.TypeUtil.parseOrCast(TypeUtil.java:61)
    at org.jpmml.evaluator.FieldValueUtil.create(FieldValueUtil.java:92)
    at org.jpmml.evaluator.FieldValueUtil.refine(FieldValueUtil.java:144)
    at org.jpmml.evaluator.FieldValueUtil.refine(FieldValueUtil.java:116)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:82)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:63)
    at org.jpmml.evaluator.PredicateUtil.evaluateSimplePredicate(PredicateUtil.java:95)
    at org.jpmml.evaluator.PredicateUtil.evaluate(PredicateUtil.java:54)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateNode(TreeModelEvaluator.java:171)
    at org.jpmml.evaluator.TreeModelEvaluator.handleTrue(TreeModelEvaluator.java:186)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateTree(TreeModelEvaluator.java:139)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateClassification(TreeModelEvaluator.java:111)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluate(TreeModelEvaluator.java:80)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:463)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluateClassification(MiningModelEvaluator.java:244)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:133)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:106)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:263)
    at com.ea.eadp.risk.service.pmml.impl.PMMLEvaluatorImpl.evaluate(PMMLEvaluatorImpl.java:114)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.evaluate(PMMLLoadTestServiceImpl.java:167)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:99)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:31)
    at com.ea.eadp.test.jpmml.Program.main(Program.java:22)

Is there a way to not use DataFrameMapper or would I have to manually change each of the float types back into double?


2. Changing the above issue, JPMML complains about the the model with the following error and I'm unable to evaluate it. Any ideas on what is the cause of this?

org.jpmml.evaluator.TypeCheckException: Expected org.jpmml.evaluator.HasProbability, but got org.jpmml.evaluator.ClassificationMap (ClassificationMap{type=VOTE, vote_entries=[0=0.0, 1=0.3, 2=0.7]})
    at org.jpmml.evaluator.OutputUtil.asResultFeature(OutputUtil.java:862)
    at org.jpmml.evaluator.OutputUtil.getProbability(OutputUtil.java:489)
    at org.jpmml.evaluator.OutputUtil.evaluate(OutputUtil.java:182)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:143)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:106)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:263)
    at com.ea.eadp.risk.service.pmml.impl.PMMLEvaluatorImpl.evaluate(PMMLEvaluatorImpl.java:114)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.evaluate(PMMLLoadTestServiceImpl.java:167)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:99)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:31)
    at com.ea.eadp.test.jpmml.Program.main(Program.java:22)

Thank you!

@vruusmann
Copy link
Member

What version of JPMML are you using?

First, the conversion from double to float was not available in older versions, because it loses numeric precision. However, it has been enabled since September 2015 (eg. see jpmml/jpmml-evaluator@2079ebc).

Please note that SkLearn uses 32-bit floating-point values for representing tree split conditions. Therefore, it is absolutely necessary to use the same datatype in PMML also, because otherwise some splits may be evaluated incorrectly.

Second, you're working with a classification-type problem, where the final (ie. the top-level MiningModel element) class probability distribution is calculated by applying the average aggregation function over all member (ie. nested TreeModel elements) class probability distributions. JPMML uses interface org.jpmml.evaluator.HasProbability to expose that information to interested parties.

Both of your problems can be solved by upgrading to the latest JPMML-Evaluator library version, which is 1.2.11 at the moment. Also, the upgrade should give you a considerable performance boost.

@koinadn
Copy link
Author

koinadn commented Mar 3, 2016

Thank you for the quick response.

That was a good call. I was using JPMML 1.1.17. However, it seems a new error has occurred after upgrading to 1.2.11 as soon as it hits the first field name:

org.jpmml.evaluator.MissingFieldException: x3
    at org.jpmml.evaluator.ModelEvaluationContext.resolve(ModelEvaluationContext.java:150)
    at org.jpmml.evaluator.EvaluationContext.evaluate(EvaluationContext.java:64)
    at org.jpmml.evaluator.PredicateUtil.evaluateSimplePredicate(PredicateUtil.java:106)
    at org.jpmml.evaluator.PredicateUtil.evaluatePredicate(PredicateUtil.java:63)
    at org.jpmml.evaluator.PredicateUtil.evaluate(PredicateUtil.java:51)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateNode(TreeModelEvaluator.java:188)
    at org.jpmml.evaluator.TreeModelEvaluator.handleTrue(TreeModelEvaluator.java:205)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateTree(TreeModelEvaluator.java:146)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateClassification(TreeModelEvaluator.java:118)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluate(TreeModelEvaluator.java:87)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:349)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluateClassification(MiningModelEvaluator.java:176)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:143)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:115)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:110)
    at com.ea.eadp.risk.service.pmml.impl.PMMLEvaluatorImpl.evaluate(PMMLEvaluatorImpl.java:111)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.evaluate(PMMLLoadTestServiceImpl.java:167)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:99)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:31)
    at com.ea.eadp.test.jpmml.Program.main(Program.java:22)

However, the PMML (as generated from the code above) does include the names as derived fields:


    <DataDictionary>
        <DataField name="Species" optype="categorical" dataType="string">
            <Value value="0"/>
            <Value value="1"/>
            <Value value="2"/>
        </DataField>
        <DataField name="Sepal.Length" optype="continuous" dataType="double"/>
        <DataField name="Sepal.Width" optype="continuous" dataType="double"/>
        <DataField name="Petal.Length" optype="continuous" dataType="double"/>
        <DataField name="Petal.Width" optype="continuous" dataType="double"/>
    </DataDictionary>
    <MiningModel functionName="classification">
        <MiningSchema>
            <MiningField name="Species" usageType="target"/>
            <MiningField name="Sepal.Length"/>
            <MiningField name="Sepal.Width"/>
            <MiningField name="Petal.Length"/>
            <MiningField name="Petal.Width"/>
        </MiningSchema>
        <Output>
            <OutputField name="probability_0" feature="probability" value="0"/>
            <OutputField name="probability_1" feature="probability" value="1"/>
            <OutputField name="probability_2" feature="probability" value="2"/>
        </Output>
        <LocalTransformations>
            <DerivedField name="x1" optype="continuous" dataType="float">
                <FieldRef field="Sepal.Length"/>
            </DerivedField>
            <DerivedField name="x2" optype="continuous" dataType="float">
                <FieldRef field="Sepal.Width"/>
            </DerivedField>
            <DerivedField name="x3" optype="continuous" dataType="float">
                <FieldRef field="Petal.Length"/>
            </DerivedField>
            <DerivedField name="x4" optype="continuous" dataType="float">
                <FieldRef field="Petal.Width"/>
            </DerivedField>
        </LocalTransformations>
        <Segmentation multipleModelMethod="average">
            <Segment id="1">
                <True/>
                <TreeModel functionName="classification" splitCharacteristic="binarySplit">
                    <MiningSchema>
                        <MiningField name="Sepal.Width"/>
                        <MiningField name="Petal.Length"/>
                        <MiningField name="Petal.Width"/>
                    </MiningSchema>
                    <Node id="1">
                        <True/>
                        <Node id="2" score="0" recordCount="54.0">
                            <SimplePredicate field="x3" operator="lessOrEqual" value="2.5999999046325684"/>

Any idea on the cause of this?

Thanks!

@vruusmann
Copy link
Member

This is a legitimate bug now. The derived field x3 depends on the user-supplied field Petal.Length, but the latter is not "imported" by the MiningSchema element of the TreeModel element.

Probably, this happens because your DataFrameMapper object does not specify any transformations for input fields.

@koinadn
Copy link
Author

koinadn commented Mar 3, 2016

Thanks again for the quick repsonse.

I tested the same code with the original PCA transformation in the example:

iris_mapper = sklearn_pandas.DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], PCA(n_components = 3)),
    ("Species", None)
])

And the PMML was consumed and evaluated properly so that does seem to be the cause of the error.

I do believe there are some cases where no transformations would be used so that would be nice to have.

Thank you.

@vruusmann
Copy link
Member

The conversion produces an invalid PMML document, because your DataFrameMapper object contains a duplicate mapping for the Petal.Width field (and no mapping for the Petal.Length field). If this typo is corrected, then the mapping to None transform works as intended.

I've updated the JPMML-SkLearn library to do extra sanity checking along those lines: jpmml/jpmml-sklearn@7d0578a

@tbayrak
Copy link

tbayrak commented May 23, 2017

Hi,

I've tried the same code above to create pmml file but got the following error;
TypeError: The pipeline object is not an instance of PMMLPipeline

any suggestions? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants