Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for transformer-only pipelines #86

Closed
AshwinSekar opened this issue Oct 9, 2018 · 4 comments
Closed

Support for transformer-only pipelines #86

AshwinSekar opened this issue Oct 9, 2018 · 4 comments

Comments

@AshwinSekar
Copy link

I understand that a PMMLPipeline must end with an estimator to be valid for conversion to pmml. I have use cases in which I have useful pipelines for preprocessing that I would like to convert to pmml for evaluation in Java.

If I stick a DummyClassifier or DummyRegressor at the end of the pipeline, it is able to be written to valid pmml, however the target_fields information is lost, and I am unsure how to recover anything but the dummy prediction from the pmml.

Is there a recommended workflow in this situation? Should I use the jpmml-plugin to create some sort of "pass through" estimator that returns the input?

Thanks for your help!

@vruusmann
Copy link
Member

Is there a recommended workflow in this situation?

There was a similar situation with Apache Spark pipelines, and we managed to find some sort of fairly elegant solution there. However, Apache Spark pipelines are far more flexible than Scikit-Learn pipelines (eg. can have multiple models in a pipeline, and there can be transformers following the last model), so the solution is probably 1:1 transferable (and I really cannot recall its technical details).

Should I use the jpmml-plugin to create some sort of "pass through" estimator that returns the input?

Probably the easiest solution to your problem:

  1. Create a dummy-like estimator. Could very well be a subclass of DummyRegressor or DummyClassifier.
  2. In its #encodeModel(Schema) method, create an empty Output element, and append an OutputField child element for every pre-processing step that you want to pass through. Be sure to use unique field names in order to avoid naming conflicts between DerivedField and OutputField elements.

Something like this should do:

<Output>
  <OutputField name="z" dataType=".." optype="..">
    <!-- refers to a DerivedField element whose name is "internal(y)" -->
    <FieldRef field="internal(y)"/>
  </OutputField>
</Output>

@vruusmann vruusmann changed the title Is it possible to access only the pre-processing steps of a pipeline? Support for transformer-only pipelines Oct 10, 2018
@vruusmann
Copy link
Member

Re-purposed this issue. Would like to provide a solution that wouldn't require defining custom estimator types and renaming fields.

Perhaps the sklearn2pmml.pipeline.PMMLPipeline class should have a marker attribute transformation_only (or similar), which would inform the JPMML-SkLearn backend that the final estimator step (if any) should be skipped.

@AshwinSekar
Copy link
Author

Thanks for the suggestion, I will look into creating a dummy estimator.

I noticed that the TransformationDictionary actually has all of the transforms in my pipeline in the form of derived fields. Is there anyway I can use these derived fields to extract the transformed values? Can I apply the expression from getExpression() in some way to the input fields?

@vruusmann
Copy link
Member

Is there anyway I can use these derived fields to extract the transformed values?

See this comment, and the issue referenced therein:
jpmml/jpmml-converter#11 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants