Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mechanism for registering custom Estimator and Transformer converter classes #20

Closed
geoHeil opened this issue Dec 12, 2016 · 7 comments
Closed

Comments

@geoHeil
Copy link

geoHeil commented Dec 12, 2016

How can I integrate a custom transformer to skelarn2pmml?

I am thinking about some preprocessing code which cleans the data, hand handles the imputation of missing values.
Is it correct to assume pyrolite will handle any pickled transformer?

@vruusmann
Copy link
Member

First, you need to introduce an appropriate converter class into the JPMML-SkLearn project:

  1. Create a subclass of sklearn.Transformer, and implement the conversion "business logic" in the Transformer#encodeFeatures(List, List, FeatureMapper) method.
  2. Register this class with org.jpmml.sklearn.PickleUtil.

There's not much documentation about it. Here's an example about implementing a converter class for Scikit-Learn's FunctionTransformer transformation type:
jpmml/jpmml-sklearn@5c4a181

Then, build your modified JPMML-SkLearn project with Apache Maven, and drop the resulting JAR file into the sklearn2pmml /resources/ directory. Currently, you would be replacing jpmml-sklearn-1.1.4.jar with your own jpmml-sklearn-1.1-SNAPSHOT.jar.

@vruusmann vruusmann changed the title Custom Transformer 2pmml Mechanism for registering custom Estimator and Transformer converter classes Dec 12, 2016
@vruusmann
Copy link
Member

Edited the title of this issue to reflect the real user need.

Actually, one shouldn't be making custom converter classes part of the JPMML-SkLearn library. They could be isolated into a standalone (mini-)project, which depends on the JPMML-SkLearn library (and other Java libraries).

If this (mini-)project is built, then it should produce a JAR file, which is suitable for dropping into sklearn2pmml /resources/ directory. At the moment, the problem is that there is no way of informing org.jpmml.sklearn.PickleUtil about those newly dropped-in converter classes - the list of supported converters is hard-coded.

A solution would be to introduce some sort of "sklearn2pmml extension module metadata" mechanism. For example, the JAR file could contain a properties file META-INF/sklearn2pmml.properties, which lists the names of new converter classes.

@geoHeil
Copy link
Author

geoHeil commented Dec 13, 2016

Is there a simpler solution e.g. similar to http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html where an arbitrary function can be registered?

@vruusmann
Copy link
Member

Class sklearn.preprocessing.FunctionTransformer supports a limited number of 1-parameter Numpy ufuncs: https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn/preprocessing/FunctionTransformer.java#L79

As you can see, in order to support an ufunc, you still need to write conversion "business logic" in Java, and (re-)build a modified JPMML-SkLearn library.

@vruusmann
Copy link
Member

Could you perhaps share your Python Transformer class? Maybe I can suggest a simple and easy way of automatically translating it to PMML then.

For example, the JPMML-R library now includes a general-purpose R-to-PMML expression conversion functionality:

iris.rf = randomForest(Species ~ . + I(log(Sepal.Length / Sepal.Width) + 1), data = iris)

Should be possible to build a similar Python-to-PMML expression converter. Of course, the trouble is that you cannot refer fields by name in Scikit-Learn; have to use field references something like $1, $2, .., $n instead.

@geoHeil
Copy link
Author

geoHeil commented Dec 13, 2016

Unfortunately sharing the code will not be possible. But I can explain you the actions which are performed.

  • filtering some customer groups (no state involved)
  • handling NaN values (state required for imputation)
  • generating some features (no state involved)

The whole pipeline is a bit similar to http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html where the cleaning of the data is encapsulated in its own transformer.

@vruusmann
Copy link
Member

Starting from the JPMML-SkLearn library version 1.2.0, the PickleUtil utility class will scan all the JAR files in the application classpath for the META-INF/sklearn2pmml.properties file, and if found, will automatically register all the listed converter classes with the runtime. For example, here is the list of built-in converter classes: https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/resources/META-INF/sklearn2pmml.properties

You can implement your own Transformer, Selector and Estimator converter classes as appropriate. When done, create a corresponding META-INF/sklearn2pmml.properties file, and package everything as a regular JAR file.

During conversion, you can add a list of JAR files to the application classpath using the newly introduced user_classpath argument:

sklearn2pmml(estimator, mapper, "estimator_mapper.pmml", user_classpath = ["/path/to/extensions.jar"])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants