Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sklearn2pmml can't handle custom function or lambda functions? #354

Closed
liuhuanshuo opened this issue Oct 27, 2022 · 29 comments
Closed

Sklearn2pmml can't handle custom function or lambda functions? #354

liuhuanshuo opened this issue Oct 27, 2022 · 29 comments

Comments

@liuhuanshuo
Copy link

I like sklearn2pmml because it helps me model it in python and deploy it in java.

But I've been troubled by a problem lately.

Because my pipeline uses a custom conversion function, it cannot be successfully converted using sklearn2pmml.

Here is my custom function code

def calc_modify_days(X):
    X['modify_date_new']  = X['modify_date'].apply(lambda x:x[:4]+'-'+x[4:6]+'-'+x[6:8] if x!='' and x<'20221230' else '2022-12-30' )
    X['modify_days'] = (pd.to_datetime(X['day_id']) - pd.to_datetime(X['modify_date_new'])).dt.days
    X['modify_days'] = X['modify_days'].apply(lambda x:-1 if x<0 else x)
    
    return X['modify_days']

def transform_channel_ty_cd(X):
    
    return X.apply(lambda x: all_cate_dict['channel_type_cd_3'].get(x) if x in all_cate_dict['channel_type_cd_3'] else 0)

Below is the pipeline code, which works properly for prediction

mapper_encode = [
    (['day_id','modify_date'],FunctionTransformer(calc_modify_days),{'alias':'modify_days'}),
    ('channel_type_cd_3',FunctionTransformer(transform_channel_ty_cd))]

mapper = DataFrameMapper(mapper_encode, input_df=True, df_out=True)

pipeline_test = PMMLPipeline(
    steps=[("mapper", mapper),
           ("classifier", clf_1)])

But when I try to convert the pipeline to a pmml file, I get an error

Standard output is empty
Standard error:
Oct 27, 2022 3:43:25 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Oct 27, 2022 3:43:25 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 61 ms.
Oct 27, 2022 3:43:25 PM org.jpmml.sklearn.Main run
INFO: Converting..
Oct 27, 2022 3:43:25 PM sklearn2pmml.pipeline.PMMLPipeline initTargetFields
WARNING: Attribute 'sklearn2pmml.pipeline.PMMLPipeline.target_fields' is not set. Assuming y as the name of the target field
Oct 27, 2022 3:43:25 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Attribute 'sklearn.preprocessing._function_transformer.FunctionTransformer.func' has an unsupported value (Java class net.razorvine.pickle.objects.ClassDictConstructor)
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:45)
	at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:82)
	at org.jpmml.sklearn.PyClassDict.getOptional(PyClassDict.java:92)
	at sklearn.preprocessing.FunctionTransformer.getFunc(FunctionTransformer.java:63)
	at sklearn.preprocessing.FunctionTransformer.encodeFeatures(FunctionTransformer.java:43)
	at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
	at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:73)
	at sklearn.Initializer.encodeFeatures(Initializer.java:44)
	at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
	at sklearn.Composite.encodeFeatures(Composite.java:129)
	at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
	at org.jpmml.sklearn.Main.run(Main.java:228)
	at org.jpmml.sklearn.Main.main(Main.java:148)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDictConstructor to numpy.core.UFunc
	at java.lang.Class.cast(Class.java:3369)
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
	... 12 more

Exception in thread "main" java.lang.IllegalArgumentException: Attribute 'sklearn.preprocessing._function_transformer.FunctionTransformer.func' has an unsupported value (Java class net.razorvine.pickle.objects.ClassDictConstructor)
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:45)
	at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:82)
	at org.jpmml.sklearn.PyClassDict.getOptional(PyClassDict.java:92)
	at sklearn.preprocessing.FunctionTransformer.getFunc(FunctionTransformer.java:63)
	at sklearn.preprocessing.FunctionTransformer.encodeFeatures(FunctionTransformer.java:43)
	at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
	at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:73)
	at sklearn.Initializer.encodeFeatures(Initializer.java:44)
	at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
	at sklearn.Composite.encodeFeatures(Composite.java:129)
	at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
	at org.jpmml.sklearn.Main.run(Main.java:228)
	at org.jpmml.sklearn.Main.main(Main.java:148)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDictConstructor to numpy.core.UFunc
	at java.lang.Class.cast(Class.java:3369)
	at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
	... 12 more

I tried to look it up and the FunctionTransformer and lambda functions seem to be the problem。

But I don't know how to deal with it, and I have to finish this task, how should I solve it?

@vruusmann vruusmann changed the title Sklearn2pmml doesn't seem to support custom feature conversion functions? Sklearn2pmml can't handle lambda functions? Oct 27, 2022
@liuhuanshuo liuhuanshuo changed the title Sklearn2pmml can't handle lambda functions? Sklearn2pmml can't handle custom function or lambda functions? Oct 27, 2022
@vruusmann
Copy link
Member

vruusmann commented Oct 27, 2022

Lambda functions are not pickleable in Python. Therefore, your pipeline would not work even in pure Python environment (dump in one computer, transfer the pickle file to another computer and load there - the pipeline won't make any predictions because the source code of the lambda function is missing).

The JPMML-SkLearn should simply throw a more relevant exception here - something like "The function definition is missing"

@liuhuanshuo
Copy link
Author

Thank you for your reply, I am aware of this, but even if I did not use the lambda function, it would not work.

Here's my function code, which doesn't work either

def transform_channel_ty_cd(X):
    
    res = []
    
    for i in range(len(X)):
        
        if X[i] in  all_cate_dict['channel_type_cd_3']:
            res.append(all_cate_dict['channel_type_cd_3'].get(X[i]))
        else:
            res.append(0)
            
    return pd.DataFrame(res)

@liuhuanshuo
Copy link
Author

liuhuanshuo commented Oct 27, 2022

Lambda functions are not pickleable in Python. Therefore, your pipeline would not work even in pure Python environment (dump in one computer, transfer the pickle file to another computer and unpickle there - the pipeline won't make any predictions because the source code of the lambda function is missing).

The JPMML-SkLearn should simply throw a more relevant exception here - something like "The function definition is missing"

Also, it seems that you are saying that my error code is not caused by the lambda function, so where is the problem?

@vruusmann
Copy link
Member

But I don't know how to deal with it, and I have to finish this task, how should I solve it?

Your calc_modify_days function body seems rather simple, and it is possible to express in terms of built-in PMML functions.

You just have to express your FunctionTransformer transformation in terms of sklearn2pmml.preprocessing.ExpressionTransformer and/or sklearn2pmml.preprocessing.DurationTransformer transformations.

For example, the modify_date_new should become:

modify_date_new = ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 and int(X[0]) < 20221230 else '2022-12-30'")

@vruusmann
Copy link
Member

vruusmann commented Oct 27, 2022

Also, it seems that you are saying that my error code is not caused by the lambda function, so where is the problem?

The problem is that your FunctionTransformer object DOES NOT INCLUDE THE BUSINESS LOGIC of the calc_modify_days Python function. It just contains an instruction "evaluate the calc_modify_days function with such-and-such arguments".

The SkLearn2PMML package can not find/convert your Python function, if it's not stored INSIDE THE PICKLE FILE.

@liuhuanshuo
Copy link
Author

但是不知道怎么处理,又要完成这个任务,应该怎么解决呢?

你的calc_modify_days函数体看起来很简单,可以用内置的 PMML 函数来表达。

你只需要用和/或转换来表达你FunctionTransformersklearn2pmml.preprocessing.ExpressionTransformer转换sklearn2pmml.preprocessing.DurationTransformer

例如,modify_date_new应该变成:

modify_date_new = ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 and int(X[0]) < 20221230 else '2022-12-30'")

Thank you very much for your direct and effective reply, but it seems that the code you provided does not work.

Your code seems to solve only the first line in the function.

calc_modify_days This function computes a modify_days from the input two columns.

Can you provide more effective guidance? Thanks a million!

@vruusmann
Copy link
Member

vruusmann commented Oct 27, 2022

it seems that the code you provided does not work.

I gave you an example about how to do "datetime assembly" from string fragments using Python's + operator (with slicing support). You should polish it into the final form.

This function computes a modify_days from the input two columns.

  1. Prepare two properly formatted datetime strings using ExpressionTransformer.
  2. Convert them from string to actual datetime objects using CastTransformer (actually, you can do it organically by setting the ExpressionTransformer.dtype attribute).
  3. Calculate the "time delta" aginst some reference datetime using DaysSinceYearTransformer.
  4. Subtract one from the other using ExpressionTransformer("X[1] - X[0]").

That's your "custom transformer", which works both in Python and (J)PMML.

@liuhuanshuo
Copy link
Author

liuhuanshuo commented Oct 27, 2022

I gave you an example about how to do "datetime assembly" from string fragments using Python's + operator (with slicing support). You should polish it into the final form.

Thank you. I seem to understand.

I immediately followed your instructions and adapted my second custom function to the following form

mapper_encode = [
('channel_type_cd_3',ExpressionTransformer("all_cate_dict['channel_type_cd_3'].get(X[0]) if X[0] in all_cate_dict['channel_type_cd_3'] else 0"))]

mapper = DataFrameMapper(mapper_encode, input_df=True, df_out=True)

But when I fit_transform, I get an error

TypeError: <lambda>() got an unexpected keyword argument 'axis'

I didn't use the lambda, but it seems to be the lambda function

@vruusmann
Copy link
Member

I immediately followed your instructions and adapted my second custom function to the following form

You can't simply take Python source code, and copy&paste it as ExpressionTransformer argument. You should try to refactor your business logic into more PMML-friendly constructs.

For example, lookup table look-ups should be performed using sklearn2pmml.preprocessing.LookupTransformer instead.

@liuhuanshuo
Copy link
Author

For example, lookup table look-ups should be performed using sklearn2pmml.preprocessing.LookupTransformer instead.

Could you please provide a demo for my second simple function so that I can learn from yours to modify other functions? I would be very grateful.

I'm new to this method and it's hard forme to find relevant materials and cases like LookupTransformer to learn from

@vruusmann
Copy link
Member

vruusmann commented Oct 27, 2022

First, do you under conceptually that lambda look-ups cannot possibly work, because the lookup base dictionary resides in Python memory space, and is never dumped with the rest of the pipeline? You need to find a way to dump the lookup dict, otherwise it'll never work.

it's hard forme to find relevant materials and cases like LookupTransformer to learn from

The source code is the primary reference:
https://github.com/jpmml/sklearn2pmml/blob/0.86.3/sklearn2pmml/preprocessing/__init__.py#L294-L327

This should work:

channel_type_cd_3_lookup = LookupTransformer(mapping = all_cate_dict['channel_type_cd_3'], default_value = 0)

@liuhuanshuo
Copy link
Author

First, do you under conceptually that lambda look-ups cannot possibly work, because the lookup base dictionary resides in Python memory space, and is never dumped with the rest of the pipeline? You need to find a way to dump the lookup

Thank you, I now fully agree and understand that you should not include any lambda functions in custom functions, and the related variables involved will not be saved.

My previous research direction has been trying to figure out how to make pipelines include lambda functions, which is wrong!

channel_type_cd_3_lookup = LookupTransformer(mapping = all_cate_dict['channel_type_cd_3'], default_value = 0)

Thank you very much, the line of code you provided is valid, and you don't even have to redefine a function!

I will follow your instructions to modify the first function!

I have one last question, can these converters be chained?

Since my first function needs to convert the string first, then calculate the time and then match the characters, it seems that multiple transformer will be needed. Can you write the code layer by layer

@liuhuanshuo
Copy link
Author

channel_type_cd_3_lookup = LookupTransformer(mapping = all_cate_dict['channel_type_cd_3'], default_value = 0)

New problems arise. I just tried the following code, which has no lambda and only contains one feature processing (what you provided, I tested and it works).

mapper_encode = [('channel_type_cd_3',LookupTransformer(mapping = all_cate_dict['channel_type_cd_3'], default_value = 0))]

mapper = DataFrameMapper(mapper_encode, input_df=True, df_out=True)

pipeline_test = PMMLPipeline(
    steps=[("mapper", mapper),
           ("classifier", clf_1)])

sklearn2pmml(pipeline_test,'pipeline_test.pmml')

There was still an error converting pmml

Standard output is empty
Standard error:
Oct 27, 2022 5:30:53 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Oct 27, 2022 5:30:53 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 60 ms.
Oct 27, 2022 5:30:53 PM org.jpmml.sklearn.Main run
INFO: Converting..
Oct 27, 2022 5:30:53 PM sklearn2pmml.pipeline.PMMLPipeline initTargetFields
WARNING: Attribute 'sklearn2pmml.pipeline.PMMLPipeline.target_fields' is not set. Assuming y as the name of the target field
Oct 27, 2022 5:30:53 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Expected all values to be of the same data type, got 2 different data types ([integer, double])
	at org.jpmml.converter.TypeUtil.getDataType(TypeUtil.java:129)
	at sklearn2pmml.preprocessing.LookupTransformer.encodeFeatures(LookupTransformer.java:98)
	at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
	at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:73)
	at sklearn.Initializer.encodeFeatures(Initializer.java:44)
	at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
	at sklearn.Composite.encodeFeatures(Composite.java:129)
	at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
	at org.jpmml.sklearn.Main.run(Main.java:228)
	at org.jpmml.sklearn.Main.main(Main.java:148)

Exception in thread "main" java.lang.IllegalArgumentException: Expected all values to be of the same data type, got 2 different data types ([integer, double])
	at org.jpmml.converter.TypeUtil.getDataType(TypeUtil.java:129)
	at sklearn2pmml.preprocessing.LookupTransformer.encodeFeatures(LookupTransformer.java:98)
	at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
	at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:73)
	at sklearn.Initializer.encodeFeatures(Initializer.java:44)
	at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
	at sklearn.Composite.encodeFeatures(Composite.java:129)
	at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
	at org.jpmml.sklearn.Main.run(Main.java:228)
	at org.jpmml.sklearn.Main.main(Main.java:148)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-242-7b6f7907de3d> in <module>
----> 1 sklearn2pmml(pipeline_test,'pipeline_test.pmml')

~/.local/lib/python3.7/site-packages/sklearn2pmml/__init__.py in sklearn2pmml(pipeline, pmml, user_classpath, with_repr, debug, java_encoding)
    264                                 print("Standard error is empty")
    265                 if retcode:
--> 266                         raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams")
    267         finally:
    268                 if debug:

RuntimeError: The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

The following is the entire content of my pipeline, which looks OK


PMMLPipeline(steps=[('mapper', DataFrameMapper(default=False, df_out=True,
                features=[('channel_type_cd_3',
                           LookupTransformer(default_value=0,
                                             mapping={'': 5.0, '100101': 6.0,
                                                      '100102': 7.0,
                                                      '100103': 8.0,
                                                      '100201': 8.0,
                                                      '100202': 8.0,
                                                      '100203': 8.0,
                                                      '100301': 6.0,
                                                      '100401': 7.0,
                                                      '110101': 2.0,
                                                      '110102': 3.0,
                                                      '110103': 8.0,
                                                      '110201': 1.0,
                                                      '110301': 7.0,
                                                      '110302': 8.0,
                                                      '110303': 8.0,
                                                      '110304': 8.0,
                                                      '110401': 4.0,
                                                      '110501': 6.0,
                                                      '110502': 7.0,
                                                      '120101': 7.0,
                                                      '120102': 8.0,
                                                      '120103': 8.0,
                                                      '120201': 8.0,
                                                      '120202': 8.0,
                                                      '120301': 8.0,
                                                      '120302': 8.0,
                                                      '120303': 8.0,
                                                      '120304': 8.0,
                                                      '120401': 8.0, ...}),
                           {})],
                input_df=True, sparse=False)),
       ('classifier', LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=0.9,
               importance_type='split', learning_rate=0.1, max_depth=4,
               min_child_samples=200, min_child_weight=0.001,
               min_split_gain=0.0, n_estimators=3000, n_jobs=-1, num_leaves=10,
               objective=None, random_state=None, reg_alpha=0.1, reg_lambda=0.1,
               silent=True, subsample=0.9, subsample_for_bin=200000,
               subsample_freq=0))])

@vruusmann
Copy link
Member

can these converters be chained?

SkLearn2PMML transformers follow Scikit-Learn API conventions, so they can (and actually should!) be chained in the form of sub-pipelines.

Please note that some SkLearn2PMML transformers support multiple input columns (eg. ExpressionTransformer), whereas some others only support a single input column (eg. LookupTransformer). In the latter case, you may need to use FeatureUnion and friends to "parallelize" computations.

@vruusmann
Copy link
Member

Exception in thread "main" java.lang.IllegalArgumentException: Expected all values to be of the same data type, got 2 different data types ([integer, double])

Kind of self-explanatory, no?

Your look-up values should all be of the same datatype. You have a mix of integer and double right now, so it might make sense to convert all integers to doubles.

@liuhuanshuo
Copy link
Author

Exception in thread "main" java.lang.IllegalArgumentException: Expected all values to be of the same data type, got 2 different data types ([integer, double])

Kind of self-explanatory, no?

Your look-up values should all be of the same datatype. You have a mix of integer and double right now, so it might make sense to convert all integers to doubles.

Thank you very much, I solved this by setting default_value = 0.0.

To be honest, I didn't even know how to locate the real problem in the face of a bunch of error codes before. Through your several analyses, I have learned how to find the problem.

I will continue to modify other code, thanks again for your help!

@vruusmann
Copy link
Member

I solved this by setting default_value = 0.0.

Nice find!

Looks like my LookupTransformer source code could be improved, by making it automatically cast default_value to the same data type as dict values!

@liuhuanshuo
Copy link
Author

  1. Prepare two properly formatted datetime strings using ExpressionTransformer.
  2. Convert them from string to actual datetime objects using CastTransformer (actually, you can do it organically by setting the ExpressionTransformer.dtype attribute).
  3. Calculate the "time delta" aginst some reference datetime using DaysSinceYearTransformer.
  4. Subtract one from the other using ExpressionTransformer("X[1] - X[0]").

That's your "custom transformer", which works both in Python and (J)PMML.

I try the following code according to your guidance

mapper_encode = [
    (['modify_date','day_id'],[(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 and int(X[0][0:8]) < 20221230 else '2022-12-30'")),
                     CastTransformer(dtype = "datetime64[D]"),
                    DaysSinceYearTransformer(year = 2022),
                    ExpressionTransformer("X[1] - X[0]")]),

    ('channel_type_cd_3',LookupTransformer(mapping = all_cate_dict['channel_type_cd_3'], default_value = 0.0))]

mapper = DataFrameMapper(mapper_encode, input_df=True, df_out=True)

mapper.fit_transform(data_test)

But an error has occurred, and here is the error code.

It looks like the first three steps worked fine (I only performed the first three steps and the results were as expected)

There seems to have been an error in 'ExpressionTransformer("X[1] -X [0] "'. Doesn't seem to read 'X[1]'

Can you see where my code is obviously wrong?

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-325-63f4571c8c4f> in <module>
      9 mapper = DataFrameMapper(mapper_encode, input_df=True, df_out=True)
     10 
---> 11 mapper.fit_transform(data_test)

~/.local/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    569         if y is None:
    570             # fit method of arity 1 (unsupervised transformation)
--> 571             return self.fit(X, **fit_params).transform(X)
    572         else:
    573             # fit method of arity 2 (supervised transformation)

~/.local/lib/python3.7/site-packages/sklearn_pandas/dataframe_mapper.py in transform(self, X)
    217             Xt = self._get_col_subset(X, columns)
    218             if transformers is not None:
--> 219                 Xt = transformers.transform(Xt)
    220             extracted.append(_handle_feature(Xt))
    221 

~/.local/lib/python3.7/site-packages/sklearn/pipeline.py in _transform(self, X)
    553         Xt = X
    554         for _, _, transform in self._iter():
--> 555             Xt = transform.transform(Xt)
    556         return Xt
    557 

~/.local/lib/python3.7/site-packages/sklearn2pmml/preprocessing/__init__.py in transform(self, X)
    150         def transform(self, X):
    151                 func = lambda x: self._eval_row(x)
--> 152                 Xt = eval_rows(X, func)
    153                 if self.dtype is not None:
    154                         Xt = cast(Xt, self.dtype)

~/.local/lib/python3.7/site-packages/sklearn2pmml/util.py in eval_rows(X, func, dtype)
     17         Xt = numpy.empty(shape = (nrow, ), dtype = dtype)
     18         for i in range(0, nrow):
---> 19                 Xt[i] = func(X[i])
     20         return Xt
     21 

~/.local/lib/python3.7/site-packages/sklearn2pmml/preprocessing/__init__.py in <lambda>(x)
    149 
    150         def transform(self, X):
--> 151                 func = lambda x: self._eval_row(x)
    152                 Xt = eval_rows(X, func)
    153                 if self.dtype is not None:

~/.local/lib/python3.7/site-packages/sklearn2pmml/preprocessing/__init__.py in _eval_row(self, X)
    143 
    144         def _eval_row(self, X):
--> 145                 return eval(self.expr)
    146 
    147         def fit(self, X, y = None):

~/.local/lib/python3.7/site-packages/sklearn2pmml/preprocessing/__init__.py in <module>

IndexError: index 1 is out of bounds for axis 0 with size 1

@vruusmann
Copy link
Member

vruusmann commented Oct 27, 2022

It looks like the first three steps worked fine (I only performed the first three steps and the results were as expected)

My advice would be to run the mapper step individually (by performing fit_transform(X)), add steps one by one, and visually confirm that you're getting correct results as the pipeline grows. It's difficult when you have all four steps already in place, and then something misbehaves.

Are you sure that your results are correct at the end of step 3? The step 4 needs a 2D input, but it seems to me that you are having only 1D input there, because already the initial ExpressionTransformer produces a 1D output (as it only manipulates the first modify_date column, completely ignoring the existence of the second day_id column).

If I understand your business logic correctly, then you want to transform modify_date, and keep day_id as-is, until the final substraction happens? If so, then you need to split the data flow - temporarily - using FeatureUnion.

Something like this:

mapper = DataFrameMapper([
  (['modify_date','day_id'], [make_feature_union(), ExpressionTransformer("X[1] - X[0]")])
])

def make_feature_union():
  return FeatureUnion([
    # Make pre-processor for `modify_date`
    (make_modify_date_pipeline()),
    # Make pre-processor for `day_id` - we just want to keep it as-is
    (make_day_id_pipeline())
  ])

# Transform the first column of the incoming data matrix
def make_modify_date_pipeline():
  return [ExpressionTransformer("X[0]"), CastTransformer(), DaysSinceYearTransformer()]

# Select the second column of the incoming data matrix, and return it as-is
def make_day_id_pipeline():
  return [ExpressionTransformer("X[1]")]

@vruusmann
Copy link
Member

Try getting my make_feature_union() to assemble correctly, and then perform FeatureUnion.fit_transform(X) with a sample dataset. You should be getting a 2D data matrix as a result.

@liuhuanshuo
Copy link
Author

Are you sure that your results are correct at the end of step 3?

I checked and the code worked, but the result was not as expected and only one column was output!

Try getting my make_feature_union() to assemble correctly, and then perform FeatureUnion.fit_transform(X) with a sample dataset. You should be getting a 2D data matrix as a result.

I'm sorry I didn't give you the sample data, but the simplest example data is defined as follows

data_test = pd.DataFrame({
'modify_date':['20220626223702','20220629204300','20220602000000'],
'day_id':['2022-07-14','2022-07-15','2022-09-14'],
'channel_type_cd_3':['120507','120516','1203033']
})

Of course, the third column has been solved under your guidance, and I'm only concerned with the first two columns now.

What I need to do is also straightforward, I need to convert the modify_date column to the same format as the day_id column (year-month-day) and then calculate the time for day_id -modify_date.

mapper = DataFrameMapper([
(['modify_date','day_id'], [make_feature_union(), ExpressionTransformer("X[1] - X[0]")])
])

I tried to execute the code you just provided, he kept getting errors, it seems that the relevant parameters are missing, I successively modified some parametersCastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year=2022), But it still doesn't work

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-339-c673557fab36> in <module>
      1 mapper_encode = [
----> 2     (['modify_date','day_id'],[make_feature_union(), ExpressionTransformer("X[1] - X[0]")]),
      3 
      4     ('channel_type_cd_3',LookupTransformer(mapping = all_cate_dict['channel_type_cd_3'], default_value = 0.0))]
      5 

<ipython-input-337-57b6e768926c> in make_feature_union()
      4     (make_modify_date_pipeline()),
      5     # Make pre-processor for `day_id` - we just want to keep it as-is
----> 6     (make_day_id_pipeline())
      7   ])
      8 

~/.local/lib/python3.7/site-packages/sklearn/pipeline.py in __init__(self, transformer_list, n_jobs, transformer_weights, verbose)
    810         self.transformer_weights = transformer_weights
    811         self.verbose = verbose
--> 812         self._validate_transformers()
    813 
    814     def get_params(self, deep=True):

~/.local/lib/python3.7/site-packages/sklearn/pipeline.py in _validate_transformers(self)
    841 
    842     def _validate_transformers(self):
--> 843         names, transformers = zip(*self.transformer_list)
    844 
    845         # validate names

ValueError: not enough values to unpack (expected 2, got 1)

Can you help me see how to solve it? The guidance you provided earlier has already saved me many detours!

@vruusmann
Copy link
Member

What I need to do is also straightforward, I need to convert the modify_date column to the same format as the day_id column (year-month-day) and then calculate the time for day_id -modify_date.

My earlier comment (#354 (comment)) was very close to the final solution.

Here's the final form:

from pandas import DataFrame
from sklearn_pandas import DataFrameMapper
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn2pmml.preprocessing import CastTransformer, DaysSinceYearTransformer, ExpressionTransformer

def make_modify_date_pipeline():
	return make_pipeline(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8]"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))

def make_day_id_pipeline():
	return make_pipeline(ExpressionTransformer("X[1]"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))

def make_feature_union():
	return FeatureUnion([
		("modify_date", make_modify_date_pipeline()),
		("day_id", make_day_id_pipeline())
	])

mapper = DataFrameMapper([
	(['modify_date','day_id'], [make_feature_union(), ExpressionTransformer("X[1] - X[0]")])
])

X = DataFrame({
	'modify_date':['20220626223702','20220629204300','20220602000000'],
	'day_id':['2022-07-14','2022-07-15','2022-09-14']
})

Xt = mapper.fit_transform(X)
print(Xt)

The above code snippet prints [[18], [16], [104]], which look like reasonable numbers to me.

Can you help me see how to solve it? The guidance you provided earlier has already saved me many detours!

Nice! I'll send you an invoice later! :-)

@liuhuanshuo
Copy link
Author

The above code snippet prints [[18], [16], [104]], which look like reasonable numbers to me.

Thank you very much for your reply, I didn't expect to give such a complete code!

Yes, this code works properly on the fit_transform and it gets the result I want.

But I don't know if you still remember my initial problem, which was that I could not convert pipeline into pmml.

I still get an error when I try to apply the code to my program and convert it to pmml, so I'll just paste the key code

SEVERE: Failed to convert
java.lang.IllegalArgumentException: Python expression 'X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 and int(X[0][0:8]) < 20221230 else '2022-12-30'' is either invalid or not supported
	at org.jpmml.sklearn.ExpressionTranslator.translate(ExpressionTranslator.java:76)
	at org.jpmml.sklearn.ExpressionTranslator.translate(ExpressionTranslator.java:63)
	at sklearn2pmml.preprocessing.ExpressionTransformer.encodeFeatures(ExpressionTransformer.java:47)
	at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
	at sklearn.Composite.encodeFeatures(Composite.java:129)
	at sklearn.pipeline.PipelineTransformer.encodeFeatures(PipelineTransformer.java:66)
	at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
	at sklearn.pipeline.FeatureUnion.encodeFeatures(FeatureUnion.java:45)
	at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
	at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:73)
	at sklearn.Initializer.encodeFeatures(Initializer.java:44)
	at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
	at sklearn.Composite.encodeFeatures(Composite.java:129)
	at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
	at org.jpmml.sklearn.Main.run(Main.java:228)
	at org.jpmml.sklearn.Main.main(Main.java:148)
Caused by: org.jpmml.sklearn.ParseException: Encountered unexpected token: "[" "["
    at line 1, column 5.

Was expecting one of:

    "!="
    "%"
    "*"
    "+"
    "-"
    "."
    "/"
    "<"
    "<="
    "=="
    ">"
    ">="
    "and"
    "if"
    "or"
    <EOF>

	at org.jpmml.sklearn.ExpressionTranslator.generateParseException(ExpressionTranslator.java:1558)
	at org.jpmml.sklearn.ExpressionTranslator.jj_consume_token(ExpressionTranslator.java:1426)
	at org.jpmml.sklearn.ExpressionTranslator.translateExpressionInternal(ExpressionTranslator.java:215)
	at org.jpmml.sklearn.ExpressionTranslator.translate(ExpressionTranslator.java:74)
	... 15 more

Of course, the above code is because I modified the code you provided a little bit (in fact, only added a judgment logic).

But even though I didn't make any changes to your code, the same error code still happened

There seems to be a problem with the ExpressionTransformer in make_modify_date_pipeline.

I tried to search for the key error codeorg.jpmml.sklearn.ParseException: Encountered unexpected token: "[" "["

And I can see related problems in the history issue, but I don't get a direct and effective plan, could you give me some hints?

===============
In addition, I have another question. Your code certainly works, and the output numbers are as expected.

But its format is str(object), which is not useful for subsequent work, and, more fatally, when I incorporated this code into my program, it changed all my columns to objects, making the model unpredictable.

My current solution is to unset the df_out parameter in mapper to make the final output an np.array solution, but I don't know if there is a better way, as this would lose the parameter input name!

Finally, to better understand what I have said above, I would like to repeat my procedure.

I have 60 columns of features, except modify_data and day_id, all columns are numeric and None (do nothing) in mapper

When I used the code you provided, it output normally, but changed all the columns to string format!

@liuhuanshuo
Copy link
Author

I still get an error when I try to apply the code to my program and convert it to pmml, so I'll just paste the key code

After searching the related posts, I found that it might be because my version of sklearn2pmml is too low. I upgraded the version to the latest 0.86.3 and it does not issue the above error, but gives a new error code!

Standard output is empty
Standard error:
Exception in thread "main" java.lang.IllegalArgumentException: The estimator object of the final step (Python class lightgbm.sklearn.LGBMClassifier) does not specify the number of outputs
	at sklearn2pmml.pipeline.PMMLPipeline.initTargetFields(PMMLPipeline.java:549)
	at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:128)
	at com.sklearn2pmml.Main.run(Main.java:91)
	at com.sklearn2pmml.Main.main(Main.java:66)

My usage is simple

  • Enter raw data,
  • Feature processing(This is the mapper part of the code below)
  • Apply to the trained model (this is important, my model is trained and can't fit again!)
clf_1 = joblib.load(f'marketing_risk_lgb_1.pkl')
pipeline_test = PMMLPipeline(
    steps=[("mapper", mapper),
           ("classifier", clf_1)])

sklearn2pmml(pipeline_test,'pipeline_test.pmml')

@liuhuanshuo
Copy link
Author

I tried to consult other technicians in the company, and they replied that only after fit can the model be saved as pmml!

But the background of my task is that I have a model that I've trained using sklearn, that I can save as a pkl file, and I can use python to call and make predictions, and of course that model is already online.

Now, I need to convert this pkl model to pmml format without retraining the model so that it can be invoked using java deployment.

So the solution I came up with is to use joblib to read the pkl file, then do feature related operations, and then save the model directly to the pmml file, which seems to be unsupported, right?

But as soon as I fit, my model changes!

@liuhuanshuo
Copy link
Author

I found a solution #352 to this problem by adding a line

pipeline_test.target_fields = ["my_single_target"]

Although I do not quite understand its specific role (if convenient, can you explain?) But it does not appear similar error!

There seems to be only one problem left, which is the following error

Standard output is empty
Standard error:
Exception in thread "main" java.lang.IllegalArgumentException: Function 'builtins.int' is not supported
	at org.jpmml.python.FunctionUtil.encodePythonFunction(FunctionUtil.java:103)
	at org.jpmml.python.FunctionUtil.encodeFunction(FunctionUtil.java:72)
	at org.jpmml.python.ExpressionTranslator.translateFunction(ExpressionTranslator.java:186)
	at org.jpmml.python.ExpressionTranslator.FunctionInvocationExpression(ExpressionTranslator.java:849)
	at org.jpmml.python.ExpressionTranslator.PrimaryExpression(ExpressionTranslator.java:646)
	at org.jpmml.python.ExpressionTranslator.UnaryExpression(ExpressionTranslator.java:594)

I can clearly know that there is something wrong with this line of code (int is used in it), but I don't know how to deal with it. Why is int not supported? I see similar problems (#134)will be solved in later versions, but I am already the latest version!

def make_modify_date_pipeline():
    return make_pipeline(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 and int(X[0][0:8]) < 20221230 else '2022-12-30'"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))

@vruusmann
Copy link
Member

vruusmann commented Oct 28, 2022

java.lang.IllegalArgumentException: Python expression 'X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 and int(X[0][0:8]) < 20221230 else '2022-12-30'' is either invalid or not supported

This error is about "array double indexing" syntax. As you mentioned, it was probably caused by the outdated SkLearn2PMML version.

Exception in thread "main" java.lang.IllegalArgumentException: The estimator object of the final step (Python class lightgbm.sklearn.LGBMClassifier) does not specify the number of outputs

The PMMLPipeline.fit(X, y) method has not been called, so the number of model target fields is unknown. Or if the fit(X, y) method was called, it was called with "anonymous" data matrix (eg. numpy.ndarray) which did not specify column names.

Exception in thread "main" java.lang.IllegalArgumentException: Function 'builtins.int' is not supported

This happens because of the int(..) function invocation within the ExpressionTranslator body. Looks like primitive cast functions are not supported yet, and you should be using the good old CastTransformer instead.

It would make sense to split the initial Python expression into two parts:

  1. Validate that the incoming integer-like datetime string is not empty, and is numerically less than some integer value.
  2. Re-format the integer-like datetime string to a proper yyyy-mm-dd datetime string.

@vruusmann
Copy link
Member

vruusmann commented Oct 28, 2022

I'll be AFK for a couple of hours.

There are too many new questions in this thread, which are not related to the original question ("how can I refactor my datetime lambda function into PMML transformers pipeline") anymore.

Now, please extract all your current open questions into new standalone JPMML-SkLearn/SkLearn2PMML issues - one question per issue! I'll take a look at them afterwards.

@liuhuanshuo
Copy link
Author

Now, please extract all your current open questions into new standalone JPMML-SkLearn/SkLearn2PMML issues - one question per issue! I'll take a look at them afterwards.

Ok, I'm sorry for delaying you for so long. I will close this issue and reopen the issue for unfinished issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants