-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sklearn2pmml can't handle custom function or lambda functions? #354
Comments
Lambda functions are not pickleable in Python. Therefore, your pipeline would not work even in pure Python environment (dump in one computer, transfer the pickle file to another computer and load there - the pipeline won't make any predictions because the source code of the lambda function is missing). The JPMML-SkLearn should simply throw a more relevant exception here - something like "The function definition is missing" |
Thank you for your reply, I am aware of this, but even if I did not use the lambda function, it would not work. Here's my function code, which doesn't work either def transform_channel_ty_cd(X):
res = []
for i in range(len(X)):
if X[i] in all_cate_dict['channel_type_cd_3']:
res.append(all_cate_dict['channel_type_cd_3'].get(X[i]))
else:
res.append(0)
return pd.DataFrame(res) |
Also, it seems that you are saying that my error code is not caused by the lambda function, so where is the problem? |
Your You just have to express your For example, the modify_date_new = ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 and int(X[0]) < 20221230 else '2022-12-30'") |
The problem is that your The SkLearn2PMML package can not find/convert your Python function, if it's not stored INSIDE THE PICKLE FILE. |
Thank you very much for your direct and effective reply, but it seems that the code you provided does not work. Your code seems to solve only the first line in the function. calc_modify_days This function computes a modify_days from the input two columns. Can you provide more effective guidance? Thanks a million! |
I gave you an example about how to do "datetime assembly" from string fragments using Python's
That's your "custom transformer", which works both in Python and (J)PMML. |
Thank you. I seem to understand. I immediately followed your instructions and adapted my second custom function to the following form mapper_encode = [
('channel_type_cd_3',ExpressionTransformer("all_cate_dict['channel_type_cd_3'].get(X[0]) if X[0] in all_cate_dict['channel_type_cd_3'] else 0"))]
mapper = DataFrameMapper(mapper_encode, input_df=True, df_out=True) But when I fit_transform, I get an error TypeError: <lambda>() got an unexpected keyword argument 'axis' I didn't use the lambda, but it seems to be the lambda function |
You can't simply take Python source code, and copy&paste it as For example, lookup table look-ups should be performed using |
Could you please provide a demo for my second simple function so that I can learn from yours to modify other functions? I would be very grateful. I'm new to this method and it's hard forme to find relevant materials and cases like LookupTransformer to learn from |
First, do you under conceptually that lambda look-ups cannot possibly work, because the lookup base dictionary resides in Python memory space, and is never dumped with the rest of the pipeline? You need to find a way to dump the lookup dict, otherwise it'll never work.
The source code is the primary reference: This should work: channel_type_cd_3_lookup = LookupTransformer(mapping = all_cate_dict['channel_type_cd_3'], default_value = 0) |
Thank you, I now fully agree and understand that you should not include any lambda functions in custom functions, and the related variables involved will not be saved. My previous research direction has been trying to figure out how to make pipelines include lambda functions, which is wrong!
Thank you very much, the line of code you provided is valid, and you don't even have to redefine a function! I will follow your instructions to modify the first function! I have one last question, can these converters be chained? Since my first function needs to convert the string first, then calculate the time and then match the characters, it seems that multiple transformer will be needed. Can you write the code layer by layer |
New problems arise. I just tried the following code, which has no lambda and only contains one feature processing (what you provided, I tested and it works). mapper_encode = [('channel_type_cd_3',LookupTransformer(mapping = all_cate_dict['channel_type_cd_3'], default_value = 0))]
mapper = DataFrameMapper(mapper_encode, input_df=True, df_out=True)
pipeline_test = PMMLPipeline(
steps=[("mapper", mapper),
("classifier", clf_1)])
sklearn2pmml(pipeline_test,'pipeline_test.pmml') There was still an error converting pmml
The following is the entire content of my pipeline, which looks OK
|
SkLearn2PMML transformers follow Scikit-Learn API conventions, so they can (and actually should!) be chained in the form of sub-pipelines. Please note that some SkLearn2PMML transformers support multiple input columns (eg. |
Kind of self-explanatory, no? Your look-up values should all be of the same datatype. You have a mix of |
Thank you very much, I solved this by setting To be honest, I didn't even know how to locate the real problem in the face of a bunch of error codes before. Through your several analyses, I have learned how to find the problem. I will continue to modify other code, thanks again for your help! |
Nice find! Looks like my |
I try the following code according to your guidance mapper_encode = [
(['modify_date','day_id'],[(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 and int(X[0][0:8]) < 20221230 else '2022-12-30'")),
CastTransformer(dtype = "datetime64[D]"),
DaysSinceYearTransformer(year = 2022),
ExpressionTransformer("X[1] - X[0]")]),
('channel_type_cd_3',LookupTransformer(mapping = all_cate_dict['channel_type_cd_3'], default_value = 0.0))]
mapper = DataFrameMapper(mapper_encode, input_df=True, df_out=True)
mapper.fit_transform(data_test) But an error has occurred, and here is the error code. It looks like the first three steps worked fine (I only performed the first three steps and the results were as expected) There seems to have been an error in 'ExpressionTransformer("X[1] -X [0] "'. Doesn't seem to read 'X[1]' Can you see where my code is obviously wrong?
|
My advice would be to run the mapper step individually (by performing Are you sure that your results are correct at the end of step 3? The step 4 needs a 2D input, but it seems to me that you are having only 1D input there, because already the initial ExpressionTransformer produces a 1D output (as it only manipulates the first If I understand your business logic correctly, then you want to transform Something like this: mapper = DataFrameMapper([
(['modify_date','day_id'], [make_feature_union(), ExpressionTransformer("X[1] - X[0]")])
])
def make_feature_union():
return FeatureUnion([
# Make pre-processor for `modify_date`
(make_modify_date_pipeline()),
# Make pre-processor for `day_id` - we just want to keep it as-is
(make_day_id_pipeline())
])
# Transform the first column of the incoming data matrix
def make_modify_date_pipeline():
return [ExpressionTransformer("X[0]"), CastTransformer(), DaysSinceYearTransformer()]
# Select the second column of the incoming data matrix, and return it as-is
def make_day_id_pipeline():
return [ExpressionTransformer("X[1]")] |
Try getting my |
I checked and the code worked, but the result was not as expected and only one column was output!
I'm sorry I didn't give you the sample data, but the simplest example data is defined as follows data_test = pd.DataFrame({
'modify_date':['20220626223702','20220629204300','20220602000000'],
'day_id':['2022-07-14','2022-07-15','2022-09-14'],
'channel_type_cd_3':['120507','120516','1203033']
}) Of course, the third column has been solved under your guidance, and I'm only concerned with the first two columns now. What I need to do is also straightforward, I need to convert the
I tried to execute the code you just provided, he kept getting errors, it seems that the relevant parameters are missing, I successively modified some parameters
Can you help me see how to solve it? The guidance you provided earlier has already saved me many detours! |
My earlier comment (#354 (comment)) was very close to the final solution. Here's the final form: from pandas import DataFrame
from sklearn_pandas import DataFrameMapper
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn2pmml.preprocessing import CastTransformer, DaysSinceYearTransformer, ExpressionTransformer
def make_modify_date_pipeline():
return make_pipeline(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8]"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))
def make_day_id_pipeline():
return make_pipeline(ExpressionTransformer("X[1]"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))
def make_feature_union():
return FeatureUnion([
("modify_date", make_modify_date_pipeline()),
("day_id", make_day_id_pipeline())
])
mapper = DataFrameMapper([
(['modify_date','day_id'], [make_feature_union(), ExpressionTransformer("X[1] - X[0]")])
])
X = DataFrame({
'modify_date':['20220626223702','20220629204300','20220602000000'],
'day_id':['2022-07-14','2022-07-15','2022-09-14']
})
Xt = mapper.fit_transform(X)
print(Xt) The above code snippet prints
Nice! I'll send you an invoice later! :-) |
Thank you very much for your reply, I didn't expect to give such a complete code! Yes, this code works properly on the fit_transform and it gets the result I want. But I don't know if you still remember my initial problem, which was that I could not convert pipeline into pmml. I still get an error when I try to apply the code to my program and convert it to pmml, so I'll just paste the key code SEVERE: Failed to convert
java.lang.IllegalArgumentException: Python expression 'X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 and int(X[0][0:8]) < 20221230 else '2022-12-30'' is either invalid or not supported
at org.jpmml.sklearn.ExpressionTranslator.translate(ExpressionTranslator.java:76)
at org.jpmml.sklearn.ExpressionTranslator.translate(ExpressionTranslator.java:63)
at sklearn2pmml.preprocessing.ExpressionTransformer.encodeFeatures(ExpressionTransformer.java:47)
at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
at sklearn.Composite.encodeFeatures(Composite.java:129)
at sklearn.pipeline.PipelineTransformer.encodeFeatures(PipelineTransformer.java:66)
at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
at sklearn.pipeline.FeatureUnion.encodeFeatures(FeatureUnion.java:45)
at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:73)
at sklearn.Initializer.encodeFeatures(Initializer.java:44)
at sklearn.Transformer.updateAndEncodeFeatures(Transformer.java:118)
at sklearn.Composite.encodeFeatures(Composite.java:129)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:208)
at org.jpmml.sklearn.Main.run(Main.java:228)
at org.jpmml.sklearn.Main.main(Main.java:148)
Caused by: org.jpmml.sklearn.ParseException: Encountered unexpected token: "[" "["
at line 1, column 5.
Was expecting one of:
"!="
"%"
"*"
"+"
"-"
"."
"/"
"<"
"<="
"=="
">"
">="
"and"
"if"
"or"
<EOF>
at org.jpmml.sklearn.ExpressionTranslator.generateParseException(ExpressionTranslator.java:1558)
at org.jpmml.sklearn.ExpressionTranslator.jj_consume_token(ExpressionTranslator.java:1426)
at org.jpmml.sklearn.ExpressionTranslator.translateExpressionInternal(ExpressionTranslator.java:215)
at org.jpmml.sklearn.ExpressionTranslator.translate(ExpressionTranslator.java:74)
... 15 more Of course, the above code is because I modified the code you provided a little bit (in fact, only added a judgment logic). But even though I didn't make any changes to your code, the same error code still happened There seems to be a problem with the I tried to search for the key error code And I can see related problems in the history issue, but I don't get a direct and effective plan, could you give me some hints? =============== But its format is str(object), which is not useful for subsequent work, and, more fatally, when I incorporated this code into my program, it changed all my columns to objects, making the model unpredictable. My current solution is to unset the df_out parameter in mapper to make the final output an np.array solution, but I don't know if there is a better way, as this would lose the parameter input name! Finally, to better understand what I have said above, I would like to repeat my procedure. I have 60 columns of features, except modify_data and day_id, all columns are numeric and None (do nothing) in mapper When I used the code you provided, it output normally, but changed all the columns to string format! |
After searching the related posts, I found that it might be because my version of
My usage is simple
clf_1 = joblib.load(f'marketing_risk_lgb_1.pkl')
pipeline_test = PMMLPipeline(
steps=[("mapper", mapper),
("classifier", clf_1)])
sklearn2pmml(pipeline_test,'pipeline_test.pmml') |
I tried to consult other technicians in the company, and they replied that only after fit can the model be saved as pmml! But the background of my task is that I have a model that I've trained using sklearn, that I can save as a pkl file, and I can use python to call and make predictions, and of course that model is already online. Now, I need to convert this pkl model to pmml format without retraining the model so that it can be invoked using java deployment. So the solution I came up with is to use joblib to read the pkl file, then do feature related operations, and then save the model directly to the pmml file, which seems to be unsupported, right? But as soon as I fit, my model changes! |
I found a solution #352 to this problem by adding a line
Although I do not quite understand its specific role (if convenient, can you explain?) But it does not appear similar error! There seems to be only one problem left, which is the following error
I can clearly know that there is something wrong with this line of code (int is used in it), but I don't know how to deal with it. Why is int not supported? I see similar problems (#134)will be solved in later versions, but I am already the latest version!
|
This error is about "array double indexing" syntax. As you mentioned, it was probably caused by the outdated SkLearn2PMML version.
The
This happens because of the It would make sense to split the initial Python expression into two parts:
|
I'll be AFK for a couple of hours. There are too many new questions in this thread, which are not related to the original question ("how can I refactor my datetime lambda function into PMML transformers pipeline") anymore. Now, please extract all your current open questions into new standalone JPMML-SkLearn/SkLearn2PMML issues - one question per issue! I'll take a look at them afterwards. |
Ok, I'm sorry for delaying you for so long. I will close this issue and reopen the issue for unfinished issues! |
I like
sklearn2pmml
because it helps me model it in python and deploy it in java.But I've been troubled by a problem lately.
Because my pipeline uses a custom conversion function, it cannot be successfully converted using sklearn2pmml.
Here is my custom function code
Below is the pipeline code, which works properly for prediction
But when I try to convert the pipeline to a pmml file, I get an error
I tried to look it up and the
FunctionTransformer
andlambda
functions seem to be the problem。But I don't know how to deal with it, and I have to finish this task, how should I solve it?
The text was updated successfully, but these errors were encountered: