ONNX model for FastLinearClassifier predictions are 1 based vs 0 based #431
Comments
I don't think that the problem is that the predictions are 1 based vs 0 based. I think that the problem is that inside the model that is generated by nimbusML there's a ValueToKeyTransformer, which maps 0 to 1, and 1 to 2, and then trains the model using those keys... so the model itself will make predictions according to those mappings. Notice that the generated model doesn't include a KeyToValueTransformer after the output of the predictor, only a ValueToKey one before the predictor. It works without ONNX, I assume, because on the way back, somehow, NimbusML automatically maps back the keys to the values. I am still trying to find where and how this happens. I guess this works because in ML.NET the PredictedLabel column is of KeyDataViewType, and in the way back, perhaps, NimbusML automatically map backs whatever is a KeyDataViewType. With ONNX this doesn't happen, because the PredictedLabel.onnx column is not of KeyDataViewType (in this case it's float) so the conversion that I would expect is not happening. In pure ML.NET (without NimbusML), this isn't a problem, because the mapping back isn't done automatically, and the user would always have to include a KeyToValueTransformer at the end of the pipeline; this mapping back would then be also exported to ONNX as part of the model. Ideally the solution would be to somehow automatically attach a KTVT at the end of the nimbusML pipeline, when exporting to onnx. But, as discussed offline, this might be a somewhat harder and more general thing to do. A workaround could have been to add a |
The NimbusML/src/python/nimbusml/pipeline.py Line 1954 in d08b702
This means that scoring the model from NimbusML would give different results than both ONNX and ML.NET. I am not sure that the correct solution is to manually add the key-to-value transformer when exporting to ONNX, wouldn't it make more sense to have NimbusML create the model including the last transformer that it applies at predict time? /cc @antoniovs1029 |
I looked at the model created by the script here, and it seems that the model implicitly adds some transforms that should not be there. The pipeline looks like this:
I am guessing that this is a problem with the script, and it should somehow mention which columns to use as features (or split the DataFrame into one containing only features and one containing only the label), but if a user does not do that, this is really bad, because the trainer trains on the label column... /cc @ganik |
Yes, as you mention, adding a KeyToValue Transformer when exporting to ONNX, as done in my PR dotnet/machinelearning#4841 (comment) , would be problematic and adding it to the NimbusML pipeline would be more ideal. But as discussed offline, it seems impossible to add a transform (such as KeyToValue) after a predictor in NimbusML, and attempting to allow doing so seems to require major changes in the way NimbusML communicates with ML.NET. So that's why the solution in my PR seems like a better (temporary) option, even with the downside you mention.
To be honest, I don't know about this, and if this might be a separate issue. So perhaps @ganik would have something to say in that regard. |
Yes, there are several "hidden" transforms added for, lets say, "smooth" integration b/w python and ML.NET. These transforms are part of the model as they are added during fit(..). So they will be converted and be part of final ONNX model. |
Shouldn't NimbusML make sure not to add these transforms that make the label column part of the features? |
Repro
`
from nimbusml.datasets import get_dataset
from nimbusml.linear_model import FastLinearClassifier
from nimbusml.preprocessing import OnnxRunner
iris_df = get_dataset("iris").as_df()
iris_df.drop(['Species'], axis=1, inplace=True)
iris_no_label_df = iris_df.drop(['Label'], axis=1)
iris_binary_df = iris_no_label_df.rename(columns={'Setosa': 'Label'})
predictor = FastLinearClassifier()
predictor.fit(iris_binary_df)
print(predictor.predict(iris_binary_df))
predictor.export_to_onnx("test.onnx", 'com.microsoft.ml')
onnxrunner = OnnxRunner(model_file="test.onnx")
print(onnxrunner.fit_transform(iris_binary_df))
`
Elapsed time: 00:00:00.9034678
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
...
145 0.0
146 0.0
147 0.0
148 0.0
149 0.0
Name: PredictedLabel, Length: 150, dtype: float64
Sepal_Length Sepal_Width Petal_Length ... PredictedLabel.onnx.0 Score.onnx.0 Score.onnx.1
0 5.1 3.5 1.4 ... 2.0 0.302603 0.697397
1 4.9 3.0 1.4 ... 2.0 0.333512 0.666488
2 4.7 3.2 1.3 ... 2.0 0.310926 0.689074
3 4.6 3.1 1.5 ... 2.0 0.331948 0.668052
4 5.0 3.6 1.4 ... 2.0 0.295445 0.704555
.. ... ... ... ... ... ... ...
145 6.7 3.0 5.2 ... 1.0 0.953864 0.046136
146 6.3 2.5 5.0 ... 1.0 0.934170 0.065830
147 6.5 3.0 5.2 ... 1.0 0.936460 0.063540
148 6.2 3.4 5.4 ... 1.0 0.950669 0.049331
149 5.9 3.0 5.1 ... 1.0 0.917651 0.082349
[150 rows x 25 columns]
The text was updated successfully, but these errors were encountered: