Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparkML GBTClassifier fails to convert to ONNX #648

Closed
yungcero opened this issue Sep 1, 2023 · 4 comments
Closed

SparkML GBTClassifier fails to convert to ONNX #648

yungcero opened this issue Sep 1, 2023 · 4 comments

Comments

@yungcero
Copy link
Contributor

yungcero commented Sep 1, 2023

Hi all - I am running into issues where I am unable to convert a Spark ML GBTClassifier to ONNX. Not sure if this is a initialization issue or something that has not been added in - so would like some insight into a solution or update so that I can help contribute if this feature is not supported yet. The code that I am using below to replicate the error is pulled from the onnxmltools/test folder for spark ml GBT Classifier. I have also replicated this error with other models I have trained where my label columns are double type (0.0/1.0) and my feature columns are created using the VectorAssembler.

lib versions:
onnxmltools: 1.11.2
onnxconverter-common: 1.13.0
pyspark: 3.3.2

from pyspark.ml.linalg import Vectors
raw_data = spark.createDataFrame(
        [(1.0, Vectors.dense(1.0)), (0.0, Vectors.sparse(1, [], []))],
        ["label", "features"],
    )
string_indexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = string_indexer.fit(raw_data)
data = si_model.transform(raw_data)
gbt = GBTClassifier(maxIter=5, maxDepth=2, labelCol="indexed", seed=42)
model = gbt.fit(data)
feature_count = data.first()[1].size
model_onnx = convert_sparkml(
    model,
    "Sparkml GBT Classifier",
    [("features", FloatTensorType([1, feature_count]))],
    spark_session= spark,
    target_opset=9
)

This will fail on convert_sparkml method, specifically in this class https://github.com/onnx/onnxmltools/blob/main/onnxmltools/convert/sparkml/operator_converters/tree_ensemble_common.py on line 65 when trying to read the written model as a parquet. This is the error message: AnalysisException: Unable to infer schema for Parquet. It must be specified manually.

Any help is appreciated. Thanks in advance!

@yungcero
Copy link
Contributor Author

yungcero commented Sep 12, 2023

Found some bugs when walking through the source code in regards to how spark handles the model conversion - will work a fix and push up a MR for review

@yungcero
Copy link
Contributor Author

Going to clean up the spark code on our forked repo then will open a pull request to merge back into the main onnxtools

xadupre added a commit that referenced this issue Oct 2, 2023
* Check if base_score is available and it is a string type convert it to float (#637)

Signed-off-by: Donald Tolley <tolleybot@gmail.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Co-authored-by: Donald Tolley <tolleybot@gmail.com>
Signed-off-by: James Cao <james.cao@ironwoodcyber.com>

* signed (#639)

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: James Cao <james.cao@ironwoodcyber.com>

* Bump ONNX 1.14.1 in CI pipelines (#644)

* verify onnx 1.14.1 rc2

Signed-off-by: jcwchen <jacky82226@gmail.com>

* Bump ONNX 1.14.1

Signed-off-by: jcwchen <jacky82226@gmail.com>

---------

Signed-off-by: jcwchen <jacky82226@gmail.com>
Signed-off-by: James Cao <james.cao@ironwoodcyber.com>

* fix (dev): Working start to address issue #648. This will help enable saving and reading of models from Spark, a requirement for GBTClassifier tree conversion

Signed-off-by: James Cao <james.cao@ironwoodcyber.com>

* feat: Allow conversions of SparkML models to ONNX using cluster mode

Signed-off-by: James Cao <james.cao@ironwoodcyber.com>

* fix: fix bug that did not fully create temp paths

Signed-off-by: James Cao <james.cao@ironwoodcyber.com>

* fix: reformat style

Signed-off-by: James Cao <james.cao@ironwoodcyber.com>

* fix: Fixed formatting style to pass ruff tests

Signed-off-by: James Cao <james.cao@ironwoodcyber.com>

---------

Signed-off-by: Donald Tolley <tolleybot@gmail.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: James Cao <james.cao@ironwoodcyber.com>
Signed-off-by: jcwchen <jacky82226@gmail.com>
Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com>
Co-authored-by: Donald Tolley <tolleybot@gmail.com>
Co-authored-by: Chun-Wei Chen <jacky82226@gmail.com>
@xadupre
Copy link
Collaborator

xadupre commented Oct 2, 2023

I'll close this since the PR fixing it was merged.

@xadupre xadupre closed this as completed Oct 2, 2023
@yungcero
Copy link
Contributor Author

yungcero commented Oct 2, 2023

Awesome. Thank you and sounds good! @xadupre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants