Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]java.lang.ArrayIndexOutOfBoundsException on multi-node cluster run #2278

Open
1 of 19 tasks
bjm88620 opened this issue Sep 5, 2024 · 1 comment · May be fixed by #2282
Open
1 of 19 tasks

[BUG]java.lang.ArrayIndexOutOfBoundsException on multi-node cluster run #2278

bjm88620 opened this issue Sep 5, 2024 · 1 comment · May be fixed by #2282

Comments

@bjm88620
Copy link

bjm88620 commented Sep 5, 2024

SynapseML version

com.microsoft.azure:synapseml_2.12:0.11.4-spark3.3

System information

  • Language version (e.g. python 3.8, scala 2.12): python 3.9
  • Spark Version (e.g. 3.2.3): 3.3.2
  • Spark Platform (e.g. Synapse, Databricks): Databricks

Describe the problem

I have a for-loop lightgbm fit job for rolling back validation;
The job failed on multi-node cluster with log error Connection Refused, and after checked the failed tasks, the executor failed with detail error message java.lang.ArrayIndexOutOfBoundsException and caused the Connection Refused error;

Meanwhile the job can run on single-node cluster without any issue.

The dataframe sent to model is around 48,000, with partition as below

Partition 0 has 19000 records
Partition 1 has 18000 records
Partition 2 has 7000 records
Partition 3 has 4000 records

And the issue cannot be fixed by df.repartition(5).

Screenshot 2024-09-04 at 21 16 29

Code to reproduce issue

max_base_date = '2024-09-01'
tmp_train_df = train_merged_df.where(sf.col('base_date')<max_base_date).cache()
tmp_actual_df = actual_merged_df.where(sf.col('base_date')<max_base_date).cache()
model.fit(tmp_train_df, tmp_actual_df)

Other info / logs

No response

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations
@bjm88620 bjm88620 added the bug label Sep 5, 2024
@github-actions github-actions bot added the triage label Sep 5, 2024
dciborow added a commit that referenced this issue Sep 7, 2024
Fixes #2278

Address the `java.lang.ArrayIndexOutOfBoundsException` error in multi-node cluster runs.

* **Error Handling:**
  - Add error handling for `scoredDataOutPtr` and `scoredDataLengthLongPtr` pointers in the `score`, `predictLeaf`, `featuresShap`, and `innerPredict` methods in `LightGBMBooster.scala`.
  - Ensure proper deletion of `scoredDataOutPtr` and `scoredDataLengthLongPtr` pointers after use in the `innerPredict` method.

* **Testing:**
  - Add a new test file `LightGBMBoosterTest.scala`.
  - Add test cases to verify that the `score`, `predictLeaf`, `featuresShap`, and `innerPredict` methods handle `scoredDataOutPtr` and `scoredDataLengthLongPtr` pointers correctly.

---

For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/microsoft/SynapseML/issues/2278?shareId=XXXX-XXXX-XXXX-XXXX).
@bjm88620
Copy link
Author

bjm88620 commented Sep 11, 2024

Hi @dciborow , I can see the fix PR is created, would like to check whether it will be available for com.microsoft.azure:synapseml_2.12:0.11.4-spark3.3 ? Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant