Fix Horovod pyarrow IndexError: list index out of range #3255

WeichenXu123 · 2021-11-02T09:04:48Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

Fixes #2193 .
Fix Horovod pyarrow IndexError: list index out of range

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2021-11-02T09:12:56Z

This is a similar fix with the approach in spark dataset converter:

We get all file URL list after saving spark dataframe, and waiting for some time until all the files becoming available to read.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

github-actions · 2021-11-02T12:28:46Z

Unit Test Results

    86 files -   298     86 suites - 298 55m 48s ⏱️ - 3h 53m 4s
  658 tests -     47   489 ✔️ -     78 168 💤 +    30 1 ❌ +1
1 911 runs - 6 539 1 311 ✔️ - 4 422 599 💤 - 2 118 1 ❌ +1

For more details on these failures, see this check.

Results for commit de0b894. ± Comparison against base commit 660f7ff.

This pull request removes 47 tests.

test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_down_by_discovery
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_down_by_exception
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_no_spark_black_list
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_spark_blacklist_no_executor_reuse
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_spark_blacklist_no_executor_reuse_in_app
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_spark_blacklist_no_executor_reuse_same_task
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_spark_blacklist_no_node_reuse
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_spark_blacklist_no_node_reuse_in_app
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_up
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_fault_tolerance_all_hosts_lost
…

This pull request skips 36 tests.

test.integration.test_spark.SparkTests ‑ test_happy_run_with_mpi
test.integration.test_spark.SparkTests ‑ test_timeout_with_mpi
test.integration.test_static_run.StaticRunTests ‑ test_run_failure_mpi_local_cmd
test.integration.test_static_run.StaticRunTests ‑ test_run_failure_mpi_local_func
test.integration.test_static_run.StaticRunTests ‑ test_run_failure_mpi_mixed_cmd
test.integration.test_static_run.StaticRunTests ‑ test_run_failure_mpi_mixed_func
test.integration.test_static_run.StaticRunTests ‑ test_run_failure_mpi_remote_cmd
test.integration.test_static_run.StaticRunTests ‑ test_run_failure_mpi_remote_func
test.integration.test_static_run.StaticRunTests ‑ test_run_success_mpi_local_cmd
test.integration.test_static_run.StaticRunTests ‑ test_run_success_mpi_local_func
…

♻️ This comment has been updated with latest results.

github-actions · 2021-11-02T12:29:02Z

Unit Test Results (with flaky tests)

    92 files -   322     92 suites - 322 1h 8m 14s ⏱️ - 3h 57m 38s
  658 tests -     47   489 ✔️ -     78 168 💤 +    30 1 ❌ +1
2 091 runs - 7 013 1 477 ✔️ - 4 678 613 💤 - 2 336 1 ❌ +1

For more details on these failures, see this check.

Results for commit de0b894. ± Comparison against base commit 660f7ff.

This pull request removes 47 tests.

test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_down_by_discovery
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_down_by_exception
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_no_spark_black_list
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_spark_blacklist_no_executor_reuse
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_spark_blacklist_no_executor_reuse_in_app
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_spark_blacklist_no_executor_reuse_same_task
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_spark_blacklist_no_node_reuse
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_spark_blacklist_no_node_reuse_in_app
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_auto_scale_up
test.integration.test_elastic_spark_tensorflow2.ElasticSparkTensorflow2Tests ‑ test_fault_tolerance_all_hosts_lost
…

This pull request skips 36 tests.

test.integration.test_spark.SparkTests ‑ test_happy_run_with_mpi
test.integration.test_spark.SparkTests ‑ test_timeout_with_mpi
test.integration.test_static_run.StaticRunTests ‑ test_run_failure_mpi_local_cmd
test.integration.test_static_run.StaticRunTests ‑ test_run_failure_mpi_local_func
test.integration.test_static_run.StaticRunTests ‑ test_run_failure_mpi_mixed_cmd
test.integration.test_static_run.StaticRunTests ‑ test_run_failure_mpi_mixed_func
test.integration.test_static_run.StaticRunTests ‑ test_run_failure_mpi_remote_cmd
test.integration.test_static_run.StaticRunTests ‑ test_run_failure_mpi_remote_func
test.integration.test_static_run.StaticRunTests ‑ test_run_success_mpi_local_cmd
test.integration.test_static_run.StaticRunTests ‑ test_run_success_mpi_local_func
…

♻️ This comment has been updated with latest results.

WeichenXu123 · 2021-11-02T13:39:11Z

@tgaddair Could you take a look ?

tgaddair

Nice job tracking this down! Couple minor comments.

tgaddair · 2021-11-04T03:14:12Z

horovod/spark/common/util.py

+            time.sleep(0.1)
+        return False
+
+    pool = ThreadPool(64)


64 seems like a lot in some cases, maybe this can be min(len(url_list), 64)?

tgaddair · 2021-11-04T03:14:56Z

horovod/spark/common/util.py

@@ -539,6 +541,40 @@ def _train_val_split(df, validation):
    return train_df, val_df, validation_ratio


+_FILE_AVAILABILITY_WAIT_TIMEOUT_SECS = 30


Can we make this configurable, possibly with an env variable?

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

tgaddair

LGTM! Thanks for the fix.

EnricoMi · 2021-11-08T13:17:10Z

@tgaddair this PR broke master: https://github.com/horovod/horovod/runs/4106828558?check_suite_focus=true#step:163:67
which made @romerojosh disable all failing tests in #3259

This should not have been merged into master. Can we revert this? Or have #3263 work around the issue and then reopen this PR for discussion?

EnricoMi · 2021-11-08T13:23:35Z

@WeichenXu123 I doubt that this returns the file urls where train_df and val_df have been written to:

            saved_file_list = list(train_df._jdf.inputFiles())
            if val_df:
                saved_file_list += list(val_df._jdf.inputFiles())

            _wait_file_available(store, saved_file_list)

This retrieves the inputFiles of train_df and val_df, where "saved_file" sounds more like you are looking for the output file. I suspect train_df to be immutable w.r.t. train_df.write.parquet in such a way that train_df does not know which file it has been written to through the DataFrameWriter.

This is why _wait_file_available(store, saved_file_list) fails because saved_file_list is an empty list.

EnricoMi · 2021-11-08T13:25:19Z

Looks like _wait_file_available has been tested but DataFrame._jdf.inputFiles() hasn't.

This reverts commit 3efc229. Signed-off-by: Travis Addair <tgaddair@gmail.com>

…" (#3265) This reverts commit 3efc229. Signed-off-by: Travis Addair <tgaddair@gmail.com>

WeichenXu123 · 2021-11-11T14:01:36Z

Oh sorry, seems my fault. I will fix it in my follow-up PR.
CC @tgaddair seemingly you merged my PR too quickly while I haven't carefully checking CI status for that PR.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: weihanmines <weihan13@amd.com>

…vod#3255)" (horovod#3265) This reverts commit 3efc229. Signed-off-by: Travis Addair <tgaddair@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com>

- Fixes issue when start_epoch != 0 Signed-off-by: Dinesh Ramasamy <89654805+iitmdinesh@users.noreply.github.com> Signed-off-by: weihanmines <weihan13@amd.com> fix torch op handles lazy release which may cause oom in elastic scenario (horovod#3110) * fix torch op handles lazy release which may cause oom in elastic scenario Signed-off-by: guoze.lin <guozelin@tencent.com> * Update mpi_ops.py Co-authored-by: guoze.lin <guozelin@tencent.com> Co-authored-by: Travis Addair <tgaddair@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com> Added support for extraction of storage options from url. (horovod#3137) * Added support for extraction of storage options from url. Signed-off-by: Manjur Ansari <maansar@microsoft.com> * mock fsspec.utils Signed-off-by: Manjur Ansari <maansar@microsoft.com> * Added missing comma Co-authored-by: Travis Addair <tgaddair@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com> Make RayExecutor use the current placement group if one exists (horovod#3134) Signed-off-by: weihanmines <weihan13@amd.com> Fix the mapping btw pyspark and numpy (horovod#3146) Signed-off-by: Haoyang Chen <haoyang@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Add tests for Keras callbacks: MetricAverageCallback, LearningRateScheduleCallback and LearningRateWarmupCallback (horovod#3102) There were no tests for MetricAverageCallback, LearningRateScheduleCallback and LearningRateWarmupCallback from hvd as noted in horovod#2659. This PR adds testing to verify the callback works. Signed-off-by: Moses Lee <14leeyuchieh@gmail.com> Co-authored-by: Moses Lee <molee@molee-ld4.linkedin.biz> Signed-off-by: weihanmines <weihan13@amd.com> Split gpu tests in head and non-head versions (horovod#3155) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Allow caller to customize the Tensorboard callback (horovod#3153) * Keras Estimator: Allow user to pass in TensorBoard callback Signed-off-by: Rich Porter <rich.porter@uber.com> * Remove callback from other processes on the same machine Signed-off-by: Rich Porter <rich.porter@uber.com> * Allow other ranks to profile as well. Doesn't seem to conflict Signed-off-by: Rich Porter <rich.porter@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> test_torch.py: add explicit join() for testing duplicated name errors (horovod#3159) For torch nightly >=10.0, we need to add an explict join() call to avoid hanging when testing duplicated name errors. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Disable TF2.6.0 XLA support on OSX (horovod#3133) Related to issue#3132 Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Fix linking _pywrap_tensorflow_internal.so and re-enable XLA on macOS (horovod#3173) Signed-off-by: weihanmines <weihan13@amd.com> Spark/Lightning: fix the usage of checkpoint callback (horovod#3186) Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Fix Cometlogger experiment key lost issue (horovod#3184) * test Signed-off-by: Peng Zhang <pengz@uber.com> * test Signed-off-by: Peng Zhang <pengz@uber.com> * fix_logger Signed-off-by: Peng Zhang <pengz@uber.com> * fix_logger Signed-off-by: Peng Zhang <pengz@uber.com> * recreate_loger Signed-off-by: Peng Zhang <pengz@uber.com> * fix_var Signed-off-by: Peng Zhang <pengz@uber.com> * test Signed-off-by: Peng Zhang <pengz@uber.com> * test Signed-off-by: Peng Zhang <pengz@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Updated torch c++ to use new aten api (horovod#3175) Signed-off-by: weihanmines <weihan13@amd.com> Spark/Keras: remove bare Keras support (horovod#3191) Signed-off-by: weihanmines <weihan13@amd.com> Make fork PRs publish test change stats (horovod#3185) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Support for nccl on cuda 11.4 (horovod#3182) Signed-off-by: Evan Brossard <evanb@maka-ars.com> Signed-off-by: weihanmines <weihan13@amd.com> Fix MPICH support (horovod#3148) * fix MPICH implementation * enable tests for MPICH and Intel MPI Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu> Signed-off-by: weihanmines <weihan13@amd.com> Increase build timeout to 40m on Buildkite (horovod#3192) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Change CMake syntax to be compatible with old versions of CMake (horovod#3196) Signed-off-by: Max H. Gerlach <git@maxgerlach.de> Signed-off-by: weihanmines <weihan13@amd.com> Reinit every torch test (horovod#3194) Signed-off-by: weihanmines <weihan13@amd.com> Add barrier call to torch module to support easy synchronization for process sets (horovod#3139) * Added barrier call to torch module Signed-off-by: TJ <tix@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Bump version to 0.23.0 (horovod#3200) Signed-off-by: Travis Addair <tgaddair@gmail.com> Co-authored-by: Max H. Gerlach <git@maxgerlach.de> Signed-off-by: weihanmines <weihan13@amd.com> Increase Parallel PyTest timeout to 10m (horovod#3198) * Increase MPI and Gloo Parallel PyTest timeout to 10m Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Spark/Lightning: don't overwrite model with checkpoint by default (horovod#3201) Lightning estimator saves model by default if there is no specified checkpoint callback. However, model is not overwritten with checkpoint file in that case. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Spark/Lightning: fix checkpoint callback dirpath typo (horovod#3204) Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Rework events in CI workflows (horovod#3202) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Allow for concurrent schedule and master build, document concurrency (horovod#3206) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Ray: fix RayExecutor to fail when num_workers=0 and num_hosts=None (horovod#3210) Signed-off-by: Travis Addair <tgaddair@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com> add_history_in_lightning_estimator (horovod#3214) Signed-off-by: Peng Zhang <pengz@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Allow buildkite building merge commits on forks (horovod#3215) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Fix json output in ci-results.yaml (horovod#3217) Signed-off-by: weihanmines <weihan13@amd.com> Spark/Lightning: fix history metrics for estimator serialization (horovod#3216) Save metrics inside the checkpoint dict , which will be load with map_location=torch.device('cpu') Signed-off-by: Peng Zhang <pengz@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> patch python source files on macCI (horovod#3220) * patch python source files on macCI * Trigger build and test CI Signed-off-by: TJ <tix@uber.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Updated examples of torch and tf to include mixed precision training (horovod#3222) * Added mixed precision example for pytorch * added mixed precision for keras Signed-off-by: TJ <tix@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Job buildkite-heads accesses ci-workflow outputs, add it to the needs (horovod#3225) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Fixes race condition for ray scale up down tests (horovod#3205) Ensure that at least one host from the previous set of hosts have been registered. Without this, the discovery script will "discover" the new set of hosts before the current set can register. This would result in a race condition. Consider a discovery schedule: ``` discovery_schedule = [ (10, ['host-1:2']), (30, ['host-1:2', 'host-2:1', 'host-3:1']), (None, ['host-2:1']), ] ``` The initial set is: ['host-1:2']. Before this is registered in the driver, the discovery script discovers the set: ['host-1:2', 'host-2:1', 'host-3:1'], and adds ['host-2:1', 'host-3:1']. However, since ['host-1:2'] has not registered, there is no coordinator to notify the workers. When host-1 and host-3 are removed, driver.resume will call _activate_workers, which will update the host assignments. It has a check to see if the intersection between the previous and current set of hosts. It finds that the previous set is ['host-1:2'], and the current set is ['host-2:1'], since there was no notification for the added and removed hosts. This ensures that the previous set of hosts can register before the current set is discovered. Signed-off-by: Abin Shahab <ashahab@linkedin.com> Signed-off-by: weihanmines <weihan13@amd.com> Removed a case of the default mutable argument pitfall (horovod#3227) Signed-off-by: Naelson Douglas <naelson17@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com> Updates to TSC members (horovod#3234) Signed-off-by: Travis Addair <tgaddair@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com> Add in-place broadcast for TensorFlow (horovod#3128) * Update comment in FindTensorflow.cmake Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Add in-place broadcast_() and broadcast_variables() for TF Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Include source files from TF in build to avoid missing symbol errors Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Limit build and test to TF 2.6+ Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Remove source files copied from TensorFlow The missing symbols are resolved by linking against _pywrap_tensorflow_internal.so, which was introduced to Horovod with PR horovod#3053. Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Fix possible type attribute values for HorovodBroadcastInplace Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Add reference variables to test Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Update comments, doc strings, changelog Signed-off-by: Max H. Gerlach <git@maxgerlach.de> Signed-off-by: weihanmines <weihan13@amd.com> [Elastic Horovod] Fix the bug for ElasticSampler and hvd.elastic.state (horovod#3144) Co-authored-by: gethinhu <gethinhu@tencent.com> Signed-off-by: weihanmines <weihan13@amd.com> a better way to handle nccl error under elastic scenario (horovod#3112) Signed-off-by: guoze.lin <guozelin@tencent.com> Signed-off-by: weihanmines <weihan13@amd.com> check torch version for mixed precision example (horovod#3238) Signed-off-by: weihanmines <weihan13@amd.com> Lightning: set limit_train_batches and limit_val_batches (horovod#3237) Tell Lightning trainer that how many batches a single epoch needs. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Spark/Lightning: reduce memory footprint of async dataloader (horovod#3239) Limit async data loader queue size. Signed-off-by: Peng Zhang <pengz@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Change default fusion threshold from 64MB to 128MB in docs (horovod#3241) Signed-off-by: weihanmines <weihan13@amd.com> fix the example of pytorch_lightning_mnist.py (horovod#3245) - remove unused arg parameters - fix model test issue on GPU Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> CI: use latest pytorch_lightning with torchhead (horovod#3243) Signed-off-by: weihanmines <weihan13@amd.com> test_gradient_aggregation with real gradient instead of a constant (horovod#3176) This fixes issue horovod#2664 by performing gradient aggregation with a real gradient instead of a constant. PR: horovod#2647 shifts the gradient allreduce when the gradient is computed (both through the DistributedOptimizer or through the DistributedGradientTape). Which means that this unittest, by design in TF2.4, doesn't call allreduce in _aggregate_gradients(). Since this unittest provide a gradient as constant (without effectively computing it), the gradient will never be allreduced. The current change ensure that instead of a constant a real gradient is computed from a loss-function. Note: The current loss-function intentionally evaluates to zero. A future PR should convert it to a real loss function(e.g. MeanSquaredError) and compute gradients from that to test gradient aggregation. Signed-off-by: Abin Shahab <ashahab@linkedin.com> Signed-off-by: weihanmines <weihan13@amd.com> Remove MetricAverageCallback warning on tf >= 2.5 (horovod#3050) Signed-off-by: Henrique Mendonça <henrique.mendonca@cscs.ch> Signed-off-by: weihanmines <weihan13@amd.com> Fix Horovod pyarrow IndexError: list index out of range (horovod#3255) Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: weihanmines <weihan13@amd.com> Fixing up current CI test failures. (horovod#3259) Signed-off-by: Josh Romero <joshr@nvidia.com> Co-authored-by: Travis Addair <tgaddair@gmail.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Revert "Fix Horovod pyarrow IndexError: list index out of range (horovod#3255)" (horovod#3265) This reverts commit 3efc229. Signed-off-by: Travis Addair <tgaddair@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com> Debugging for lightning data loader and fix for simple profiler. (horovod#3253) add debugging flag for lightning data loader , make async data loader queue size configurable Signed-off-by: weihanmines <weihan13@amd.com> Call process_set._setup in init() to point to the correct native lib path (horovod#3258) * call setup for common process_set in remote trainers moved _setup call to init() Signed-off-by: TJ <tix@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Add support for MXNet async dependency engine. (horovod#3242) Signed-off-by: Josh Romero <joshr@nvidia.com> Signed-off-by: weihanmines <weihan13@amd.com>

init

7e97c77

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 force-pushed the fix_issue_2193 branch from 14372eb to 7e97c77 Compare November 2, 2021 09:05

update

3222eba

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 added 2 commits November 2, 2021 17:52

fix

19e06a5

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix

e5976c1

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

tgaddair reviewed Nov 4, 2021

View reviewed changes

update

de0b894

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

tgaddair approved these changes Nov 4, 2021

View reviewed changes

tgaddair merged commit 3efc229 into horovod:master Nov 4, 2021

EnricoMi mentioned this pull request Nov 8, 2021

Fixing up current CI test failures. #3259

Merged

tgaddair added a commit that referenced this pull request Nov 8, 2021

Revert "Fix Horovod pyarrow IndexError: list index out of range (#3255)"

a6fe96b

This reverts commit 3efc229. Signed-off-by: Travis Addair <tgaddair@gmail.com>

tgaddair mentioned this pull request Nov 8, 2021

Revert "Fix Horovod pyarrow IndexError: list index out of range (#3255)" #3265

Merged

tgaddair added a commit that referenced this pull request Nov 8, 2021

Revert "Fix Horovod pyarrow IndexError: list index out of range (#3255)…

0c6988e

…" (#3265) This reverts commit 3efc229. Signed-off-by: Travis Addair <tgaddair@gmail.com>

WeichenXu123 mentioned this pull request Nov 12, 2021

Fix Horovod pyarrow IndexError: list index out of range #3274

Merged

4 tasks

weihanmines pushed a commit to weihanmines/horovod that referenced this pull request Dec 11, 2021

Fix Horovod pyarrow IndexError: list index out of range (horovod#3255)

9bcc76c

Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: weihanmines <weihan13@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Horovod pyarrow IndexError: list index out of range #3255

Fix Horovod pyarrow IndexError: list index out of range #3255

WeichenXu123 commented Nov 2, 2021 •

edited

Loading

WeichenXu123 commented Nov 2, 2021

github-actions bot commented Nov 2, 2021 •

edited

Loading

github-actions bot commented Nov 2, 2021 •

edited

Loading

WeichenXu123 commented Nov 2, 2021

tgaddair left a comment

tgaddair Nov 4, 2021

tgaddair Nov 4, 2021

tgaddair left a comment

EnricoMi commented Nov 8, 2021 •

edited

Loading

EnricoMi commented Nov 8, 2021

EnricoMi commented Nov 8, 2021

WeichenXu123 commented Nov 11, 2021

		@@ -539,6 +541,40 @@ def _train_val_split(df, validation):
		return train_df, val_df, validation_ratio


		_FILE_AVAILABILITY_WAIT_TIMEOUT_SECS = 30

Fix Horovod pyarrow IndexError: list index out of range #3255

Fix Horovod pyarrow IndexError: list index out of range #3255

Conversation

WeichenXu123 commented Nov 2, 2021 • edited Loading

Checklist before submitting

Description

Review process to land

WeichenXu123 commented Nov 2, 2021

github-actions bot commented Nov 2, 2021 • edited Loading

Unit Test Results

github-actions bot commented Nov 2, 2021 • edited Loading

Unit Test Results (with flaky tests)

WeichenXu123 commented Nov 2, 2021

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Nov 4, 2021

Choose a reason for hiding this comment

tgaddair Nov 4, 2021

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

EnricoMi commented Nov 8, 2021 • edited Loading

EnricoMi commented Nov 8, 2021

EnricoMi commented Nov 8, 2021

WeichenXu123 commented Nov 11, 2021

WeichenXu123 commented Nov 2, 2021 •

edited

Loading

github-actions bot commented Nov 2, 2021 •

edited

Loading

github-actions bot commented Nov 2, 2021 •

edited

Loading

EnricoMi commented Nov 8, 2021 •

edited

Loading