fix: Move model to the correct device for eval #3554

jeffkinnison · 2023-08-28T22:34:26Z

This update addresses the issue described in #3544, which saw a device mismatch when running evaluation on GPU.

Repros:

The original issue repros in tests/integration_tests/test_automl.py::test_auto_train when using GPU.
A similar error occurs during tests/integration_tests/test_api.py::test_api_intent_classification when using GPU.

Fixes:

Move the model to GPU in ludwig.models.predictor.Predictor.batch_evaluation when using GPU.
Move the model to GPU in ludwig.models.predictor.Predictor.batch_prediction when using GPU.

github-actions · 2023-08-29T01:27:56Z

Unit Test Results

      6 files ±      0       6 suites ±0 1h 35m 59s ⏱️ + 14m 11s
2 826 tests +2 792 2 813 ✔️ +2 784 12 💤 +7 1 ❌ +1
2 869 runs +2 781 2 847 ✔️ +2 775 21 💤 +5 1 ❌ +1

For more details on these failures, see this check.

Results for commit d325924. ± Comparison against base commit f34c272.

♻️ This comment has been updated with latest results.

jeffkinnison · 2023-08-29T16:21:37Z

It looks like there may have been an error in test_automl.py. _run_train_with_config checks that post-hyperopt evaluations are handled correctly. There are two cases explicitly tested:

At least one trial completes and a best trial exists to evaluate
No trial completes and no evaluation runs

In the first case, it looks like we were testing for RayTuneExecutor._evaluate_best_model to be called 0 times, but a successful hyperopt run should follow up with an evaluate step and call _evaluate_best_model once.

tgaddair · 2023-08-29T19:19:17Z

ludwig/models/predictor.py

+        try:
+            self.dist_model = self._distributed.to_device(self.dist_model)
+        except AttributeError:
+            logging.info("Using DDP, skipping device assignment.")


Why do we need this handler? DDP just inherits the base implementation.

Hmm. I ran into an AttributeError when calling distributed.to_device in a test using ~~DDPStrategy~~ DistributedDataParallel. Double-checking that now.

I can't recreate the AttributeError, it may have been a separate mistake on my part while testing the fix. The try/except block has been removed.

It showed up in the unit tests here, when trying to use DistributedDataParallel not DDPStrategy. It looks like the exception is raised when calling model.to_device inside of BaseStrategy.to_device.

ray.exceptions.RayTaskError(AttributeError): ray::_Inner.train() (pid=2974, ip=10.1.110.4, repr=TorchTrainer) E File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 368, in train E raise skipped from exception_cause(skipped) E File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure E ray.get(object_ref) E ray.exceptions.RayTaskError(AttributeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=3026, ip=10.1.110.4, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7ff7b49fe650>) E File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute E raise skipped from exception_cause(skipped) E File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper E train_func(*args, **kwargs) E File "/home/runner/work/ludwig/ludwig/ludwig/backend/ray.py", line 498, in <lambda> E lambda config: train_fn(**config), E File "/home/runner/work/ludwig/ludwig/ludwig/backend/ray.py", line 215, in train_fn E results = trainer.train(train_shard, val_shard, test_shard, return_state_dict=True, **kwargs) E File "/home/runner/work/ludwig/ludwig/ludwig/distributed/base.py", line 155, in wrapped E res = fn(*args, **kwargs) E File "/home/runner/work/ludwig/ludwig/ludwig/trainers/trainer.py", line 812, in train E should_break = self._train_loop( E File "/home/runner/work/ludwig/ludwig/ludwig/trainers/trainer.py", line 981, in _train_loop E should_break = self.run_evaluation( E File "/home/runner/work/ludwig/ludwig/ludwig/trainers/trainer.py", line 572, in run_evaluation E self.evaluation(test_set, TEST, progress_tracker.test_metrics, self.eval_batch_size, progress_tracker) E File "/home/runner/work/ludwig/ludwig/ludwig/trainers/trainer.py", line 1074, in evaluation E metrics, _ = predictor.batch_evaluation(dataset, collect_predictions=False, dataset_name=dataset_name) E File "/home/runner/work/ludwig/ludwig/ludwig/models/predictor.py", line 219, in batch_evaluation E self.dist_model = self._distributed.to_device(self.dist_model) E File "/home/runner/work/ludwig/ludwig/ludwig/distributed/base.py", line 53, in to_device E return model.to_device(device if device is not None else get_torch_device()) E File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__ E raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'") E AttributeError: 'DistributedDataParallel' object has no attribute 'to_device'

thelinuxkid · 2023-08-30T01:47:41Z

This fixes the issue for me in the wild for the predict command in one of the examples

ludwig predict --model_path results/experiment_run/model --dataset rotten_tomatoes_test.csv

pytorch version: 2.0.0+cu118
cuda version: 11.8
system: ubuntu 20.04
Nvidia driver version: 525
GPU: Nvidia H100

Before

After

jeffkinnison · 2023-08-30T02:58:33Z

@thelinuxkid That's great to hear! We'll get this merged asap.

tgaddair · 2023-08-30T04:22:26Z

Test failure is transient, merging this.

move model to correct device for eval

8eb7fe5

jeffkinnison requested a review from tgaddair August 28, 2023 22:34

handle distributed strategies without to_device

3a2fb57

_evaluate_best_model should be executed on hyperopt success

fe7f4f4

test disabling function call count check

3467ba3

tgaddair reviewed Aug 29, 2023

View reviewed changes

jeffkinnison added 2 commits August 29, 2023 18:47

remove exception handling

82701ba

add custom to_device to DDPStrategy

d325924

tgaddair approved these changes Aug 29, 2023

View reviewed changes

tgaddair merged commit 2bc1488 into master Aug 30, 2023
13 of 16 checks passed

tgaddair deleted the hyperopt-device-mismatch branch August 30, 2023 04:22

tgaddair mentioned this pull request Aug 31, 2023

Issue Running Ludwig AutoML on Modal.com Cloud Computing Service #3544

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Move model to the correct device for eval #3554

fix: Move model to the correct device for eval #3554

jeffkinnison commented Aug 28, 2023

github-actions bot commented Aug 29, 2023 •

edited

Loading

jeffkinnison commented Aug 29, 2023

tgaddair Aug 29, 2023

jeffkinnison Aug 29, 2023 •

edited

Loading

jeffkinnison Aug 29, 2023 •

edited

Loading

jeffkinnison Aug 29, 2023 •

edited

Loading

thelinuxkid commented Aug 30, 2023

jeffkinnison commented Aug 30, 2023

tgaddair commented Aug 30, 2023

fix: Move model to the correct device for eval #3554

fix: Move model to the correct device for eval #3554

Conversation

jeffkinnison commented Aug 28, 2023

github-actions bot commented Aug 29, 2023 • edited Loading

Unit Test Results

jeffkinnison commented Aug 29, 2023

tgaddair Aug 29, 2023

Choose a reason for hiding this comment

jeffkinnison Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

jeffkinnison Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

jeffkinnison Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

thelinuxkid commented Aug 30, 2023

Before

After

jeffkinnison commented Aug 30, 2023

tgaddair commented Aug 30, 2023

github-actions bot commented Aug 29, 2023 •

edited

Loading

jeffkinnison Aug 29, 2023 •

edited

Loading

jeffkinnison Aug 29, 2023 •

edited

Loading

jeffkinnison Aug 29, 2023 •

edited

Loading