-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Move model to the correct device for eval #3554
Conversation
Unit Test Results 6 files ± 0 6 suites ±0 1h 35m 59s ⏱️ + 14m 11s For more details on these failures, see this check. Results for commit d325924. ± Comparison against base commit f34c272. ♻️ This comment has been updated with latest results. |
It looks like there may have been an error in
In the first case, it looks like we were testing for |
ludwig/models/predictor.py
Outdated
try: | ||
self.dist_model = self._distributed.to_device(self.dist_model) | ||
except AttributeError: | ||
logging.info("Using DDP, skipping device assignment.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this handler? DDP just inherits the base implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I ran into an AttributeError
when calling distributed.to_device
in a test using DDPStrategy
DistributedDataParallel
. Double-checking that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't recreate the AttributeError
, it may have been a separate mistake on my part while testing the fix. The try/except block has been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It showed up in the unit tests here, when trying to use DistributedDataParallel
not DDPStrategy
. It looks like the exception is raised when calling model.to_device
inside of BaseStrategy.to_device
.
ray.exceptions.RayTaskError(AttributeError): ray::_Inner.train() (pid=2974, ip=10.1.110.4, repr=TorchTrainer)
E File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 368, in train
E raise skipped from exception_cause(skipped)
E File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
E ray.get(object_ref)
E ray.exceptions.RayTaskError(AttributeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=3026, ip=10.1.110.4, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7ff7b49fe650>)
E File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
E raise skipped from exception_cause(skipped)
E File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
E train_func(*args, **kwargs)
E File "/home/runner/work/ludwig/ludwig/ludwig/backend/ray.py", line 498, in <lambda>
E lambda config: train_fn(**config),
E File "/home/runner/work/ludwig/ludwig/ludwig/backend/ray.py", line 215, in train_fn
E results = trainer.train(train_shard, val_shard, test_shard, return_state_dict=True, **kwargs)
E File "/home/runner/work/ludwig/ludwig/ludwig/distributed/base.py", line 155, in wrapped
E res = fn(*args, **kwargs)
E File "/home/runner/work/ludwig/ludwig/ludwig/trainers/trainer.py", line 812, in train
E should_break = self._train_loop(
E File "/home/runner/work/ludwig/ludwig/ludwig/trainers/trainer.py", line 981, in _train_loop
E should_break = self.run_evaluation(
E File "/home/runner/work/ludwig/ludwig/ludwig/trainers/trainer.py", line 572, in run_evaluation
E self.evaluation(test_set, TEST, progress_tracker.test_metrics, self.eval_batch_size, progress_tracker)
E File "/home/runner/work/ludwig/ludwig/ludwig/trainers/trainer.py", line 1074, in evaluation
E metrics, _ = predictor.batch_evaluation(dataset, collect_predictions=False, dataset_name=dataset_name)
E File "/home/runner/work/ludwig/ludwig/ludwig/models/predictor.py", line 219, in batch_evaluation
E self.dist_model = self._distributed.to_device(self.dist_model)
E File "/home/runner/work/ludwig/ludwig/ludwig/distributed/base.py", line 53, in to_device
E return model.to_device(device if device is not None else get_torch_device())
E File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
E raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
E AttributeError: 'DistributedDataParallel' object has no attribute 'to_device'
This fixes the issue for me in the wild for the
pytorch version: 2.0.0+cu118 BeforeAfter |
@thelinuxkid That's great to hear! We'll get this merged asap. |
Test failure is transient, merging this. |
This update addresses the issue described in #3544, which saw a device mismatch when running evaluation on GPU.
Repros:
tests/integration_tests/test_automl.py::test_auto_train
when using GPU.tests/integration_tests/test_api.py::test_api_intent_classification
when using GPU.Fixes:
ludwig.models.predictor.Predictor.batch_evaluation
when using GPU.ludwig.models.predictor.Predictor.batch_prediction
when using GPU.