Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Move model to the correct device for eval #3554

Merged
merged 6 commits into from
Aug 30, 2023

Conversation

jeffkinnison
Copy link
Contributor

This update addresses the issue described in #3544, which saw a device mismatch when running evaluation on GPU.

Repros:

  1. The original issue repros in tests/integration_tests/test_automl.py::test_auto_train when using GPU.
  2. A similar error occurs during tests/integration_tests/test_api.py::test_api_intent_classification when using GPU.

Fixes:

  1. Move the model to GPU in ludwig.models.predictor.Predictor.batch_evaluation when using GPU.
  2. Move the model to GPU in ludwig.models.predictor.Predictor.batch_prediction when using GPU.

@github-actions
Copy link

github-actions bot commented Aug 29, 2023

Unit Test Results

       6 files  ±       0         6 suites  ±0   1h 35m 59s ⏱️ + 14m 11s
2 826 tests +2 792  2 813 ✔️ +2 784  12 💤 +7  1 +1 
2 869 runs  +2 781  2 847 ✔️ +2 775  21 💤 +5  1 +1 

For more details on these failures, see this check.

Results for commit d325924. ± Comparison against base commit f34c272.

♻️ This comment has been updated with latest results.

@jeffkinnison
Copy link
Contributor Author

It looks like there may have been an error in test_automl.py. _run_train_with_config checks that post-hyperopt evaluations are handled correctly. There are two cases explicitly tested:

  1. At least one trial completes and a best trial exists to evaluate
  2. No trial completes and no evaluation runs

In the first case, it looks like we were testing for RayTuneExecutor._evaluate_best_model to be called 0 times, but a successful hyperopt run should follow up with an evaluate step and call _evaluate_best_model once.

try:
self.dist_model = self._distributed.to_device(self.dist_model)
except AttributeError:
logging.info("Using DDP, skipping device assignment.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this handler? DDP just inherits the base implementation.

Copy link
Contributor Author

@jeffkinnison jeffkinnison Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I ran into an AttributeError when calling distributed.to_device in a test using DDPStrategy DistributedDataParallel. Double-checking that now.

Copy link
Contributor Author

@jeffkinnison jeffkinnison Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recreate the AttributeError, it may have been a separate mistake on my part while testing the fix. The try/except block has been removed.

Copy link
Contributor Author

@jeffkinnison jeffkinnison Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It showed up in the unit tests here, when trying to use DistributedDataParallel not DDPStrategy. It looks like the exception is raised when calling model.to_device inside of BaseStrategy.to_device.

ray.exceptions.RayTaskError(AttributeError): ray::_Inner.train() (pid=2974, ip=10.1.110.4, repr=TorchTrainer)
E                 File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 368, in train
E                   raise skipped from exception_cause(skipped)
E                 File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
E                   ray.get(object_ref)
E               ray.exceptions.RayTaskError(AttributeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=3026, ip=10.1.110.4, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7ff7b49fe650>)
E                 File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
E                   raise skipped from exception_cause(skipped)
E                 File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
E                   train_func(*args, **kwargs)
E                 File "/home/runner/work/ludwig/ludwig/ludwig/backend/ray.py", line 498, in <lambda>
E                   lambda config: train_fn(**config),
E                 File "/home/runner/work/ludwig/ludwig/ludwig/backend/ray.py", line 215, in train_fn
E                   results = trainer.train(train_shard, val_shard, test_shard, return_state_dict=True, **kwargs)
E                 File "/home/runner/work/ludwig/ludwig/ludwig/distributed/base.py", line 155, in wrapped
E                   res = fn(*args, **kwargs)
E                 File "/home/runner/work/ludwig/ludwig/ludwig/trainers/trainer.py", line 812, in train
E                   should_break = self._train_loop(
E                 File "/home/runner/work/ludwig/ludwig/ludwig/trainers/trainer.py", line 981, in _train_loop
E                   should_break = self.run_evaluation(
E                 File "/home/runner/work/ludwig/ludwig/ludwig/trainers/trainer.py", line 572, in run_evaluation
E                   self.evaluation(test_set, TEST, progress_tracker.test_metrics, self.eval_batch_size, progress_tracker)
E                 File "/home/runner/work/ludwig/ludwig/ludwig/trainers/trainer.py", line 1074, in evaluation
E                   metrics, _ = predictor.batch_evaluation(dataset, collect_predictions=False, dataset_name=dataset_name)
E                 File "/home/runner/work/ludwig/ludwig/ludwig/models/predictor.py", line 219, in batch_evaluation
E                   self.dist_model = self._distributed.to_device(self.dist_model)
E                 File "/home/runner/work/ludwig/ludwig/ludwig/distributed/base.py", line 53, in to_device
E                   return model.to_device(device if device is not None else get_torch_device())
E                 File "/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
E                   raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
E               AttributeError: 'DistributedDataParallel' object has no attribute 'to_device'

@thelinuxkid
Copy link
Contributor

This fixes the issue for me in the wild for the predict command in one of the examples

ludwig predict --model_path results/experiment_run/model --dataset rotten_tomatoes_test.csv

pytorch version: 2.0.0+cu118
cuda version: 11.8
system: ubuntu 20.04
Nvidia driver version: 525
GPU: Nvidia H100

Before

image

After

image

@jeffkinnison
Copy link
Contributor Author

@thelinuxkid That's great to hear! We'll get this merged asap.

@tgaddair
Copy link
Collaborator

Test failure is transient, merging this.

@tgaddair tgaddair merged commit 2bc1488 into master Aug 30, 2023
13 of 16 checks passed
@tgaddair tgaddair deleted the hyperopt-device-mismatch branch August 30, 2023 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants