Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't validating on Mac MPS #566

Closed
fatevase opened this issue Oct 3, 2022 · 3 comments
Closed

Can't validating on Mac MPS #566

fatevase opened this issue Oct 3, 2022 · 3 comments
Assignees
Labels
Milestone

Comments

@fatevase
Copy link

fatevase commented Oct 3, 2022

Describe the bug
when I follow doc to using mmengine, train is fine, but an error occurred while validate dataset.
User specified autocast device_type must be cuda or cpu, but got mps

Reproduction

  1. What command or script did you run?
    code from here: https://mmengine.readthedocs.io/zh_CN/latest/get_started/15_minutes.html

  2. Did you make any modifications on the code or config? Did you understand what you have modified?
    No

  3. What dataset did you use?
    cifar-10

Environment

------------------------------------------------------------
System environment:
    sys.platform: darwin
    Python: 3.9.13 (main, Aug 25 2022, 18:24:45) [Clang 12.0.0 ]
    CUDA available: False
    numpy_random_seed: 1950343038
    GCC: Apple clang version 13.1.6 (clang-1316.0.21.2.5)
    PyTorch: 1.12.1
    PyTorch compiling details: PyTorch built with:
  - GCC 4.2
  - C++ Version: 201402
  - clang 13.1.6
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: NO AVX
  - Build settings: BLAS_INFO=accelerate, BUILD_TYPE=Release, CXX_COMPILER=/Applications/Xcode_13.4.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -Wno-deprecated-declarations -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_PYTORCH_METAL_EXPORT -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -DUSE_COREML_DELEGATE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-range-loop-analysis -Wno-pass-failed -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-typedef-redefinition -Wno-unknown-warning-option -Wno-unused-private-field -Wno-inconsistent-missing-override -Wno-aligned-allocation-unavailable -Wno-c++14-extensions -Wno-constexpr-not-const -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -DUSE_MPS -fno-objc-arc -Wno-unused-private-field -Wno-missing-braces -Wno-c++14-extensions -Wno-constexpr-not-const, LAPACK_INFO=accelerate, TORCH_VERSION=1.12.1, USE_CUDA=0, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=OFF, USE_ROCM=OFF, 

    TorchVision: 0.13.1
    OpenCV: 4.6.0
    MMEngine: 0.1.0

Runtime environment:
    dist_cfg: {'backend': 'nccl'}
    seed: None
    Distributed launcher: none
    Distributed training: False
    GPU number: 1
------------------------------------------------------------
  1. You may add addition that may be helpful for locating the problem, such as
    use conda install PyTorch.

Bug fix
I found the error caused by autocast on runner.loops.py on 431

        with autocast(enabled=self.fp16):
            outputs = self.runner.model.test_step(data_batch)
@zhouzaida
Copy link
Member

Hi @fatevase , thanks for your report. It is a bug of autocast which will throw an error when the device is mps.

def autocast(device_type: Optional[str] = None,

@fatevase
Copy link
Author

fatevase commented Oct 5, 2022

Hi @fatevase , thanks for your report. It is a bug of autocast which will throw an error when the device is mps.

def autocast(device_type: Optional[str] = None,

thanks, I saw that train do not use autocast. Is there any difference between train and val?

@zhouzaida
Copy link
Member

zhouzaida commented Oct 6, 2022

Both training and validation support AMP but there are different ways to enable AMP. If you need to enable autocast when training models, you can use AmpOptimWrapper.

outputs = self.runner.model.train_step(
data_batch, optim_wrapper=self.runner.optim_wrapper)

def train_step(self, data: Union[dict, tuple, list],
optim_wrapper: OptimWrapper) -> Dict[str, torch.Tensor]:
"""Interface for model forward, backward and parameters updating during
training process.
:meth:`train_step` will perform the following steps in order:
- If :attr:`module` defines the preprocess method,
call ``module.preprocess`` to pre-processing data.
- Call ``module.forward(**data)`` and get losses.
- Parse losses.
- Call ``optim_wrapper.optimizer_step`` to update parameters.
- Return log messages of losses.
Args:
data (dict or tuple or list): Data sampled from dataset.
optim_wrapper (OptimWrapper): A wrapper of optimizer to
update parameters.
Returns:
Dict[str, torch.Tensor]: A ``dict`` of tensor for logging.
"""
# Enable automatic mixed precision training context.
with optim_wrapper.optim_context(self):
data = self.module.data_preprocessor(data, training=True)
losses = self._run_forward(data, mode='loss')
if self.detect_anomalous_params:
detect_anomalous_params(losses, model=self)
parsed_loss, log_vars = self.module.parse_losses(losses)
optim_wrapper.update_params(parsed_loss)
return log_vars

def optim_context(self, model: nn.Module):
"""Enables the context for mixed precision training, and enables the
context for disabling gradient synchronization during gradient
accumulation context.
Args:
model (nn.Module): The training model.
"""
from mmengine.runner.amp import autocast
with super().optim_context(model), autocast():
yield

@ZwwWayne ZwwWayne added this to the 0.2.0 milestone Oct 8, 2022
@ZwwWayne ZwwWayne modified the milestones: 0.2.0, 0.3.0 Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants