Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training] DORT fails with new PyTorch changes #16355

Closed
wschin opened this issue Jun 14, 2023 · 0 comments · Fixed by #16394
Closed

[Training] DORT fails with new PyTorch changes #16355

wschin opened this issue Jun 14, 2023 · 0 comments · Fixed by #16394
Assignees
Labels
training issues related to ONNX Runtime training; typically submitted using template

Comments

@wschin
Copy link
Contributor

wschin commented Jun 14, 2023

Describe the issue

Per #16353, Orttraining Linux Lazy Tensor CI Pipeline fails due to PyTorch's change. Please investigate and fix DORT.

To reproduce

Build from source and re-run the test.

Urgency

No response

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

main branch

PyTorch Version

main branch

Execution Provider

Other / Unknown

Execution Provider Library Version

No response

@wschin wschin added the training issues related to ONNX Runtime training; typically submitted using template label Jun 14, 2023
@wschin wschin self-assigned this Jun 14, 2023
@wschin wschin changed the title [Training] [Training] DORT fails with new PyTorch changes Jun 14, 2023
wschin added a commit that referenced this issue Jun 20, 2023
Fix #16355. The root cause change in PyTorch is
[#103302](pytorch/pytorch#103302), which seem
blocking calling make_fx inside a dynamo backend.

Changes:
1. Move decomposition to `register_backend.py`, so we don't have to call
`make_fx` inside DORT, which triggers a bunch of new exceptions.
2. Remove shape inference based on FakeTensorProp since the FX graph
received from dynamo contains all shapes now.
3. Fix a macro bug so that DORT can build without CUDA.

Before (3),
```
#if defined(USE_CUDA) || defined(USE_ROCM)
  virtual PhiloxGenerator& PhiloxGenerator__Default() = 0;
#ifdef ENABLE_TRAINING_TORCH_INTEROP
...
#endif
#endif
```
After (3),
```
#if defined(USE_CUDA) || defined(USE_ROCM)
  virtual PhiloxGenerator& PhiloxGenerator__Default() = 0;
#endif
#ifdef ENABLE_TRAINING_TORCH_INTEROP
...
#endif
```
The later one looks better since the `ENABLE_TRAINING_TORCH_INTEROP` is
for Python bridge code, not for random-number-generating kernels
`PhiloxGenerator`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant