[Training] DORT fails with new PyTorch changes #16355

wschin · 2023-06-14T18:14:06Z

Describe the issue

Per #16353, Orttraining Linux Lazy Tensor CI Pipeline fails due to PyTorch's change. Please investigate and fix DORT.

To reproduce

Build from source and re-run the test.

Urgency

No response

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

main branch

PyTorch Version

main branch

Execution Provider

Other / Unknown

Execution Provider Library Version

No response

The text was updated successfully, but these errors were encountered:

Fix #16355. The root cause change in PyTorch is [#103302](pytorch/pytorch#103302), which seem blocking calling make_fx inside a dynamo backend. Changes: 1. Move decomposition to `register_backend.py`, so we don't have to call `make_fx` inside DORT, which triggers a bunch of new exceptions. 2. Remove shape inference based on FakeTensorProp since the FX graph received from dynamo contains all shapes now. 3. Fix a macro bug so that DORT can build without CUDA. Before (3), ``` #if defined(USE_CUDA) || defined(USE_ROCM) virtual PhiloxGenerator& PhiloxGenerator__Default() = 0; #ifdef ENABLE_TRAINING_TORCH_INTEROP ... #endif #endif ``` After (3), ``` #if defined(USE_CUDA) || defined(USE_ROCM) virtual PhiloxGenerator& PhiloxGenerator__Default() = 0; #endif #ifdef ENABLE_TRAINING_TORCH_INTEROP ... #endif ``` The later one looks better since the `ENABLE_TRAINING_TORCH_INTEROP` is for Python bridge code, not for random-number-generating kernels `PhiloxGenerator`.

wschin added the training issues related to ONNX Runtime training; typically submitted using template label Jun 14, 2023

wschin self-assigned this Jun 14, 2023

wschin changed the title ~~[Training]~~ [Training] DORT fails with new PyTorch changes Jun 14, 2023

wschin mentioned this issue Jun 17, 2023

Update DORT to follow PyTorch changes #16394

Merged

wschin closed this as completed in #16394 Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] DORT fails with new PyTorch changes #16355

[Training] DORT fails with new PyTorch changes #16355

wschin commented Jun 14, 2023 •

edited

[Training] DORT fails with new PyTorch changes #16355

[Training] DORT fails with new PyTorch changes #16355

Comments

wschin commented Jun 14, 2023 • edited

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

PyTorch Version

Execution Provider

Execution Provider Library Version

wschin commented Jun 14, 2023 •

edited