Add try-except block for deepspeed import handling by Harthi7 · Pull Request #267 · instructlab/training

Harthi7 · 2024-10-13T11:08:01Z

Context Issue:

Implemented error handling for the import of deepspeed to prevent runtime crashes
when the module is unavailable. This improves the robustness of the application.

Issue: #250

Harthi7 · 2024-10-17T12:22:08Z

Hey @RobotSail , can you please review my PR

RobotSail · 2024-10-17T14:10:12Z

+except ImportError:
+    DeepSpeedCPUAdam = None
+    FusedAdam = None
+    print("DeepSpeed Adam optimizers are not available. Some features may be unavailable.")


I like where you're headed with this. One thing we should do though is guard this print statement to the first rank to prevent it from appearing everywhere in a multi-GPU setting:

Suggested change

print("DeepSpeed Adam optimizers are not available. Some features may be unavailable.")

local_rank = os.getenv('LOCAL_RANK', None)

if __name__ == '__main__' and (local_rank is None or local_rank == 0):

print("DeepSpeed Adam optimizers are not available. Some features may be unavailable.")

RobotSail · 2024-10-17T14:12:00Z

+    from deepspeed.runtime.zero.utils import ZeRORuntimeException
+except ImportError:
+    ZeRORuntimeException = None
+    print("ZeRORuntimeException is not available. Some features may be unavailable.")


In which cases would the ZeRORuntimeException not be available for import?

From my understanding the only reason why ZeRORuntimeException could not be imported are general issues that can affect any other import, the reason why I wrapped it in an Except/Catch statement is because I assumed it was part of the issue to make DeepSpeed imports optional

@Harthi7 I see. In that case, could you please group them all into a single section?

RobotSail · 2024-10-17T14:19:25Z

Thank you for the PR @Harthi7 ! This is a great change. I've left a few comments, nothing major. Mainly I'm not familiar with the ZeRORuntimeException not being available for import, when would this happen? And would this be limited to just the ZeRORuntimeException, or is it part of a greater issue where parts of DeepSpeed are not available for import?

Because I think rather than emitting a log about being unable to import that specific exception, it may be better to indicate that an entire section of DeepSpeed can't be imported due a piece not being built properly.

Harthi7 · 2024-10-21T13:05:49Z

Hello @RobotSail, I update my PR please review the new changes

RobotSail · 2024-10-21T13:30:08Z

-from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam
-from deepspeed.runtime.zero.utils import ZeRORuntimeException
+try:
+    from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam


So we'll need to actually split this into two different try/catch blocks: one that imports DeepSpeedCPUAdam, and another that imports everything else.

So we'll need to do the following:

try: # handle CPU optimizer here, it may from deepspeed.ops.adam import DeepSpeedCPUAdam except ImportError: DeepSpeedCPUAdam = None local_rank = int(os.getenv('LOCAL_RANK', None)) if __name__ == '__main__' and (not local_rank or local_rank == 0): print("DeepSpeed CPU Optimizer is not available. Some features may be unavailable.") # handle everything else separately try: from deepspeed.ops.adam import FusedAdam from deepspeed.runtime.zero.utils import ZeRORuntimeException except ImportError: FusedAdam = None ZeRORuntimeException = None local_rank = int(os.getenv('LOCAL_RANK', None)) if __name__ == '__main__' and (not local_rank or local_rank == 0): print("DeepSpeed is not available. Some features may be unavailable.")

Apologies for any confusion this may have caused on your part.

RobotSail · 2024-10-21T13:36:45Z

+    DeepSpeedCPUAdam = None
+    FusedAdam = None
+    ZeRORuntimeException = None
+    local_rank = int(os.getenv('LOCAL_RANK', -1))


LOCAL_RANK is typically populated by torchrun, so you'll want to handle the case where it may not be set - in which case we should just assume that we are the main process:

Suggested change

local_rank = int(os.getenv('LOCAL_RANK', -1))

local_rank = int(os.getenv('LOCAL_RANK', None))

if __name__ == '__main__' and (not local_rank or local_rank == 0):

RobotSail · 2024-10-21T13:51:09Z

Thank you for updating this PR @Harthi7. Since we're trying to handle the case when DeepSpeed cannot be imported, we should also add some safeguards to prevent the code from proceeding when the necessary imports are not available. So we'll want to do this in two places:

Inside of the primary train function, which will handle all possible cases:

training/src/instructlab/training/main_ds.py

Line 512 in c0ee1aa

def main(args):
Inside of the run_training API (which is responsible for invoking this script, and is consumed by other libraries):

training/src/instructlab/training/main_ds.py

Line 736 in c0ee1aa

if train_args.deepspeed_options.save_samples:

Basically what you'll want to do in either of these cases is check what the distributed training framework being used is, and whether the options we want are available.

So for the first piece, you'll want to add something like the following to the main function:

if args.distributed_training_framework == 'deepspeed' and not FusedAdam:
    # means we can't import anything from deepspeed
    raise ImportError("DeepSpeed was selected but we cannot import the `FusedAdam` optimizer")

if args.distributed_training_framework == 'deeppeed' and args.cpu_offload_optimizer and not DeepSpeedCPUAdam:
    raise ImportError("DeepSpeed was selected and CPU offloading was requested, but DeepSpeedCPUAdam could not be imported. This likely means you need to build DeepSpeed with the CPU adam flags.")

And then we'll want to do something similar for the second piece within the run_training function. This way we can provide cleaner exceptions. Otherwise, if we wait for the exceptions to be raised after we've already invoked torchrun, then the exceptions will get lost due to the subprocess call. So in run_training you'd do something like:

# at the top of file:
from instructlab.training.confg import DistributedBackend


# inside of `run_training`:
if train_args.distributed_backend == DistributedBackend.DeepSpeed:
    if not FusedAdam:
        raise ImportError("DeepSpeed was selected as the distributed backend, but FusedAdam could not be imported. Please double-check that DeepSpeed is installed correctly")

    if train_args.deepspeed_options.cpu_offload_optimizer and not DeepSpeedCPUAdam:
        raise ImportError("DeepSpeed CPU offloading was enabled, but DeepSpeedCPUAdam could not be imported. This is most likely because DeepSpeed was not built with CPU Adam. Please rebuild DeepSpeed to have CPU Adam, or disable CPU offloading.")

Currently, the training library does not exit when an error is encountered within the training loop (invoked through torchrun). This commit updates that functionality so we correctly return an exit code of 1 on child failure. Additionally, this commit also adds the `make fix` command which automatically fixes all trivial issues picked up on by ruff Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>

Signed-off-by: Harthi7 <abdullah-harthi7@live.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>

…nto a single section Signed-off-by: abdullah-ibm <abdullah@ibm.com>

Signed-off-by: abdullah-ibm <abdullah@ibm.com>

…ssary imports are not available Signed-off-by: abdullah-ibm <abdullah@ibm.com>

Signed-off-by: Nathan Weinberg <nweinber@redhat.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>

Signed-off-by: Harthi7 <abdullah-harthi7@live.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>

…nto a single section Signed-off-by: abdullah-ibm <abdullah@ibm.com>

Signed-off-by: abdullah-ibm <abdullah@ibm.com>

…ssary imports are not available Signed-off-by: abdullah-ibm <abdullah@ibm.com>

…github.com/Harthi7/training into feature/add-try-except-import-to-deepspeed

this commit adds a new E2E job meant to test integration of training library changes with the CLI's "full" train pipeline to prevent any regressions it also updates the relevant mergify configuration Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

was being incorrectly labeled as 'small" Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

was still using the old AMI from the previous job Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

was still using the old instance type Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

Currently, the training library does not exit when an error is encountered within the training loop (invoked through torchrun). This commit updates that functionality so we correctly return an exit code of 1 on child failure. Additionally, this commit also adds the `make fix` command which automatically fixes all trivial issues picked up on by ruff Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

Signed-off-by: Harthi7 <abdullah-harthi7@live.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>

…nto a single section Signed-off-by: abdullah-ibm <abdullah@ibm.com>

Signed-off-by: abdullah-ibm <abdullah@ibm.com>

…ssary imports are not available Signed-off-by: abdullah-ibm <abdullah@ibm.com>

…github.com/Harthi7/training into feature/add-try-except-import-to-deepspeed

RobotSail · 2024-10-22T14:53:29Z

+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: |
          . venv/bin/activate
+<<<<<<< HEAD


Did you mean to include this?

RobotSail

Thank you for the changes @Harthi7 ! The new changes look good, only thing is it seems like there are some extra changes being made to one of the workflow files. If you can drop those changes then I can approve.

Harthi7 · 2024-10-22T15:06:10Z

Thank you for the changes @Harthi7 ! The new changes look good, only thing is it seems like there are some extra changes being made to one of the workflow files. If you can drop those changes then I can approve.

I made a new branch because of a rebasing issue, Here is the new PR: #291

mergify · 2024-10-25T14:55:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. @Harthi7 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

RobotSail · 2024-10-25T16:09:04Z

@Harthi7 Should we close this PR then?

mergify Bot added the ci-failure label Oct 13, 2024

Harthi7 mentioned this pull request Oct 13, 2024

Make DeepSpeed an optional requirement #250

Closed

mergify Bot added ci-failure and removed ci-failure labels Oct 13, 2024

RobotSail reviewed Oct 17, 2024

View reviewed changes

Comment thread src/instructlab/training/main_ds.py Outdated

mergify Bot added ci-failure and removed ci-failure labels Oct 17, 2024

RobotSail reviewed Oct 21, 2024

View reviewed changes

mergify Bot added the ci-failure label Oct 21, 2024

RobotSail reviewed Oct 21, 2024

View reviewed changes

RobotSail and others added 6 commits October 22, 2024 11:17

Add try-except block for deepspeed import handling

13cd4c0

Signed-off-by: Harthi7 <abdullah-harthi7@live.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>

update: constrain error message to first rank and group all imports i…

edb70b8

…nto a single section Signed-off-by: abdullah-ibm <abdullah@ibm.com>

fix: seprated try/catch statments

0469a66

Signed-off-by: abdullah-ibm <abdullah@ibm.com>

add some safeguards to prevent the code from proceeding when the nece…

b30af33

…ssary imports are not available Signed-off-by: abdullah-ibm <abdullah@ibm.com>

ci: grant HF_TOKEN access to the medium-size E2E CI job

158f199

Signed-off-by: Nathan Weinberg <nweinber@redhat.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>

mergify Bot added CI/CD Affects CI/CD configuration testing Relates to testing and removed ci-failure labels Oct 22, 2024

abdullah-ibm added 4 commits October 22, 2024 11:22

Add try-except block for deepspeed import handling

fb0d757

Signed-off-by: Harthi7 <abdullah-harthi7@live.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>

update: constrain error message to first rank and group all imports i…

4da2d71

…nto a single section Signed-off-by: abdullah-ibm <abdullah@ibm.com>

fix: seprated try/catch statments

1d7e8ca

Signed-off-by: abdullah-ibm <abdullah@ibm.com>

add some safeguards to prevent the code from proceeding when the nece…

ccbf01b

…ssary imports are not available Signed-off-by: abdullah-ibm <abdullah@ibm.com>

abdullah-ibm and others added 11 commits October 22, 2024 11:24

Merge branch 'feature/add-try-except-import-to-deepspeed' of https://…

c93dc43

…github.com/Harthi7/training into feature/add-try-except-import-to-deepspeed

fix: incorrect label for AWS medium runner

951c632

was being incorrectly labeled as 'small" Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

fix: add new AMI to job

c39cbe2

was still using the old AMI from the previous job Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

fix: correct instance type being used

367df69

was still using the old instance type Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

Add try-except block for deepspeed import handling

b70477b

Signed-off-by: Harthi7 <abdullah-harthi7@live.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>

update: constrain error message to first rank and group all imports i…

d6bbf1b

…nto a single section Signed-off-by: abdullah-ibm <abdullah@ibm.com>

fix: seprated try/catch statments

f166b01

Signed-off-by: abdullah-ibm <abdullah@ibm.com>

add some safeguards to prevent the code from proceeding when the nece…

992aad7

…ssary imports are not available Signed-off-by: abdullah-ibm <abdullah@ibm.com>

Merge branch 'feature/add-try-except-import-to-deepspeed' of https://…

e2626c5

…github.com/Harthi7/training into feature/add-try-except-import-to-deepspeed

Harthi7 mentioned this pull request Oct 22, 2024

Added some safeguards when the necessary imports are not available #291

Merged

RobotSail reviewed Oct 22, 2024

View reviewed changes

mergify Bot added the needs-rebase label Oct 25, 2024

Harthi7 closed this Oct 25, 2024

	local_rank = int(os.getenv('LOCAL_RANK', -1))
	local_rank = int(os.getenv('LOCAL_RANK', None))
	if __name__ == '__main__' and (not local_rank or local_rank == 0):

Conversation

Harthi7 commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Harthi7 commented Oct 17, 2024

Uh oh!

RobotSail Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

RobotSail Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

Harthi7 Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

RobotSail Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RobotSail commented Oct 17, 2024

Uh oh!

Harthi7 commented Oct 21, 2024

Uh oh!

RobotSail Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

RobotSail Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

RobotSail commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RobotSail Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Harthi7 commented Oct 22, 2024

Uh oh!

mergify Bot commented Oct 25, 2024

Uh oh!

RobotSail commented Oct 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Harthi7 commented Oct 13, 2024 •

edited

Loading

RobotSail commented Oct 21, 2024 •

edited

Loading