Skip to content

Add try-except block for deepspeed import handling#267

Closed
Harthi7 wants to merge 21 commits intoinstructlab:mainfrom
Harthi7:feature/add-try-except-import-to-deepspeed
Closed

Add try-except block for deepspeed import handling#267
Harthi7 wants to merge 21 commits intoinstructlab:mainfrom
Harthi7:feature/add-try-except-import-to-deepspeed

Conversation

@Harthi7
Copy link
Copy Markdown
Contributor

@Harthi7 Harthi7 commented Oct 13, 2024

Context Issue:

#250 (comment)

Implemented error handling for the import of deepspeed to prevent runtime crashes
when the module is unavailable. This improves the robustness of the application.

Issue: #250

@Harthi7
Copy link
Copy Markdown
Contributor Author

Harthi7 commented Oct 17, 2024

Hey @RobotSail , can you please review my PR

Comment thread src/instructlab/training/main_ds.py Outdated
except ImportError:
DeepSpeedCPUAdam = None
FusedAdam = None
print("DeepSpeed Adam optimizers are not available. Some features may be unavailable.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where you're headed with this. One thing we should do though is guard this print statement to the first rank to prevent it from appearing everywhere in a multi-GPU setting:

Suggested change
print("DeepSpeed Adam optimizers are not available. Some features may be unavailable.")
local_rank = os.getenv('LOCAL_RANK', None)
if __name__ == '__main__' and (local_rank is None or local_rank == 0):
print("DeepSpeed Adam optimizers are not available. Some features may be unavailable.")

Comment thread src/instructlab/training/main_ds.py Outdated
from deepspeed.runtime.zero.utils import ZeRORuntimeException
except ImportError:
ZeRORuntimeException = None
print("ZeRORuntimeException is not available. Some features may be unavailable.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases would the ZeRORuntimeException not be available for import?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding the only reason why ZeRORuntimeException could not be imported are general issues that can affect any other import, the reason why I wrapped it in an Except/Catch statement is because I assumed it was part of the issue to make DeepSpeed imports optional

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Harthi7 I see. In that case, could you please group them all into a single section?

Comment thread src/instructlab/training/main_ds.py Outdated
@RobotSail
Copy link
Copy Markdown
Member

Thank you for the PR @Harthi7 ! This is a great change. I've left a few comments, nothing major. Mainly I'm not familiar with the ZeRORuntimeException not being available for import, when would this happen? And would this be limited to just the ZeRORuntimeException, or is it part of a greater issue where parts of DeepSpeed are not available for import?

Because I think rather than emitting a log about being unable to import that specific exception, it may be better to indicate that an entire section of DeepSpeed can't be imported due a piece not being built properly.

@mergify mergify Bot added ci-failure and removed ci-failure labels Oct 17, 2024
@Harthi7
Copy link
Copy Markdown
Contributor Author

Harthi7 commented Oct 21, 2024

Hello @RobotSail, I update my PR please review the new changes

Comment thread src/instructlab/training/main_ds.py Outdated
from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam
from deepspeed.runtime.zero.utils import ZeRORuntimeException
try:
from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we'll need to actually split this into two different try/catch blocks: one that imports DeepSpeedCPUAdam, and another that imports everything else.

So we'll need to do the following:

try:
    # handle CPU optimizer here, it may 
    from deepspeed.ops.adam import DeepSpeedCPUAdam
except ImportError:
    DeepSpeedCPUAdam = None
    local_rank = int(os.getenv('LOCAL_RANK', None))
    if __name__ == '__main__' and (not local_rank or local_rank == 0):
        print("DeepSpeed CPU Optimizer is not available. Some features may be unavailable.")


# handle everything else separately
try:
    from deepspeed.ops.adam import FusedAdam
    from deepspeed.runtime.zero.utils import ZeRORuntimeException
except ImportError:
    FusedAdam = None
    ZeRORuntimeException = None
    local_rank = int(os.getenv('LOCAL_RANK', None))
    if __name__ == '__main__' and (not local_rank or local_rank == 0):
        print("DeepSpeed is not available. Some features may be unavailable.")

Apologies for any confusion this may have caused on your part.

@mergify mergify Bot added the ci-failure label Oct 21, 2024
Comment thread src/instructlab/training/main_ds.py Outdated
DeepSpeedCPUAdam = None
FusedAdam = None
ZeRORuntimeException = None
local_rank = int(os.getenv('LOCAL_RANK', -1))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOCAL_RANK is typically populated by torchrun, so you'll want to handle the case where it may not be set - in which case we should just assume that we are the main process:

Suggested change
local_rank = int(os.getenv('LOCAL_RANK', -1))
local_rank = int(os.getenv('LOCAL_RANK', None))
if __name__ == '__main__' and (not local_rank or local_rank == 0):

@RobotSail
Copy link
Copy Markdown
Member

RobotSail commented Oct 21, 2024

Thank you for updating this PR @Harthi7. Since we're trying to handle the case when DeepSpeed cannot be imported, we should also add some safeguards to prevent the code from proceeding when the necessary imports are not available. So we'll want to do this in two places:

  1. Inside of the primary train function, which will handle all possible cases:
  2. Inside of the run_training API (which is responsible for invoking this script, and is consumed by other libraries):
    if train_args.deepspeed_options.save_samples:

Basically what you'll want to do in either of these cases is check what the distributed training framework being used is, and whether the options we want are available.

So for the first piece, you'll want to add something like the following to the main function:

if args.distributed_training_framework == 'deepspeed' and not FusedAdam:
    # means we can't import anything from deepspeed
    raise ImportError("DeepSpeed was selected but we cannot import the `FusedAdam` optimizer")

if args.distributed_training_framework == 'deeppeed' and args.cpu_offload_optimizer and not DeepSpeedCPUAdam:
    raise ImportError("DeepSpeed was selected and CPU offloading was requested, but DeepSpeedCPUAdam could not be imported. This likely means you need to build DeepSpeed with the CPU adam flags.") 

And then we'll want to do something similar for the second piece within the run_training function. This way we can provide cleaner exceptions. Otherwise, if we wait for the exceptions to be raised after we've already invoked torchrun, then the exceptions will get lost due to the subprocess call. So in run_training you'd do something like:

# at the top of file:
from instructlab.training.confg import DistributedBackend


# inside of `run_training`:
if train_args.distributed_backend == DistributedBackend.DeepSpeed:
    if not FusedAdam:
        raise ImportError("DeepSpeed was selected as the distributed backend, but FusedAdam could not be imported. Please double-check that DeepSpeed is installed correctly")

    if train_args.deepspeed_options.cpu_offload_optimizer and not DeepSpeedCPUAdam:
        raise ImportError("DeepSpeed CPU offloading was enabled, but DeepSpeedCPUAdam could not be imported. This is most likely because DeepSpeed was not built with CPU Adam. Please rebuild DeepSpeed to have CPU Adam, or disable CPU offloading.")

RobotSail and others added 6 commits October 22, 2024 11:17
Currently, the training library does not exit when an error is encountered
within the training loop (invoked through torchrun). This commit updates
that functionality so we correctly return an exit code of 1 on child failure.

Additionally, this commit also adds the `make fix` command which
automatically fixes all trivial issues picked up on by ruff

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: abdullah-ibm <abdullah@ibm.com>
Signed-off-by: Harthi7 <abdullah-harthi7@live.com>
Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…nto a single section

Signed-off-by: abdullah-ibm <abdullah@ibm.com>
Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…ssary imports are not available

Signed-off-by: abdullah-ibm <abdullah@ibm.com>
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
Signed-off-by: abdullah-ibm <abdullah@ibm.com>
@mergify mergify Bot added CI/CD Affects CI/CD configuration testing Relates to testing and removed ci-failure labels Oct 22, 2024
abdullah-ibm added 4 commits October 22, 2024 11:22
Signed-off-by: Harthi7 <abdullah-harthi7@live.com>
Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…nto a single section

Signed-off-by: abdullah-ibm <abdullah@ibm.com>
Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…ssary imports are not available

Signed-off-by: abdullah-ibm <abdullah@ibm.com>
abdullah-ibm and others added 11 commits October 22, 2024 11:24
this commit adds a new E2E job meant to test integration
of training library changes with the CLI's "full" train
pipeline to prevent any regressions

it also updates the relevant mergify configuration

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
was being incorrectly labeled as 'small"

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
was still using the old AMI from the previous job

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
was still using the old instance type

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
Currently, the training library does not exit when an error is encountered
within the training loop (invoked through torchrun). This commit updates
that functionality so we correctly return an exit code of 1 on child failure.

Additionally, this commit also adds the `make fix` command which
automatically fixes all trivial issues picked up on by ruff

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Harthi7 <abdullah-harthi7@live.com>
Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…nto a single section

Signed-off-by: abdullah-ibm <abdullah@ibm.com>
Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…ssary imports are not available

Signed-off-by: abdullah-ibm <abdullah@ibm.com>
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
. venv/bin/activate
<<<<<<< HEAD
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to include this?

Copy link
Copy Markdown
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes @Harthi7 ! The new changes look good, only thing is it seems like there are some extra changes being made to one of the workflow files. If you can drop those changes then I can approve.

@Harthi7
Copy link
Copy Markdown
Contributor Author

Harthi7 commented Oct 22, 2024

Thank you for the changes @Harthi7 ! The new changes look good, only thing is it seems like there are some extra changes being made to one of the workflow files. If you can drop those changes then I can approve.

I made a new branch because of a rebasing issue, Here is the new PR: #291

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Oct 25, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @Harthi7 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Oct 25, 2024
@RobotSail
Copy link
Copy Markdown
Member

@Harthi7 Should we close this PR then?

@Harthi7 Harthi7 closed this Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration needs-rebase testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants