Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Lightning DeepSpeed Integration #103

Merged
merged 2 commits into from
Mar 29, 2021
Merged

Initial Lightning DeepSpeed Integration #103

merged 2 commits into from
Mar 29, 2021

Conversation

SeanNaren
Copy link
Contributor

Thanks @minimaxir for your hard work, this repo is super awesome! I learnt a lot back when this repo was released about how text generation really works :)

Related to #97.

I've enabled DeepSpeed using default parameters to start with, which does not include CPU Offloading as this comes with a speed degradation as default. I've also got a PR setup to PyTorch Lightning to update the README as information was slightly outdated!

Whilst testing DeepSpeed, I noticed that when training with default parameters the gradient clip value is set to 0, turning off any gradient clipping. Is this by choice? When training with FP16 my loss did not converge, and NaNed without DeepSpeed. To remedy, I reduced the LR to 1e-4, and set max_grad_norm to 1.0, which I think is the default in HF Transformers as well.

Since DeepSpeed really takes effect at larger parameter sizes (since the buffer themselves are around 3GB of RAM by default) I tested DeepSpeed using a much larger network size (1.5B parameters) and tested across 8 GPUs, seeing good scaling of parameter sizes. As the number of GPUs increased, I did see the memory usage decrease which is as expected as we shard the states across GPUs. I'll continue to try push the numbers on our A100 Server.

I'll try do some tests with DeepSpeed at sizes that fit DDP, but I'll continue to test. Let me know if you have any issues/feedback!!

Script:

from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2Config
from aitextgen import aitextgen

file_name = "input.txt"

train_tokenizer(file_name)
tokenizer_file = "aitextgen.tokenizer.json"

# Roughly 1.5B parameters
config = GPT2Config(n_embd=3072, n_head=16)

ai = aitextgen(tokenizer_file=tokenizer_file, config=config)

data = TokenDataset(file_name, tokenizer_file=tokenizer_file, block_size=64)

ai.train(
    data,
    batch_size=1,
    num_steps=50000,
    generate_every=5000,
    save_every=5000,
    n_gpu=8,
    num_workers=4,
    fp16=True,
    use_deepspeed=False,
    max_grad_norm=1,
    learning_rate=1e-4
)

ai.generate(10, prompt="ROMEO:")

ai2 = aitextgen(model="trained_model/pytorch_model.bin",
                tokenizer_file="aitextgen.tokenizer.json",
                config="trained_model/config.json")

ai2.generate(10, prompt="ROMEO:")

@SeanNaren
Copy link
Contributor Author

Just additionally we're working on ZeRO 3 right now, and once this in place I'm more than happy to add a separate PR :)

@minimaxir
Copy link
Owner

Thanks for the PR! I'll test it out and see if it works.

Whilst testing DeepSpeed, I noticed that when training with default parameters the gradient clip value is set to 0, turning off any gradient clipping. Is this by choice? When training with FP16 my loss did not converge, and NaNed without DeepSpeed. To remedy, I reduced the LR to 1e-4, and set max_grad_norm to 1.0, which I think is the default in HF Transformers as well.

I forget why I set that, so no issue changing it here.

@minimaxir
Copy link
Owner

Sorry about the late response: IRL has been busy!

Testing in the 1-GPU case in Colab, things look good, although unfortunately it goes OOM with your simulated 1.5B model (even w/ gradient checkpointing) despite the DeepSpeed literature asserting that the framework allows that (maybe that support is a ZeRO 3 thing).

Checking the multi-GPU case now. If that works I'm good with merging.

@minimaxir
Copy link
Owner

Maybe put on hold since multi-GPU is harder than expected (trying on a GCP AI Notebook w/ 4 T4 and it hangs at the spawner, which is likely unrelated to this PR but blocking testing)

Screen Shot 2021-03-28 at 5 13 04 PM

@minimaxir minimaxir merged commit fd2cfca into minimaxir:master Mar 29, 2021
@minimaxir
Copy link
Owner

However, I'll merge it since this PR seems to be pretty safe for non-DeepSpeed use cases. The GPU issue I mentioned above is likely unrelated (will file an issue to pytorch-lightning if I get a better test case).

Thanks again!

@SeanNaren SeanNaren deleted the feat/deepspeed branch March 29, 2021 13:46
@SeanNaren
Copy link
Contributor Author

Woop! no need to thank, you've done a lot for the community :)

Regarding the issue this is probably because we're using multiprocessing within a notebook... I think this opens up the discussion of a spawn based DeepSpeed plugin! I can make an issue and cross-reference this now

@SeanNaren
Copy link
Contributor Author

Made the issue to track @minimaxir if you have the ability to test outside of a notebook via a terminal let me know if it works, or I can prioritize to get the spawn version of DeepSpeed together!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants