-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about training scripts #4
Comments
I wanna express my gratitude for your assistance. I've made some initial progress in alleviating OOM issue. I've successfully trained head on an 8x3090 server. However, I'm looking to further optimize GPU memory usage. I have a few inquiries regarding the solution you provided: I couldn't locate the file I also considered using I've implemented ZeRO3. Does this imply that I'm already utilizing FSDP? Your insights and guidance would be greatly appreciated. Thank you for your time and assistance. |
Furthermore, I encountered another issue. In the |
The initialization of policy and reference model is in HA-DPO/ha_dpo/trainer/base_dpo_trainer.py Lines 137 to 144 in 42f72c5
To only instantiate one model, you should pass And in reference model reward modeling: HA-DPO/ha_dpo/trainer/llava_dpo_trainer.py Lines 53 to 57 in 42f72c5
you should pass a reference model, which can be obtained by with model.disable_adapters():
all_logits = model.forward(
inputs_embeds=batch_inputs_embeds,
labels=None,
attention_mask=batch_attention_mask,
).logits.to(torch.float32) or ref_model = model.get_base_model()
all_logits = model.forward(
inputs_embeds=batch_inputs_embeds,
labels=None,
attention_mask=batch_attention_mask,
).logits.to(torch.float32) Be sure to make sure that the As for FSDP, there are no FSDP settings in the scrips provided in current |
I assume that in your case, if you pass |
Thank you for your assistance. I've made modifications to the code and successfully reduced GPU memory usage using the following command:
In
In
I've found some introductory blogs in Chinese about FSDP and ZeRO, blog1, blog2. I believe ZeRO3 should achieve similar effects to FSDP in reducing GPU memory usage. Regarding some warnings during training, I'd like to provide my thoughts on them:
I successfully trained on a 8x3090 server, and the results before and after modifying ref_model were consistent. After training
I would like to express my gratitude once again for your assistance !! |
Happy to see your contribution to our codebase and congratulate on the satisfying results!
Once again thanks for your contribution! |
Hi,
Thanks for the excellent work!
I'm currently facing an issue while running LLaVA-1.5 on our 8x3090 server. Specifically, when setting
freeze_backbone=True
,freeze_mm_mlp_adapter=True
, andtune_mm_mlp_adapter=False
, we encounter an Out of Memory error.Here's the command we're using:
Q1: In my understanding, I've frozen the backbone, projector, and vision tower (the vision encoder), and I haven't utilized LoRA training. Therefore, I wouldn't expect this setup to consume an excessive amount of memory. What could be causing this issue?
Q2: Furthermore, sometimes we encounter the "killing subprocess" issue after loading the model parameters. What could be causing this problem?
Q3: As I read from code, do I need to revise
train_dpo.py
in order to trainlm_head
? Thank youlog
The text was updated successfully, but these errors were encountered: