oom #13

Tomsentable · 2023-06-21T08:44:20Z

i used 8*A100 to fintine the llama-7b weight,when iteration>=3000,it save weight oom,how to fix it?

thuqinyj16 · 2023-06-29T15:12:59Z

Lower the batch size per GPU. You can also restart training from an existing checkpoint (e.g., trained for 3000 steps). Deepspeed has some bugs unfixed.

thuqinyj16 closed this as completed Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oom #13

oom #13

Tomsentable commented Jun 21, 2023

thuqinyj16 commented Jun 29, 2023

oom #13

oom #13

Comments

Tomsentable commented Jun 21, 2023

thuqinyj16 commented Jun 29, 2023