DPO training on LLAMA-2 completely corrupts the model #1884

steremma · 2024-07-29T08:12:46Z

Hello, I am following the usual recipe of starting from a pre-trained model TheBloke/Llama-2-7B-fp16, doing an SFT step (which I can verify significantly improves performance on my test set) and then DPO. The samples are in-policy (sampled from the SFT with low temperature, then annotated with LLM as a judge). I can manually verify that preferences look fine. I launch DPO and after a tiny amount of steps, even as low as 5 steps (effective batch size = 16, so 80 samples have been used in total) the model COMPLETELY degenerates. E.g. here is a sample:

Prompt = "Please find the location: 'Hello, I wanna go to London'"
SFT output = "{'city': 'London', 'country': 'UK'}"
DPO output = "Home » News » India » 'BJP's Agenda is to Destroy Constitution': Rahul Gandhi on CAA
433 'BJP's Agenda is to Destroy Constitution': Rahul Gandhi on CAA
434 File photo of Congress leader Rahul Gandhi.
.....

So the DPO output is completely random, out of (SFT) distribution text that never ends (cut at truncation point). I am at a loss, specifically because exactly the same code worked on another problem. In other words it doesn't look like a bug, just extreme instability that I don't see reported in the literature.

Any hints? I am using reasonable beta (as high as 0.3) and a tiny LR (5e-7) but to no avail. I am trying with qLora default hparams (same as I did my SFT without issues).

The text was updated successfully, but these errors were encountered:

qgallouedec · 2024-10-07T14:04:57Z

Hi, thanks for sharing. System info and steps for reproducing (including data, script etc) would help

qgallouedec · 2024-10-20T15:46:37Z

Closed as not enough information to reproduce.

qgallouedec added ❓ question Seeking clarification or more information 🏋 DPO Related to DPO labels Oct 7, 2024

qgallouedec closed this as completed Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPO training on LLAMA-2 completely corrupts the model #1884

DPO training on LLAMA-2 completely corrupts the model #1884

steremma commented Jul 29, 2024 •

edited

Loading

qgallouedec commented Oct 7, 2024

qgallouedec commented Oct 20, 2024

DPO training on LLAMA-2 completely corrupts the model #1884

DPO training on LLAMA-2 completely corrupts the model #1884

Comments

steremma commented Jul 29, 2024 • edited Loading

qgallouedec commented Oct 7, 2024

qgallouedec commented Oct 20, 2024

steremma commented Jul 29, 2024 •

edited

Loading