Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPO training on LLAMA-2 completely corrupts the model #1884

Closed
steremma opened this issue Jul 29, 2024 · 2 comments
Closed

DPO training on LLAMA-2 completely corrupts the model #1884

steremma opened this issue Jul 29, 2024 · 2 comments
Labels
🏋 DPO Related to DPO ❓ question Seeking clarification or more information

Comments

@steremma
Copy link

steremma commented Jul 29, 2024

Hello, I am following the usual recipe of starting from a pre-trained model TheBloke/Llama-2-7B-fp16, doing an SFT step (which I can verify significantly improves performance on my test set) and then DPO. The samples are in-policy (sampled from the SFT with low temperature, then annotated with LLM as a judge). I can manually verify that preferences look fine. I launch DPO and after a tiny amount of steps, even as low as 5 steps (effective batch size = 16, so 80 samples have been used in total) the model COMPLETELY degenerates. E.g. here is a sample:

Prompt = "Please find the location: 'Hello, I wanna go to London'"
SFT output = "{'city': 'London', 'country': 'UK'}"
DPO output = "Home » News » India » 'BJP's Agenda is to Destroy Constitution': Rahul Gandhi on CAA
433 'BJP's Agenda is to Destroy Constitution': Rahul Gandhi on CAA
434 File photo of Congress leader Rahul Gandhi.
.....

So the DPO output is completely random, out of (SFT) distribution text that never ends (cut at truncation point). I am at a loss, specifically because exactly the same code worked on another problem. In other words it doesn't look like a bug, just extreme instability that I don't see reported in the literature.

Any hints? I am using reasonable beta (as high as 0.3) and a tiny LR (5e-7) but to no avail. I am trying with qLora default hparams (same as I did my SFT without issues).

@qgallouedec qgallouedec added ❓ question Seeking clarification or more information 🏋 DPO Related to DPO labels Oct 7, 2024
@qgallouedec
Copy link
Member

Hi, thanks for sharing. System info and steps for reproducing (including data, script etc) would help

@qgallouedec
Copy link
Member

Closed as not enough information to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏋 DPO Related to DPO ❓ question Seeking clarification or more information
Projects
None yet
Development

No branches or pull requests

2 participants