You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am following the usual recipe of starting from a pre-trained model TheBloke/Llama-2-7B-fp16, doing an SFT step (which I can verify significantly improves performance on my test set) and then DPO. The samples are in-policy (sampled from the SFT with low temperature, then annotated with LLM as a judge). I can manually verify that preferences look fine. I launch DPO and after a tiny amount of steps, even as low as 5 steps (effective batch size = 16, so 80 samples have been used in total) the model COMPLETELY degenerates. E.g. here is a sample:
Prompt = "Please find the location: 'Hello, I wanna go to London'" SFT output = "{'city': 'London', 'country': 'UK'}" DPO output = "Home » News » India » 'BJP's Agenda is to Destroy Constitution': Rahul Gandhi on CAA
433 'BJP's Agenda is to Destroy Constitution': Rahul Gandhi on CAA
434 File photo of Congress leader Rahul Gandhi.
.....
So the DPO output is completely random, out of (SFT) distribution text that never ends (cut at truncation point). I am at a loss, specifically because exactly the same code worked on another problem. In other words it doesn't look like a bug, just extreme instability that I don't see reported in the literature.
Any hints? I am using reasonable beta (as high as 0.3) and a tiny LR (5e-7) but to no avail. I am trying with qLora default hparams (same as I did my SFT without issues).
The text was updated successfully, but these errors were encountered:
Hello, I am following the usual recipe of starting from a pre-trained model
TheBloke/Llama-2-7B-fp16
, doing an SFT step (which I can verify significantly improves performance on my test set) and then DPO. The samples are in-policy (sampled from the SFT with low temperature, then annotated with LLM as a judge). I can manually verify that preferences look fine. I launch DPO and after a tiny amount of steps, even as low as 5 steps (effective batch size = 16, so 80 samples have been used in total) the model COMPLETELY degenerates. E.g. here is a sample:Prompt = "Please find the location: 'Hello, I wanna go to London'"
SFT output = "{'city': 'London', 'country': 'UK'}"
DPO output = "Home » News » India » 'BJP's Agenda is to Destroy Constitution': Rahul Gandhi on CAA
433 'BJP's Agenda is to Destroy Constitution': Rahul Gandhi on CAA
434 File photo of Congress leader Rahul Gandhi.
.....
So the DPO output is completely random, out of (SFT) distribution text that never ends (cut at truncation point). I am at a loss, specifically because exactly the same code worked on another problem. In other words it doesn't look like a bug, just extreme instability that I don't see reported in the literature.
Any hints? I am using reasonable beta (as high as 0.3) and a tiny LR (5e-7) but to no avail. I am trying with
qLora
default hparams (same as I did my SFT without issues).The text was updated successfully, but these errors were encountered: