You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Question: What do reward models really optimize for? How much assumed context do they have?
E.g. adversarial attack might include:
arbitrary \n after some average number of words
long semi-random sequence of words in paragraphs
i.e. just formatting.
It might still give high score. If detects coherence etc., would be impressive since then has to be as good as an LLM itself.
Then reward models might assume alot about nature of input data, that already human readable, correct, etc.
How can RLHF can prune wrong/hallucinated responses?
Also, human may be picking up on trivial changes, like formatting, which is easily trainable for. E.g.
thesis at front
average words per sentence
average sentences per paragraph
new lines between paragraphs
summary at end.
At least the length part is easily chosen from available open data. Summary can be generated from samsum type models, and thesis may not be as important for now.
The text was updated successfully, but these errors were encountered:
Question: What do reward models really optimize for? How much assumed context do they have?
E.g. adversarial attack might include:
i.e. just formatting.
It might still give high score. If detects coherence etc., would be impressive since then has to be as good as an LLM itself.
Then reward models might assume alot about nature of input data, that already human readable, correct, etc.
How can RLHF can prune wrong/hallucinated responses?
Also, human may be picking up on trivial changes, like formatting, which is easily trainable for. E.g.
At least the length part is easily chosen from available open data. Summary can be generated from samsum type models, and thesis may not be as important for now.
The text was updated successfully, but these errors were encountered: