Learning to Summarize Reward Model Loss indexing

Been improving on this for H4, and I think there may be a slight bug. At each reward calculation, which input has a higher reward needs to be known to pass it into the model.

In TRL: https://github.com/lvwerra/trl/blob/a05ddbdd836d3217c80a4b3e679ba984bfd4fa24/examples/summarization/scripts/reward_summarization.py#L185

Here's the original paper, note how the indexing depends on which is selected. Or, maybe this is handled elsewhere in the script (I didn't see it).
![image](https://user-images.githubusercontent.com/10695622/222621959-642f6249-cd5b-4119-a59d-f9fa33953cce.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learning to Summarize Reward Model Loss indexing #191

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Learning to Summarize Reward Model Loss indexing #191

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions