About the used evaluation set #4

jiacheng-ye · 2023-03-05T11:37:00Z

Hi, thanks for your great work!

In the paper Figure 3, the TL;DR summary task is used to report the ROUGE metric. I'm wondering where is the dataset? Is that from load_dataset('openai/summarize_from_feedback', 'validation') and calculate the rouge between the generate summary and the higher-scored summary?
In Figure 4., how is the multiple-choice prompt look like?

The text was updated successfully, but these errors were encountered:

lhao499 · 2023-03-07T00:54:45Z

Thanks for the nice words.

Yeah the validation split is used for evaluation, with instruction being 'a good summary is:'. For evaluation on hh-rlhf , the choice template is 'The following is a dialogue: {dialogue}. This dialogue is {choice}', where choice is either 'good' or 'bad' chosen per likelihood.

jiacheng-ye · 2023-03-07T01:17:12Z

Thanks.
I notice that there can be multiple summaries in the validation set for the same document spanning in different instances, however, I guess maybe only the "policy"="ref" one is the human-written one? Did you preprocess first to get the "ref" summary for each document (which will reduce the validation set) or just use the higher-scored summary as the oracle in each instance (which may not be human-written like the following fig)?

lhao499 · 2023-06-14T03:26:47Z

I apologize for the delay. We choose the higher scored summary as oracle in our experiments.

lhao499 closed this as completed Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the used evaluation set #4

About the used evaluation set #4

jiacheng-ye commented Mar 5, 2023

lhao499 commented Mar 7, 2023 •

edited

jiacheng-ye commented Mar 7, 2023

lhao499 commented Jun 14, 2023

About the used evaluation set #4

About the used evaluation set #4

Comments

jiacheng-ye commented Mar 5, 2023

lhao499 commented Mar 7, 2023 • edited

jiacheng-ye commented Mar 7, 2023

lhao499 commented Jun 14, 2023

lhao499 commented Mar 7, 2023 •

edited