-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SamSum dataset #1
Conversation
Fixed the readme to not mangle the directory layout
src/helm/benchmark/run_specs.py
Outdated
name="sam_sum", | ||
scenario_spec=scenario_spec, | ||
adapter_spec=adapter_spec, | ||
metric_specs=get_open_ended_generation_metric_specs() + get_generative_harms_metric_specs(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI get_generative_harms_metric_specs()
requires a Perspective API key to work. Let me know if you need help with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weiweiy are we gonna need or do we already have a perspective API key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No :( But we do have AOI api keys
Extractiveness: Extend of generated summary being extracted directly from source text (Also looks at ngrams) | ||
Compression: Length of original doc vs summary | ||
Faithfullness: Ask an LLM how good the summary is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to use human evaluation / model evaluation? We have a way of using Mechanical Turk (or GPT-4, simulating a human evaluator) to evaluate generations; let me know and I can send you details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No human evaluation planned right now, this was just some personal notes for myself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yifanmai, can you share GPT-4 simulated human eval? We could use gpt4 if it is valid. Otherwise I guess BERT-score will have to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BERT-score should be sufficient!
If you want to try GPT-4 critique, put the following in prod_env/credentials.conf
:
critiqueType: model
critiqueModelName: openai/gpt-4-32k-0613
Then add this to metrics in your RunSpec
:
MetricSpec(class_name="helm.benchmark.metric.summarization_critique_metrics.SummarizationCritiqueMetric", args={"num_respondents": 1})
On second thought, I am concerned that the prompt in SummarizationCritiqueMetric
is not very good (we did not use this in the production official HELM results).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need OpenAI API keys in prod_env/credentials.conf
:
openaiOrgId: "org-..."
openaiApiKey: "sk-..."
Ok this is working now
Right now the summarization task is about sumamrizing the content in 1 sentence, we can change this by changing the adapter spec
Used this as a dataset link since private links are not playing well with wget https://gist.githubusercontent.com/msaroufim/3f1845a5d93b50d849c42b7baeb2f716/raw/11c2d1814a69bb2cfa54549eaa50c0dcc104b9e5/samsum.tsv