Skip to content

dig-team/hanna-benchmark-asg

Repository files navigation

HANNA Benchmark Repository

Resources for the paper "Do Language Models Enjoy Their Own Stories? Prompting Large Language Models For Automatic Story Evaluation" accepted in TACL and awaiting publication.

Authors: Cyril Chhun, Fabian Suchanek and Chloé Clavel.

Note: resources for the paper "Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation" accepted in COLING 2022 can be accessed in the coling branch here.

Table of contents

  1. Updates
  2. Data
  3. Jupyter Notebook
  4. Citation
  5. Acknowledgements
  6. Get Involved

Updates

2024/05/13 - Update for TACL Paper
2022/08/24 - Initial commit

Data

We release in this repository HANNA, a large annotated dataset of Human-ANnotated NArratives for ASG evaluation. HANNA contains annotations for 1,056 stories generated from 96 prompts from the WritingPrompts dataset. Each story was annotated by 3 raters on 6 criteria (Relevance, Coherence, Empathy, Surprise, Engagement and Complexity), for a grand total of 19,008 annotations.

Additionally, we release the scores of those 1,056 stories evaluated by 72 automatic metrics and annotated by 4 different Large Language Models (Beluga-13B, Llama-13B, Mistral-7B, ChatGPT).

  • hanna_stories_annotations.csv contains the raw annotations from our experiment.
    • Story ID is the ID of the story (from 0 to 1,055). Stories are grouped by model (0 to 95 are the Human stories, 96 to 191 are the BertGeneration stories, etc.).
    • Prompt is the prompt
    • Human is the corresponding human story
    • Story is the generated story
    • Model is the model used to generate the story
    • Relevance is the Relevance (RE) score
    • Coherence is the Coherence (CH) score
    • Empathy is the Empathy (EM) score
    • Surprise is the Surprise (SU) score
    • Engagement is the Engagement (EG) score
    • Complexity is the Complexity (CX) score
    • Worker ID is the ID of the mTurk worker
    • Assignment ID is the ID of the mTurk assignment
    • Work time in seconds is the time the worker spent on the assignment in seconds
    • Name is the name entered by the worker for the first mentioned character in the story
  • hanna_metrics_scores_llm.csv contains average human annotations, average LLM annotations, and the scores of automatic measures per story per system. For instance, on row 2, you will find the scores of the stories generated by the BertGeneration model. Each list of that row contains the scores of stories 96 to 191 for each metric.

We also release:

  • th complete answers of the LLMs (in the llm_answers folder)
  • the ~1,500 annotations of our user study on the LLM explanations (user.study.csv)
  • the 384 stories generated by Llama-7B, Beluga-13B, Llama-30B and Platypus2-70B (hanna_llm_stories.csv).

Samples

Story ID Prompt Human Story Model RE CH EM SU EG CX
99 Write a story about an elderly wizard and his young female apprentice. His body was failing. He had taken care of it very well, but 205 years were a long time. Not a drop of alcohol all those long and lonely tavern nights, not a crumb of tobacco for the old pipe. [...] “Follow me,” his mentor said. “I must stop this wizard.” At that, Tawthorn drew his dagger and leaned towards the woman. “If you were correct, my professor was right. You could change the world, and save the kingdom.” [...] BertGeneration 3 2 2 2 2 3
519 You are a immortal during the zombie apocalypse, During the apocalypse the zombies ignore you and you try to live a normal life during the outbreak. 50 years is a long time, enough time to go crazy and return sane. I remember before it happened, the CDC joked that they would have a cure “within a week” [...] After a few weeks of running, you see something inside a tube/pulse generator. I woke up groggy. The day was Monday, it was Tuesday. How was my day going so fast? [...] GPT-2 5 5 3 4 4 4
862 When a new president is elected, they are given a special security briefing. In reality, this is an old tradition where various directors, military officers and current ministers present fake evidence and compete to see who can convince the president of the most ridiculous things. [...] “Mr President I want you to know I am telling you this in full confidence .” Said the head of the Secret Service. The President looked at him. “Yes go ahead .” [...] “Mr. President, you can see this! You know what the problem is. You see, President Obama, in the US, has been working on the latest model of the President 's campaign for over two years! [...] Fusion 2 1 1 1 1 1

Jupyter Notebook

We provide the Jupyter Notebook data_visualization.ipynb containing the code we used to generate our results. It also allows for easier visualisation of the data from the csv files.

Setup

The code was tested with Python 3.9.7. You can install the required packages with

pip install -r requirements.txt

You will also need the williams.py file from the nlp-williams repository for the Williams section of the notebook respectively. We cannot include them in the repository for licensing reasons.

If you do not plan to run the cells of this section, simply comment the corresponding import in the first cell.

Citation

Coming soon.

Acknowledgements

This work was performed using HPC resources from GENCI-IDRIS (Grant 2022-AD011013105R1) and was partially funded by the grant ANR-20-CHIA-0012-01 (``NoRDF''). We would also like to convey our appreciation to TACL Action Editor Ehud Reiter, as well as to our anonymous reviewers, for their valuable feedback.

Cyril, Fabian and Chloé are members of the NoRDF project.

Dataset

WritingPrompts (Fan et al., 2018)

Used systems

Libraries

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

About

HANNA, a large annotated dataset of Human-ANnotated NArratives for ASG evaluation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published