✍🏻 COMPOSE & REVIEW: SURVEY PAPER CHALLENGE

Starting kit:

🏁 Introduction

This competition is designed to propel advancements in the automation and fine-tuning of Large Language Models (LLMs), with an emphasis on automated prompt engineering. The primary application involves the generation of systematic review reports, overview papers, white papers, and essays that critically synthesize online information. The coverage spans multiple domains including Literary or philosophical essays (LETTERS), Scientific literature (SCIENCES), and topics surrounding the United Nations Sustainable Development Goals (SOCIAL SCIENCES).

📝 Task

Participants will work with challenge-provided keywords or brief prompts, transforming them into detailed prompts that elicit the production of concise articles or essays. These pieces, averaging four pages or about 2000 words, should include verifiable claims and precise references. Initially, the focus will be on text-only reports, with the anticipation of extending to multi-modality reports in future iterations of the challenge.

In this machine learning competition, participants will interface their models with an organizer-provided API. The model will function autonomously, using internet resources to tackle the tasks at hand. The contest simulates a peer-review process, with both the production and review of papers executed by AI systems. Participants will submit models capable of both generating and reviewing papers automatically. As a result, this competition will also foster the development of automated review systems.

The competition's assessment strategy relies on a peer-review principle, with papers generated by one model being evaluated by other models. Final decisions on paper acceptance, rejection, and awarding of best paper titles will be made by the organizers and jury, who will assess both the papers and their reviews.

📊 Data description

This challenge consists of 2 tracks:

Generator track: The goal of this track is to generate systematic review papers according to the given prompts and instructions.
Reviewer track: The goal of this track is to review research papers with the given criteria.

You are welcome to train your model using the any external data of your choice as long as you provide the neccesary API in your code prior to submission.

📂 Files

📝 Generator track

generator/prompts.csv

id: A unique identifier for the paper
prompt: Automatically generated prompts by reverse-engineering original papers, the prompts will be the input to your model.

generator/instructions.txt

The instructions of good practices to generate systematic review papers.

📝 Reviewer track

reviewer/instructions.txt

The instructions of good practices to review survey papers.

reviewer/papers

A foldder that contains variations of paper that needs to be reviewed.

💯 Evaluation

In the feed-back and development phases:

Generator track: AI-authors submitted by the participants are evaluated by our own automated AI-reviewer, called AI-referee-reviewer (to distinguish it from the AI-reviewers submitted by the participants).
Reviewer track: AI-reviewers submitted by the participants are evaluated by our automated AI-meta-reviewer.

In the final phase, the papers generated by the AI-authors of the contestants are evaluated by human reviewers and by the AI-reviewers of the other contestants:

Generator track: The final score is based on the human reviews of the AI-generated papers.
Reviewer track: The final score is based on the correlation between the human reviews and the reviews of the contestant's AI-reviewer.

Details are provided below:

AI-referee-reviewer

We provide details about our implementation of an AI-reviewer, used to evaluated AI-generated papers the Generator track for the feed-back and development phases.

Criteria of evaluation

The overall evaluation metric for this track is the average of the ranking score of our AI-referee-reviewer, over a number of review criteria and a number of generated papers. The review criteria as the same as those used to instruct the reviewers of the participants in the Reviewer track:

Clarity: Is the paper written in good English, with correct grammar, and precise vocabulary? Is the paper well organized in meaningful sections and subsections? Are the concepts clearly explained, with short sentences?
Soundness: Does the answer present accurate facts, supported by citations of authoritative references?
Contribution: Does the answer provide a comprehensive overview, comparing and contrasting a plurality of viewpoints?
Responsibility: Does the paper address potential risks or ethical issues and is respectful of human moral values, including fairness, and privacy?

We use a variety of methods to produce such scores, which we call "raw scores".

Calibration

We then calibrate the raw scores using reference papers as follows, using reference "good" and "bad" papers answering the given prompts.

Each prompt was obtained from an original human paper, which we use to create a number of paraphrased versions, fulfilling the format of the AI-generated papers (2000 words including references). These paraphrased versions are either of good or bad quality, referred to as GOOD-PAPERS and BAD-PAPERS. The calibrated scores are obtained as:

Calibrated score = (Accuracy of rating AI-generated-paper better than GOOD-PAPERS using raw score) + (Accuracy of rating AI-generated-paper better than BAD-PAPERS using raw score) / 2

Good and bad papers are generated with a language model, with prompts such as "summarize paper x in Good English to less than 2000 words" or "summarize paper x in beginner English to less than 2000 words, making a lot of typos and grammatical errors". Examples of such good and bad papers are provided with the sample data in the Feedback Phase, in the 📝 generator/papers folder.

AI-meta-reviewer

We implemented a AI-meta-reviewer, which scores the reviews provided by the AI-reviewers of the participants during the feed-back and development phase. It uses 5 meta-review criteria:

Rating: Is the score consistent with the text feed-back? (for each criterion)
Precision: Is the text feed-back precise (does it point to a specific reason of praise of criticism)?
Correctness: Is the praise or criticism correct and well substantiated?
Recommendation: Does the text feed-back provide detailed and actionable recommendations for improvement?
Respectfulness: Is the language polite and non discriminatory?

For the final phase we use the Kendall rank correlation coefficient between the human reviews and the AI-reviewer reviews for each criterion. The final score is the average of each criterion score.

Install Anaconda and create an environment with Python 3.8 (RECOMMENDED)

Prerequisites

Usage:

The file README.ipynb contains step-by-step instructions on how to create a sample submission
modify sample_code_submission to provide a better model or you can also write your own model in the jupyter notebook.

References and credits

Université Paris Saclay (https://www.universite-paris-saclay.fr/)
ChaLearn (http://www.chalearn.org/)

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
dummy_code_submission		dummy_code_submission
generator_ingestion_program		generator_ingestion_program
generator_scoring_program		generator_scoring_program
others		others
reviewer_ingestion_program		reviewer_ingestion_program
reviewer_scoring_program		reviewer_scoring_program
sample_code_submission		sample_code_submission
sample_data		sample_data
scoring_output		scoring_output
.gitignore		.gitignore
README.md		README.md
how_to_get_openai_api.md		how_to_get_openai_api.md

kentrachmat/Auto-Survey-Challenge

Folders and files

Latest commit

History

Repository files navigation