Skip to content

ntunlp/OpenSource-LLMs-better-than-OpenAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

OpenSource-LLMs-better-than-ChatGPT

Datasets

Evaluation datasets

1/ General capabilities

  1. Measuring Massive Multitask Language Understanding (MMLU). Dan Hendrycks et al, ICLR 2021. 15,908 questions (14,079 for test set) from 57 tasks (mathematics, US history, computer science, law, etc).
  2. LLM-as-a-judge with MT-Bench and Chatbot Arena. Lianmin Zheng et al, Neurips 2023. 80 multi-turn questions from 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science).
  3. AlpacaEval. 2023. 805 instructions to follow from the AlpacaFarm evaluation set.
  4. Open LLM Leaderboard. 2023. Live leaderboard ranking LLMs on ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K.
  5. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (Big-Bench). Aarohi Srivastava et al, TLMR 2023. Large benchmark of 204 tasks, contributed by 450 authors across 132 institutions, including tasks from: linguistics, childhood development, maths, common-sense reasoning, biology, physics, etc.
  6. Large Language Models are not Fair Evaluators (FairEval-Vicuna). Peiyi Wang et al, 2023. 80 questions from the Vicuna Benchmark, with multiple evidence calibration and balanced position calibration.

2/ Agent capabilities

Tool usage

  1. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. Minghao Li et al, EMNLP 2023.
  2. On the Tool Manipulation Capability of Open-source Large Language Models (ToolBench). Qiantong Xu et al, 2023.
  3. Gorilla: Large Language Model Connected with Massive APIs (APIBench). Shishir G. Patil et al, 2023.
  4. ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases. Qiaoyu Tang et al, 2023.
  5. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. Xingyao Wang et al, 2023.

Self-debugging

  1. InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback (InterCode-Bash and InterCode-SQL). John Yang et al, 2023.
  2. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback (MINT-MBPP and MINT-HumanEval). Xingyao Wang et al, 2023.
  3. Code as Policies: Language Model Programs for Embodied Control (RoboCodeGen). Jacky Liang et al, 2023.

Following feedback

  1. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. Xingyao Wang et al, 2023.

Exploring environment

  1. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. Mohit Shridhar et al, ICLR 2021.
  2. InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback (InterCode-CTF). John Yang et al, 2023.
  3. WebArena: A Realistic Web Environment for Building Autonomous Agents. Shuyan Zhou et al, 2023.

3/ Logical reasoning (maths, coding, etc)

  1. Evaluating Large Models Trained on Code (HumanEval benchmark). Mark Chen et al, 2021. 164 hand-written programming problems. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem.
  2. Training Verifiers to Solve Math Word Problems (GSM8K benchmark). Karl Cobbe et al, 2021. 8.5K (7.5k training + 1k test) high quality grade school math problems created by human problem writers. Each problem takes between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations.

4/ Long-context (summarization, QA, etc)

General long-context benchmarks

  1. Long Range Arena: A Benchmark for Efficient Transformers. Yi Tay et al, ICLR 2021. Benchmark of 6 tasks, each between 1k and 16k input tokens. Tasks encompass several modalities: text, images, spatial reasoning.
  2. SCROLLS: Standardized CompaRison Over Long Language Sequences. Uri Shaham et al, EMLP 2022. Benchmark made of 7 existing long-input datasets: 2 summarization datasets (GovReport and SummScreenFD), 1 query-focused summarization dataset (QMSum), 3 QA datasets (Qasper, NarrativeQA, QuALITY, ), and 1 NLI dataset (ContractNLI).
  3. ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding. Uri Shaham et al, EMNLP 2023. Extension of SCROLLS focusing on zero-shot evaluation. Compared to SCROLLS, ZeroScrolls discards ContractNLI, and adds 1 query-based summarization tasks (SQuALITY), 1 QA dataset (MuSiQue) and 2 custom aggregation tasks (SpaceDigest, BookSumSort).
  4. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. Yushi Bai et al, 2023. Bilingual English/Chinese, multi-task benchmark for long context understanding. 21 datasets across 6 task categories, with an average length of 6,711 words (English) and 13,386 characters (Chinese). The tasks cover key long-context tasks such as single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion.
  5. L-Eval: Instituting Standardized Evaluation for Long Context Language Models. Chenxin An et al, 2023. 20 long-input tasks covering diverse aspects. 4 are built from scratch, 4 are re-annotation of existing datasets, and 12 are manually filtered existing datasets.
  6. BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models. Zican Dong et al, 2023. 10 datasets from 5 tasks, all designed to avoid pre-training data contamination by collecting evaluation data in recent period (2023).
  7. M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models. Wai-Chung Kwan et al, 2023. 36 datasets covering 11 tasks and 12 domains, in English and Chinese. Datasets are split in 5 abilities of understanding: explicit single-span, semantic single-span, explicit multiple-span, semantic multiple-span, and global.

Long-context (generic) summarization

  1. BookSum: A Collection of Datasets for Long-form Narrative Summarization. Kryściński et al, 2021. Colletion of datasets resulting in 46,532 paragraph-level, 12,630 chapter-level (BookSum-Chapter), and 405 book-level summarization data points.
  2. Efficient Attentions for Long Document Summarization (GovReport). Huang et al, NAACL 2021. 19,466 documents split into 17519 training, 974 validation and 973 test samples. Average length is 9409.4 words per document and 553.4 words per summary.
  3. SummScreen: A Dataset for Abstractive Screenplay Summarization. Chen et al, ACL 2022. 22,503 episodes from TVMegaSite (SummScreen-TMS, split into 18,915/1,795/1,793 train/dev/test) and 4,021 episodes from ForeverDreaming (SummScreen-FD, split into 3,673/338/337 train/dev/test).

Long-context (query-focused) summarization

  1. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. Ming Zhong et al, NAACL 2021. 1,808 query-summary pairs over 232 meetings in multiple domains. Meetings are from 3 categories: product, academic, committee ; and are annotated by AMT workers.
  2. SQuALITY: Building a Long-Document Summarization Dataset the Hard Way. Wang et al, EMNLP 2022. 100 stories, 500 questions, and 2000 summaries (there are 4 reference summaries per question).

Long-context question-answering (QA)

  1. The NarrativeQA Reading Comprehension Challenge. Kočiský et al, 2017. 46,765 question–answer pairs from 1,567 stories (1,102/115/355 train/valid/test) from books and movie scripts.
  2. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers (Qasper). Dasigi et al, NAACL 2021. 5,049 questions (2,593/1,005/1,451 train/valid/test) over 1,585 NLP papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text.
  3. QuALITY: Question Answering with Long Input Texts, Yes!. Pang et al, NAACL 2022. Multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens. 6,737 questions split into 2,523/2,086/2,128 train/dev/test.
  4. MuSiQue: Multihop Questions via Single-hop Question Composition Trivedi et al, TACL 2022. Multihop QA dataset with 25K 2-4 hop questions, split into 19,938/2,417/2,459 train/dev/test.

5/ Specific NLP tasks

Question-answering (QA)

Reading comprehension
  1. SQuAD: 100,000+ Questions for Machine Comprehension of Text. Rajpurkar et al, EMNLP 2016. 100k+ questions asked by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.
  2. Know What You Don't Know: Unanswerable Questions for SQuAD (SQuAD 2.0). Rajpurkar et al, ACL 2018. SQuAD 2.0 combines existing SQuAD data with over 50k unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.
  3. QuAC: Question Answering in Context. Choi et al, EMNLP 2018. 100k questions (83,568/7,354/7,353 train/dev/test) from 14K information-seeking QA dialogs.
  4. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. Christopher Clark et al, NAACL 2019. 16k naturally occurring yes/no questions, split in 9.4k train, 3.2k dev, 3.2k test. Each question is paired with a Wikipedia passage.
Commonsense reasoning
  1. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. Clark et al, 2018. 7,787 natural, grade-school science questions (authored for human tests), split into 3,370/869/3,548 train/dev/test.
  2. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. Zhilin Yang et al, EMNLP 2018. 113k QA pairs from Wikipedia which require reasoning from multiple documents.
  3. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering (OpenBookQA). Mihaylov et al, EMNLP 2018. Dataset modeled after open book exams for assessing human understanding of a subject. Around 6k questions (4957/500/500 train/dev/test) probe an understanding of 1,329 elementary level science facts.
  4. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. Talmor et al, NAACL 2019. 12,247 multiple-choice questions that mention the source concept and discriminate in turn between each of the target concepts.
  5. HellaSwag: Can a Machine Really Finish Your Sentence. Zellers et al, ACL 2019. Questions collected with Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers.
  6. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. Sakaguchi et al, 2019. Large-scale dataset of 44k problems, inspired by the original Winograd Schema Challenge (WSC), a benchmark for commonsense reasoning made of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. 12,282 instances split into 9,248/1,267/1,767 train/dev/test sets.
  7. SocialIQA: Commonsense Reasoning about Social Interactions (SIQA). Sap et al, EMNLP 2019. 38k (33,410/1,954/2,224 train/dev/test) multiple-choice commonsense questions along with correct and incorrect answers about social interactions collected through crowdsourcing.
  8. PIQA: Reasoning about Physical Commonsense in Natural Language. Bisk et al, AAAI 2020. Benchmarking progress in physical commonsense understanding, with 16k/2k/3k train/dev/test QA pairs.
World knowledge
  1. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Joshi et al, 2017. 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average.
  2. Natural Questions: A Benchmark for Question Answering Research. Tom Kwiatkowski et al, TACL 2019. 307,373 training examples with single annotations; 7,830 development examples with 5-way annotations and 7,842 test examples with 5-way annotations. Questions are real anonymized, aggregated queries issued to the Google search engine. Each question is paired with an entire Wikipedia page.
  3. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Stephanie Lin et al, ACL 2022. 817 questions spanning 38 categories. Question and answers are hand-written by human annotators and designed to elicit imitative falsehoods.
Specific domain
  1. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. Ankit Pal et al, ACM CHIL 2022. 194k multiple-choice questions from real world medical entrance exams.

6/ Trustworthy AI

  1. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Sewon Min et al, EMNLP 2023. Evaluation of LLMs at generating people biographies, using precision of atomic facts.

Fine-tuning / instruction-tuning datasets

  1. AgentTuning: Enabling Generalized Agent Abilities For LLMs. Zeng et al., 2023. Introduce AgentInstruct dataset: 1,866 high quality interaction trajectories generated by GPT-4 and verified by human.
  2. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. Ning Ding et al, 2023. Introduce UltraChat dataset: 1.5 million high-quality multi-turn dialogues covering a wide range of topics and instructions.
  3. OpenAssistant Conversations -- Democratizing Large Language Model Alignment. Andreas Köpf et al, 2023. 161,443 messages (91,829 prompter and 69,614 assistant messages) distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings.

Open-source LLMs vs ChatGPT

In the following, we report cases where an open-source LLM (e.g., Llama-2) outperforms an OpenAI, paying LLM (e.g., ChatGPT). To maintain conciseness, we only report the highest performing version of the open-source LLM (which usually is the largest model, when there are several backbone sizes being tested).

We categorize LLMs depending on the type of training performed:
-Pre-training (PT) refers to LLMs pre-trained from scratch.
-Continual pre-training (CPT) refers to LLMs initialized from an already pre-trained LLM (e.g, Llama-2) and then undergoing another phase of pre-training.
-Fine-tuning or Instruction tuning (FT) are LLMs trained with supervised fine-tuning on instruction tuning datasets or standard downstream tasks datasets.
-Inference (INF) designates proposed techniques which drive LLM performance while not changing the model weights.

Note that a proposed LLM may fall into several of the above 4 categories.

General capabilities (Table 1)

LLM Date released LLM size Training MT-Bench AlpacaEval AlpacaEval-2 Open LLM LB
WizardLM [paper] April 24th, 2023 70B FT 7.71 92.91 12.03 _
Llama-2-chat [paper] July 18th, 2023 70B FT 6.86 92.66 13.87 _
Godzilla [HF card] Aug 11th, 2023 70B FT _ _ _ 67.01
Zephyr [paper] Oct 25th, 2023 70B FT 7.34 90.60 10.99 52.15
Yi-chat [HF card] Nov 23rd, 2023 34B FT _ 94.08 29.6 68.68
Mixtral-8x7B [paper] Jan 4th, 2024 13B FT 8.30 94.78 18.26 68.42
Self-Rewarding 70B [paper] Jan 18th, 2024 70B FT _ _ 20.44 _
------------------------------ ------------------- -------------- -------------- -------------- -------------- -------------- --------------
GPT-3.5-turbo Nov 2022 ? ? 7.94 81.71 14.13 70.21
GPT-4 March 2023 ? ? 8.99 95.28 23.58 85.36

Agent capabilities (Table 2)

LLM Date released LLM size Training ALFWorld InterCode-CTF WebArena Code Generation
Lemur-chat [paper] Oct 10th, 2023 70B CPT + FT 59.70 22.00 5.30 17.65
------------------------- ------------------- -------------- -------------- -------------- -------------- -------------- --------------
GPT-3.5-turbo Nov 2022 ? ? 41.79 11.00 7.38 9.56
GPT-4 March 2023 ? ? 84.33 37.00 10.59 _

Logical reasoning (Table 3)

LLM Date released LLM size Training GSM8K HumanEval
WizardCoder [paper] June 14th, 2023 15B FT _ 57.3
Phi-1 [paper] June 20th, 2023 1.3B PT + FT _ 50.6
WizardMath [paper] Aug 18th, 2023 70B FT 81.6 _
OpenChat-3.5 [paper] Sept 20th, 2023 70B CPT + FT 71.3 77.4
Lemur-chat [paper] Oct 10th, 2023 70B CPT + FT 66.3 61.0
Mixtral 8x7B [blog] Dec 11th, 2023 13B FT 58.4 _
------------------------- ------------------- -------------- -------------- -------------- ------------
GPT-3.5-turbo Nov 2022 ? ? 57.1 48.1
GPT-4 March 2023 ? ? 92.0 67.0

Long-context modelling on ZeroSCROLLS (Table 4)

LLM Date released LLM size Training GovReport SummScreen QMSum SQuALITY Qasper NarrativeQA QuALITY MuSiQue SpaceDigest BookSumSort
Llama-2-long-chat [paper] Sept 27th, 2023 70B CPT + FT 26.0 15.0 20.0 20.9 52.0 31.7 82.6 27.3 55.5 46.2
Llama-2-chat-32k + retrieval [paper] Oct 4th, 2023 70B FT _ _ 18.3 _ 31.3 24.5 69.6 26.7 _ _
--------------------------------------- ------------------- -------------- -------------- --------------- ---------------- ----------- -------------- ------------ ----------------- ------------- ------------- ----------------- -------------------
GPT-3.5-turbo Nov 2022 ? ? 21.3 16.1 15.6 20.4 49.3 25.1 66.6 27.1 49.1 49.8
GPT-3.5-turbo-16k June 2023 ? ? 24.3 16.2 17.4 21.4 50.0 29.5 72.0 27.0 54.1 54.6
GPT-4 March 2023 ? ? 26.3 17.3 18.5 22.6 50.7 27.6 89.2 41.1 62.8 60.5

Hallucination (Table 5)

LLM Date released LLM size Training TruthfulQA FactScore HotpotQA OpenBookQA MedMC-QA TriviaQA
text-davinci-002 + PKG [paper] May 8th, 2023 175B INF _ _ _ _ 47.4 _
GPT-3.5-turbo + CRITIC [paper] May 19th, 2023 _ INF _ _ 38.7 _ _ 75.1
text-davinci-002 + LMvsLM [paper] May 22nd, 2023 175B INF _ _ _ _ _ 83.1
GPT-3.5-turbo + CoK [paper] May 22nd, 2023 _ INF _ _ 35.4 _ 73.3 _
Platypus [paper] Aug 14th, 2023 70B FT 62.3 _ _ _ _ _
GPT-3.5-turbo + KSL [paper] Sept 6th, 2023 _ FT _ _ _ 81.6 _ _
LLama + CoVe [paper] Sept 20th, 2023 65B FT + INF _ 71.4 _ _ _ _
-------------------------------------- ------------------- -------------- -------------- ---------------- ---------------- ---------------- ---------------- ---------------- ----------------
GPT-3.5-turbo Nov 2022 ? ? 47.0 58.7 24.0 78.3 44.4 79.3

Citation

If our work is useful to your research, please consider citing our survey:

@article{chen2023chatgpt,
  title={ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?},
  author={Chen, Hailin and Jiao, Fangkai and Li, Xingxuan and Qin, Chengwei and Ravaut, Mathieu and Zhao, Ruochen and Xiong, Caiming and Joty, Shafiq},
  journal={arXiv preprint arXiv:2311.16989},
  year={2023}
}

About

Listing all reported open-source LLMs achieving a higher score than proprietary, paying OpenAI models (ChatGPT, GPT-4).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published