CREATE

Authors: Manya Wadhwa, Tiasa Singha Roy, Harvey Lederman, Junyi Jessy Li, Greg Durrett

Overview

CREATE is a benchmark designed to measure associative reasoning in models. This benchmark evaluates whether models can construct valid, diverse, and insightful paths that connect two concepts through intermediate entities or relationships. We introduce creative utility, a unified metric that captures both the quality and diversity of generated connections.

Example query:

“What are different ways to connect Dakota Johnson to people who starred in fantasy or science-fiction movies?”

We want the model to generate paths like:

Dakota Johnson co-stars with Chris Evans in Materialists; Chris Evans played Captain America in The Avengers.

Dakota Johnson is the stepdaughter of Antonio Banderas, who voiced Puss in Boots in Shrek.

These responses illustrate associative creativity: each path is coherent, factually grounded, and offers a distinct conceptual route between the two endpoints.

🛠️ Installation

First, install the required dependencies:

python3 -m venv create_env 
source create_env/bin/activate
pip install -r requirements.txt

🚀 Running CREATE

Our benchmark is available on huggingface! The following code snippet shows how to access the benchmark.

from datasets import load_dataset
data = load_dataset('wadhma/CREATE')['train'].to_pandas() 
print(data['query']) ## the benchmark questions

The base prompt we use in the paper for is included in prompt.py.

📊 Evaluation

Setup

Copy keyhandler_template.py to keyhandler.py and add your API keys (OpenAI, Anthropic, and/or HuggingFace as needed for your evaluator model).
Ensure your predictions file has the required format (see below).

Input Format

Your input must be a .jsonl file or a HuggingFace dataset with at least:

Column	Type	Description
`query`	str	The benchmark question
`path_prediction`	list[str]	Model-generated paths (one string per path)

Running Evaluation

The script computes strength and factuality scores, then aggregates them into the creative utility metric. In the paper we use gpt-oss-120b for evaluations; you can use any model supported by LiteLLM. Note: we are in the process of updating this to bespoke-curator for more efficient inference.

python evaluate_creative_utility.py --input_file <path_or_hf_dataset> [options]

Common options:

Option	Default	Description
`--input_file`	(required)	Path to `.jsonl` file or HuggingFace dataset (e.g. `org/dataset-name`)
`--split`	`train`	Dataset split (HuggingFace only)
`--response_column`	`path_prediction`	Column name for model responses
`--model_name`	`gpt-4.1-mini-2025-04-14`	Evaluator model for strength/factuality
`--patience`	`0.9`	Patience parameter for creative utility
`--factuality_threshold`	`1.0`	Threshold for filtering paths
`--output`	None	Output path for results (JSONL)
`--vllm`	False	Use vLLM endpoint for inference
`--server_url`	`""`	Server URL for vLLM/open-source models

Example:

python evaluate_creative_utility.py --input_file predictions.jsonl --model_name gpt-4o --output results.jsonl

The script prints a summary including mean creative utility, average strength, average factuality, and evaluation cost. See evaluate_creative_utility.py for full details.

Citation

@InProceedings{Wadhwa-Et-Al-2026:CREATE,
  title = {CREATE: Testing LLMs for Associative Creativity},
  author = {Manya Wadhwa and Tiasa Singha Roy and Harvey Lederman and Junyi Jessy Li and Greg Durrett},
  booktitle = {arXiv},
  year = {2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
resources		resources
README.md		README.md
creative_utility.py		creative_utility.py
evaluate_creative_utility.py		evaluate_creative_utility.py
inference.py		inference.py
keyhandler_template.py		keyhandler_template.py
path_evaluator.py		path_evaluator.py
prompt.py		prompt.py
prompt_bank.py		prompt_bank.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CREATE

Overview

🛠️ Installation

🚀 Running CREATE

📊 Evaluation

Setup

Input Format

Running Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CREATE

Overview

🛠️ Installation

🚀 Running CREATE

📊 Evaluation

Setup

Input Format

Running Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages