Skip to content

open-compass/InteractScience

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation

InteractScience is a benchmark specifically designed to evaluate the capability of large language models in generating interactive scientific demonstration code. This project provides a complete evaluation pipeline including model inference, automated testing, and multi-dimensional assessment.

πŸ“ Directory Structure

.
β”œβ”€β”€ data/                           # Benchmark dataset
β”‚   β”œβ”€β”€ interactscience.jsonl       # Main dataset file containing problems and references
β”‚   └── snapshots/                  # Reference screenshot directory
β”‚       β”œβ”€β”€ *_Snapshot-1.png
β”‚       β”œβ”€β”€ *_Snapshot-2.png
β”‚       └── ...
β”œβ”€β”€ PFT_tests/                      # Program Functionality Testing (PFT) scripts
β”‚   β”œβ”€β”€ *.spec.js                   # Playwright test scripts
β”‚   └── ...
β”œβ”€β”€ VQT_tests/                      # Visual Quality Testing (VQT) scripts
β”‚   β”œβ”€β”€ *.spec.js                   # Playwright test scripts
β”‚   └── ...
β”œβ”€β”€ eval/                           # Model inference results
β”‚   β”œβ”€β”€ interactscience_lm_*.jsonl  # Language model inference results
β”‚   β”œβ”€β”€ interactscience_vlm_*.jsonl # Vision-language model inference results
β”‚   └── ...
β”œβ”€β”€ results/                        # Test result data
β”‚   β”œβ”€β”€ lm_results/                 # Language model test results
β”‚   β”‚   β”œβ”€β”€ PFT_test_results/       # Program functionality test results
β”‚   β”‚   β”œβ”€β”€ VQT_test_results/       # Visual quality test results
β”‚   β”‚   β”œβ”€β”€ VQT_clip_results/       # CLIP scoring results
β”‚   β”‚   └── VQT_vlm_judge_results/  # VLM scoring results
β”‚   └── vlm_results/                # Vision-language model test results
β”œβ”€β”€ run_generation.sh               # Model inference script
β”œβ”€β”€ run_benchmark.sh                # Automated testing script
β”œβ”€β”€ run_vlm_as_judge.sh             # VLM scoring script
β”œβ”€β”€ cal_metrics.py                  # Metrics calculation script
β”œβ”€β”€ test_llm.py                     # Language model testing main program
β”œβ”€β”€ vlm_as_judge.py                 # VLM scoring main program
β”œβ”€β”€ clip_score.py                   # CLIP score calculation
└── extract_and_save_code.py        # Code extraction and saving

πŸš€ Usage Tutorial

1. Environment Setup

First install Node.js and npm, then install the Playwright testing environment:

# Install project dependencies
npm install

# Install Playwright browsers
npx playwright install

2. Model Inference

Use the run_generation.sh script for model inference:

# Edit the model path and parameters in the script
vim run_generation.sh

# Run inference (requires model path configuration)
bash run_generation.sh

Script Description:

  • Starts vLLM API server
  • Calls test_llm.py for inference
  • Results saved to eval/ directory

3. Automated Testing

Use the run_benchmark.sh script for automated testing:

# Set the model name to test
export MODEL="your_model_name"

# Run tests
bash run_benchmark.sh

Testing Process:

  1. Extract HTML code from inference results (extract_and_save_code.py)
  2. Execute Program Functionality Testing (PFT) using playwright_PFT.config.js
  3. Execute Visual Quality Testing (VQT) using playwright_VQT.config.js
  4. Calculate CLIP similarity scores (clip_score.py)
  5. Results saved to results/ directory

4. VLM Scoring

Use run_vlm_as_judge.sh for VLM-as-Judge evaluation:

# Edit model and path configuration in the script
vim run_vlm_as_judge.sh

# Run VLM scoring
bash run_vlm_as_judge.sh

Scoring Description:

  • Uses vision-language models to score generated results
  • Compares reference screenshots with generated screenshots
  • Evaluation based on predefined checklists

5. Results Analysis

Use cal_metrics.py and cal_vlm_as_judege_score.py to calculate final metrics:

python cal_metrics.py
python cal_vlm_as_judege_score.py

πŸ“Š Dataset Description

interactscience.jsonl

Main dataset file, each line contains a test sample:

  • id: Unique identifier
  • question: Detailed HTML implementation plan
  • lm_system_prompt: Language model system prompt
  • vlm_system_prompt: Vision-language model system prompt
  • image_path: List of reference screenshot paths
  • snapshot_checklists: Visual verification checklists

Reference Screenshots

Located in data/snapshots/ directory, naming format:

  • {task_id}_Snapshot-{number}.png

πŸ§ͺ Test Types

1. Program Functionality Testing (PFT)

  • Validates functional correctness of HTML code
  • Checks interactive element behavior
  • Tests JavaScript logic

2. Visual Quality Testing (VQT)

  • Generates page screenshots
  • Compares with reference screenshots
  • Calculates perceptual similarity (CLIP scores)
  • Calculates semantic correctness (VLM-judge scores)

πŸ› οΈ Core Scripts Description

test_llm.py

Language model testing main program:

python test_llm.py \
    --dataset_path data/interactscience.jsonl \
    --prompt_type lm_system_prompt \
    --dump_path eval/result.jsonl \
    --model_path your_model_path \
    --base_url http://localhost:8000/v1 \
    --api_key EMPTY

vlm_as_judge.py

VLM scoring main program:

python vlm_as_judge.py \
    --reference_image_dir data/snapshots \
    --generated_image_dir generated_images \
    --checklist_file data/checklists.jsonl \
    --output_path results/vlm_judge.jsonl \
    --base_url your_api_endpoint \
    --api_key your_api_key

πŸ“ˆ Evaluation Metrics

  • Program Functionality Test Pass Rate: Percentage of PFT test cases passed
  • Visual Quality Score: Visual similarity based on CLIP model
  • VLM Score: Comprehensive score given by multimodal models

Experiments

We have evaluated 30 state-of-the-art large language models on the InteractScience benchmark. The results are available in the results/ directory.

Model PFT Overall (%) PFT Average (%) PFT Perfect (%) VQT Action (%) VQT CLIP VQT VLM-judge
Closed-Source Large Language Models
GPT-5 39.47 37.61 16.08 89.66 71.95 57.02
GPT-4.1 37.07 34.08 11.19 89.15 71.21 52.84
GPT-4o 28.27 27.09 5.59 85.93 67.11 42.45
o3 34.93 32.09 13.99 89.83 72.24 52.82
o4-mini 37.33 34.90 13.29 88.64 71.79 51.90
Gemini-2.5-Pro 35.33 34.62 11.19 86.78 70.65 54.69
Gemini-2.5-Flash 31.60 31.07 10.49 86.95 69.59 49.34
Claude-Sonnet-4-20250514 41.47 37.40 13.29 89.66 73.50 55.42
Claude-Opus-4-20250514 40.27 36.34 11.19 89.32 73.22 54.93
Claude-3.5-Sonnet 33.33 31.45 9.79 90.17 72.32 49.43
Open-Source Large Language Models
DeepSeek-R1-0528 33.87 32.02 8.39 88.31 69.54 49.46
DeepSeek-V3-0324 31.73 30.57 10.49 85.93 68.68 49.46
Kimi-K2 31.60 31.22 9.79 87.29 70.11 50.04
GLM-4.5 29.33 26.65 8.39 70.51 55.90 38.57
Intern-S1 31.87 28.93 7.69 87.46 68.74 45.27
gpt-oss-120b 28.00 27.78 9.79 90.85 72.13 49.57
gpt-oss-20b 15.20 12.97 3.50 80.51 54.68 21.40
Qwen3-235B-A22B-Instruct-2507 33.33 31.46 13.29 78.14 70.02 45.14
Qwen3-32B 27.20 24.09 5.59 87.46 66.46 39.69
Qwen3-14B 24.13 23.58 7.69 85.08 66.46 36.53
Qwen3-8B 20.00 18.85 4.20 81.53 64.13 34.67
Qwen3-4B 14.67 13.10 2.80 82.03 60.90 28.33
Qwen3-1.7B 6.53 6.22 1.40 75.76 59.65 20.33
Qwen2.5-Coder-32B-Instruct 27.20 25.10 7.69 84.58 51.67 38.51
Qwen2.5-Coder-14B-Instruct 22.53 20.61 4.90 85.42 64.47 35.72
Qwen2.5-Coder-7B-Instruct 12.40 10.51 0.70 82.37 65.17 26.97
Qwen2.5-VL-72B-Instruct 23.73 22.82 6.99 87.12 64.33 37.30
Qwen2.5-VL-7B-Instruct 7.47 6.72 0.70 70.00 49.49 20.41
Llama-3.1-70B-Instruct 18.67 18.04 4.90 88.64 59.56 33.36
Llama-3.1-8B-Instruct 11.33 10.16 3.50 80.00 65.42 22.75

Comparison Across Difficulty Levels

Comparison Across Disciplines

Results on Multimodal LLMs with Reference Snapshots as Input

Example Cases

Citation

@article{InteractScience,
  author       = {Qiaosheng Chen and Yang Liu and Lei Li and Kai Chen and Qipeng Guo and Gong Cheng and Fei Yuan},
  title        = {InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation}, 
  journal      = {arXiv preprint arXiv:2510.09724},
  year         = {2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published