🔬 SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

TL;DR

SciZoom is a large-scale benchmark for hierarchical scientific summarization comprising 44,946 papers from NeurIPS, ICLR, ICML, and EMNLP (2020-2025), stratified into Pre-LLM and Post-LLM eras.

Hierarchical Annotations: Full Text → Abstract (70:1) → Contributions (110:1) → TL;DR (600:1)
Temporal Stratification: Pre-LLM (37.3%) vs Post-LLM (62.7%) around Nov 2022 ChatGPT release

Quick Start

Installation

git clone https://github.com/janghana/SciZoom.git
cd SciZoom
bash setup_scizoom.sh

Load Dataset

from datasets import load_dataset

dataset = load_dataset("hanjang/SciZoom")

for paper in dataset["test"]:
    print(paper["title"])
    print(paper["abstract"])
    print(paper["contributions"])
    print(paper["era"])  # 'pre-llm' or 'post-llm'

For detailed data exploration and analysis, see tutorials.ipynb.

LLM Inference

Summarization with vLLM

For efficient LLM inference, we recommend vLLM.

pip install vllm

See official documentation: https://docs.vllm.ai/

Embedding with NV-Embed-v2

For semantic similarity analysis, we use NV-Embed-v2.

See official model card: https://huggingface.co/nvidia/NV-Embed-v2

Tutorials

See tutorials.ipynb for:

Data loading and exploration
Era-based analysis & visualization
Hierarchical summarization evaluation
Cross-granularity similarity analysis

Citation

@article{jang2026scizoom,
  title={SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era},
  author={Jang, Han and Lee, Junhyeok and Choi, Kyu Sung},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

License

Dataset: CC-BY-4.0
Code: MIT License

Acknowledgments

This work was conducted at the AICON Lab (Advanced Imaging and Computational Neuroimaging Laboratory), Department of Radiology, Seoul National University Hospital.

Contact

Han Jang - hanjang@snu.ac.kr | janghana | Google Scholar
AICON Lab - snuh-rad-aicon

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
notebook		notebook
LICENSE		LICENSE
README.md		README.md
overview.png		overview.png
setup_scizoom.sh		setup_scizoom.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

TL;DR

Quick Start

Installation

Load Dataset

LLM Inference

Summarization with vLLM

Embedding with NV-Embed-v2

Tutorials

Citation

License

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔬 SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

TL;DR

Quick Start

Installation

Load Dataset

LLM Inference

Summarization with vLLM

Embedding with NV-Embed-v2

Tutorials

Citation

License

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages