🌐 Project | 📄 Paper | 🤗 Dataset
SciZoom is a large-scale benchmark for hierarchical scientific summarization comprising 44,946 papers from NeurIPS, ICLR, ICML, and EMNLP (2020-2025), stratified into Pre-LLM and Post-LLM eras.
- Hierarchical Annotations: Full Text → Abstract (70:1) → Contributions (110:1) → TL;DR (600:1)
- Temporal Stratification: Pre-LLM (37.3%) vs Post-LLM (62.7%) around Nov 2022 ChatGPT release
git clone https://github.com/janghana/SciZoom.git
cd SciZoom
bash setup_scizoom.shfrom datasets import load_dataset
dataset = load_dataset("hanjang/SciZoom")
for paper in dataset["test"]:
print(paper["title"])
print(paper["abstract"])
print(paper["contributions"])
print(paper["era"]) # 'pre-llm' or 'post-llm'For detailed data exploration and analysis, see tutorials.ipynb.
For efficient LLM inference, we recommend vLLM.
pip install vllmSee official documentation: https://docs.vllm.ai/
For semantic similarity analysis, we use NV-Embed-v2.
See official model card: https://huggingface.co/nvidia/NV-Embed-v2
See tutorials.ipynb for:
- Data loading and exploration
- Era-based analysis & visualization
- Hierarchical summarization evaluation
- Cross-granularity similarity analysis
@article{jang2026scizoom,
title={SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era},
author={Jang, Han and Lee, Junhyeok and Choi, Kyu Sung},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}- Dataset: CC-BY-4.0
- Code: MIT License
This work was conducted at the AICON Lab (Advanced Imaging and Computational Neuroimaging Laboratory), Department of Radiology, Seoul National University Hospital.
- Han Jang - hanjang@snu.ac.kr | janghana | Google Scholar
- AICON Lab - snuh-rad-aicon
