Skip to content

hmwang2002/InternSVG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

[ICLR 2026] InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Β Β Β Β  Β Β Β Β  Β Β Β Β  Β Β Β Β 

πŸ“š Introduction

We present the InternSVG family, an integrated data–benchmark–model suite.

  • 🧩 SAgoge Dataset β€” The largest and most comprehensive multimodal dataset for SVG tasks, spanning icons, long-sequence illustrations, scientific diagrams, and dynamic animations. It provides rich hierarchical structures and diverse attributes, supporting tasks of varied difficulty levels.
  • πŸ“Š SArena Benchmark β€” A companion benchmark offering unified task definitions and standardized evaluation protocols, aligned with SAgoge’s domains and difficulty spectrum. It enables consistent comparison across SVG understanding, editing, and generation tasks.
  • πŸ€– InternSVG Model β€” A unified multimodal large language model (MLLM) for SVG understanding, editing, and generation.

πŸ”₯ News

  • [2026-01-28] πŸŽ‰ InternSVG-8B is now available on HuggingFace! πŸ€—Model
  • [2026-01-28] πŸŽ‰ We release the SAgoge dataset. πŸ€—Dataset
  • [2026-01-26] πŸŽ‰ InternSVG has been accepted at ICLR 2026!
  • [2025-10-13] πŸŽ‰ We release the SArena benchmark. πŸ€—Benchmark
  • [2025-10-13] πŸ‘‹ Upload paper and init project. Read

πŸ“ Open-Source Plan

  • Evaluation code
  • SArena benchmark
  • SAgoge dataset
  • Fine-tuning scripts
  • Model weights
  • Paper

πŸ“Œ Quick Start

βš™οΈ Installation

git clone https://github.com/hmwang2002/InternSVG.git
cd InternSVG

conda create -n internsvg python=3.9 -y
conda activate internsvg
pip install -r requirements.txt

# install clip
pip install git+https://github.com/openai/CLIP.git

Download ViCLIP.

mkdir sarena_ckpt
cd sarena_ckpt
# You need to login first and have the access to the repo https://huggingface.co/OpenGVLab/ViCLIP. Use the command "huggingface-cli login" to login.
huggingface-cli download --resume-download OpenGVLab/ViCLIP ViClip-InternVid-10M-FLT.pth --local-dir .
cd ..

For training, you need to install LLaMA-Factory.

pip install deepspeed==0.16.9
pip install av==14.4.0
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
cd ..

(Optional) If you need to simplify your own SVG code, install svgo.

conda install nodejs
npm install -g svgo

πŸ€– InternSVG Model

The InternSVG-8B model is available at Hugging Face. It is based on the InternVL3-8B model, incorporating SVG-specific tokens, and undergoes Supervised Fine-Tuning (SFT) under a two-stage training strategy using the massive SVG training samples from the SAgoge dataset.

Deploy

We recommend using LMDeploy for deployment. An example of launching a proxy server with 8 parallel workers (one per GPU) is provided below:

#!/bin/bash
model_path="MODEL_PATH"
model_name="InternSVG"

# proxy
lmdeploy serve proxy --server-name 0.0.0.0 --server-port 10010 --routing-strategy "min_expected_latency" &

worker_num=8
for ((i = 0; i < worker_num; i++)); do
    timestamp=$(date +"%Y-%m-%d_%H-%M-%S")
    CUDA_VISIBLE_DEVICES="${i}" lmdeploy serve api_server ${model_path} --proxy-url http://0.0.0.0:10010 \
        --model-name ${model_name} \
        --tp 1 \
        --max-batch-size 512 \
        --backend pytorch \
        --server-port $((10000 + i)) \
        --session-len 16384 \
        --chat-template "internvl2_5" \
        --log-level WARNING &>> ./logs/api_${model_name}_${timestamp}_${i}.out  &
    sleep 10s
done

Train

If you need to train your own model, please follow these steps:

  1. Prepare the Dataset: Download the SAgoge dataset. After that, update the paths for the SAgoge-related subdatasets in LLaMA-Factory/data/dataset_info.json to match your local file paths.

  2. Download InternVL3-8B: Download the InternVL3-8B from link.

  3. Add Special Tokens: Before training, you must add SVG-specific tokens to the base model. Run the utils/add_token.py script, which adds these special tokens to the original model weights and initializes their embeddings based on subwords.

  4. Start Training: We provide example configuration scripts for the two-stage training process. You can find them at:

    • Stage 1: LLaMA-Factory/examples/train_full/stage_1.yaml
    • Stage 2: LLaMA-Factory/examples/train_full/stage_2.yaml

    Then use llamafactory-cli train to start training.

🧩 SAgoge Dataset

The SAgoge dataset is available at Hugging Face. To use SAgoge, please download the dataset and extract media.tar.gz to access the image files. After extraction, you will get:

SAgoge/
β”œβ”€β”€ media/
β”‚   β”œβ”€β”€ stage1/
β”‚   β”‚   β”œβ”€β”€ chem/
β”‚   β”‚   └── icon/
β”‚   └── stage2/
β”‚       β”œβ”€β”€ animation/
β”‚       β”œβ”€β”€ chem/
β”‚       β”œβ”€β”€ icon/
β”‚       └── illustration/
β”œβ”€β”€ stage1/
β”‚   β”œβ”€β”€ chem/
β”‚   β”‚   β”œβ”€β”€ img2svg/
β”‚   β”‚   └── text2svg/
β”‚   └── icon/
β”‚       β”œβ”€β”€ edit/
β”‚       β”œβ”€β”€ generation/
β”‚       β”‚   β”œβ”€β”€ img2svg/
β”‚       β”‚   └── text2svg/
β”‚       └── understanding/
└── stage2/
    β”œβ”€β”€ animation/
    β”‚   β”œβ”€β”€ text2sani/
    β”‚   └── video2sani/
    β”œβ”€β”€ chem/
    β”‚   β”œβ”€β”€ img2svg/
    β”‚   └── text2svg/
    β”œβ”€β”€ icon/
    β”‚   β”œβ”€β”€ edit/
    β”‚   β”œβ”€β”€ generation/
    β”‚   β”‚   β”œβ”€β”€ img2svg/
    β”‚   β”‚   └── text2svg/
    β”‚   └── understanding/
    └── illustration/
        β”œβ”€β”€ img2svg/
        └── text2svg/

Statistics of SAgoge:

Dataset #SVGs #Samples Avg. Tokens
Icon 2.8M 11M 846
Illustration 600K 1.6M 8673
Animation 61K 122K 847
Chemistry 1.7M 3.4M 1752

πŸ“Š SArena Benchmark

Download

The SArena benchmark is available here. You can use the huggingface_hub command to download directly:

hf download InternSVG/SArena SArena.zip --repo-type dataset --resume-download --local-dir PATH_TO_YOUR_DIR
unzip SArena.zip

After extraction, you will get:

SArena/
β”œβ”€β”€ animation/
β”‚   β”œβ”€β”€ overall/
β”‚   β”œβ”€β”€ svg/
β”‚   β”œβ”€β”€ video/
β”‚   β”œβ”€β”€ text2sani.jsonl
β”‚   └── video2sani.jsonl
β”‚
β”œβ”€β”€ chemistry/
β”‚   β”œβ”€β”€ images/
β”‚   β”œβ”€β”€ svg/
β”‚   β”œβ”€β”€ img2svg.jsonl
β”‚   └── text2svg.jsonl
β”‚
β”œβ”€β”€ illustration/
β”‚   β”œβ”€β”€ images/
β”‚   β”œβ”€β”€ svg/
β”‚   β”œβ”€β”€ caption.jsonl
β”‚   β”œβ”€β”€ img2svg.jsonl
β”‚   └── text2svg.jsonl
β”‚
β”œβ”€β”€ Icon/
β”‚   β”œβ”€β”€ edit/
β”‚   β”‚   └── data/
β”‚   β”‚       β”œβ”€β”€ color_complex.jsonl
β”‚   β”‚       β”œβ”€β”€ color_simple.jsonl
β”‚   β”‚       β”œβ”€β”€ crop.jsonl
β”‚   β”‚       β”œβ”€β”€ flip.jsonl
β”‚   β”‚       β”œβ”€β”€ opacity.jsonl
β”‚   β”‚       β”œβ”€β”€ outline.jsonl
β”‚   β”‚       β”œβ”€β”€ rotate.jsonl
β”‚   β”‚       β”œβ”€β”€ scale.jsonl
β”‚   β”‚       β”œβ”€β”€ styletransform_openmoji.jsonl
β”‚   β”‚       └── translate.jsonl
β”‚   β”‚
β”‚   β”œβ”€β”€ generation/
β”‚   β”‚   β”œβ”€β”€ images/
β”‚   β”‚   β”œβ”€β”€ svg/
β”‚   β”‚   β”œβ”€β”€ caption.jsonl
β”‚   β”‚   β”œβ”€β”€ img2svg.jsonl
β”‚   β”‚   └── text2svg.jsonl
β”‚   β”‚
β”‚   └── understanding/
β”‚       └── sarena_un.jsonl

Inference

Template scripts for inference can be found in the scripts/inference/ folder.

For example, for the icon/illustration/chemistry generation task, you can modify the script above by specifying your own paths and API configuration.

#!/bin/bash
export PYTHONPATH=$(pwd):$PYTHONPATH

BASE_URL="BASE_URL"
API_KEY="API_KEY"
MODEL_NAME="MODEL_NAME"
TEXT2SVG_TEST_PATH="PATH_TO_TEXT2SVG_TEST_PATH"
IMG2SVG_TEST_PATH="PATH_TO_IMG2SVG_TEST_PATH"
OUTPUT_DIR="PATH_TO_OUTPUT_DIR"
RETRY=1
TEMPERATURE=0.0
MAX_TOKENS=4000
MAX_WORKERS=32

python metrics/inference/inference.py \
--base_url ${BASE_URL} \
--api_key ${API_KEY} \
--model_name ${MODEL_NAME} \
--text2svg_test_path ${TEXT2SVG_TEST_PATH} \
--img2svg_test_path ${IMG2SVG_TEST_PATH} \
--output_dir ${OUTPUT_DIR} \
--temperature ${TEMPERATURE} \
--max_tokens ${MAX_TOKENS} \
--max_workers ${MAX_WORKERS}

Then run:

bash scripts/inference/gen/demo.sh

Specifically, for SVG animation generation task, a template inference script is provided at scripts/inference/animation/demo.sh.

When all test samples have been processed, each SVG file needs to be converted into an MP4 video for metric evaluation. Use the script utils/svg_animate.py to generate MP4 files. Note that we need two resolutions: 448Γ—448 and 128Γ—128. Before running, modify the OUTPUT_DIRS and FILE_DIRS variables in the run_all_mp() function. (Notably, in our code, if the output path contains '_128', it will automatically use the 128Γ—128 resolution.)

The directory structure of the test files is as follows:

evaluate
β”œβ”€β”€ .vscode
β”œβ”€β”€ animation/gpt4o
β”‚   β”œβ”€β”€ text2sani
β”‚   β”‚   β”œβ”€β”€ svg/
β”‚   β”‚   β”œβ”€β”€ video/
β”‚   β”‚   β”œβ”€β”€ video_128/
β”‚   β”‚   └── output.jsonl
β”‚   └── video2sani
β”‚       β”œβ”€β”€ svg/
β”‚       β”œβ”€β”€ video/
β”‚       β”œβ”€β”€ video_128/
β”‚       └── output.jsonl

Evaluate

The scripts/evaluate/ directory contains template scripts for running evaluation across different domains (e.g., icon, illustration, chemistry, and animation).

Each subfolder corresponds to a specific domain:

scripts/evaluate/
β”œβ”€β”€ icon/
β”‚   β”œβ”€β”€ edit/
β”‚   β”œβ”€β”€ gen/
β”‚   └── un/
β”œβ”€β”€ illustration/
β”œβ”€β”€ chem/
└── animation/

Below is a demo for evaluating generation tasks (Text-to-SVG and Image-to-SVG):

#!/bin/bash
export PYTHONPATH=$(pwd):$PYTHONPATH

python evaluate_gen.py \
    --model_name "GPT-4o" \
    --text2svg_test_dir "PATH_TO_TEXT2SVG_RESULTS" \
    --img2svg_test_dir "PATH_TO_IMG2SVG_RESULTS" \
    --tokenizer_path "PATH_TO_TOKENIZER" \
    --test_file_path "PATH_TO_TEST_JSONL" \
    --gt_img_dir "PATH_TO_GT_IMAGES" \
    --gt_svg_dir "PATH_TO_GT_SVGS" \
    --caption_path "PATH_TO_CAPTIONS" \
    --bench_name "Icon"

If your model does not support either the Text-to-SVG or Image-to-SVG task, simply set the corresponding test directory argument (--text2svg_test_dir or --img2svg_test_dir) to an empty string.

Note on the SArena-Icon Score Reported by Intern-S1-Pro

intern-s1-pro

We thank the Intern-S1-Pro team for treating SVG generation as an important capability of the model, and for using SArena-Icon as a benchmark for evaluation.

To better align with the common practice in general-purpose foundation model technical reports (i.e., reporting results in a 0 – 100 scale), we additionally report a single aggregated score for SArena-Icon, computed from the original SArena metrics.

We define the final score as:

final score = 0.3 * CLIP-I2I + 0.3 * (100 * DINO) + 0.2 * (100 * SSIM) + 0.2 * (100 * (1 - LPIPS))

πŸ“œ Acknowledgements

We would like to thank Kiyotaka, yinlikestudy, and quentin-77 for their valuable contributions to this project.

The InternSVG model is developed based on InternVL and further fine-tuned with LLaMA-Factory for SVG understanding, editing, and generation tasks.

We also acknowledge the following open-source efforts that have contributed to advancing SVG understanding and generation:

License

InternSVG is licensed under the Apache License 2.0.

πŸ“– Citation

@article{wang2025internsvg,
  title={InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models},
  author={Wang, Haomin and Yin, Jinhui and Wei, Qi and Zeng, Wenguang and Gu, Lixin and Ye, Shenglong and Gao, Zhangwei and Wang, Yaohui and Zhang, Yanting and Li, Yuanqi and others},
  journal={arXiv preprint arXiv:2510.11341},
  year={2025}
}

About

[ICLR 2026] Official repository of "InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages