Welcome to AR-Omni! 👋 AR-Omni is a single-decoder, single-token-stream autoregressive any-to-any model that generates text, images, and speech without expert decoders. It uses task-aware loss reweighting, token-level perceptual alignment for image tokens, and a finite-state decoding machine to balance modality learning, improve visual fidelity, and trade off stability vs. creativity during inference.
Important
Pure autoregressive “Omni” without expert decoders. AR-Omni uses a single Transformer decoder to support autoregressive text and image generation, as well as real-time speech synthesis (as measured by the TTS task).
🧭 Unified any-to-any AR paradigm
A single token stream with next-token prediction and one decoder, natively handling text, images, and speech—while preserving the purity of autoregressive modeling.
⚖️ Modality imbalance mitigation
Task-aware loss reweighting to prevent training from being dominated by a subset of modalities or tasks.
🎛️ Stability–creativity trade-offs
A finite-state decoding machine that selects different decoding strategies for different sub-tasks during inference.
🗣️ Real-time speech synthesis
Efficient real-time speech synthesis (as measured by the TTS task).
- 2026.01 Initial release of AR-Omni v0.1.
- AR-Omni-Pretrain checkpoint
- AR-Omni-Chat checkpoint
- Instruction-tuning dataset
- Training code and recipes
- Paper link
- Streaming inference
- Public demo / gradio / space
General requirements. Except for basic libs, we vendor two editable libraries in this repo: accelerate and transformers.
python -m venv .venv
source .venv/bin/activate
pip install -U pip wheel setuptools
pip install -r requirements.txt
# Install the provided libs
pip install -e ./transformers
pip install -e ./accelerateWavTokenizer. This project requires the WavTokenizer: the checkpoint and the YAML config.
- https://huggingface.co/novateur/WavTokenizer/resolve/main/WavTokenizer_small_600_24k_4096.ckpt
- https://huggingface.co/novateur/WavTokenizer/resolve/main/wavtokenizer_smalldata_frame40_3s_nq1_code4096_dim512_kmeans200_attn.yaml
CosyVoice. Please configure the CosyVoice environment (PyTorch/CUDA/audio dependencies, model assets, etc.) by following the official guide:
git clone https://github.com/FunAudioLLM/CosyVoice.git- AR-Omni-Pretrain-v0.1: https://huggingface.co/ModalityDance/AR-Omni-Pretrain-v0.1
- AR-Omni-Chat-v0.1: https://huggingface.co/ModalityDance/AR-Omni-Chat-v0.1
Below are four commands for the four core tasks.
python inference/inference_pretrain.py \
--ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
--tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
--out_dir ./outputs/tts \
--device 0 \
tts \
--text "Good afternoon! How are you today?" \
--instruction "Convert this text into speech." \
--wavtokenizer_root /path/to/WavTokenizer \
--wavtokenizer_config /path/to/wavtokenizer.yaml \
--wavtokenizer_ckpt /path/to/wavtokenizer.ckpt \
--max_gen_len 1024 \
--out_name tts.wavpython inference/inference_pretrain.py \
--ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
--tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
--out_dir ./outputs/asr \
--device 0 \
asr \
--audio_path inference/ref.wav \
--wavtokenizer_root /path/to/WavTokenizer \
--wavtokenizer_config /path/to/wavtokenizer.yaml \
--wavtokenizer_ckpt /path/to/wavtokenizer.ckpt \
--instruction "Can you please convert this speech into written text?" \
--max_seq_len 256python inference/inference_pretrain.py \
--ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
--tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
--out_dir ./outputs/caption \
--device 0 \
caption \
--image_path inference/demo_test.jpg \
--instruction "Describe this image in detail." \
--max_gen_len 256python inference/inference_pretrain.py \
--ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
--tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
--out_dir ./outputs/t2i \
--device 0 \
t2i \
--text "a bunch of ripe strawberries on a plate" \
--temp 1.0 \
--guidance_scale_image 1.32 \
--out_name t2i_test.pngNote
In AR-Omni-v0.1, no real speech recordings were included in training. We recommend testing with clean, clear speech. We provide speech2tokens.py, a CosyVoice-based TTS input script for quick use and as a development reference. We will open-source the next version as soon as possible. The next release is optimized for real-world speech scenarios.
inference_chat.py runs dialog(s) described in a JSON/JSONL file and saves decoded text / images / speech generation.
It supports:
- (Recommended for now) Text → (CosyVoice2 TTS) → Input
- WAV → Input
- Optional image(s) per turn via
image_paths
Requires CosyVoice2.
If CosyVoice is not installed as a package, setPYTHONPATHto include its repo.
PYTHONPATH=/path/to/CosyVoice/third_party/Matcha-TTS:/path/to/CosyVoice${PYTHONPATH:+:$PYTHONPATH} \
python3 inference/inference_chat.py \
--input ./infer_test.json \
--output_dir ./test_results \
--model_root /path/to/converted_model_root \
--hf_tokenizer /path/to/converted_model_root \
--cosyvoice_model_dir /path/to/CosyVoice2-0.5B \
--wavtokenizer_cfg_path /path/to/wavtokenizer.yaml \
--wavtokenizer_ckpt_path /path/to/wavtokenizer.ckpt \
--save_audio --save_imagesCommon optional flags:
--txt_temp,--txt_top_p: sampling settings for text--img_temp: sampling temp for image tokens--bandwidth_id: wavtokenizer bandwidth id
You can pass:
- a single dialog:
{"dialog_id": "...", "turns": [ ... ]} - a list of dialogs:
[{"dialog_id": "...", "turns": [...]}, ...] - a list of turns:
[{turn}, {turn}, ...]
Each turn must provide either:
text: user text will be converted to speech and then tokenizedwav_path: directly tokenize a WAV file
Optional fields per turn:
image_paths: list of image paths for this turnuser_append_text: appended instruction after vocal tokensspeaker_wav: reference speaker WAV for CosyVoice2prompt_text: optional prompt text for speaker/stylesilence_head_sec,silence_tail_sec: silence padding secondssilence_head_tokens,silence_tail_tokens: explicit silence token paddingreset: reset conversation history before this turn
Create infer_test.json:
{
"dialog_id": "demo_0001",
"speaker_wav": "./inference/ref.wav",
"turns": [
{
"text": "Describe the image in detail.",
"image_paths": ["inference/demo_test.jpg"],
"user_append_text": "Please acknowledge the user's vocal input, create a textual response.",
"reset": true
}
]
}{
"dialog_id": "demo_0002",
"speaker_wav": "./inference/ref.wav",
"turns": [
{
"text": "Can you show me the sunset?",
"user_append_text": "Please transcribe the user's vocal input, create a picture of it.",
"reset": true
}
]
}Artifacts are saved under:
output_dir/<dialog_id>/turn_<index>_<uid>/decoded_text.txtoutput_dir/<dialog_id>/turn_<index>_<uid>/meta.json- optional images/speech if enabled:
.../*_seg*_img*_.png.../*_seg*_speech_*.wav
- a global log:
output_dir/batch_log.json
We provide two training stages: pre-training and instruction tuning.
AR-Omni is trained on tokenized multimodal sequences. In both stages, the multimodal content has already been converted into discrete tokens and can be fed directly to the autoregressive model.
- Pretrain data: built from public corpora at a large scale. Due to the dataset size and distributed sources, we do not host a packaged pretrain dataset in this repo. Please refer to the paper for the data recipe and obtain the open-source corpora accordingly.
- Instruction tuning data: our open-source release is provided as tokenized multimodal instruction data:
cd training/pretrain
deepspeed --num_gpus 8 pretrain.py \
--model_path /path/to/base_model \
--output_path /path/to/output_pretrain_ckpt \
--dataset_dir /path/to/pretrain_jsonl_shards \
--deepspeed_config ds_config.json \
--learning_rate 1e-5 \
--gradient_accumulation_steps 16 \
--response_weighted_tasks "image_caption,speech_to_text" \
--response_seg_weight 2.0 \
--perception_weight 1.0Common options:
--resume_from_checkpoint /path/to/ckpt--skip_shards N --skip_samples N
cd training/instruction-tuning
deepspeed --num_gpus 8 sft.py \
--data_path /path/to/AR-Omni-Instruct-v0.1.parquet \
--model_path /path/to/pretrained_or_base_model \
--output_path /path/to/output_sft_ckpt \
--deepspeed_config /path/to/ds_config.json \
--learning_rate 1e-5 \
--gradient_accumulation_steps 8 \
--sl_project YOUR_PROJECT \
--sl_experiment YOUR_EXPERIMENT \
--max_length 2048 \
--segment_loss_weight 1.0 \
--global_weight 1.0Common options:
--resume_from_checkpoint /path/to/ckpt--sl_key YOUR_SWANLAB_KEY(optional)
AR-Omni is a unified any-to-any model in the autoregressive paradigm without expert decoders.
- One decoder, one token stream, one objective
- Multimodal generation is completely formulated as standard next-token prediction over an interleaved sequence.
Three practical issues in unified AR modeling and our fixes:
-
Modality imbalance → task-aware loss reweighting. Unified AR training can be dominated by modalities with longer token budgets. We use a Weighted NTP objective that upweights task-relevant response tokens, keeping optimization aligned with the intended outputs and preventing skewed learning in unified training.
-
Visual fidelity → lightweight token-level perceptual alignment loss for image tokens.
Cross-entropy provides exact-match supervision but lacks geometric awareness in discrete visual codes. We add a small perceptual alignment loss that aligns hidden states to a frozen image embedding space, encouraging visually coherent structures even when token-level matches are imperfect. -
Stability–creativity trade-offs → finite-state decoding with task-aware strategy switching.
Different tasks prefer different decoding behaviors. We use a finite-state decoding machine that switches strategies within one generation, using greedy decoding for deterministic subtasks and sampling for open-ended generation, avoiding a one-size-fits-all decoding rule.
.
├── README.md
├── LICENSE
├── requirements.txt
│
├── assets/
│ ├── LOGO.png
│ └── overview.png
│
├── training/
│ ├── pretrain/
│ │ ├── pretrain.py # entry
│ │ ├── pretrain_trainer.py
│ │ ├── perception.py
│ │ └── ds_config.json
│ └── instruction-tuning/
│ ├── sft.py # entry
│ ├── trainer.py
│ └── perception.py
│
├── inference/
│ ├── inference_pretrain.py # entry
│ ├── inference_chat.py # entry
│ ├── speech2tokens.py
│ ├── infer_test.json
│ ├── demo_test.jpg
│ └── ref.wav
│
├── accelerate/
└── transformers/
We thank the open-source projects and research community that made this work possible.
This project is licensed under the MIT License. It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License. Please refer to the LICENSE file for more details.
If you use AR-Omni in your research or applications, please consider citing:
@misc{cheng2026aromniunifiedautoregressivemodel,
title={AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation},
author={Dongjie Cheng and Ruifeng Yuan and Yongqi Li and Runyang You and Wenjie Wang and Liqiang Nie and Lei Zhang and Wenjie Li},
year={2026},
eprint={2601.17761},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.17761},
}