AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

Welcome to AR-Omni! 👋 AR-Omni is a single-decoder, single-token-stream autoregressive any-to-any model that generates text, images, and speech without expert decoders. It uses task-aware loss reweighting, token-level perceptual alignment for image tokens, and a finite-state decoding machine to balance modality learning, improve visual fidelity, and trade off stability vs. creativity during inference.

🪐 Key Features

Important

Pure autoregressive “Omni” without expert decoders. AR-Omni uses a single Transformer decoder to support autoregressive text and image generation, as well as real-time speech synthesis (as measured by the TTS task).

🧭 Unified any-to-any AR paradigm
A single token stream with next-token prediction and one decoder, natively handling text, images, and speech—while preserving the purity of autoregressive modeling.

⚖️ Modality imbalance mitigation
Task-aware loss reweighting to prevent training from being dominated by a subset of modalities or tasks.

🎛️ Stability–creativity trade-offs
A finite-state decoding machine that selects different decoding strategies for different sub-tasks during inference.

🗣️ Real-time speech synthesis
Efficient real-time speech synthesis (as measured by the TTS task).

🔥 News

2026.01 Initial release of AR-Omni v0.1.

Roadmap

🚀 Quick Start

1. Installation

General requirements. Except for basic libs, we vendor two editable libraries in this repo: accelerate and transformers.

python -m venv .venv
source .venv/bin/activate

pip install -U pip wheel setuptools

pip install -r requirements.txt

# Install the provided libs
pip install -e ./transformers
pip install -e ./accelerate

WavTokenizer. This project requires the WavTokenizer: the checkpoint and the YAML config.

CosyVoice. Please configure the CosyVoice environment (PyTorch/CUDA/audio dependencies, model assets, etc.) by following the official guide:

https://github.com/FunAudioLLM/CosyVoice

git clone https://github.com/FunAudioLLM/CosyVoice.git

2. Inference

Checkpoints

AR-Omni-Pretrain-v0.1: https://huggingface.co/ModalityDance/AR-Omni-Pretrain-v0.1
AR-Omni-Chat-v0.1: https://huggingface.co/ModalityDance/AR-Omni-Chat-v0.1

AR-Omni-Pretrain

Below are four commands for the four core tasks.

(1) TTS

python inference/inference_pretrain.py \
  --ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
  --tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
  --out_dir ./outputs/tts \
  --device 0 \
  tts \
  --text "Good afternoon! How are you today?" \
  --instruction "Convert this text into speech." \
  --wavtokenizer_root /path/to/WavTokenizer \
  --wavtokenizer_config /path/to/wavtokenizer.yaml \
  --wavtokenizer_ckpt /path/to/wavtokenizer.ckpt \
  --max_gen_len 1024 \
  --out_name tts.wav

(2) ASR

python inference/inference_pretrain.py \
  --ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
  --tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
  --out_dir ./outputs/asr \
  --device 0 \
  asr \
  --audio_path inference/ref.wav \
  --wavtokenizer_root /path/to/WavTokenizer \
  --wavtokenizer_config /path/to/wavtokenizer.yaml \
  --wavtokenizer_ckpt /path/to/wavtokenizer.ckpt \
  --instruction "Can you please convert this speech into written text?" \
  --max_seq_len 256

(3) Image Captioning

python inference/inference_pretrain.py \
  --ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
  --tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
  --out_dir ./outputs/caption \
  --device 0 \
  caption \
  --image_path inference/demo_test.jpg \
  --instruction "Describe this image in detail." \
  --max_gen_len 256

(4) Text-to-Image (T2I)

python inference/inference_pretrain.py \
  --ckpt_path /path/to/AR-Omni-Pretrain-v0.1 \
  --tokenizer_path /path/to/AR-Omni-Pretrain-v0.1 \
  --out_dir ./outputs/t2i \
  --device 0 \
  t2i \
  --text "a bunch of ripe strawberries on a plate" \
  --temp 1.0 \
  --guidance_scale_image 1.32 \
  --out_name t2i_test.png

AR-Omni-Chat (Interleaved Any-to-Any Conversation)

Note

In AR-Omni-v0.1, no real speech recordings were included in training. We recommend testing with clean, clear speech. We provide speech2tokens.py, a CosyVoice-based TTS input script for quick use and as a development reference. We will open-source the next version as soon as possible. The next release is optimized for real-world speech scenarios.

inference_chat.py runs dialog(s) described in a JSON/JSONL file and saves decoded text / images / speech generation. It supports:

(Recommended for now) Text → (CosyVoice2 TTS) → Input
WAV → Input
Optional image(s) per turn via image_paths

(1) Run command

Requires CosyVoice2.
If CosyVoice is not installed as a package, set PYTHONPATH to include its repo.

PYTHONPATH=/path/to/CosyVoice/third_party/Matcha-TTS:/path/to/CosyVoice${PYTHONPATH:+:$PYTHONPATH} \
python3 inference/inference_chat.py \
  --input ./infer_test.json \
  --output_dir ./test_results \
  --model_root /path/to/converted_model_root \
  --hf_tokenizer /path/to/converted_model_root \
  --cosyvoice_model_dir /path/to/CosyVoice2-0.5B \
  --wavtokenizer_cfg_path /path/to/wavtokenizer.yaml \
  --wavtokenizer_ckpt_path /path/to/wavtokenizer.ckpt \
  --save_audio --save_images

Common optional flags:

--txt_temp, --txt_top_p : sampling settings for text
--img_temp : sampling temp for image tokens
--bandwidth_id : wavtokenizer bandwidth id

(2) Input schema

You can pass:

a single dialog: {"dialog_id": "...", "turns": [ ... ]}
a list of dialogs: [{"dialog_id": "...", "turns": [...]}, ...]
a list of turns: [{turn}, {turn}, ...]

Each turn must provide either:

text: user text will be converted to speech and then tokenized
wav_path: directly tokenize a WAV file

Optional fields per turn:

image_paths: list of image paths for this turn
user_append_text: appended instruction after vocal tokens
speaker_wav: reference speaker WAV for CosyVoice2
prompt_text: optional prompt text for speaker/style
silence_head_sec, silence_tail_sec: silence padding seconds
silence_head_tokens, silence_tail_tokens: explicit silence token padding
reset: reset conversation history before this turn

(3) Example input

Create infer_test.json:

{
  "dialog_id": "demo_0001",
  "speaker_wav": "./inference/ref.wav",
  "turns": [
    {
      "text": "Describe the image in detail.",
      "image_paths": ["inference/demo_test.jpg"],
      "user_append_text": "Please acknowledge the user's vocal input, create a textual response.",
      "reset": true
    }
  ]
}

{
  "dialog_id": "demo_0002",
  "speaker_wav": "./inference/ref.wav",
  "turns": [
    {
      "text": "Can you show me the sunset?",
      "user_append_text": "Please transcribe the user's vocal input, create a picture of it.",
      "reset": true
    }
  ]
}

(4) Outputs

Artifacts are saved under:

output_dir/<dialog_id>/turn_<index>_<uid>/decoded_text.txt
output_dir/<dialog_id>/turn_<index>_<uid>/meta.json
optional images/speech if enabled:
- .../*_seg*_img*_.png
- .../*_seg*_speech_*.wav
a global log:
- output_dir/batch_log.json

3. Training

We provide two training stages: pre-training and instruction tuning.

3.1 Data

AR-Omni is trained on tokenized multimodal sequences. In both stages, the multimodal content has already been converted into discrete tokens and can be fed directly to the autoregressive model.

Pretrain data: built from public corpora at a large scale. Due to the dataset size and distributed sources, we do not host a packaged pretrain dataset in this repo. Please refer to the paper for the data recipe and obtain the open-source corpora accordingly.
Instruction tuning data: our open-source release is provided as tokenized multimodal instruction data:
- https://huggingface.co/datasets/ModalityDance/AR-Omni-Instruct-v0.1

3.2 Pretraining

cd training/pretrain

deepspeed --num_gpus 8 pretrain.py \
  --model_path /path/to/base_model \
  --output_path /path/to/output_pretrain_ckpt \
  --dataset_dir /path/to/pretrain_jsonl_shards \
  --deepspeed_config ds_config.json \
  --learning_rate 1e-5 \
  --gradient_accumulation_steps 16 \
  --response_weighted_tasks "image_caption,speech_to_text" \
  --response_seg_weight 2.0 \
  --perception_weight 1.0

Common options:

--resume_from_checkpoint /path/to/ckpt
--skip_shards N --skip_samples N

3.3 Instruction Tuning

cd training/instruction-tuning

deepspeed --num_gpus 8 sft.py \
  --data_path /path/to/AR-Omni-Instruct-v0.1.parquet \
  --model_path /path/to/pretrained_or_base_model \
  --output_path /path/to/output_sft_ckpt \
  --deepspeed_config /path/to/ds_config.json \
  --learning_rate 1e-5 \
  --gradient_accumulation_steps 8 \
  --sl_project YOUR_PROJECT \
  --sl_experiment YOUR_EXPERIMENT \
  --max_length 2048 \
  --segment_loss_weight 1.0 \
  --global_weight 1.0

Common options:

--resume_from_checkpoint /path/to/ckpt
--sl_key YOUR_SWANLAB_KEY (optional)

✨ How It Works

AR-Omni is a unified any-to-any model in the autoregressive paradigm without expert decoders.

One decoder, one token stream, one objective
Multimodal generation is completely formulated as standard next-token prediction over an interleaved sequence.

Three practical issues in unified AR modeling and our fixes:

Modality imbalance → task-aware loss reweighting. Unified AR training can be dominated by modalities with longer token budgets. We use a Weighted NTP objective that upweights task-relevant response tokens, keeping optimization aligned with the intended outputs and preventing skewed learning in unified training.
Visual fidelity → lightweight token-level perceptual alignment loss for image tokens.
Cross-entropy provides exact-match supervision but lacks geometric awareness in discrete visual codes. We add a small perceptual alignment loss that aligns hidden states to a frozen image embedding space, encouraging visually coherent structures even when token-level matches are imperfect.
Stability–creativity trade-offs → finite-state decoding with task-aware strategy switching.
Different tasks prefer different decoding behaviors. We use a finite-state decoding machine that switches strategies within one generation, using greedy decoding for deterministic subtasks and sampling for open-ended generation, avoiding a one-size-fits-all decoding rule.

🗂️ Project Structure

.
├── README.md
├── LICENSE
├── requirements.txt
│
├── assets/
│   ├── LOGO.png
│   └── overview.png
│
├── training/
│   ├── pretrain/
│   │   ├── pretrain.py            # entry
│   │   ├── pretrain_trainer.py
│   │   ├── perception.py
│   │   └── ds_config.json
│   └── instruction-tuning/
│       ├── sft.py                 # entry
│       ├── trainer.py
│       └── perception.py
│
├── inference/
│   ├── inference_pretrain.py      # entry
│   ├── inference_chat.py          # entry
│   ├── speech2tokens.py
│   ├── infer_test.json
│   ├── demo_test.jpg
│   └── ref.wav
│
├── accelerate/
└── transformers/

🌱 Acknowledgements

We thank the open-source projects and research community that made this work possible.

This project is licensed under the MIT License. It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License. Please refer to the LICENSE file for more details.

📚 Citation

If you use AR-Omni in your research or applications, please consider citing:

@misc{cheng2026aromniunifiedautoregressivemodel,
      title={AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation}, 
      author={Dongjie Cheng and Ruifeng Yuan and Yongqi Li and Runyang You and Wenjie Wang and Liqiang Nie and Lei Zhang and Wenjie Li},
      year={2026},
      eprint={2601.17761},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.17761}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

🪐 Key Features

🔥 News

Roadmap

📑 Table of Contents

🚀 Quick Start

1. Installation

2. Inference

Checkpoints

AR-Omni-Pretrain

(1) TTS

(2) ASR

(3) Image Captioning

(4) Text-to-Image (T2I)

AR-Omni-Chat (Interleaved Any-to-Any Conversation)

(1) Run command

(2) Input schema

(3) Example input

(4) Outputs

3. Training

3.1 Data

3.2 Pretraining

3.3 Instruction Tuning

✨ How It Works

🗂️ Project Structure

🌱 Acknowledgements

📚 Citation

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
accelerate		accelerate
assets		assets
docs		docs
inference		inference
training		training
transformers		transformers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

🪐 Key Features

🔥 News

Roadmap

📑 Table of Contents

🚀 Quick Start

1. Installation

2. Inference

Checkpoints

AR-Omni-Pretrain

(1) TTS

(2) ASR

(3) Image Captioning

(4) Text-to-Image (T2I)

AR-Omni-Chat (Interleaved Any-to-Any Conversation)

(1) Run command

(2) Input schema

(3) Example input

(4) Outputs

3. Training

3.1 Data

3.2 Pretraining

3.3 Instruction Tuning

✨ How It Works

🗂️ Project Structure

🌱 Acknowledgements

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages