X-Codec-2.0

Paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis

Update (2025-02-13): Add Llasa finetune instruction.

Update (2025-02-07): Our paper has been released!

Directly used on Hugging Face

Codec: xcodec2 (Use xcodec2==0.1.5 for codec inference and llasa fine-tuning. I’ve removed unnecessary dependencies, and it works fine in my testing. However, I’m not sure if other problems may arise. If you prefer more stability, I recommend using xcodec2==0.1.3 which accurately aligns during my codec training.)

Llasa 1b version: Llasa-1B

Llasa 1b Multilingual version: Llasa-1B-Multilingual (Not mentioned in the paper)

Llasa 3b version: Llasa-3B

Llasa 8b version: Llasa-8B

Features

Single Vector Quantization
- 65536 Codebook Size using Finite Scalar Quantization achieving 99% codebook usage. ( comparable to text tokenizers, LLaMA3 128256)
- 50x1 Tokens per Second
Multilingual Speech Semantic Support
- Uses Wav2Vec2-BERT, a semantic encoder pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages.
- Codec trained on 150k hours of multilingual speech data, including Emilia (En/Zh/De/Fr/Ja/Ko) and MLS (En/Fr/De/Nl/Es/It/Pt/Pl).
High-Quality Speech Reconstruction
- Transformer + Vocos Decoder
- BigCodec encoder
- Spec discriminator with FFT sizes {78, 126, 206, 334, 542, 876, 1418, 2296} tailored for transformer decoder. Details here
- Achieving UTMOS 4.13 WER 2.47 (hubert-large-ls960-ft) sim 0.82 (wavlm_large_finetune) stoi 0.92 pesq-nb 3.05 pesq-wb 2.44 on librispeech-test-clean reconstruction (gt: WER 1.96 UTMOS 4.09)
- Only for 16kHz speech

Commandline Usage

Setup

Code is tested on python3.9

Please follow the following steps to setup your environment

Clone this repo
conda create --name xcodec2 python=3.9
conda activate xcodec2
pip install -r requirements.txt
Download the pretrained checkpoint here

Inference

python inference.py

Train

To train a XCodec2, firstly you have to prepare your data

Make a file list by:

python get_tsv.py

Train a X-Codec-2.0 with the default setting by:

python train.py log_dir=/path/to/log_dir

Large-scale training, Batch inference and large-scale code extracting:

Batch inference

python inference_save_code.py

Training

Sbatch train_slurm.sh

Code extracting

Sbatch large_scale_save_code.sh

Code will save in output folder with the same subfolder structure for audio file.

Acknowledgement

I would like to extend a special thanks to authors of BigCodec, since our code base is mainly borrowed from BigCodec.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
common		common
config		config
criterions		criterions
module		module
test_audio		test_audio
vq		vq
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_module.py		data_module.py
get_tsv.py		get_tsv.py
inference.py		inference.py
inference_save_code.py		inference_save_code.py
large_scale_save_code.sh		large_scale_save_code.sh
lightning_module.py		lightning_module.py
loss.py		loss.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
run.sh		run.sh
train.py		train.py
train_slurm.sh		train_slurm.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

X-Codec-2.0

Directly used on Hugging Face

Features

Commandline Usage

Setup

Inference

Train

Large-scale training, Batch inference and large-scale code extracting:

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jinluo12345/Xcodec

Folders and files

Latest commit

History

Repository files navigation

X-Codec-2.0

Directly used on Hugging Face

Features

Commandline Usage

Setup

Inference

Train

Large-scale training, Batch inference and large-scale code extracting:

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages