Skip to content

jinluo12345/Xcodec

Repository files navigation

arXiv

X-Codec-2.0

Paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis

Update (2025-02-13): Add Llasa finetune instruction.

Update (2025-02-07): Our paper has been released!

Directly used on Hugging Face

Codec: xcodec2 (Use xcodec2==0.1.5 for codec inference and llasa fine-tuning. I’ve removed unnecessary dependencies, and it works fine in my testing. However, I’m not sure if other problems may arise. If you prefer more stability, I recommend using xcodec2==0.1.3 which accurately aligns during my codec training.)

Llasa 1b version: Llasa-1B

Llasa 1b Multilingual version: Llasa-1B-Multilingual (Not mentioned in the paper)

Llasa 3b version: Llasa-3B

Llasa 8b version: Llasa-8B

Features

  • Single Vector Quantization

    • 65536 Codebook Size using Finite Scalar Quantization achieving 99% codebook usage. ( comparable to text tokenizers, LLaMA3 128256)
    • 50x1 Tokens per Second
  • Multilingual Speech Semantic Support

    • Uses Wav2Vec2-BERT, a semantic encoder pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages.
    • Codec trained on 150k hours of multilingual speech data, including Emilia (En/Zh/De/Fr/Ja/Ko) and MLS (En/Fr/De/Nl/Es/It/Pt/Pl).
  • High-Quality Speech Reconstruction

    • Transformer + Vocos Decoder
    • BigCodec encoder
    • Spec discriminator with FFT sizes {78, 126, 206, 334, 542, 876, 1418, 2296} tailored for transformer decoder. Details here
    • Achieving UTMOS 4.13 WER 2.47 (hubert-large-ls960-ft) sim 0.82 (wavlm_large_finetune) stoi 0.92 pesq-nb 3.05 pesq-wb 2.44 on librispeech-test-clean reconstruction (gt: WER 1.96 UTMOS 4.09)
    • Only for 16kHz speech

Commandline Usage

Setup

Code is tested on python3.9

Please follow the following steps to setup your environment

  1. Clone this repo
  2. conda create --name xcodec2 python=3.9
  3. conda activate xcodec2
  4. pip install -r requirements.txt
  5. Download the pretrained checkpoint here

Inference

python inference.py  

Train

To train a XCodec2, firstly you have to prepare your data

  1. Make a file list by:
python get_tsv.py
  1. Train a X-Codec-2.0 with the default setting by:
python train.py log_dir=/path/to/log_dir

Large-scale training, Batch inference and large-scale code extracting:

Batch inference

python inference_save_code.py

Training

Sbatch train_slurm.sh

Code extracting

Sbatch large_scale_save_code.sh

Code will save in output folder with the same subfolder structure for audio file.

Acknowledgement

I would like to extend a special thanks to authors of BigCodec, since our code base is mainly borrowed from BigCodec.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published