# CosyVoice Kaggle Notebook

This notebook allows you to run CosyVoice (Fun-CosyVoice3-0.5B) on Kaggle with GPU support.

**Features:**
- Text-to-Speech with zero-shot voice cloning
- Cross-lingual voice synthesis
- Instruct-based control
- Fine-grained control (laughter, breath, etc.)

**Repository:** https://github.com/infinite-gaming-studio/CosyVoice

## 1. Clone Repository and Setup

In [None]:
# Clone the repository with submodules
!git clone --recursive https://github.com/infinite-gaming-studio/CosyVoice.git
%cd CosyVoice

In [None]:
# Update submodules (in case of network issues during clone)
!git submodule update --init --recursive

## 2. Install Dependencies

In [None]:
# Install system dependencies for sox
!apt-get update -qq
!apt-get install -y -qq sox libsox-dev

In [None]:
# Install Python dependencies
!pip install -q -r requirements.txt --no-deps 2>&1 | tail -20

## 3. Download Pre-trained Models

In [None]:
# Download Fun-CosyVoice3-0.5B model (recommended)
from modelscope import snapshot_download
import os

# Create models directory
os.makedirs('pretrained_models', exist_ok=True)

# Download Fun-CosyVoice3-0.5B model
print("Downloading Fun-CosyVoice3-0.5B model...")
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')

print("Models downloaded successfully!")

## 4. Test Basic Inference

In [None]:
# Test basic inference
import sys
sys.path.append('third_party/Matcha-TTS')

from cosyvoice.cli.cosyvoice import AutoModel
import torchaudio

# Initialize model
print("Loading model...")
cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
print("Model loaded!")

In [None]:
# Test zero-shot voice cloning
import os
os.makedirs('outputs', exist_ok=True)

# Test text
test_text = '你好，欢迎使用CosyVoice语音合成系统，这是Kaggle上的测试。'
prompt_text = '希望你以后能够做的比我还好呦。'

# Check if we have sample audio
sample_wav = './asset/zero_shot_prompt.wav'
if not os.path.exists(sample_wav):
    print(f"Warning: {sample_wav} not found. Creating a simple test.")
    # Generate with empty prompt (will use default)
    prompt_wav = None
else:
    prompt_wav = sample_wav

# Run inference
print("Running inference...")
if prompt_wav:
    for i, j in enumerate(cosyvoice.inference_zero_shot(test_text, prompt_text, prompt_wav, stream=False)):
        torchaudio.save(f'outputs/test_zero_shot_{i}.wav', j['tts_speech'], cosyvoice.sample_rate)
else:
    # Fallback to instruct mode
    for i, j in enumerate(cosyvoice.inference_instruct2(test_text, 'You are a helpful assistant.<|endofprompt|>', stream=False)):
        torchaudio.save(f'outputs/test_instruct_{i}.wav', j['tts_speech'], cosyvoice.sample_rate)

print("Test audio generated in outputs/ directory")

## 5. Launch Web UI

Run this cell to start the Gradio Web UI. A public URL will be generated.

In [None]:
# Launch Web UI with public URL
!python webui.py --port 50000 --model_dir pretrained_models/Fun-CosyVoice3-0.5B --share

## Alternative: Custom Inference Examples

In [None]:
# Example: Fine-grained control with laughter
text_with_control = '在他讲述那个荒诞故事的过程中，他突然[laughter]停下来，因为他自己也被逗笑了[laughter]。'

if os.path.exists('./asset/zero_shot_prompt.wav'):
    for i, j in enumerate(cosyvoice.inference_cross_lingual(text_with_control, './asset/zero_shot_prompt.wav', stream=False)):
        torchaudio.save(f'outputs/fine_grained_control_{i}.wav', j['tts_speech'], cosyvoice.sample_rate)
    print("Fine-grained control audio generated!")

In [None]:
# Example: Instruct mode with dialect
text = '收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。'
instruct = 'You are a helpful assistant. 请用四川话说这句话。<|endofprompt|>'

if os.path.exists('./asset/zero_shot_prompt.wav'):
    for i, j in enumerate(cosyvoice.inference_instruct2(text, instruct, './asset/zero_shot_prompt.wav', stream=False)):
        torchaudio.save(f'outputs/instruct_sichuan_{i}.wav', j['tts_speech'], cosyvoice.sample_rate)
    print("Instruct audio generated!")

In [None]:
# Example: Cross-lingual synthesis
cross_text = "<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that's coming into the family is a reason why sometimes we don't buy the whole thing."

if os.path.exists('./asset/cross_lingual_prompt.wav'):
    for i, j in enumerate(cosyvoice.inference_cross_lingual(cross_text, './asset/cross_lingual_prompt.wav', stream=False)):
        torchaudio.save(f'outputs/cross_lingual_{i}.wav', j['tts_speech'], cosyvoice.sample_rate)
    print("Cross-lingual audio generated!")

## Download Output Files

In [None]:
# List all generated audio files
import os
output_files = os.listdir('outputs')
print("Generated audio files:")
for f in output_files:
    if f.endswith('.wav'):
        print(f"  - outputs/{f}")