# Fish Speech S1-Mini - SageMaker Studio Setup

This notebook sets up Fish Speech for zero-shot voice cloning in SageMaker Studio.

**Requirements:**
- SageMaker Studio with GPU instance (ml.g4dn.xlarge, ml.g5.xlarge, etc.)
- Python 3.10+

Run each cell in order.

## Step 1: Check GPU

In [None]:
!nvidia-smi

## Step 2: Install Dependencies

This will take 5-10 minutes. The script installs all dependencies in the correct order.

In [None]:
!python scripts/install_sagemaker.py

## Step 3: Download Model

Download the S1-Mini model from Hugging Face (~1.7GB)

In [None]:
from huggingface_hub import snapshot_download

snapshot_download(
    'fishaudio/openaudio-s1-mini',
    local_dir='checkpoints/openaudio-s1-mini'
)
print("Model downloaded!")

## Step 4: Verify Installation

In [None]:
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Test imports
from s1_mini import ProductionTTSEngine, EngineConfig
print("\nâœ“ S1-Mini imports working!")

## Step 5: Test Basic TTS (No Voice Cloning)

In [None]:
from s1_mini import ProductionTTSEngine, EngineConfig
from IPython.display import Audio

# Initialize engine
config = EngineConfig(
    checkpoint_path="checkpoints/openaudio-s1-mini",
    device="cuda",
    precision="float16",
    compile_model=True,  # Enable Triton compilation on Linux
)

engine = ProductionTTSEngine(config)
engine.start()
print("Engine ready!")

In [None]:
# Generate basic TTS
response = engine.generate(
    text="Hello! This is a test of the Fish Speech text to speech system running on SageMaker Studio.",
    temperature=0.7,
    top_p=0.8,
)

if response.success:
    sample_rate, audio = response.audio
    print(f"Generated {len(audio)/sample_rate:.2f}s of audio")
    print(f"RTF: {response.metrics.realtime_factor:.2f}x")
    Audio(audio, rate=sample_rate)
else:
    print(f"Error: {response.error}")

## Step 6: Test Zero-Shot Voice Cloning

Upload a reference audio file (WAV format) to clone a voice.

In [None]:
# Upload your reference audio file to SageMaker Studio
# Then update these variables:

REFERENCE_AUDIO_PATH = "your_reference.wav"  # <-- UPDATE THIS
REFERENCE_TEXT = "The text spoken in the reference audio"  # <-- UPDATE THIS
TEXT_TO_SYNTHESIZE = "This text will be spoken in the cloned voice."

In [None]:
# Load reference audio
with open(REFERENCE_AUDIO_PATH, "rb") as f:
    reference_audio_bytes = f.read()

print(f"Reference audio: {len(reference_audio_bytes):,} bytes")

# Generate with voice cloning
response = engine.generate(
    text=TEXT_TO_SYNTHESIZE,
    reference_audio=reference_audio_bytes,
    reference_text=REFERENCE_TEXT,
    temperature=0.7,
    top_p=0.8,
)

if response.success:
    sample_rate, audio = response.audio
    print(f"Generated {len(audio)/sample_rate:.2f}s of audio")
    print(f"RTF: {response.metrics.realtime_factor:.2f}x")
    Audio(audio, rate=sample_rate)
else:
    print(f"Error: {response.error}")

## Step 7: Cleanup

Stop the engine when done to free GPU memory.

In [None]:
engine.stop()
print("Engine stopped, GPU memory freed.")

---

## API Server (Optional)

Start the API server for HTTP-based TTS.

In [None]:
# Start API server (runs in background)
# Access at: http://localhost:8080/docs
!python -m s1_mini.server --checkpoint checkpoints/openaudio-s1-mini --port 8080 &

## Expected Performance on SageMaker Studio

| Instance | GPU | RTF | Notes |
|----------|-----|-----|-------|
| ml.g4dn.xlarge | T4 | ~1.5x | Good for development |
| ml.g5.xlarge | A10G | ~2.0x | Best value |
| ml.p3.2xlarge | V100 | ~2.5x | High performance |

RTF > 1.0x means faster than real-time (good!).