# Fish Speech S1-Mini - SageMaker Studio Setup

This notebook sets up Fish Speech for zero-shot voice cloning in SageMaker Studio.

**Requirements:**
- SageMaker Studio with GPU instance (ml.g4dn.xlarge, ml.g5.xlarge, etc.)
- Python 3.10+

Run each cell in order.

## Step 1: Check GPU

In [1]:
!nvidia-smi

Sun Jan 18 17:06:46 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 591.74                 Driver Version: 591.74         CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 5070 Ti   WDDM  |   00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8              9W /  300W |       0MiB /  16303MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

## Step 2: Install Dependencies

This will take 5-10 minutes. The script installs all dependencies in the correct order.

In [5]:
!python scripts/install_sagemaker.py

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fish-speech 0.1.0 requires descript-audio-codec, which is not installed.
fish-speech 0.1.0 requires descript-audiotools, which is not installed.
fish-speech 0.1.0 requires einops>=0.7.0, which is not installed.
fish-speech 0.1.0 requires einx[torch]==0.2.2, which is not installed.
fish-speech 0.1.0 requires transformers>=4.45.2, which is not installed.
vector-quantize-pytorch 1.27.7 requires einops>=0.8.0, which is not installed.
vector-quantize-pytorch 1.27.7 requires einx>=0.3.0, which is not installed.
datasets 2.18.0 requires fsspec[http]<=2024.2.0,>=2023.1.0, but you have fsspec 2025.12.0 which is incompatible.
fish-speech 0.1.0 requires numpy<=1.26.4, but you have numpy 2.2.6 which is incompatible.
fish-speech 0.1.0 requires pydantic==2.9.2, but you have pydantic 2.12.5 which is incompatible.
gradio 5.49.1 r


╔══════════════════════════════════════════════════════════════════╗
║        Fish Speech - SageMaker Studio Installation               ║
╚══════════════════════════════════════════════════════════════════╝

  Not in SageMaker Studio (local environment)
Project root: c:\Users\PC\Desktop\fish-speech

CHECKING GPU

>>> nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
GPU detected: NVIDIA GeForce RTX 5070 Ti, 16303 MiB

STEP 1: Cleaning up existing packages

>>> pip uninstall -y torch
Found existing installation: torch 2.9.1+cu130
Uninstalling torch-2.9.1+cu130:
  Successfully uninstalled torch-2.9.1+cu130

>>> pip uninstall -y torchvision
Found existing installation: torchvision 0.24.1
Uninstalling torchvision-0.24.1:
  Successfully uninstalled torchvision-0.24.1

>>> pip uninstall -y torchaudio
Found existing installation: torchaudio 2.9.1+cu130
Uninstalling torchaudio-2.9.1+cu130:
  Successfully uninstalled torchaudio-2.9.1+cu130

>>> pip uninstall -y numpy
Found existin

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'c:\\users\\pc\\desktop\\fish-speech\\.venv\\lib\\site-packages\\numpy-2.2.6.dist-info\\METADATA'



## Step 3: Download Model

Download the S1-Mini model from Hugging Face (~1.7GB)

In [3]:
from huggingface_hub import snapshot_download

snapshot_download(
    'fishaudio/openaudio-s1-mini',
    local_dir='checkpoints/openaudio-s1-mini'
)
print("Model downloaded!")

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

Model downloaded!


## Step 4: Verify Installation

In [6]:
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Test imports
from s1_mini import ProductionTTSEngine, EngineConfig
print("\n✓ S1-Mini imports working!")

AttributeError: module 'torch' has no attribute '__version__'

## Step 5: Test Basic TTS (No Voice Cloning)

In [None]:
from s1_mini import ProductionTTSEngine, EngineConfig
from IPython.display import Audio

# Initialize engine
config = EngineConfig(
    checkpoint_path="checkpoints/openaudio-s1-mini",
    device="cuda",
    precision="float16",
    compile_model=True,  # Enable Triton compilation on Linux
)

engine = ProductionTTSEngine(config)
engine.start()
print("Engine ready!")

In [None]:
# Generate basic TTS
response = engine.generate(
    text="Hello! This is a test of the Fish Speech text to speech system running on SageMaker Studio.",
    temperature=0.7,
    top_p=0.8,
)

if response.success:
    sample_rate, audio = response.audio
    print(f"Generated {len(audio)/sample_rate:.2f}s of audio")
    print(f"RTF: {response.metrics.realtime_factor:.2f}x")
    Audio(audio, rate=sample_rate)
else:
    print(f"Error: {response.error}")

## Step 6: Test Zero-Shot Voice Cloning

Upload a reference audio file (WAV format) to clone a voice.

In [None]:
# Upload your reference audio file to SageMaker Studio
# Then update these variables:

REFERENCE_AUDIO_PATH = "your_reference.wav"  # <-- UPDATE THIS
REFERENCE_TEXT = "The text spoken in the reference audio"  # <-- UPDATE THIS
TEXT_TO_SYNTHESIZE = "This text will be spoken in the cloned voice."

In [None]:
# Load reference audio
with open(REFERENCE_AUDIO_PATH, "rb") as f:
    reference_audio_bytes = f.read()

print(f"Reference audio: {len(reference_audio_bytes):,} bytes")

# Generate with voice cloning
response = engine.generate(
    text=TEXT_TO_SYNTHESIZE,
    reference_audio=reference_audio_bytes,
    reference_text=REFERENCE_TEXT,
    temperature=0.7,
    top_p=0.8,
)

if response.success:
    sample_rate, audio = response.audio
    print(f"Generated {len(audio)/sample_rate:.2f}s of audio")
    print(f"RTF: {response.metrics.realtime_factor:.2f}x")
    Audio(audio, rate=sample_rate)
else:
    print(f"Error: {response.error}")

## Step 7: Cleanup

Stop the engine when done to free GPU memory.

In [None]:
engine.stop()
print("Engine stopped, GPU memory freed.")

---

## API Server (Optional)

Start the API server for HTTP-based TTS.

In [None]:
# Start API server (runs in background)
# Access at: http://localhost:8080/docs
!python -m s1_mini.server --checkpoint checkpoints/openaudio-s1-mini --port 8080 &

## Expected Performance on SageMaker Studio

| Instance | GPU | RTF | Notes |
|----------|-----|-----|-------|
| ml.g4dn.xlarge | T4 | ~1.5x | Good for development |
| ml.g5.xlarge | A10G | ~2.0x | Best value |
| ml.p3.2xlarge | V100 | ~2.5x | High performance |

RTF > 1.0x means faster than real-time (good!).