<a href="https://colab.research.google.com/github/jeremiahoclark/open_source_colabs/blob/main/fish_speech_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fish Speech - Colab Version

This notebook has been adapted for Google Colab from the original Fish Speech repository.

## Initial Setup

First, let's install the required dependencies and clone the repository.

In [1]:
# Install required packages
!pip install torch torchaudio transformers accelerate gradio

# Clone the Fish Speech repository
!git clone https://github.com/fishaudio/fish-speech.git
!cd fish-speech

Collecting gradio
  Downloading gradio-5.4.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.4-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.4.2 (from gradio)
  Downloading gradio_client-1.4.2-py3-none-any.whl.metadata (7.1 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart==0.0.12 (from gradio)
  Downloading python_multipart-0.0.12-py3-none-any.whl.metadata (

## Set UTF-8 Encoding

In [2]:
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

'en_US.UTF-8'

## Download Model Checkpoints

Now we'll download the model checkpoints from Hugging Face.

In [3]:
# Install Hugging Face CLI if not already installed
!pip install -q huggingface_hub

# Download the model
!huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4/

Fetching 8 files:   0% 0/8 [00:00<?, ?it/s]Downloading '.gitattributes' to 'checkpoints/fish-speech-1.4/.cache/huggingface/download/.gitattributes.a6344aac8c09253b3b630fb776ae94478aa0275b.incomplete'
Downloading 'README.md' to 'checkpoints/fish-speech-1.4/.cache/huggingface/download/README.md.c70e58a0f6ae54f86a898f60b1d6264956a74b54.incomplete'
Downloading 'config.json' to 'checkpoints/fish-speech-1.4/.cache/huggingface/download/config.json.966eef0416c72fbe6b2938856c28707e3710b072.incomplete'
Downloading 'firefly-gan-vq-fsq-8x1024-21hz-generator.pth' to 'checkpoints/fish-speech-1.4/.cache/huggingface/download/firefly-gan-vq-fsq-8x1024-21hz-generator.pth.01b81dbf753224a156c3fe139b88bf0b9a0f54b11bee864f95e66511c3ccd754.incomplete'
Downloading 'model.pth' to 'checkpoints/fish-speech-1.4/.cache/huggingface/download/model.pth.5792b78b48d1f92d0b2cef8f07aad6b7ac19db2aff23a2054a5cce1d019ab4fb.incomplete'
Downloading 'special_tokens_map.json' to 'checkpoints/fish-speech-1.4/.cache/huggingface/d

## WebUI Inference

Run the WebUI with Colab's port forwarding capabilities.

In [4]:
# For Colab, we need to use port forwarding
from google.colab import output
output.serve_kernel_port_as_window(7860)

!python tools/webui.py \
    --llama-checkpoint-path checkpoints/fish-speech-1.4 \
    --decoder-checkpoint-path checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth \
    --share  # This enables Gradio's public URL feature

Try `serve_kernel_port_as_iframe` instead. [0m


<IPython.core.display.Javascript object>

python3: can't open file '/content/tools/webui.py': [Errno 2] No such file or directory


## Break-down CLI Inference

### 1. Encode Reference Audio

In [6]:
# Upload your audio file using Colab's file uploader
from google.colab import files
uploaded = files.upload()
src_audio = next(iter(uploaded.keys()))  # Get the filename of the uploaded file

!python tools/vqgan/inference.py \
    -i {src_audio} \
    --checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"



Saving jap_speaker_1.wav to jap_speaker_1.wav
python3: can't open file '/content/tools/vqgan/inference.py': [Errno 2] No such file or directory


FileNotFoundError: [Errno 2] No such file or directory: 'fake.wav'

In [7]:
from IPython.display import Audio, display
audio = Audio(filename="jap_speaker_1.wav")
display(audio)

### 2. Generate Semantic Tokens from Text

In [None]:
!python tools/llama/generate.py \
    --text """
    この動画を 作成してくれてありがとうございました！
    とても聞きやすいです
    私の好きな日本の食べ物は油そばです。大好きです！
    次の動画を見るのを楽しみにしています。
    """ \
    --prompt-text "The text corresponding to reference audio" \
    --prompt-tokens "fake.npy" \
    --checkpoint-path "checkpoints/fish-speech-1.4" \
    --num-samples 2

### 3. Generate Speech from Semantic Tokens

In [None]:
!python tools/vqgan/inference.py \
    -i "codes_0.npy" \
    --checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"

from IPython.display import Audio, display
audio = Audio(filename="jap_speaker_1.wav")
display(audio)

python3: can't open file '/content/tools/vqgan/inference.py': [Errno 2] No such file or directory
