# Fish Speech

### For Windows User / win用户

In [1]:
!chcp 65001

/bin/bash: line 1: chcp: command not found


### For Linux User / Linux 用户

In [2]:
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

'en_US.UTF-8'

### Prepare Model

In [3]:
# For Chinese users, you probably want to use mirror to accelerate downloading
# !set HF_ENDPOINT=https://hf-mirror.com
# !export HF_ENDPOINT=https://hf-mirror.com 

!huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5/

Fetching 7 files: 100%|█████████████████████████| 7/7 [00:00<00:00, 6320.80it/s]
/home/leo/Desktop/leo-ext/self/other/fish-speech/checkpoints/fish-speech-1.5


## WebUI Inference

> You can use --compile to fuse CUDA kernels for faster inference (10x).

In [None]:
!python tools/run_webui.py \
    --llama-checkpoint-path checkpoints/fish-speech-1.5 \
    --decoder-checkpoint-path checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth \
    --compile

[32m2025-05-15 10:01:00.962[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m56[0m - [1mLoading Llama model...[0m
[32m2025-05-15 10:01:06.081[0m | [1mINFO    [0m | [36mfish_speech.models.text2semantic.inference[0m:[36mload_model[0m:[36m681[0m - [1mRestored model from checkpoint[0m
[32m2025-05-15 10:01:06.082[0m | [1mINFO    [0m | [36mfish_speech.models.text2semantic.inference[0m:[36mload_model[0m:[36m687[0m - [1mUsing DualARTransformer[0m
[32m2025-05-15 10:01:06.082[0m | [1mINFO    [0m | [36mfish_speech.models.text2semantic.inference[0m:[36mload_model[0m:[36m695[0m - [1mCompiling function...[0m
[32m2025-05-15 10:01:06.821[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m64[0m - [1mLoading VQ-GAN model...[0m
  @autocast(enabled = False)
  @autocast(enabled = False)
  @autocast(enabled = False)
  @autocast(enabled = False)
[32m2025-05-15 10:01:07.363[0m | [1mINFO    [0m | [36mfish_speech.models.vqgan.infer

## Break-down CLI Inference

### 1. Encode reference audio: / 从语音生成 prompt: 

You should get a `fake.npy` file.

你应该能得到一个 `fake.npy` 文件.

In [None]:
## Enter the path to the audio file here
src_audio = r"D:\PythonProject\vo_hutao_draw_appear.wav"

!python fish_speech/models/vqgan/inference.py \
    -i {src_audio} \
    --checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"

from IPython.display import Audio, display
audio = Audio(filename="fake.wav")
display(audio)

### 2. Generate semantic tokens from text: / 从文本生成语义 token:

> This command will create a codes_N file in the working directory, where N is an integer starting from 0.

> You may want to use `--compile` to fuse CUDA kernels for faster inference (~30 tokens/second -> ~300 tokens/second).

> 该命令会在工作目录下创建 codes_N 文件, 其中 N 是从 0 开始的整数.

> 您可以使用 `--compile` 来融合 cuda 内核以实现更快的推理 (~30 tokens/秒 -> ~300 tokens/秒)

In [None]:
!python fish_speech/models/text2semantic/inference.py \
    --text "hello world" \
    --prompt-text "The text corresponding to reference audio" \
    --prompt-tokens "fake.npy" \
    --checkpoint-path "checkpoints/fish-speech-1.5" \
    --num-samples 2
    # --compile

### 3. Generate speech from semantic tokens: / 从语义 token 生成人声:

In [None]:
!python fish_speech/models/vqgan/inference.py \
    -i "codes_0.npy" \
    --checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"

from IPython.display import Audio, display
audio = Audio(filename="fake.wav")
display(audio)