<a href="https://colab.research.google.com/github/inpremathilaka/flowbite-admin-dashboard/blob/main/kokoro_tts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# kokoro

- https://huggingface.co/hexgrad/Kokoro-82M
- https://github.com/yl4579/StyleTTS2
- [2023. StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models](https://arxiv.org/abs/2306.07691)
- [2022. iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform](https://arxiv.org/abs/2203.02395)
- Decoder only: no diffusion, no encoder release

参数：

Kokoro v0.19: 82M params (Model total has 81.763 million parameters), Apache, trained on <100 hours of audio

模型参数低，直接可以在低端设备上运行，比如手机端，边缘硬件。


发布的开源权重是Kokoro v0.19，不支持中文，但是可以通过 phonemizer 将文本转成音素，但是效果不好

Kokoro v0.23 支持中文， 但是未公开权重


https://huggingface.co/spaces/hexgrad/Kokoro-TTS


## run kokoro-tts with pytorch

In [3]:
# 1️⃣ Install dependencies silently
!git lfs install
!git clone https://huggingface.co/hexgrad/Kokoro-82M
%cd Kokoro-82M
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!pip install -q phonemizer torch transformers scipy munch

Updated git hooks.
Git LFS initialized.
Cloning into 'Kokoro-82M'...
remote: Enumerating objects: 421, done.[K
remote: Counting objects: 100% (421/421), done.[K
remote: Compressing objects: 100% (199/199), done.[K
remote: Total 421 (delta 240), reused 391 (delta 221), pack-reused 0 (from 0)[K
Receiving objects: 100% (421/421), 1.83 MiB | 19.34 MiB/s, done.
Resolving deltas: 100% (240/240), done.
/content/Kokoro-82M/Kokoro-82M/Kokoro-82M
[0m

In [5]:
# 2️⃣ Build the model and load the default voicepack
from models import build_model
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
MODEL = build_model('kokoro-v0_19.pth', device)

total_params = 0
for key,model in MODEL.items():
    print(f'{key} Model: {model}')
    params = sum(p.numel() for p in model.parameters())
    total_params += params
    model_million_params = params / 1e6
    print(f'{key} Model has {model_million_params:.3f} million parameters')

model_million_params = total_params / 1e6
print(f'Model total has {model_million_params:.3f} million parameters')

VOICE_NAME = [
    'af', # Default voice is a 50-50 mix of Bella & Sarah
    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky',
][0]
VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
print(f'Loaded voice: {VOICE_NAME}')


ModuleNotFoundError: No module named 'models'

In [None]:
# 3️⃣ Call generate, which returns 24khz audio and the phonemes used
from kokoro import generate
text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
# Language is determined by the first letter of the VOICE_NAME:
# 🇺🇸 'a' => American English => en-us
# 🇬🇧 'b' => British English => en-gb


In [None]:
# 4️⃣ Display the 24khz audio and print the output phonemes
from IPython.display import display, Audio
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)


hˌaʊ kʊd aɪ nˈoʊ? ɪts ɐn ʌnˈænsɚɹəbəl kwˈɛstʃən. lˈaɪk ˈæskɪŋ ɐn ʌnbˈɔːɹn tʃˈaɪld ɪf ðeɪl lˈiːd ɐ ɡˈʊd lˈaɪf. ðeɪ hˈævənt ˈiːvən bˌɪn bˈɔːɹn.


# run kokoro-tts with onnx

In [None]:
import io
import json

import numpy as np
import requests
import torch

voices = [
    "af",
    "af_bella",
    "af_nicole",
    "af_sarah",
    "af_sky",
    "am_adam",
    "am_michael",
    "bf_emma",
    "bf_isabella",
    "bm_george",
    "bm_lewis",
]
voices_json = {}
pattern = "https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/{voice}.pt"
for voice in voices:
    url = pattern.format(voice=voice)
    print(f"Downloading {url}")
    r = requests.get(url)
    content = io.BytesIO(r.content)
    voice_data: np.ndarray = torch.load(content).numpy()
    voices_json[voice] = voice_data.tolist()

with open("/content/voices.json", "w") as f:
    json.dump(voices_json, f, indent=4)

Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af.pt


  voice_data: np.ndarray = torch.load(content).numpy()


Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af_bella.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af_nicole.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af_sarah.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af_sky.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/am_adam.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/am_michael.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/bf_emma.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/bf_isabella.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/bm_george.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/bm_lewis.pt


In [None]:
!ls -lh /content/voices.json

-rw-r--r-- 1 root root 52M Jan  9 05:23 /content/voices.json


In [None]:
!wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files/kokoro-v0_19.onnx -O /content/kokoro-v0_19.onnx
!wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files/voices.json -O /content/kokoro-voices.json


--2025-01-09 05:39:41--  https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files/kokoro-v0_19.onnx
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/911666237/7fe25b8c-a762-4449-a9c5-27047a10d5e6?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250109%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250109T053941Z&X-Amz-Expires=300&X-Amz-Signature=7af306fafd9a0e1a76abbdc5f2ecb08e2e7444a88affecb72859e16e5f737fb0&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dkokoro-v0_19.onnx&response-content-type=application%2Foctet-stream [following]
--2025-01-09 05:39:41--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/911666237/7fe25b8c-a762-4449-a9c5-27047a10d5e6?X-Amz-Algorithm=AWS4-HMAC-S

In [None]:
!ls -lh /content/kokoro-voices.json /content/kokoro-v0_19.onnx

-rw-r--r-- 1 root root 330M Jan  3 17:01 /content/kokoro-v0_19.onnx
-rw-r--r-- 1 root root  52M Jan  3 16:34 /content/kokoro-voices.json


In [None]:
!pip uninstall -q phonemizer # use phonemizer_fork, Text -> Phonemics
!pip install -Uq kokoro-onnx


In [None]:
import soundfile as sf
from kokoro_onnx import Kokoro

kokoro = Kokoro("/content/kokoro-v0_19.onnx", "/content/kokoro-voices.json")
samples, sample_rate = kokoro.create(
    "Hello. This audio generated by kokoro!", voice="af_sarah", speed=1.0, lang="en-us"
)
sf.write("audio.wav", samples, sample_rate)
print("Created audio.wav")

Created audio.wav


In [None]:
from IPython.display import display, Audio
display(Audio(data="audio.wav",rate=sample_rate))



In [None]:
import soundfile as sf
from kokoro_onnx import Kokoro

kokoro = Kokoro("/content/kokoro-v0_19.onnx", "/content/kokoro-voices.json")
samples, sample_rate = kokoro.create(
    "Hello. 你好啊！从前，有一个小女孩，名叫莉莉。她喜欢在阳光下外面玩耍。有一天，她在后院看到一棵柠檬树。它很高，上面结满了柠檬。", voice="af_sarah", speed=1.0, lang="cmn"
)





In [None]:
from IPython.display import display, Audio
display(Audio(data=samples, rate=sample_rate, autoplay=True))