<a href="https://colab.research.google.com/github/ju-li/bark/blob/main/notebooks/long_form_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
!pip install git+https://github.com/suno-ai/bark.git


Collecting git+https://github.com/suno-ai/bark.git
  Cloning https://github.com/suno-ai/bark.git to /tmp/pip-req-build-_zwyo61j
  Running command git clone --filter=blob:none --quiet https://github.com/suno-ai/bark.git /tmp/pip-req-build-_zwyo61j
  Resolved https://github.com/suno-ai/bark.git to commit 773624d26db84278a55aacae9a16d7b25fbccab8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting boto3 (from suno-bark==0.0.1a0)
  Downloading boto3-1.28.84-py3-none-any.whl.metadata (6.7 kB)
Collecting encodec (from suno-bark==0.0.1a0)
  Downloading encodec-0.1.1.tar.gz (3.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting funcy (from suno-bark==0.0.1a0)
  Downloading

In [10]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"


from IPython.display import Audio
import nltk  # we'll use this to split into sentences
nltk.download('punkt')
import numpy as np

from bark.generation import (
    generate_text_semantic,
    preload_models,
)
from bark.api import semantic_to_waveform
from bark import generate_audio, SAMPLE_RATE

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [7]:
preload_models()

Downloading text_2.pt:   0%|          | 0.00/5.35G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading coarse_2.pt:   0%|          | 0.00/3.93G [00:00<?, ?B/s]

Downloading fine_2.pt:   0%|          | 0.00/3.74G [00:00<?, ?B/s]

Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /root/.cache/torch/hub/checkpoints/encodec_24khz-d7cc33bc.th
100%|██████████| 88.9M/88.9M [00:00<00:00, 115MB/s]


# Simple Long-Form Generation
We split longer text into sentences using `nltk` and generate the sentences one by one.

In [8]:
script = """
Hey, have you heard about this new text-to-audio model called "Bark"?
Apparently, it's the most realistic and natural-sounding text-to-audio model
out there right now. People are saying it sounds just like a real person speaking.
I think it uses advanced machine learning algorithms to analyze and understand the
nuances of human speech, and then replicates those nuances in its own speech output.
It's pretty impressive, and I bet it could be used for things like audiobooks or podcasts.
In fact, I heard that some publishers are already starting to use Bark to create audiobooks.
It would be like having your own personal voiceover artist. I really think Bark is going to
be a game-changer in the world of text-to-audio technology.
""".replace("\n", " ").strip()

In [11]:
sentences = nltk.sent_tokenize(script)

In [None]:
SPEAKER = "v2/en_speaker_6"
GEN_TEMP = 0.6
silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence

pieces = []
for sentence in sentences:
    audio_array = generate_audio(sentence, history_prompt=SPEAKER, text_temp=0.6, )
    pieces += [audio_array, silence.copy()]


In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

# $ \\ $

# Advanced Long-Form Generation
Somtimes Bark will hallucinate a little extra audio at the end of the prompt.
We can solve this issue by lowering the threshold for bark to stop generating text.
We use the `min_eos_p` kwarg in `generate_text_semantic`

In [None]:
GEN_TEMP = 0.6
SPEAKER = "v2/en_speaker_6"
silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence

pieces = []
for sentence in sentences:
    semantic_tokens = generate_text_semantic(
        sentence,
        history_prompt=SPEAKER,
        temp=GEN_TEMP,
        min_eos_p=0.05,  # this controls how likely the generation is to end
    )

    audio_array = semantic_to_waveform(semantic_tokens, history_prompt=SPEAKER,)
    pieces += [audio_array, silence.copy()]




  0%|          | 0/768 [00:00<?, ?it/s][A
  0%|          | 1/768 [00:00<01:58,  6.46it/s][A
  0%|          | 3/768 [00:00<01:06, 11.54it/s][A
  1%|          | 5/768 [00:00<00:55, 13.65it/s][A
  1%|          | 7/768 [00:00<00:51, 14.86it/s][A
  1%|          | 9/768 [00:00<00:48, 15.64it/s][A
  1%|▏         | 11/768 [00:00<00:47, 16.08it/s][A
  2%|▏         | 13/768 [00:00<00:46, 16.36it/s][A
  2%|▏         | 15/768 [00:00<00:45, 16.45it/s][A
  2%|▏         | 17/768 [00:01<00:45, 16.59it/s][A
  2%|▏         | 19/768 [00:01<00:45, 16.48it/s][A
  3%|▎         | 21/768 [00:01<00:45, 16.28it/s][A
  3%|▎         | 23/768 [00:01<00:45, 16.34it/s][A
  3%|▎         | 25/768 [00:01<00:45, 16.21it/s][A
  4%|▎         | 27/768 [00:01<00:45, 16.41it/s][A
  4%|▍         | 29/768 [00:01<00:45, 16.41it/s][A
  4%|▍         | 31/768 [00:01<00:45, 16.36it/s][A
  4%|▍         | 33/768 [00:02<00:44, 16.44it/s][A
  5%|▍         | 35/768 [00:02<00:44, 16.38it/s][A
  5%|▍         | 37/768 [

In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

# $ \\ $

# Make a Long-Form Dialog with Bark

### Step 1: Format a script and speaker lookup

In [None]:
speaker_lookup = {"Samantha": "v2/en_speaker_9", "John": "v2/en_speaker_2"}

# Script generated by chat GPT
script = """
Samantha: Hey, have you heard about this new text-to-audio model called "Bark"?

John: No, I haven't. What's so special about it?

Samantha: Well, apparently it's the most realistic and natural-sounding text-to-audio model out there right now. People are saying it sounds just like a real person speaking.

John: Wow, that sounds amazing. How does it work?

Samantha: I think it uses advanced machine learning algorithms to analyze and understand the nuances of human speech, and then replicates those nuances in its own speech output.

John: That's pretty impressive. Do you think it could be used for things like audiobooks or podcasts?

Samantha: Definitely! In fact, I heard that some publishers are already starting to use Bark to create audiobooks. And I bet it would be great for podcasts too.

John: I can imagine. It would be like having your own personal voiceover artist.

Samantha: Exactly! I think Bark is going to be a game-changer in the world of text-to-audio technology."""
script = script.strip().split("\n")
script = [s.strip() for s in script if s]
script

['Samantha: Hey, have you heard about this new text-to-audio model called "Bark"?',
 "John: No, I haven't. What's so special about it?",
 "Samantha: Well, apparently it's the most realistic and natural-sounding text-to-audio model out there right now. People are saying it sounds just like a real person speaking.",
 'John: Wow, that sounds amazing. How does it work?',
 'Samantha: I think it uses advanced machine learning algorithms to analyze and understand the nuances of human speech, and then replicates those nuances in its own speech output.',
 "John: That's pretty impressive. Do you think it could be used for things like audiobooks or podcasts?",
 'Samantha: Definitely! In fact, I heard that some publishers are already starting to use Bark to create audiobooks. And I bet it would be great for podcasts too.',
 'John: I can imagine. It would be like having your own personal voiceover artist.',
 'Samantha: Exactly! I think Bark is going to be a game-changer in the world of text-to-audi

### Step 2: Generate the audio for every speaker turn

In [None]:
pieces = []
silence = np.zeros(int(0.5*SAMPLE_RATE))
for line in script:
    speaker, text = line.split(": ")
    audio_array = generate_audio(text, history_prompt=speaker_lookup[speaker], )
    pieces += [audio_array, silence.copy()]

100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 34.03it/s]
100%|████████████████████████████████████████████████████████████████████████| 22/22 [00:08<00:00,  2.55it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 71.58it/s]
100%|████████████████████████████████████████████████████████████████████████| 11/11 [00:04<00:00,  2.65it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:04<00:00, 22.75it/s]
100%|████████████████████████████████████████████████████████████████████████| 33/33 [00:13<00:00,  2.53it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 70.76it/s]
100%|████████████████████████████████████████████████████████████████████████| 11/11 [00:04<00:00,  2.63it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:04<00:00, 20.46it/s]
100%|█████

### Step 3: Concatenate all of the audio and play it

In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)