<a href="https://colab.research.google.com/github/nbiish/patron-tools/blob/main/colabs/BARK_text_to_speech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install

In [None]:
# install bark (make sure you have torch>=2 for much faster flash-attention)
!pip install git+https://github.com/suno-ai/bark.git

## Imports and Load Models

In [None]:
from bark import SAMPLE_RATE, generate_audio, preload_models
from IPython.display import Audio

preload_models()

# Basic prompting

In [None]:
text_prompt = "" # @param {type:"string"}
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

# Advanced Prompting

## imports

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from IPython.display import Audio
import nltk #splits sentences
import numpy as np

from bark.generation import (
    generate_text_semantic,
    preload_models,
)
from bark.api import semantic_to_waveform
from bark import generate_audio, SAMPLE_RATE

## Simple Long-Form Generation

In [None]:
script = """
Here is a short cyberpunk science fiction story featuring Anishinaabe characters:

Nimiki gazed out the towering plexiglass window of her apartment, watching the hovercars zoom by in the night sky. She leaned back in her chair and took a long drag from her electronic cigarette, the tip glowing a faint blue.

"Another day in the concrete jungle," she murmured.

As a cybersecurity expert for OjibweTech, one of the largest tech firms in the region, Nimiki was constantly battling the forces that threatened to destabilize the network. Rogue AIs, foreign state actors, cyberterrorists - they all wanted a piece of the vast stream of data and commerce that flowed through the company's servers.

A notification popped up on her retinal display - time to jack into the system for her nightly monitoring. She strode over to the console on the wall and unspooled the optic cable from her wrist port. As soon as she plugged in, code began racing across her vision. To the untrained eye it was just a blur, but to Nimiki it was a whole virtual world alive with information.

Tonight the network hummed along smoothly, though Nimiki knew dangers lurked in every encrypted partition and dark node. She had to be ever vigilant. The Anishinaabe legacy of technological innovation depended on it.

After ensuring no major threats were detected, Nimiki logged off. She grabbed her sleek leather jacket and headed for the door. Maybe she'd drop by the local microchip speakeasy and unwind a bit before calling it a night. Her cyberdefense skills were top-notch, but staying human in this churning, ever-changing world? That was the real trick.
""".replace("\n", " ").strip()

In [None]:
import nltk
nltk.download('punkt')
sentences = nltk.sent_tokenize(script)

In [None]:
SPEAKER = "v2/en_speaker_6"
silence = np.zeros(int(0.25 * SAMPLE_RATE))  #quarter second of silence

pieces = []
for sentence in sentences:
  audio_array = generate_audio(sentence, history_prompt=SPEAKER)
  pieces += [audio_array, silence.copy()]

In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

## Advanced Long-Form Generation

In [None]:
GEN_TEMP = 0.6
SPEAKER = "v2/en_speaker_6"
silence = np.zeros(int(0.25 * SAMPLE_RATE))

pieces = []
for sentence in sentences:
  semantic_tokens = generate_text_semantic(
      sentence,
      history_prompt=SPEAKER,
      temp=GEN_TEMP,
      min_EOS_p=0.05, # this controls how likely the generation is to end
  )

  audio_array = semantic_to_waveform(semantic_tokens, history_prompt=SPEAKER,)
  pieces += [audio_array, silence.copy()]

In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

## Make Long-Form Dialog from Bark

In [None]:
 speaker_lookup = {"Samantha": "v2/en_speaker_9", "John": "v2/en)speaker_2"}

script = """
Samantha: Hey, have you heard about this new text-to-audio model called "Bark"?

John: No, I haven't. What's so special about it?

Samantha: Well, apparently it's the most realistic and natural-sounding text-to-audio model out there right now. People are saying it sounds just like a real person speaking.

John: Wow, that sounds amazing. How does it work?

Samantha: I think it uses advanced machine learning algorithms to analyze and understand the nuances of human speech, and then replicates those nuances in its own speech output.

John: That's pretty impressive. Do you think it could be used for things like audiobooks or podcasts?

Samantha: Definitely! In fact, I heard that some publishers are already starting to use Bark to create audiobooks. And I bet it would be great for podcasts too.

John: I can imagine. It would be like having your own personal voiceover artist.

Samantha: Exactly! I think Bark is going to be a game-changer in the world of text-to-audio technology."""
script = script.strip().split("\n")
script = [s.strip() for s in script if s]
script

In [None]:
pieces = []
silence = np.zeros(int(0.5*SAMPLE_RATE))
for line in script:
  speaker, text = line.split(": ")
  audio_array = generate_audio(text, history_prompt=speaker_lookup[speaker], )
  pieces += [audio_array, silence.copy()]

In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)