# Bark in 🤗 Transformers

The Bark model is available in 🤗 Transformers from v4.31.0 onwards!

In this notebook, we'll demonstrate how to use the Bark model using the 🤗 Transformers library, covering un-conditional generation, speaker prompted generation, and advanced text prompts for controllable generation.

## Bark Architecture


Bark is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark).

Bark is made of 4 main models:

- `BarkSemanticModel` (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
- `BarkCoarseModel` (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the `BarkSemanticModel` model. It aims at predicting the first two audio codebooks necessary for EnCodec.
- `BarkFineModel` (the 'fine acoustics' model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
- having predicted all the codebook channels from the `EncodecModel`, Bark uses it to decode the output audio array.

It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.


## General Functions

In [None]:
from pathlib import Path
import datetime
import scipy
import os

# Home path (only works on Linux)
# https://stackoverflow.com/a/58988310
home_dir = Path(os.readlink('/proc/%s/cwd' % os.environ['JPY_PARENT_PID']))

# Paths
bark_dir = home_dir / 'bark'
if not bark_dir.exists():
    os.mkdir(bark_dir)


def timestamp():
    d = datetime.datetime.now()
    return f'{d.year}{d.month}{d.day}{d.hour}{d.minute}{d.second}'


def save_file(fn, sampling_rate, data):
    scipy.io.wavfile.write(fn, rate=sampling_rate, data=data)

## Prepare the Environment

Check GPU

In [None]:
!nvidia-smi

Install the 🤗 Transformers package from the main branch:

In [None]:
!pip install --upgrade --quiet pip
!pip install --quiet git+https://github.com/huggingface/transformers.git

## Load the Model

The pre-trained Bark small and large checkpoints can be loaded from the [pre-trained weights](https://huggingface.co/suno/bark) on the Hugging Face Hub. You can change the repo-id with the checkpoint size that you wish to use.

We'll default to the large checkpoint, for better quality but slower inference. But you can use the small checkpoint by using `"suno/bark-small"` instead of `"suno/bark"`.



In [None]:
from transformers import BarkModel

model = BarkModel.from_pretrained("suno/bark-small")

In [None]:
model

List models

In [None]:
!du -h -d 2 {home_dir}/.cache/huggingface/ | sort -hr

Place the model to an accelerator device if available.

In [None]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

## Generating speech

Bark is an highly-controllable text-to-speech model, meaning you can use with various settings, as we are going to see.

Before everything, load `BarkProcessor` in order to be able to pre-process the inputs.

The processor role here is two-sides:
1. It is used to tokenize the input text, i.e. to cut it into small pieces that the model can understand.
2. It stores speaker embeddings, i.e voice presets that can condition the generation.

In [None]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("suno/bark")

### Unconditional generation

First, let's generate speech in the most simple manner possible, with no frills.

In [None]:
# prepare the inputs
text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
inputs = processor(text_prompt)

# generate speech
speech_output = model.generate(**inputs.to(device))

The audio outputs are a three-dimensional Torch tensor of shape `(batch_size, num_channels, sequence_length)`. To listen
to the generated audio samples, you can either play them in an ipynb notebook:

In [None]:
from IPython.display import Audio

sampling_rate = model.generation_config.sample_rate
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate, autoplay=True)

Save them as a .wav file using a third-party library, e.g. scipy (note here that we also need to remove the channel dimension from our audio tensor):

In [None]:
save_file(f'{bark_dir}/{timestamp()}.wav', sampling_rate, speech_output[0].cpu().numpy())

### Conditional generation

Suno AI team proposes a [library of preset voices](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) that are used to condition the generated speech. In other words, it generates speech that appears to be generated by the predefined conditional voice.

The processor can be used to automatically load these speaker prompts when tokenising the input text.

Let's try one voice preset:

In [None]:
voice_preset = "v2/en_speaker_6"

# prepare the inputs
text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
inputs = processor(text_prompt, voice_preset=voice_preset)

# generate speech
speech_output = model.generate(**inputs.to(device))

# let's hear it
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

In [None]:
save_file(f'{bark_dir}/{timestamp()}.wav', sampling_rate, speech_output[0].cpu().numpy())

Great, let's try another voice preset:

In [None]:
voice_preset = "v2/en_speaker_3"

# prepare the inputs
text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
inputs = processor(text_prompt, voice_preset=voice_preset)

# generate speech
speech_output = model.generate(**inputs.to(device))

# let's hear it
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

In [None]:
save_file(f'{bark_dir}/{timestamp()}.wav', sampling_rate, speech_output[0].cpu().numpy())

### More advanced generation techniques

The previous generation methods were all generated by default using sampling mode (`do_sample=True`) but you can also use [more advanced generation techniques](https://huggingface.co/docs/transformers/generation_strategies) such as `beam_search` to have better quality.

You can also specify specifc generation parameters for each sub-model by simply prepending `semantic_`, `coarse_` or `fine_` to the generation parameters you want.

Let's use it with the previous `text_prompt`.



In [None]:
speech_output = model.generate(**inputs, num_beams = 4, temperature = 0.5, semantic_temperature = 0.8)

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

### Multilingual speech

Bark can also generate multilingual speech such as French and Chinese speech.

In [None]:
# Multilingual speech - simplified Chinese
inputs = processor("惊人的！我会说中文")

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

In [None]:
save_file(f'{bark_dir}/{timestamp()}.wav', sampling_rate, speech_output[0].cpu().numpy())

In [None]:
# Multilingual speech - French - let's use a voice_preset as well
inputs = processor("Je peux générer du son facilement avec ce modèle.", voice_preset="fr_speaker_3")

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

In [None]:
save_file(f'{bark_dir}/{timestamp()}.wav', sampling_rate, speech_output[0].cpu().numpy())

### **Non-verbal** communications

The model can also produce **nonverbal communications** like laughing, sighing and crying.


In [None]:
# Adding non-speech cues to the input text
inputs = processor("[clears throat] Hello uh ..., my dog is cute [laughter]")


# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

In [None]:
save_file(f'{bark_dir}/{timestamp()}.wav', sampling_rate, speech_output[0].cpu().numpy())

### More applications:

Bark can also generate music. You can help it out by adding music notes around your lyrics.

In [None]:
inputs = processor("♪ In the jungle, the mighty jungle, the lion barks tonight ♪")

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

In [None]:
save_file(f'{bark_dir}/{timestamp()}.wav', sampling_rate, speech_output[0].cpu().numpy())

In [None]:
# more advanced prompts!

text_prompt = """
    WOMAN: I would like an oatmilk latte please.
    MAN: Wow, that's expensive!
"""

inputs = processor(text_prompt)

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

In [None]:
save_file(f'{bark_dir}/{timestamp()}.wav', sampling_rate, speech_output[0].cpu().numpy())

## Simple API example

You'll need to launch your ngrok notebook on the side to get this running

In [None]:
from flask import Flask, request, send_file

app = Flask(__name__)

@app.route('/text-to-speech', methods=['GET', 'POST'])
def text_to_speech():
    text = request.args.get('text')
    
    # Code to convert text to a WAV file
    inputs = processor(text)

    # # Generate speech
    speech_output = model.generate(**inputs.to(device))

    # Save File
    fn = f"{bark_dir}/{timestamp()}.wav"
    save_file(fn, sampling_rate, speech_output[0].cpu().numpy())
    
    # Return it
    return send_file(fn, as_attachment=True)

if __name__ == '__main__':
    app.run()