## Prepare the Environment
Let’s make sure we’re connected to a GPU to run this notebook. To get a GPU, click Runtime -> Change runtime type, then change Hardware accelerator from None to GPU. We can verify that we’ve been assigned a GPU and view its specifications through the nvidia-smi command:

In [1]:
!nvidia-smi

Wed Nov 29 17:48:28 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install --upgrade --quiet pip
!pip install --quiet git+https://github.com/huggingface/transformers.git

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/2.1 MB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/2.1 MB[0m [31m6.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.9/2.1 MB[0m [31m9.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m1.5/2.1 MB[0m [31m11.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.1/2.1 MB[0m [31m12.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build whee

## Load the Model
The pre-trained Bark small and large checkpoints can be loaded from the pre-trained weights on the Hugging Face Hub. You can change the repo-id with the checkpoint size that you wish to use.

We'll default to the large checkpoint, for better quality but slower inference. But you can use the small checkpoint by using "suno/bark-small" instead of "suno/bark".

In [3]:
from transformers import BarkModel, AutoProcessor

model = BarkModel.from_pretrained("suno/bark-small")

config.json:   0%|          | 0.00/8.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

In [4]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)
processor = AutoProcessor.from_pretrained("suno/bark")

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

speaker_embeddings_path.json:   0%|          | 0.00/61.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

## Function `generate_audio`
This function takes a text prompt and a voice preset as arguments and generates an audio file from them using the sumo/bark model.

The sumo/bark model is a text-to-speech model that can produce realistic speech and other sounds from text1. The voice preset is a parameter that controls the voice characteristics of the speech, such as the accent, pitch, speed, and emotion2.

The function does the following steps:

- It uses the `processor` object to encode the text prompt and the `voice preset` into a format that the model can understand.

- It uses the `model` object to generate the speech output from the encoded inputs. The speech output is a numpy array that contains the audio data.

- It uses the `Audio` object from the `IPython display` module to play the speech output in the notebook. It also uses the scipy module to write the speech output to a wav file named `“speech_output.wav”` in the current directory.

- The function does not return anything, but it outputs the audio file as a side effect.

Here is an example of how to use the function:
```py
# Define the voice preset and the text prompt
voice_preset = "v2/en_speaker_9"
text_prompt = "What motivated you to pursue a career in data science?"

# Call the function
generate_audio(text_prompt, voice_preset)
```
This will create and play an audio file that says “What motivated you to pursue a career in data science?” in a female voice with an English accent. You can find the audio file in the same folder as your notebook.

In [5]:
from IPython.display import Audio
import scipy

def generate_audio(text_prompt, voice_preset):
    """Generates an audio file from a text prompt using the sumo/bark model.

    Args:
        text_prompt (str): The text to be converted to speech.
        voice_preset (str): The voice preset to be used by the model.

    Returns:
        None

    Outputs:
        A wav file named "speech_output.wav" containing the speech generated by the model.

    Example:
        >>> voice_preset = "v2/en_speaker_9"
        >>> text_prompt = "What motivated you to pursue a career in data science?"
        >>> generate_audio(text_prompt, voice_preset)
        # This will create a file named "speech_output.wav" in the current directory
    """
    # prepare the inputs
    inputs = processor(text_prompt, voice_preset=voice_preset)

    # generate speech
    speech_output = model.generate(**inputs.to(device))

    sampling_rate = model.generation_config.sample_rate
    Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

    scipy.io.wavfile.write("speech_output.wav", rate=sampling_rate, data=speech_output[0].cpu().numpy())

In [6]:
voice_preset = "v2/en_speaker_9"
text_prompt = "What motivated you to pursue a career in data science?"

generate_audio(text_prompt, voice_preset)

en_speaker_9_semantic_prompt.npy:   0%|          | 0.00/3.06k [00:00<?, ?B/s]

en_speaker_9_coarse_prompt.npy:   0%|          | 0.00/8.94k [00:00<?, ?B/s]

en_speaker_9_fine_prompt.npy:   0%|          | 0.00/17.8k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


## Source
[suno/bark-small Hugging Face](https://huggingface.co/suno/bark-small)

[suno/bark Hugging Face colab notebook](https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing#scrollTo=qhCf0VZ0WlET)