# **EASY IMPLEMENTATION OF TEXT TO SPECH WITH BARK**

---

📎 README

This notebook provides a straightforward implementation of text-to-speech using the Bark model, utilizing the Transformers and BetterTransformer libraries from Hugging Face.

About:

*   Uses Bark for text-to-speech.
*   Ensures compatibility with the BetterTransformer library from Hugging Face for optimizing the small model 'suno/bark-small.'
*   The large model 'suno/bark-small' is not compatible with the BetterTransformer library.
*  Includes optimization techniques.
*  Provides the possibility of text-to-speech for conversations.

References:

1.  [Huggig Face, Pre-trained models for text-to-speech](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models).
2. [GitHub suno-ai/bark](https://github.com/suno-ai/bark/tree/main).


In [None]:
# @title #1. ✨ Installing dependences.

!pip install --quiet git+https://github.com/huggingface/transformers.git &> /dev/null
!pip install --quiet git+https://github.com/huggingface/optimum.git &> /dev/null
!pip install --quiet git+https://github.com/huggingface/accelerate.git &> /dev/null

import warnings
warnings.filterwarnings("ignore")

import torch
import transformers

# @markdown Select the Bark model:
type_model = "suno/bark" #@param ["suno/bark-small", "suno/bark"]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# @markdown Enable optimization techniques if necessary.

# @markdown Load in fp16, light degradation in performance, memory footprint reduced by 50% and a speed gain of 5%.
Load_fp16 = False #@param {type:"boolean"}

# @markdown CPU offload, slight degradation in speed (10%), huge memory footprint reduction (60% 🤯).
CPU_offload = False #@param {type:"boolean"}

from transformers import  BarkModel

if type_model == "suno/bark":

    if Load_fp16 == True:

        model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)

    elif Load_fp16 == False:

        model = BarkModel.from_pretrained(type_model).to(device)

elif type_model == "suno/bark-small":

    from optimum.bettertransformer import BetterTransformer

    if Load_fp16 == True:

        # load in fp16, With a slight degradation in performance, you benefit from a memory footprint reduced by 50% and a speed gain of 5%.
        model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16)

        # convert to bettertransformer, There's no performance degradation, which means you can get exactly the same result as without this function, while gaining 20% to 30% in speed!

        model = BetterTransformer.transform(model, keep_original_model=False).to(device)

    elif Load_fp16 == False:
        # Bark Model
        model = BarkModel.from_pretrained(type_model)

        # convert to bettertransformer, There's no performance degradation, which means you can get exactly the same result as without this function, while gaining 20% to 30% in speed!

        model = BetterTransformer.transform(model, keep_original_model=False).to(device)

# Embbedings
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(type_model)

if CPU_offload == True:

    model.enable_cpu_offload()


# 2. 🗣 Text to speech.

*   Choose voice_preset.
*   Posibility of enable advance configuration.
*   Input your text.




In [1]:
# @title #Run and after play!
class CFG:

    # Voice speaker
    #@markdown Select the type of voice, Bark supports 100+ speaker [Click here for info!](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c)
    voice_preset = "v2/es_speaker_5" #@param {type:"string"} #"v2/es_speaker_8"

    # Bark parameters
    #@markdown Enable advance configuration:

    #@markdown Default: fine_temperature: 0.5, coarse_temperature: 0.8

    # Advance config
    advance = True #@param {type:"boolean"}
    fine_temperature = 0.5 # @param {type:"slider", min:0, max:1, step:0.1}
    coarse_temperature = 0.8 # @param {type:"slider", min:0, max:1, step:0.1}

import nltk  # we'll use this to split into sentences

nltk.download('punkt')

# prepare the inputs
#text_prompt =  "Hugging Face, Inc. es una empresa estadounidense que desarrolla herramientas para crear aplicaciones utilizando el aprendizaje automático.1​ Es conocida por su biblioteca de transformadores creada para aplicaciones de procesamiento de lenguaje natural y su plataforma que permite a los usuarios compartir conjuntos de datos y modelos de aprendizaje automático." #@param {type:"string"} "Podemos definir Hugging Face como una empresa de tecnologia que se dedica al desarrollo de herramientas y plataformas de procesamiento de lenguaje natural o NLP basadas en inteligencia artificial."

text_prompt = """Oye, ¿has oído hablar de este nuevo modelo de conversión de texto a audio llamado "Bark"?
Aparentemente, es el modelo de conversión de texto a audio más realista y con un sonido más natural.
ahí fuera ahora mismo. La gente dice que suena como si hablara una persona real.
Creo que utiliza algoritmos avanzados de aprendizaje automático para analizar y comprender el
matices del habla humana y luego replica esos matices en su propia producción de habla.
Es bastante impresionante y apuesto a que podría usarse para cosas como audiolibros o podcasts.
De hecho, escuché que algunos editores ya están empezando a utilizar Bark para crear audiolibros.
Sería como tener tu propio locutor personal. Realmente creo que Bark va a
cambie las reglas del juego en el mundo de la tecnología de conversión de texto a audio."""

# For long text is best to use: text_promt = """ Input text here"""

text_prompt = text_prompt.replace("\n", " ").strip()

sentences = nltk.sent_tokenize(text_prompt)

#speech_output = model.generate(**inputs, num_beams = 4, temperature = 0.5, semantic_temperature = 0.8)

sampling_rate = model.generation_config.sample_rate

import numpy as np

silence = np.zeros(int(0.25 * sampling_rate))  # quarter second of silence
pieces = []
for sentence in sentences:

    inputs = processor(sentence, voice_preset=CFG.voice_preset).to(device)

    if CFG.advance == True:
        speech_output = model.generate(**inputs, do_sample = True, fine_temperature = CFG.fine_temperature, coarse_temperature = CFG.coarse_temperature)
    elif CFG.advance == False:
        speech_output = model.generate(**inputs)

    pieces += [speech_output.cpu().numpy(), silence.copy()]

import scipy
from scipy.io.wavfile import write as write_wav

pieces =  np.concatenate(pieces, axis=None)
write_wav("tex_to_speech.wav", sampling_rate, pieces)

from IPython.display import Audio

Audio(pieces, rate=sampling_rate)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


NameError: ignored

In [31]:
import scipy
from scipy.io.wavfile import write as write_wav

pieces = pieces.cpu().numpy()
pieces = np.concatenate(pieces, axis=None)

write_wav("tex_to_speech.wav", sampling_rate, pieces)

from IPython.display import Audio

Audio(pieces, rate=sampling_rate)

AttributeError: ignored

# 3. 🗨 Text to speech for conversations.

3.1   Choose the roles and voice speakers before writing your conversation.
:
> For example:

```
speaker_lookup {"Rol 1": "v2/en_speaker_8", "Rol 2": "v2/en_speaker_5"}
conversation = """
Rol 1: Hi!.
Rol 2: Bie."""

```

3.2   Run the script and after play!

In [None]:
# 3.1 Choose the roles and voices spakers and write your conversation.

# Select the type of voice, Bark supports 100+ speaker https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c

#Make a Long-Form Dialog with Bark:

speaker_lookup = {"Samantha": "v2/es_speaker_8", "John": "v2/es_speaker_5"}

conversation = """
Samantha: Oye, ¿has oído hablar de este nuevo modelo de conversión de texto a audio llamado Bark?

John: No, no lo he hecho. ¿Qué tiene de especial?

Samantha: Bueno, aparentemente es el modelo de conversión de texto a audio más realista y natural que existe en este momento. La gente dice que suena como si hablara una persona real.

John: Vaya, eso suena increíble. ¿Como funciona?

Samantha: Creo que utiliza algoritmos avanzados de aprendizaje automático para analizar y comprender los matices del habla humana y luego replica esos matices en su propia producción de voz.

John: Eso es bastante impresionante. ¿Crees que podría usarse para cosas como audiolibros o podcasts?

Samantha: ¡definitivamente! De hecho, escuché que algunos editores ya están empezando a utilizar Bark para crear audiolibros. Y apuesto a que también sería genial para los podcasts.

John: me lo puedo imaginar. Sería como tener tu propio locutor personal.

Samantha: ¡exactamente! Creo que Bark cambiará las reglas del juego en el mundo de la tecnología de conversión de texto a audio.
"""


In [None]:
# @title # Run and after play your conversation!

class CFG:

    # Bark parameters

    #@markdown Enable advance configuration:

    #@markdown Default: fine_temperature: 0.5, coarse_temperature: 0.8

    # Advance config
    advance = False #@param {type:"boolean"}
    fine_temperature = 0.5 # @param {type:"slider", min:0, max:1, step:0.05}
    coarse_temperature = 0.8 # @param {type:"slider", min:0, max:1, step:0.05}

conversation = conversation.strip().split("\n")
conversation = [s.strip() for s in conversation if s]

sampling_rate = model.generation_config.sample_rate

import numpy as np
silence = np.zeros(int(0.5 * sampling_rate))  # quarter second of silence
pieces = []
for sentence in conversation:
    speaker, text = sentence.split(": ")

    inputs = processor(sentence, voice_preset=speaker_lookup[speaker]).to(device)

    if CFG.advance == True:
        speech_output = model.generate(**inputs, do_sample = True, fine_temperature = CFG.fine_temperature, coarse_temperature = CFG.coarse_temperature)
    elif CFG.advance == False:
        speech_output = model.generate(**inputs)

    pieces += [speech_output.cpu().numpy(), silence.copy()]

import scipy
from scipy.io.wavfile import write as write_wav

write_wav("bark_generation.wav", sampling_rate, np.concatenate(pieces, axis=None))

from IPython.display import Audio

Audio(np.concatenate(pieces, axis=None), rate=sampling_rate)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generati