## Try bark from suno AI

In [1]:
from transformers import BarkModel
import os

path_model = "D:\\2025\\Master BKHN\\Ky thuat lap trinh noi dung so\\AI-driven-Virtual-Storyteller\\models"

os.environ["TRANSFORMER_CACHE"] = path_model

model = BarkModel.from_pretrained("suno/bark-small", trust_remote_code=True, cache_dir=path_model)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

In [4]:
device

'cuda:0'

In [5]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("suno/bark")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [6]:
# prepare the inputs
text_prompt = "Once and a time there was a young boy named Timmy.Timmy loved to play with his toys and was always eager to get something new. One day, Timmy went to the park and got to play with his toys. He quickly picked up some shiny blocks and started to play with them. Soon enough, he found himself getting more and more excited. \n\nTimmy's Mommy explained to Timmy how important it was to have a healthy snack every day."
inputs = processor(text_prompt)

# generate speech
speech_output = model.generate(**inputs.to(device))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [7]:
from IPython.display import Audio

sampling_rate = model.generation_config.sample_rate
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

In [9]:
import scipy

scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_output[0].cpu().numpy())

In [10]:
voice_preset = "v2/en_speaker_6"

# prepare the inputs
inputs = processor(text_prompt, voice_preset=voice_preset)

# generate speech
speech_output = model.generate(**inputs.to(device))

# let's hear it
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [11]:
voice_preset = "v2/en_speaker_3"

# prepare the inputs
inputs = processor(text_prompt, voice_preset=voice_preset)

# generate speech
speech_output = model.generate(**inputs.to(device))

# let's hear it
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [None]:
speech_output = model.generate(**inputs, num_beams = 4, temperature = 0.5, semantic_temperature = 0.8)

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

In [13]:
# Multilingual speech - simplified Chinese
inputs = processor("惊人的！我会说中文")

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [None]:
# Multilingual speech - French - let's use a voice_preset as well
inputs = processor("Je peux générer du son facilement avec ce modèle.", voice_preset="fr_speaker_3")

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

In [15]:
# Adding non-speech cues to the input text
inputs = processor("[clears throat] Hello uh ..., my dog is cute [laughter]")


speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [16]:
# more advanced prompts!

text_prompt = """
    WOMAN: I would like an oatmilk latte please.
    MAN: Wow, that's expensive!
"""

inputs = processor(text_prompt)

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
