Doc: https://huggingface.co/facebook/mms-tts-por

In [1]:
from transformers import pipeline

modelo = 'facebook/mms-tts-por'

leitor = pipeline('text-to-speech', model=modelo)

Device set to use cpu


In [2]:
import time

texto = "Olá, estou aprendendo Python!"
inicio = time.time()
fala = leitor(texto)
final = time.time()

In [3]:
print(f'Levou {final - inicio:.2f}s para gerar o áudio')
fala

Levou 1.03s para gerar o áudio


{'audio': array([[-0.00221753, -0.00195846, -0.00168468, ...,  0.00043582,
          0.00043574,  0.00038217]], shape=(1, 38400), dtype=float32),
 'sampling_rate': 16000}

In [4]:
import IPython 

IPython.display.Audio(data=fala['audio'], rate=fala['sampling_rate'])

## Modelo Bark
Doc: https://huggingface.co/suno/bark-small

In [5]:
modelo = 'suno/bark-small'
leitor = pipeline('text-to-speech', model=modelo, forward_params={'max_new_tokens': 50})

Device set to use cpu


In [6]:
import time

texto = "Olá, estou aprendendo Python!"
inicio = time.time()
fala = leitor(texto)
final = time.time()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [7]:
print(f'Levou {final - inicio:.2f}s para gerar o áudio')
fala

Levou 38.31s para gerar o áudio


{'audio': array([[-0.01525873, -0.01458377, -0.01504359, ...,  0.02820617,
          0.027472  ,  0.02777063]], shape=(1, 66240), dtype=float32),
 'sampling_rate': 24000}

In [8]:
IPython.display.Audio(data=fala['audio'], rate=fala['sampling_rate'])

## Otimizando e configurando o modelo

pip install accelerate
pip install optimum

In [27]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cpu'

In [10]:
modelo = 'suno/bark-small'
leitor = pipeline(
    'text-to-speech',
    model=modelo, 
    model_kwargs={"torch_dtype": torch.float16},
    forward_params={'max_new_tokens': 50}
)

Device set to use cpu


In [28]:
leitor.model = leitor.model.to(device)
# leitor.model = torch.to_bettertransformer(leitor.model)
# leitor.model.enable_cpu_offload()

## Selecionando a voz do modelo
Doc:

In [29]:
from transformers import AutoProcessor, AutoModel

modelo = 'suno/bark-small'
processador = AutoProcessor.from_pretrained(modelo, max_new_tokens=50)
leitor = AutoModel.from_pretrained(modelo)

In [30]:
texto = "Olá, estou aprendendo Python!"
voz = 'v2/pt_speaker_1'

inicio = time.time()
inputs = processador(texto, voice_preset=voz)
vetor_audio= leitor.generate(**inputs)
fala = {
    'audio': vetor_audio.numpy(),
    'sampling_rate': leitor.generation_config.sample_rate,
    
}

final = time.time()

print(f'Levou {final - inicio:.2f}s para gerar o áudio')


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Levou 58.02s para gerar o áudio


In [31]:
IPython.display.Audio(data=fala['audio'], rate=fala['sampling_rate'])