## Notebook 4: TTS Workflow

We have the exact podcast transcripts ready now to generate our audio for the Podcast.

In this notebook, we will learn how to generate Audio using both `suno/bark` and `parler-tts/parler-tts-mini-v1` models first. 

After that, we will use the output from Notebook 3 to generate our complete podcast

Note: Please feel free to extend this notebook with newer models. The above two were chosen after some tests using a sample prompt.

⚠️ Warning: This notebook likes have `transformers` version to be `4.43.3` or earlier so we will downgrade our environment to make sure things run smoothly

Credit: [This](https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing#scrollTo=68QtoUqPWdLk) Colab was used for starter code


We can install these packages for speedups

In [6]:
!pip3 install optimum
!pip install -U flash-attn --no-build-isolation
!pip install -U transformers

Collecting transformers
  Using cached transformers-4.46.2-py3-none-any.whl.metadata (44 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Using cached tokenizers-0.20.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Using cached transformers-4.46.2-py3-none-any.whl (10.0 MB)
Using cached tokenizers-0.20.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.19.1
    Uninstalling tokenizers-0.19.1:
      Successfully uninstalled tokenizers-0.19.1
  Attempting uninstall: transformers
    Found existing installation: transformers 4.43.3
    Uninstalling transformers-4.43.3:
      Successfully uninstalled transformers-4.43.3
Successfully installed tokenizers-0.20.3 transformers-4.46.2


Let's import the necessary frameworks

In [7]:
from IPython.display import Audio
import IPython.display as ipd
from tqdm import tqdm

In [8]:
from transformers import BarkModel, AutoProcessor, AutoTokenizer
import torch
import json
import numpy as np
from parler_tts import ParlerTTSForConditionalGeneration

ModuleNotFoundError: No module named 'parler_tts'

### Testing the Audio Generation

Let's try generating audio using both the models to understand how they work. 

Note the subtle differences in prompting:
- Parler: Takes in a `description` prompt that can be used to set the speaker profile and generation speeds
- Suno: Takes in expression words like `[sigh]`, `[laughs]` etc. You can find more notes on the experiments that were run for this notebook in the [TTS_Notes.md](./TTS_Notes.md) file to learn more.

Please set `device = "cuda"` below if you're using a single GPU node.

#### Parler Model

Let's try using the Parler Model first and generate a short segment with speaker Laura's voice

In [10]:
# Set up device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and tokenizer
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

# Define text and description
text_prompt = """
Exactly! And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
description = """
Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
"""
# Tokenize inputs
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)

# Generate audio
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

# Play audio in notebook
ipd.Audio(audio_arr, rate=model.config.sampling_rate)

NameError: name 'ParlerTTSForConditionalGeneration' is not defined

#### Bark Model

Amazing, let's try the same with bark now:
- We will set the `voice_preset` to our favorite speaker
- This time we can include expression prompts inside our generation prompt
- Note you can CAPTILISE words to make the model emphasise on these
- You can add hyphens to make the model pause on certain words

In [11]:
voice_preset = "v2/en_speaker_6"
sampling_rate = 24000

In [12]:
device = "cuda:7"

processor = AutoProcessor.from_pretrained("suno/bark")

#model =  model.to_bettertransformer()
#model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to(device)#.to_bettertransformer()

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

speaker_embeddings_path.json:   0%|          | 0.00/61.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/8.81k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.49G [00:00<?, ?B/s]

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)
Some weights of the model checkpoint at suno/bark were not used when initializing BarkModel: ['codec_model.decoder.layers.0.conv.weight_g', 'codec_model.decoder.layers.0.conv.weight_v', 'codec_model.decoder.layers.10.block.1.conv.weight_g', 'codec_model.decoder.layers.10.block.1.conv.weight_v', 'codec_model.decoder.layers.10.block.3.conv.weight_g', 'codec_model.decoder.layers.10.block.3.conv.weight_v', 'codec_model.decoder.layers.10.shortcut.conv.weight_g', 'codec_model.decoder.layers.10.shortcut.conv.weight_v', 'codec_model.decoder.layers.12.conv.weight_g', 'codec_model.decoder.layers.12.conv.weight_v', 'codec_model.decoder.layers.13.block.1.conv.weight_g', 'codec_model.decoder.layers.13.block.1.conv.weight_v', 'codec_model.decoder.layers.13.block.3.conv.weight_g', 'codec_model.decoder.layers.13.block.3.conv.weight_v', 'codec_model.decoder.layers.13.shortcut.conv.weight_g',

generation_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

RuntimeError: ('No CUDA GPUs are available', "This image doesn't seem to support current EC2 instance type, please check release notes for supported EC2 instance type")

In [11]:
text_prompt = """
Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
inputs = processor(text_prompt, voice_preset=voice_preset).to(device)

speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


## Bringing it together: Making the Podcast

Okay now that we understand everything-we can now use the complete pipeline to generate the entire podcast

Let's load in our pickle file from earlier and proceed:

In [13]:
import pickle

with open('./resources/podcast_ready_data.pkl', 'rb') as file:
    PODCAST_TEXT = pickle.load(file)

Let's define load in the bark model and set it's hyper-parameters for discussions

In [14]:
bark_processor = AutoProcessor.from_pretrained("suno/bark")
bark_model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to("cuda:3")
bark_sampling_rate = 24000

Some weights of the model checkpoint at suno/bark were not used when initializing BarkModel: ['codec_model.decoder.layers.0.conv.weight_g', 'codec_model.decoder.layers.0.conv.weight_v', 'codec_model.decoder.layers.10.block.1.conv.weight_g', 'codec_model.decoder.layers.10.block.1.conv.weight_v', 'codec_model.decoder.layers.10.block.3.conv.weight_g', 'codec_model.decoder.layers.10.block.3.conv.weight_v', 'codec_model.decoder.layers.10.shortcut.conv.weight_g', 'codec_model.decoder.layers.10.shortcut.conv.weight_v', 'codec_model.decoder.layers.12.conv.weight_g', 'codec_model.decoder.layers.12.conv.weight_v', 'codec_model.decoder.layers.13.block.1.conv.weight_g', 'codec_model.decoder.layers.13.block.1.conv.weight_v', 'codec_model.decoder.layers.13.block.3.conv.weight_g', 'codec_model.decoder.layers.13.block.3.conv.weight_v', 'codec_model.decoder.layers.13.shortcut.conv.weight_g', 'codec_model.decoder.layers.13.shortcut.conv.weight_v', 'codec_model.decoder.layers.15.conv.weight_g', 'codec_mo

RuntimeError: ('No CUDA GPUs are available', "This image doesn't seem to support current EC2 instance type, please check release notes for supported EC2 instance type")

Now for the Parler model:

In [None]:
parler_model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to("cuda:3")
parler_tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

In [None]:
speaker1_description = """
Laura's voice is expressive and dramatic in delivery, speaking at a moderately fast pace with a very close recording that almost has no background noise.
"""

We will concatenate the generated segments of audio and also their respective sampling rates since we will require this to generate the final audio

In [None]:
generated_segments = []
sampling_rates = []  # We'll need to keep track of sampling rates for each segment

In [None]:
device="cuda:3"

Function generate text for speaker 1

In [None]:
def generate_speaker1_audio(text):
    """Generate audio using ParlerTTS for Speaker 1"""
    input_ids = parler_tokenizer(speaker1_description, return_tensors="pt").input_ids.to(device)
    prompt_input_ids = parler_tokenizer(text, return_tensors="pt").input_ids.to(device)
    generation = parler_model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
    audio_arr = generation.cpu().numpy().squeeze()
    return audio_arr, parler_model.config.sampling_rate

Function to generate text for speaker 2

In [None]:
def generate_speaker2_audio(text):
    """Generate audio using Bark for Speaker 2"""
    inputs = bark_processor(text, voice_preset="v2/en_speaker_6").to(device)
    speech_output = bark_model.generate(**inputs, temperature=0.9, semantic_temperature=0.8)
    audio_arr = speech_output[0].cpu().numpy()
    return audio_arr, bark_sampling_rate


Helper function to convert the numpy output from the models into audio

In [None]:
def numpy_to_audio_segment(audio_arr, sampling_rate):
    """Convert numpy array to AudioSegment"""
    # Convert to 16-bit PCM
    audio_int16 = (audio_arr * 32767).astype(np.int16)
    
    # Create WAV file in memory
    byte_io = io.BytesIO()
    wavfile.write(byte_io, sampling_rate, audio_int16)
    byte_io.seek(0)
    
    # Convert to AudioSegment
    return AudioSegment.from_wav(byte_io)

In [15]:
PODCAST_TEXT

'[\n    ("Speaker 1", "Welcome to \'The Knowledge Distillation Podcast\'! I\'m your host, and today we\'re diving into the fascinating world of Knowledge Distillation, a methodology that\'s revolutionizing the way we transfer advanced capabilities from proprietary Large Language Models to their open-source counterparts. We\'re joined by [Speaker 2], who\'s new to this topic, and we\'re excited to explore the ins and outs of Knowledge Distillation together."),\n    ("Speaker 2", "Umm, hi! I\'m excited to be here. So, what is Knowledge Distillation?"),\n    ("Speaker 1", "Knowledge Distillation is a technique that enables us to transfer knowledge from a large, complex model to a smaller, more efficient model. Think of it like distilling a fine wine – we\'re trying to capture the essence of the larger model and put it into a smaller, more manageable package."),\n    ("Speaker 2", "Hmm, that\'s a great analogy! But I\'m still a bit confused – how does this work in practice?"),\n    ("Speak

Most of the times we argue in life that Data Structures isn't very useful. However, this time the knowledge comes in handy. 

We will take the string from the pickle file and load it in as a Tuple with the help of `ast.literal_eval()`

In [18]:
import ast
ast.literal_eval(PODCAST_TEXT)

[('Speaker 1',
  "Welcome to this week's episode of AI Insights, where we explore the latest developments in the field of artificial intelligence. Today, we're going to dive into the fascinating world of knowledge distillation, a methodology that transfers advanced capabilities from leading proprietary Large Language Models, or LLMs, to their open-source counterparts. Joining me on this journey is my co-host, who's new to the topic, and I'll be guiding them through the ins and outs of knowledge distillation. So, let's get started!"),
 ('Speaker 2',
  "Sounds exciting! I've heard of knowledge distillation, but I'm not entirely sure what it's all about. Can you give me a brief overview?"),
 ('Speaker 1',
  "Of course! Knowledge distillation is a technique that enables the transfer of knowledge from a large, complex model, like GPT-4 or Gemini, to a smaller, more efficient model, like LLaMA or Mistral. This process allows the smaller model to learn from the teacher model's output, enablin

#### Generating the Final Podcast

Finally, we can loop over the Tuple and use our helper functions to generate the audio

In [39]:
final_audio = None

for speaker, text in tqdm(ast.literal_eval(PODCAST_TEXT), desc="Generating podcast segments", unit="segment"):
    if speaker == "Speaker 1":
        audio_arr, rate = generate_speaker1_audio(text)
    else:  # Speaker 2
        audio_arr, rate = generate_speaker2_audio(text)
    
    # Convert to AudioSegment (pydub will handle sample rate conversion automatically)
    audio_segment = numpy_to_audio_segment(audio_arr, rate)
    
    # Add to final audio
    if final_audio is None:
        final_audio = audio_segment
    else:
        final_audio += audio_segment

Generating podcast segments:   6%|███▉                                                          | 1/16 [00:20<05:02, 20.16s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Generating podcast segments:  19%|███████████▋                                                  | 3/16 [01:02<04:33, 21.06s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Generating podcast segments:  31%|███████████████████▍                                          | 5/16 [01:41<03:30, 19.18s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected beh

### Output the Podcast

We can now save this as a mp3 file

In [40]:
final_audio.export("./resources/_podcast.mp3", 
                  format="mp3", 
                  bitrate="192k",
                  parameters=["-q:a", "0"])

<_io.BufferedRandom name='_podcast.mp3'>

### Suggested Next Steps:

- Experiment with the prompts: Please feel free to experiment with the SYSTEM_PROMPT in the notebooks
- Extend workflow beyond two speakers
- Test other TTS Models
- Experiment with Speech Enhancer models as a step 5.

In [None]:
#fin