# Text-to-Speech Generation using Bark Model with Intel Extension for PyTorch

## Learning Objectives

After completing this workshop, participants will be able to:

### Remember
- List the key components of the Bark text-to-speech model
- Recall the basic syntax for Intel Extension for PyTorch optimization

### Understand
- Explain how the Bark model converts text to speech
- Describe the role of voice presets in audio generation
- Interpret model inference times and performance metrics

### Apply
- Implement Intel optimizations for PyTorch models
- Use the Bark model for text-to-speech generation
- Configure different voice presets and parameters

### Analyze
- Compare performance between optimized and non-optimized models
- Examine the impact of different text inputs on audio quality

### Evaluate
- Assess the quality of generated audio outputs
- Judge the effectiveness of different optimization techniques

### Create
- Develop custom applications using the Bark model
- Design new text inputs for optimal audio generation

## Setup and Installation

First, let's install the required packages:

In [None]:
!pip install torch intel_extension_for_pytorch transformers scipy

## Import Required Libraries

In [1]:
import torch
import scipy
import time
from transformers import AutoProcessor, BarkModel
from scipy.io.wavfile import write as write_wav

## Implementing the BarkAudioSynthesizer Class

In [2]:
class BarkAudioSynthesizer:
    def __init__(self, model_path='suno/bark-small'):
        self.processor = AutoProcessor.from_pretrained(model_path)
        self.model = BarkModel.from_pretrained(model_path)
        
        # Configure dtype and device
        self.dtype = torch.float32
        if torch.xpu.is_available():
            self.device = torch.device('xpu')
        else:
            print("XPU not available, falling back to CPU")
            self.device = torch.device('cpu')
            
        # Move model to device and optimize
        self.model = self.model.to(self.device)
        self.model = self.model.to(memory_format=torch.channels_last)       
        
            
    def generate_audio(self, text, voice_preset='v2/en_speaker_6', output_file='output.wav'):
        inputs = self.processor(text, voice_preset=voice_preset)
        inputs = {k: v.to(self.device) if hasattr(v, 'to') else v for k, v in inputs.items()}
        
        with torch.inference_mode():
            try:
                # Warmup
                print("Performing warmup inference...")
                audio_array = self.model.generate(**inputs)
                
                # Actual inference
                print("Generating audio...")
                start_time = time.time()
                audio_array = self.model.generate(**inputs)
                if torch.xpu.is_available():
                    torch.xpu.synchronize()
                inference_time = time.time() - start_time
                
                print(f"Inference time: {inference_time:.2f} seconds")
                
                audio_array = audio_array.cpu().numpy().squeeze()
                sample_rate = self.model.generation_config.sample_rate
                write_wav(output_file, sample_rate, audio_array)
                
            except Exception as e:
                print(f"Error during generation: {str(e)}")
                raise
            
        return output_file

## Testing Different Voice Presets

Let's test the model with different voice presets:

In [3]:
synthesizer = BarkAudioSynthesizer()

# Test with male voice
synthesizer.generate_audio(
    "This is a test with a male voice.",
    voice_preset='v2/en_speaker_6',
    output_file='male_voice.wav'
)

# Test with female voice
synthesizer.generate_audio(
    "This is a test with a female voice.",
    voice_preset='v2/en_speaker_9',
    output_file='female_voice.wav'
)

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


Performing warmup inference...


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Generating audio...
Inference time: 20.48 seconds


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Performing warmup inference...


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Generating audio...
Inference time: 31.62 seconds


'female_voice.wav'

## Testing Different Text Types

Let's experiment with different types of text input:

In [None]:
test_texts = [
    "Hello! This is a simple greeting.",
    "1, 2, 3, 4, 5. Testing numbers.",
    "Question: How does this sound as a question?",
    "[laughing] This is an example with emotion!"
]

for i, text in enumerate(test_texts):
    synthesizer.generate_audio(
        text,
        voice_preset='v2/en_speaker_6',
        output_file=f'test_{i}.wav'
    )

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Performing warmup inference...


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Generating audio...
Inference time: 26.98 seconds


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Performing warmup inference...


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Generating audio...
Inference time: 70.93 seconds


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Performing warmup inference...


## Performance Analysis

Let's measure the inference time for different text lengths:

In [None]:
texts = {
    'short': "Hello.",
    'medium': "This is a medium length sentence for testing.",
    'long': "This is a longer piece of text that we'll use to test how the model performs with more content. It includes multiple sentences and should take longer to process."
}

for length, text in texts.items():
    print(f"\nTesting {length} text:")
    synthesizer.generate_audio(
        text,
        voice_preset='v2/en_speaker_6',
        output_file=f'{length}_text.wav'
    )

In [None]:
##### def main():
    # Initialize synthesizer
    synthesizer = BarkAudioSynthesizer()
    
    # Get user input
    print("\nBark Audio Synthesis Menu:")
    print("1. Use preset example")
    print("2. Enter custom text")
    choice = input("Enter your choice (1 or 2): ")
    
    if choice == '1':
        examples = {
            '1': {
                'text': "This is a simple test of the text-to-speech system.",
                'voice': 'v2/en_speaker_6',
                'desc': 'Basic Test'
            },
            '2': {
                'text': "Hello! How are you doing today?",
                'voice': 'v2/en_speaker_9',
                'desc': 'Greeting'
            },
            '3': {
                'text': "The quick brown fox jumps over the lazy dog.",
                'voice': 'v2/en_speaker_3',
                'desc': 'Pangram'
            },
            '4': {
                'text': "One, two, three, four, five. Testing the numbers.",
                'voice': 'v2/en_speaker_9',
                'desc': 'Numbers'
            },
            '5': {
                'text': "Today's weather is sunny with clear skies.",
                'voice': 'v2/en_speaker_6',
                'desc': 'Weather'
            }
        }
        
        print("\nSelect an example:")
        for key, value in examples.items():
            print(f"{key}. {value['desc']}")
        
        example_choice = input("Enter your choice (1-5): ")
        selected = examples.get(example_choice)
        
        if selected:
            text = selected['text']
            voice_preset = selected['voice']
        else:
            print("Invalid choice. Using default example.")
            text = examples['1']['text']
            voice_preset = examples['1']['voice']
    else:
        text = input("\nEnter the text to synthesize: ")
        print("\nAvailable voice presets:")
        print("1. v2/en_speaker_6 (Default male voice)")
        print("2. v2/en_speaker_9 (Female voice)")
        print("3. v2/en_speaker_3 (Child voice)")
        voice_choice = input("Select voice preset (1-3): ")
        
        voice_presets = {
            '1': 'v2/en_speaker_6',
            '2': 'v2/en_speaker_9',
            '3': 'v2/en_speaker_3'
        }
        voice_preset = voice_presets.get(voice_choice, 'v2/en_speaker_6')
    
    output_file = input("\nEnter output filename (default: output.wav): ") or 'output.wav'
    
    # Generate audio
    try:
        synthesizer.generate_audio(text, voice_preset, output_file)
        print(f"\nAudio generated successfully: {output_file}")
    except Exception as e:
        print(f"Error generating audio: {str(e)}")

if __name__ == '__main__':
    main()

## Summary and Key Takeaways

### Knowledge and Comprehension
- Understood the basic architecture of the Bark text-to-speech model
- Learned how to implement Intel Extension for PyTorch optimizations

### Application and Analysis
- Successfully implemented text-to-speech generation with different voice presets
- Analyzed performance characteristics with different input types
- Applied optimization techniques for improved performance

### Synthesis and Evaluation
- Created a functional text-to-speech system with Intel optimizations
- Evaluated the performance impact of different input types and optimization techniques

### Best Practices
1. Always perform a warmup inference for more accurate timing measurements
2. Use appropriate voice presets for different types of content
3. Consider text length and complexity when optimizing performance
4. Implement proper error handling for production environments

### Next Steps
- Experiment with different model variants
- Explore advanced voice customization options
- Implement batch processing for multiple texts
- Optimize for specific hardware configurations