# Multimodal Generative AI with Hugging Face

This notebook explores the various modalities of generative AI beyond just text-to-image generation, using Hugging Face's transformers and other libraries with PyTorch.

## Environment Setup

First, let's install the necessary libraries for working with different modalities using PyTorch.

In [39]:
# Install required packages - PyTorch-only version (no TensorFlow/Keras dependencies)
%pip install torch torchvision torchaudio "transformers[torch]" datasets accelerate diffusers librosa scipy soundfile matplotlib tqdm sentence-transformers --no-dependencies
%pip uninstall -y tensorflow keras tf-keras

Note: you may need to restart the kernel to use updated packages.
Found existing installation: tensorflow 2.19.0
Uninstalling tensorflow-2.19.0:
  Successfully uninstalled tensorflow-2.19.0
Found existing installation: keras 3.9.2
Uninstalling keras-3.9.2:
  Successfully uninstalled keras-3.9.2
Found existing installation: tf_keras 2.19.0
Uninstalling tf_keras-2.19.0:
  Successfully uninstalled tf_keras-2.19.0
Note: you may need to restart the kernel to use updated packages.
Found existing installation: tensorflow 2.19.0
Uninstalling tensorflow-2.19.0:
  Successfully uninstalled tensorflow-2.19.0
Found existing installation: keras 3.9.2
Uninstalling keras-3.9.2:
  Successfully uninstalled keras-3.9.2
Found existing installation: tf_keras 2.19.0
Uninstalling tf_keras-2.19.0:
  Successfully uninstalled tf_keras-2.19.0
Note: you may need to restart the kernel to use updated packages.


In [40]:
# Set environment variable to disable TensorFlow warnings and force PyTorch-only usage
import os
os.environ["USE_TF"] = "0"  # Disable TensorFlow in transformers library
os.environ["TRANSFORMERS_NO_TF"] = "1"  # Explicitly tell transformers not to import TensorFlow

In [41]:
# Check GPU availability for PyTorch
import torch

if torch.cuda.is_available():
    device = "cuda"
    gpu_name = torch.cuda.get_device_name(0)
    print(f"GPU available: {gpu_name}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"PyTorch CUDA enabled: {torch.cuda.is_available()}")
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Memory allocated: {torch.cuda.memory_allocated(0) / 1024**2:.2f} MB")
    print(f"Memory reserved: {torch.cuda.memory_reserved(0) / 1024**2:.2f} MB")
else:
    device = "cpu"
    print("No GPU available, using CPU. Some models will run slowly or require reduced precision.")

print(f"\nPyTorch version: {torch.__version__}")
print(f"Using device: {device}")

# Set default tensor type for better performance
if device == "cuda":
    torch.set_default_tensor_type('torch.cuda.FloatTensor')
    # Enable TF32 for faster computation on Ampere GPUs (RTX 30xx series and newer)
    if torch.cuda.get_device_capability(0)[0] >= 8:
        print("Enabling TF32 for faster computation on Ampere or newer GPUs")
        torch.backends.cuda.matmul.allow_tf32 = True

GPU available: NVIDIA GeForce RTX 4060
CUDA version: 12.4
PyTorch CUDA enabled: True
Current device: 0
Memory allocated: 496.22 MB
Memory reserved: 1212.00 MB

PyTorch version: 2.5.1
Using device: cuda
Enabling TF32 for faster computation on Ampere or newer GPUs


## PyTorch Configuration and Best Practices

Let's configure PyTorch for optimal performance and check available hardware acceleration features.

In [42]:
# Configure PyTorch for optimal performance
def configure_pytorch():
    config = {"device": device}
    
    if device == "cuda":
        # Check for MPS (Metal Performance Shaders for Mac)
        config["has_mps"] = hasattr(torch.backends, "mps") and torch.backends.mps.is_available()
        
        # Check half-precision support
        config["supports_half"] = torch.cuda.is_available()
        config["supports_bfloat16"] = torch.cuda.is_bf16_supported() if torch.cuda.is_available() else False
        
        # Check for CUDA optimization features
        config["cudnn_enabled"] = torch.backends.cudnn.enabled
        config["cudnn_benchmark"] = torch.backends.cudnn.benchmark
        
        # Enable cuDNN benchmark mode for potentially faster runtime
        # This is good when input sizes don't vary much
        torch.backends.cudnn.benchmark = True
        
        # Get GPU details
        config["gpu_name"] = torch.cuda.get_device_name(0)
        config["gpu_capability"] = torch.cuda.get_device_capability(0)
        config["gpu_count"] = torch.cuda.device_count()
    
    return config

# Apply and print configuration
pytorch_config = configure_pytorch()
print("PyTorch Configuration:")
for key, value in pytorch_config.items():
    print(f"  {key}: {value}")

# Set the default dtype based on hardware capabilities for better model loading
if device == "cuda" and pytorch_config.get("supports_half", False):
    if pytorch_config.get("supports_bfloat16", False):
        print("\nUsing bfloat16 precision for faster computation and better numeric stability")
        default_dtype = torch.bfloat16
    else:
        print("\nUsing float16 precision for faster computation")
        default_dtype = torch.float16
else:
    print("\nUsing standard float32 precision")
    default_dtype = torch.float32

PyTorch Configuration:
  device: cuda
  has_mps: False
  supports_half: True
  supports_bfloat16: True
  cudnn_enabled: True
  cudnn_benchmark: True
  gpu_name: NVIDIA GeForce RTX 4060
  gpu_capability: (8, 9)
  gpu_count: 1

Using bfloat16 precision for faster computation and better numeric stability


In [43]:
# Import common libraries
import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from diffusers import DiffusionPipeline, StableDiffusionPipeline
import IPython.display as ipd
from tqdm.notebook import tqdm

## 1. Text Generation

Let's start with text generation using a pre-trained language model from Hugging Face with PyTorch.

In [44]:
# PyTorch text generation
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "gpt2"
print(f"Loading {model_name} model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
print(f"Model loaded on {device}")

# Generate text
prompt = "Multimodal AI can help us"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate with different parameters
print("Generating text...")
output = model.generate(
    **inputs, 
    max_length=50,
    num_return_sequences=1,
    temperature=0.8,  # Control randomness (higher = more random)
    top_k=50,         # Sample from top K likely tokens
    top_p=0.95        # Nucleus sampling - sample from tokens comprising top p probability mass
)

# Print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("\nGenerated output:")
print(generated_text)

Loading gpt2 model and tokenizer...


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Model loaded on cuda
Generating text...

Generated output:
Multimodal AI can help us to understand the world around us.

The AI is able to predict the future, and it can also predict the future of the world.

The AI can also predict the future of the world.

Generated output:
Multimodal AI can help us to understand the world around us.

The AI is able to predict the future, and it can also predict the future of the world.

The AI can also predict the future of the world.


In [45]:
# Alternative: Use Hugging Face pipeline (PyTorch-based)
try:
    text_generator = pipeline('text-generation', model='gpt2', device=0 if device == 'cuda' else -1)
    use_cuda = True
except RuntimeError as e:
    print(f"GPU acceleration failed: {e}\nFalling back to CPU")
    text_generator = pipeline('text-generation', model='gpt2', device=-1)
    use_cuda = False

# Generate text
prompt = "Multimodal AI can help us"
results = text_generator(prompt, max_length=50, num_return_sequences=1)

# Print the generated text
print(f"Using {'GPU' if use_cuda else 'CPU'} for inference")
print(results[0]['generated_text'])

GPU acceleration failed: Failed to import transformers.models.gpt2.modeling_tf_gpt2 because of the following error (look up to see its traceback):
Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.
Falling back to CPU


RuntimeError: Failed to import transformers.models.gpt2.modeling_tf_gpt2 because of the following error (look up to see its traceback):
Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.

## 2. Image Generation

Next, let's explore image generation using Stable Diffusion from Hugging Face's Diffusers library.

In [None]:
# Load a stable diffusion model with appropriate device and memory optimization
def load_image_generator(model_id="runwayml/stable-diffusion-v1-5"):
    if device == "cuda":
        try:
            # Try half precision first for better memory efficiency
            return StableDiffusionPipeline.from_pretrained(
                model_id, torch_dtype=default_dtype).to(device)
        except RuntimeError as e:
            print(f"Half precision failed: {e}\nTrying full precision")
            try:
                # Try full precision
                return StableDiffusionPipeline.from_pretrained(model_id).to(device)
            except RuntimeError as e:
                print(f"Full precision failed: {e}\nFalling back to CPU")
                return StableDiffusionPipeline.from_pretrained(model_id).to("cpu")
    else:
        print("Using CPU for image generation (will be slow)")
        return StableDiffusionPipeline.from_pretrained(model_id).to("cpu")

# Generate an image with progress bar
def generate_image(prompt, num_steps=30):
    print(f"Generating image for prompt: '{prompt}'")
    if device == "cpu":
        print("Warning: Using CPU for image generation may take several minutes!")
    
    generator = load_image_generator()
    
    # Create progress bar callback
    progress_bar = tqdm(total=num_steps)
    def callback_fn(step, timestep, latents):
        progress_bar.update(1)
    
    # Generate the image
    with torch.inference_mode():  # More efficient than no_grad for inference
        image = generator(prompt, num_inference_steps=num_steps, callback=callback_fn).images[0]
    progress_bar.close()
    return image

# Generate and display an image
try:
    prompt = "A beautiful digital painting of a futuristic city with flying vehicles"
    image = generate_image(prompt)
    display(image)
except Exception as e:
    print(f"Image generation failed: {e}")
    print("Try using a simpler model or reducing the image size")

## 3. Audio Generation and Speech Synthesis

Now let's look at audio generation and text-to-speech synthesis using PyTorch-based models.

In [None]:
# Text-to-Speech using SpeechT5 (PyTorch-based)
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import torch
import soundfile as sf

# Function to synthesize speech with error handling - PyTorch focused
def synthesize_speech(text, output_file="speech.wav"):
    try:
        print(f"Loading TTS models to {device}...")
        # Load processor, model, and vocoder
        processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
        model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts").to(device)
        vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(device)
        
        # Load speaker embeddings
        print("Loading speaker embeddings...")
        embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
        speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0).to(device)
        
        # Synthesize speech
        print(f"Synthesizing speech for text: '{text}'")
        inputs = processor(text=text, return_tensors="pt").to(device)
        
        # Use inference mode for better memory efficiency
        with torch.inference_mode():
            speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
        
        # Save and play audio
        sf.write(output_file, speech.cpu().numpy(), samplerate=16000)
        print(f"Speech saved to {output_file}")
        return output_file
    except Exception as e:
        print(f"Speech synthesis failed: {e}")
        return None

# Generate speech
text = "Multimodal AI is revolutionizing how we interact with computers."
speech_file = synthesize_speech(text)
if speech_file:
    ipd.display(ipd.Audio(speech_file))

## 4. Text-to-Video Generation

Let's explore text-to-video generation using PyTorch-based models.

In [None]:
# Text-to-video generation with PyTorch-optimized implementation
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video

def generate_video(prompt, num_steps=25, output_file="generated_video.mp4"):
    try:
        print(f"Loading text-to-video model on {device}...")
        if device == "cpu":
            print("WARNING: Video generation on CPU may be extremely slow or fail due to memory constraints")
            pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b")
        else:  # Using CUDA
            try:
                # Use optimal precision based on GPU capabilities
                pipe = DiffusionPipeline.from_pretrained(
                    "damo-vilab/text-to-video-ms-1.7b", 
                    torch_dtype=default_dtype
                )
                # Enable memory-efficient attention if available
                if hasattr(pipe, "enable_xformers_memory_efficient_attention"):
                    pipe.enable_xformers_memory_efficient_attention()
            except Exception as e:
                print(f"Loading with optimized settings failed: {e}\nTrying with default precision")
                pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b")
        
        pipe = pipe.to(device)
        
        # Add progress bar
        progress_bar = tqdm(total=num_steps)
        def callback_fn(step, timestep, latents):
            progress_bar.update(1)
        
        # Generate video with PyTorch's inference_mode for efficiency
        print(f"Generating video for prompt: '{prompt}'")
        with torch.inference_mode():
            video_frames = pipe(prompt, num_inference_steps=num_steps, callback=callback_fn).frames
        progress_bar.close()
        
        # Export and return video path
        export_to_video(video_frames, output_file)
        print(f"Video saved to {output_file}")
        return output_file
    except Exception as e:
        print(f"Video generation failed: {e}")
        print("This might be due to insufficient memory. Try reducing resolution or inference steps.")
        return None

# Generate a short video (use a smaller number of steps for faster generation)
prompt = "A rocket launching from Earth into space"
video_file = generate_video(prompt, num_steps=15)
if video_file:
    ipd.display(ipd.Video(video_file))

## 5. Music Generation

Now let's generate music using PyTorch-based models.

In [None]:
# Music generation using Facebook's MusicGen model (PyTorch-based)
from transformers import AutoProcessor, MusicgenForConditionalGeneration

def generate_music(prompt, duration_seconds=5, output_file="generated_music.wav"):
    try:
        print(f"Loading music generation model on {device}...")
        model_id = "facebook/musicgen-small"  # Use small model for reduced memory usage
        
        # Load model and processor with optimal settings for PyTorch
        processor = AutoProcessor.from_pretrained(model_id)
        model = MusicgenForConditionalGeneration.from_pretrained(
            model_id,
            torch_dtype=default_dtype if device == "cuda" else torch.float32
        ).to(device)
        
        # Calculate tokens based on duration (approximate conversion)
        max_tokens = int(duration_seconds * 50)  # ~50 tokens per second
        
        # Generate music with progress tracking
        print(f"Generating music for prompt: '{prompt}' ({duration_seconds} seconds)")
        inputs = processor(
            text=[prompt],
            padding=True,
            return_tensors="pt",
        ).to(device)
        
        # Generate with PyTorch inference_mode
        with torch.inference_mode():
            audio_values = model.generate(**inputs, max_new_tokens=max_tokens)
        
        # Convert to numpy and save
        sampling_rate = model.config.audio_encoder.sampling_rate
        audio_data = audio_values[0, 0].cpu().numpy()
        sf.write(output_file, audio_data, sampling_rate)
        
        print(f"Music saved to {output_file}")
        return audio_data, sampling_rate
    except Exception as e:
        print(f"Music generation failed: {e}")
        return None, None

# Generate music
music_prompt = "An electronic dance track with a strong beat"
audio_data, sampling_rate = generate_music(music_prompt)
if audio_data is not None:
    ipd.display(ipd.Audio(audio_data, rate=sampling_rate))

## 6. Cross-Modal Generation: Image-to-Text (Image Captioning)

In [None]:
# Image captioning using BLIP (PyTorch-based)
from PIL import Image
import requests
from transformers import BlipProcessor, BlipForConditionalGeneration

def caption_image(image_path=None, image_url=None):
    try:
        # Load model and processor
        print(f"Loading image captioning model on {device}...")
        processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
        model = BlipForConditionalGeneration.from_pretrained(
            "Salesforce/blip-image-captioning-base", 
            torch_dtype=default_dtype if device == "cuda" else torch.float32
        ).to(device)
        
        # Get image from URL or path
        if image_url:
            print(f"Downloading image from {image_url}")
            image = Image.open(requests.get(image_url, stream=True).raw)
        elif image_path:
            print(f"Loading image from {image_path}")
            image = Image.open(image_path)
        else:
            # Use a default example image
            print("Using default example image")
            url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            image = Image.open(requests.get(url, stream=True).raw)
        
        # Generate caption with PyTorch's inference_mode
        inputs = processor(image, return_tensors="pt").to(device)
        with torch.inference_mode():
            out = model.generate(**inputs)
        caption = processor.decode(out[0], skip_special_tokens=True)
        
        return image, caption
    except Exception as e:
        print(f"Image captioning failed: {e}")
        return None, None

# Generate caption for the default example image
image, caption = caption_image()
if image:
    display(image)
    print(f"Caption: {caption}")

## 7. Visual Question Answering (VQA)

Let's explore how PyTorch-based models can answer questions about images.

In [None]:
# Visual Question Answering with PyTorch optimization
from transformers import ViltProcessor, ViltForQuestionAnswering

def answer_visual_question(image, question):
    try:
        print(f"Loading VQA model on {device}...")
        processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
        model = ViltForQuestionAnswering.from_pretrained(
            "dandelin/vilt-b32-finetuned-vqa",
            torch_dtype=default_dtype if device == "cuda" else torch.float32
        ).to(device)
        
        # Prepare inputs
        print(f"Answering: '{question}'")
        inputs = processor(image, question, return_tensors="pt").to(device)
        
        # Run inference with memory optimization
        with torch.inference_mode():
            outputs = model(**inputs)
        logits = outputs.logits
        idx = logits.argmax(-1).item()
        return model.config.id2label[idx]
    except Exception as e:
        print(f"VQA failed: {e}")
        return None

# Use the same image from the previous cell for VQA
if 'image' in locals() and image is not None:
    questions = [
        "How many cats are in the image?",
        "What color are the cats?",
        "Where are the cats sitting?"
    ]
    
    print("Visual Question Answering results:")
    for question in questions:
        answer = answer_visual_question(image, question)
        if answer:
            print(f"Q: {question}")
            print(f"A: {answer}\n")

## 8. Practical Applications and Multimodal Combinations

Let's explore a more complex example combining multiple modalities with PyTorch.

In [None]:
# Example: Generate an image from text, then generate a caption for verification
def generate_and_caption_image(prompt):
    print(f"\n=== Starting multimodal pipeline for: '{prompt}' ===")
    
    # 1. Generate image from text
    print("\n[STEP 1/2] Generating image from text prompt...")
    generated_image = None
    try:
        image_generator = StableDiffusionPipeline.from_pretrained(
            "runwayml/stable-diffusion-v1-5", 
            torch_dtype=default_dtype if device == "cuda" else torch.float32
        ).to(device)
        with torch.inference_mode():
            generated_image = image_generator(prompt).images[0]
        print("✓ Image generated successfully!")
    except Exception as e:
        print(f"× Image generation failed: {e}")
        return None, None, None
    
    # 2. Generate caption from the image
    print("\n[STEP 2/2] Generating caption for the image...")
    try:
        caption_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
        caption_model = BlipForConditionalGeneration.from_pretrained(
            "Salesforce/blip-image-captioning-base",
            torch_dtype=default_dtype if device == "cuda" else torch.float32
        ).to(device)
        inputs = caption_processor(generated_image, return_tensors="pt").to(device)
        with torch.inference_mode():
            out = caption_model.generate(**inputs)
        generated_caption = caption_processor.decode(out[0], skip_special_tokens=True)
        print("✓ Caption generated successfully!")
    except Exception as e:
        print(f"× Caption generation failed: {e}")
        return generated_image, None, None
    
    # 3. Calculate similarity between original prompt and generated caption
    similarity_score = None
    try:
        from sentence_transformers import SentenceTransformer, util
        print("\n[BONUS] Calculating semantic similarity between prompt and caption...")
        model = SentenceTransformer('all-MiniLM-L6-v2').to(device if device == "cuda" else "cpu")
        with torch.inference_mode():
            embedding1 = model.encode(prompt, convert_to_tensor=True)
            embedding2 = model.encode(generated_caption, convert_to_tensor=True)
        similarity_score = float(util.pytorch_cos_sim(embedding1, embedding2)[0][0])
        print(f"✓ Similarity score calculated: {similarity_score:.4f} (0-1 scale)")
    except Exception as e:
        print(f"× Similarity calculation failed: {e}")
    
    print("\n=== Multimodal pipeline complete! ===")
    return generated_image, generated_caption, similarity_score

# Run the multimodal pipeline
original_prompt = "A surreal painting of a floating island with waterfalls"
generated_image, generated_caption, similarity = generate_and_caption_image(original_prompt)

if generated_image is not None:
    # Display results
    print(f"\nOriginal prompt: {original_prompt}")
    display(generated_image)
    print(f"Generated caption: {generated_caption}")
    if similarity:
        print(f"Semantic similarity: {similarity:.4f}")
        # Interpret similarity
        if similarity > 0.7:
            print("✓ High similarity: The image closely matches the original prompt")
        elif similarity > 0.5:
            print("⚠ Moderate similarity: The image somewhat matches the original prompt")
        else:
            print("× Low similarity: The image may not match the original prompt well")

## 9. Multimodal Chatbots

This section demonstrates how to create a simple multimodal chatbot using PyTorch-based models.

In [None]:
# PyTorch-based multimodal chatbot
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
from PIL import Image
import requests

def create_multimodal_chatbot(model_id="llava-hf/llava-1.5-7b-hf"):
    try:
        print(f"Loading multimodal chatbot model on {device}...")
        print("This may take a while as the model is large.")
        
        if device == "cuda":
            # Use optimal dtype and quantization for memory efficiency
            dtype = default_dtype
            processor = AutoProcessor.from_pretrained(model_id)
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=dtype,
                device_map="auto",  # Automatically handle device placement
                load_in_4bit=True,  # Enable 4-bit quantization for memory efficiency
            )
        else:
            print("Warning: Running without GPU will be very slow and may fail")
            processor = AutoProcessor.from_pretrained(model_id)
            model = AutoModelForCausalLM.from_pretrained(model_id)
            
        print("Multimodal chatbot loaded successfully!")
        return processor, model
    except Exception as e:
        print(f"Failed to load multimodal chatbot: {e}")
        print("Consider using a smaller model or enabling GPU acceleration.")
        return None, None

def chat_with_image(processor, model, image_url, question):
    if processor is None or model is None:
        print("Chatbot not loaded successfully.")
        return
        
    try:
        # Download image from URL
        image = Image.open(requests.get(image_url, stream=True).raw)
        display(image)
        
        # Process inputs
        prompt = f"<image>\nUser: {question}\nAssistant:"
        inputs = processor(prompt, image, return_tensors="pt").to(device)
        
        # Generate response
        print(f"User: {question}")
        print("Assistant: ", end="")
        
        # Generate with PyTorch's inference_mode
        with torch.inference_mode():
            output = model.generate(
                **inputs,
                max_new_tokens=500,
                do_sample=True,
                temperature=0.7,
                top_k=50,
                top_p=0.95,
            )
        
        # Process output
        response = processor.decode(output[0], skip_special_tokens=True)
        response = response.split("Assistant:")[-1].strip()
        
        print(response)
        return response
    except Exception as e:
        print(f"Error in multimodal chat: {e}")
        return None

# Note: Uncomment this section if you have sufficient GPU memory
# print("Loading multimodal chatbot (this may take a minute)...")
# processor, model = create_multimodal_chatbot()
# if processor and model:
#     # Example usage with image URL and question
#     image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
#     question = "What do you see in this image and how many cats are there?"
#     chat_with_image(processor, model, image_url, question)

## 10. PyTorch Model Optimization Techniques

This section demonstrates various PyTorch optimization techniques for improving inference speed and reducing memory usage.

In [None]:
# PyTorch optimization techniques for inference
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def demonstrate_pytorch_optimizations():
    print("PyTorch Model Optimization Techniques")
    print("====================================\n")
    
    if device != "cuda":
        print("These optimizations are most effective with GPU acceleration.")
    
    # 1. Model Quantization
    print("1. Model Quantization")
    print("--------------------")
    print("Loading a small model to demonstrate quantization...")
    try:
        # Load a small model
        model_id = "gpt2"
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        
        # Standard FP32 model (baseline)
        model_fp32 = AutoModelForCausalLM.from_pretrained(model_id)
        
        # Dynamic Quantization (CPU only)
        model_int8 = torch.quantization.quantize_dynamic(
            model_fp32,  # the original model
            {torch.nn.Linear},  # a set of layers to dynamically quantize
            dtype=torch.qint8  # the target dtype for quantized weights
        )
        
        # Compare model sizes
        def get_model_size(model):
            param_size = 0
            for param in model.parameters():
                param_size += param.nelement() * param.element_size()
            buffer_size = 0
            for buffer in model.buffers():
                buffer_size += buffer.nelement() * buffer.element_size()
            return (param_size + buffer_size) / 1024 / 1024  # Size in MB
        
        print(f"FP32 model size: {get_model_size(model_fp32):.2f} MB")
        print(f"INT8 model size: {get_model_size(model_int8):.2f} MB")
        print(f"Memory savings: {(1 - get_model_size(model_int8) / get_model_size(model_fp32)) * 100:.1f}%\n")
    except Exception as e:
        print(f"Quantization demonstration failed: {e}\n")
    
    # 2. Inference Mode vs No Grad
    print("2. Inference Mode vs No Grad")
    print("---------------------------")
    try:
        # Create a sample model and input
        sample_model = torch.nn.Linear(1000, 1000).to(device)
        sample_input = torch.randn(32, 1000).to(device)
        
        # Warmup
        for _ in range(10):
            _ = sample_model(sample_input)
        
        # Time with torch.no_grad
        import time
        start_time = time.time()
        for _ in range(100):
            with torch.no_grad():
                _ = sample_model(sample_input)
        no_grad_time = time.time() - start_time
        
        # Time with torch.inference_mode
        start_time = time.time()
        for _ in range(100):
            with torch.inference_mode():
                _ = sample_model(sample_input)
        inference_mode_time = time.time() - start_time
        
        print(f"torch.no_grad() time: {no_grad_time:.4f} seconds")
        print(f"torch.inference_mode() time: {inference_mode_time:.4f} seconds")
        print(f"Speedup: {no_grad_time / inference_mode_time:.2f}x\n")
    except Exception as e:
        print(f"Inference mode comparison failed: {e}\n")
        
    # 3. Memory-efficient transformers
    print("3. Memory Efficient Attention")
    print("----------------------------")
    print("Memory-efficient attention methods like FlashAttention, xFormers,")
    print("or PyTorch's scaled_dot_product_attention can significantly reduce memory usage")
    print("and improve inference speed in transformer models.")
    
    # Check if xformers is installed
    try:
        import xformers
        print(f"\nxFormers version {xformers.__version__} is installed.")
        print("You can enable memory-efficient attention with:")
        print("pipe.enable_xformers_memory_efficient_attention()")
    except ImportError:
        print("\nxFormers is not installed. For better performance, install it with:")
        print("pip install xformers")
    
    # Check if Flash Attention 2 is available
    try:
        from transformers.utils import is_flash_attn_2_available
        if is_flash_attn_2_available():
            print("\nFlash Attention 2 is available for use!")
            print("It can be enabled automatically when loading models with:")
            print("attn_implementation='flash_attention_2'")
        else:
            print("\nFlash Attention 2 is not available.")
    except ImportError:
        print("\nCould not check Flash Attention 2 availability.")

# Run the optimization demonstrations
demonstrate_pytorch_optimizations()

## Conclusion

This notebook demonstrated the capabilities of multimodal generative AI using PyTorch-based models from Hugging Face's transformers and related libraries. We explored:

- Text generation with transformer models
- Image generation with diffusion models
- Text-to-speech synthesis
- Text-to-video generation
- Music generation
- Cross-modal tasks like image captioning and VQA
- Combining multiple modalities in practical applications
- Building multimodal chatbots that understand both text and images
- PyTorch optimization techniques for inference

The PyTorch ecosystem provides powerful tools for developing and deploying state-of-the-art multimodal AI systems with excellent performance and GPU acceleration.

## Further Resources

- [PyTorch Documentation](https://pytorch.org/docs/stable/index.html)
- [Hugging Face Documentation](https://huggingface.co/docs)
- [Diffusers Library Documentation](https://huggingface.co/docs/diffusers/index)
- [Transformers Library Documentation](https://huggingface.co/docs/transformers/index)
- [Hugging Face Model Hub](https://huggingface.co/models)
- [Audiocraft GitHub Repository](https://github.com/facebookresearch/audiocraft)
- [Sentence Transformers](https://www.sbert.net/)
- [PyTorch Performance Tuning Guide](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html)