[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/oviya-raja/ist-402/blob/main/learning-path/W08/W8_Speech_to_Image.ipynb)

---

# Speech-to-Image Generator

## Overview
This notebook implements an end-to-end multimodal pipeline that converts spoken descriptions into AI-generated images.

## Architecture
1. **Speech-to-Text**: Transcribe audio using OpenAI Whisper
2. **Text-to-Image**: Generate images from text using Stable Diffusion v1.5

## Pipeline Flow
```
Audio File ‚Üí Whisper (Transcription) ‚Üí Text Prompt ‚Üí 
Stable Diffusion (Generation) ‚Üí Generated Image
```

Alternative: Direct text input bypasses transcription stage.

## Features
- **Dual Input Methods**: Upload audio files OR type text directly
- **High-Quality Transcription**: OpenAI Whisper for accurate speech recognition
- **Creative Image Generation**: Stable Diffusion v1.5 for diverse image creation
- **Adjustable Settings**: Control quality (inference steps) and prompt adherence (guidance scale)
- **User-Friendly Interface**: Clear progress indicators and image download

## Usage
1. Run the cell below to install dependencies and launch the app
2. Choose input method:
   - **Audio Tab**: Upload audio file (WAV, MP3, M4A, FLAC) and transcribe
   - **Text Tab**: Type your image description directly
3. Adjust quality settings (optional)
4. Click "Generate Image" to create your artwork

## Technical Stack
- **Speech Recognition**: OpenAI Whisper (tiny variant)
- **Image Generation**: Stable Diffusion v1.5 (runwayml)
- **UI Framework**: Streamlit
- **Deep Learning**: PyTorch, Transformers, Diffusers

In [None]:
# =====================================================
#  Audio-to-Image Generator ‚Äî TESTED & WORKING
#  Run this entire cell in Google Colab
# =====================================================
# This cell sets up the environment, installs dependencies, and launches the app

# ==================== STEP 1: Clean Environment ====================
# Kill any existing Streamlit processes to avoid conflicts
print("üßπ Cleaning up...")
import os
import subprocess
try:
    subprocess.run(["pkill", "-f", "streamlit"], capture_output=True, timeout=5)
except:
    pass

# ==================== STEP 2: Install Packages ====================
# Install required packages for speech recognition and image generation
# Using latest compatible versions to avoid import errors
print("üì¶ Installing packages (2-3 minutes)...")
%pip install -q "transformers>=4.35.0" "diffusers>=0.24.0" accelerate streamlit soundfile torch torchvision pyngrok requests==2.32.4

print("‚úÖ Packages installed!")

# ==================== STEP 4: Create Streamlit App ====================
app_code = '''
import streamlit as st
import torch
from transformers import pipeline
from diffusers import StableDiffusionPipeline
import time

# Config
st.set_page_config(page_title="üéôÔ∏è Audio-to-Image", layout="centered")

# ==================== Load Models ====================
@st.cache_resource
def load_models():
    """
    Load both Whisper (speech-to-text) and Stable Diffusion (text-to-image) models.
    Models are cached to avoid reloading on every interaction.
    First run takes 3-5 minutes to download models.
    """
    st.info("Loading AI models... (first run takes 3-5 minutes)")

    # Whisper for speech-to-text
    # Using 'tiny' variant for speed; larger variants (base, small, medium) offer better accuracy
    whisper = pipeline(
        "automatic-speech-recognition",
        model="openai/whisper-tiny",
        device=0 if torch.cuda.is_available() else -1  # GPU if available, else CPU
    )

    # Stable Diffusion for image generation
    # v1.5 provides good balance of quality and speed
    device = "cuda" if torch.cuda.is_available() else "cpu"
    sd = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16 if device == "cuda" else torch.float32,  # Float16 on GPU for efficiency
        safety_checker=None  # Disabled for faster inference and flexibility
    ).to(device)

    if device == "cuda":
        sd.enable_attention_slicing()  # Memory optimization for GPU

    return whisper, sd

whisper_model, sd_model = load_models()

# ==================== UI ====================
st.title("üéôÔ∏è Audio-to-Image Generator")
st.markdown("Transform your voice into stunning AI-generated images!")
st.markdown("---")

# Input methods
tab1, tab2 = st.tabs(["üé§ Upload Audio", "‚úçÔ∏è Type Text"])

prompt_text = None

with tab1:
    st.write("Upload an audio file with your image description")
    audio_file = st.file_uploader(
        "Choose audio file",
        type=["wav", "mp3", "m4a", "flac"],
        help="Speak clearly: 'A beautiful sunset over mountains'"
    )

    if audio_file:
        st.audio(audio_file)

        if st.button("üéß Transcribe Audio", type="primary"):
            with st.spinner("Converting speech to text..."):
                # Save temp file
                with open("temp_audio.wav", "wb") as f:
                    f.write(audio_file.read())

                # Transcribe
                result = whisper_model("temp_audio.wav")
                prompt_text = result["text"]

                st.success(f"‚úÖ Transcription: **{prompt_text}**")
                st.session_state.prompt = prompt_text

with tab2:
    manual_prompt = st.text_area(
        "Describe the image you want to generate:",
        placeholder="Example: A serene lake surrounded by autumn trees at sunset",
        height=100
    )
    if manual_prompt:
        st.session_state.prompt = manual_prompt

# Settings
with st.expander("‚öôÔ∏è Advanced Settings"):
    col1, col2 = st.columns(2)
    steps = col1.slider("Quality (inference steps)", 10, 50, 25,
                       help="More steps = better quality but slower")
    guidance = col2.slider("Prompt strength", 5.0, 15.0, 7.5,
                          help="Higher = follows prompt more closely")

# Generate button
st.markdown("---")
if st.button("üé® Generate Image", type="primary", use_container_width=True):

    # Get prompt from session state
    final_prompt = st.session_state.get('prompt', None)

    if not final_prompt:
        st.error("‚ùå Please provide audio or text first!")
        st.stop()

    # Generate image
    st.info(f"üé® Generating image from: **{final_prompt}**")
    st.write("This may take 30 seconds to 3 minutes depending on your GPU...")

    progress_bar = st.progress(0)
    start_time = time.time()

    with st.spinner("Creating your masterpiece..."):
        try:
            # Generate
            image = sd_model(
                prompt=final_prompt,
                num_inference_steps=steps,
                guidance_scale=guidance,
                height=512,
                width=512
            ).images[0]

            elapsed = time.time() - start_time
            progress_bar.progress(100)

            # Display
            st.success(f"‚úÖ Generated in {elapsed:.1f} seconds!")
            st.image(image, caption=final_prompt)

            # Save and download
            image.save("generated_image.png")
            with open("generated_image.png", "rb") as f:
                st.download_button(
                    "üíæ Download Image",
                    data=f,
                    file_name=f"ai_art_{int(time.time())}.png",
                    mime="image/png",
                    use_container_width=True
                )

        except Exception as e:
            st.error(f"‚ùå Generation failed: {str(e)}")
            st.info("Try simplifying your prompt or reducing quality settings")

# Footer
st.markdown("---")
st.caption("üîä Powered by OpenAI Whisper + Stable Diffusion v1.5")

# GPU info
device_info = "üöÄ GPU Accelerated" if torch.cuda.is_available() else "üê¢ CPU Mode (slower)"
st.caption(device_info)
'''

with open("app.py", "w") as f:
    f.write(app_code)

print("‚úÖ App created!")

# ==================== STEP 3: Setup ngrok ====================
import time
from pyngrok import ngrok

# ‚ö†Ô∏è IMPORTANT: Set your ngrok token here
# Get it from: https://dashboard.ngrok.com/get-started/your-authtoken
# Replace with your own token for public access
NGROK_TOKEN = "3443vHI71ODZeUY6WQUeBW45KG7_HL7SDdKFz6uty9yqd8Cg"  # ‚ö†Ô∏è CHANGE THIS!

if NGROK_TOKEN == "YOUR_TOKEN_HERE":
    print("\n‚ùå ERROR: Please set your ngrok token!")
    print("   1. Go to: https://dashboard.ngrok.com/get-started/your-authtoken")
    print("   2. Copy your token")
    print("   3. Replace 'YOUR_TOKEN_HERE' in the code above")
    raise SystemExit

try:
    ngrok.set_auth_token(NGROK_TOKEN)
    print("‚úÖ ngrok token configured")
except Exception as e:
    print(f"‚ö†Ô∏è Warning: Could not set ngrok token: {e}")
    print("   Continuing without ngrok (local access only)")

# Kill existing tunnels
try:
    for tunnel in ngrok.get_tunnels():
        ngrok.disconnect(tunnel.public_url)
except:
    pass

# ==================== STEP 4: Write app.py ====================
try:
    with open("app.py", "w", encoding="utf-8") as f:
        f.write(app_code)
    print("‚úÖ app.py generated successfully")
except Exception as e:
    print(f"‚ùå Failed to write app.py: {e}")
    raise

# ==================== STEP 5: Start Streamlit and ngrok ====================
import sys

# Kill any existing streamlit on port 8501
try:
    if os.name == 'nt':  # Windows
        os.system('netstat -ano | findstr :8501')
    else:  # macOS/Linux
        os.system('lsof -ti:8501 | xargs kill -9 2>/dev/null || true')
except:
    pass

# Start Streamlit
print("\nüöÄ Starting Streamlit...")
try:
    if sys.platform.startswith('win'):
        subprocess.Popen(
            [sys.executable, "-m", "streamlit", "run", "app.py", "--server.port", "8501", "--server.headless", "true"],
            creationflags=subprocess.CREATE_NEW_CONSOLE
        )
    else:
        subprocess.Popen(
            ["streamlit", "run", "app.py", "--server.port", "8501", "--server.headless", "true"],
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL,
            start_new_session=True
        )
    
    time.sleep(5)  # Give Streamlit time to start
    print("‚úÖ Streamlit started!")
    
except Exception as e:
    print(f"‚ö†Ô∏è Error starting Streamlit: {e}")
    print("   You can start it manually with: streamlit run app.py")

# Create ngrok tunnel
print("\nüåê Creating public URL with ngrok...")
try:
    public_url = ngrok.connect(8501)
    print("\n" + "="*60)
    print("‚úÖ SUCCESS! Your app is running!")
    print("="*60)
    print(f"\nüåê Public URL (share this):")
    print(f"   {public_url}")
    print(f"\nüè† Local URL:")
    print(f"   http://localhost:8501")
    print(f"\nüìå Tips:")
    print(f"   ‚Ä¢ Keep this notebook running")
    print(f"   ‚Ä¢ First image generation takes longer (loading models)")
    print(f"   ‚Ä¢ Use short, clear voice prompts")
    print(f"   ‚Ä¢ CPU mode works but is slower than GPU")
    print("\n" + "="*60)
    
except Exception as e:
    print(f"\n‚ö†Ô∏è Could not create ngrok tunnel: {e}")
    print("\nüìå App is running locally at: http://localhost:8501")
    print("   (ngrok tunnel failed, but local access works)")
    print("\nüîß Troubleshooting:")
    print("   1. Check your ngrok token is correct")
    print("   2. Make sure you replaced 'YOUR_TOKEN_HERE'")
    print("   3. Try restarting the kernel and running again")

## Example Usage

### Method 1: Audio Input
1. Record or upload an audio file describing your desired image
2. Supported formats: WAV, MP3, M4A, FLAC
3. Click "Transcribe Audio" to convert speech to text
4. Review the transcription
5. Click "Generate Image" to create the image

**Example Audio Prompts:**
- "A serene lake surrounded by autumn trees at sunset"
- "A futuristic cityscape at night with neon lights"
- "A cozy coffee shop with warm lighting"

### Method 2: Direct Text Input
1. Type your image description directly
2. Be descriptive for better results
3. Click "Generate Image"

**Example Text Prompts:**
- "A beautiful sunset over mountains with trees in the foreground"
- "A modern minimalist living room with large windows"
- "A vintage typewriter on a wooden desk with books"

### Tips for Best Results
- **Be Descriptive**: Include details about colors, mood, style, composition
- **Quality Settings**: 
  - More inference steps = higher quality but slower (25-50 recommended)
  - Higher guidance scale = follows prompt more closely (7.5-10 recommended)
- **Generation Time**: First generation takes longer (model loading), subsequent ones are faster
- **GPU vs CPU**: GPU is 5-10x faster; CPU works but is slower