[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/oviya-raja/ist-402/blob/main/learning-path/W08/W8_Speech_to_Image.ipynb)

---

# Speech-to-Image Generator

## Overview
This notebook implements an end-to-end multimodal pipeline that converts spoken descriptions into AI-generated images.

## Architecture
1. **Speech-to-Text**: Transcribe audio using OpenAI Whisper
2. **Text-to-Image**: Generate images from text using Stable Diffusion v1.5

## Pipeline Flow
```
Audio File ‚Üí Whisper (Transcription) ‚Üí Text Prompt ‚Üí
Stable Diffusion (Generation) ‚Üí Generated Image
```

Alternative: Direct text input bypasses transcription stage.

## Features
- **Dual Input Methods**: Upload audio files OR type text directly
- **High-Quality Transcription**: OpenAI Whisper for accurate speech recognition
- **Creative Image Generation**: Stable Diffusion v1.5 for diverse image creation
- **Adjustable Settings**: Control quality (inference steps) and prompt adherence (guidance scale)
- **User-Friendly Interface**: Clear progress indicators and image download

## Usage
1. Run the cell below to install dependencies and launch the app
2. Choose input method:
   - **Audio Tab**: Upload audio file (WAV, MP3, M4A, FLAC) and transcribe
   - **Text Tab**: Type your image description directly
3. Adjust quality settings (optional)
4. Click "Generate Image" to create your artwork

## Technical Stack
- **Speech Recognition**: OpenAI Whisper (tiny variant)
- **Image Generation**: Stable Diffusion v1.5 (runwayml)
- **UI Framework**: Streamlit
- **Deep Learning**: PyTorch, Transformers, Diffusers

In [1]:
# =====================================================
#  Audio-to-Image Generator ‚Äî FIXED VERSION
#  Run this entire cell in Google Colab
# =====================================================

# ==================== STEP 1: Early Cleanup ====================
print("üßπ Cleaning up existing processes...")
import os
import subprocess
import time

try:
    subprocess.run(["pkill", "-f", "streamlit"], capture_output=True, timeout=5)
    os.system('pkill -9 ngrok 2>/dev/null || true')
    os.system('killall ngrok 2>/dev/null || true')
    time.sleep(1)
    print("‚úÖ Cleanup complete")
except:
    pass

# ==================== STEP 2: Install Packages ====================
print("üì¶ Installing packages (2-3 minutes)...")
%pip install -q "transformers>=4.35.0" "diffusers>=0.24.0" accelerate streamlit soundfile torch torchvision pyngrok python-dotenv requests==2.32.4
print("‚úÖ Packages installed!")

# ==================== STEP 3: Create Streamlit App ====================
app_code = '''
import streamlit as st
import torch
from transformers import pipeline
from diffusers import StableDiffusionPipeline
import time

# Config
st.set_page_config(page_title="üéôÔ∏è Audio-to-Image", layout="centered")

# ==================== Load Models ====================
@st.cache_resource
def load_models():
    """
    Load both Whisper (speech-to-text) and Stable Diffusion (text-to-image) models.
    Models are cached to avoid reloading on every interaction.
    First run takes 3-5 minutes to download models.
    """
    st.info("Loading AI models... (first run takes 3-5 minutes)")

    # Whisper for speech-to-text
    whisper = pipeline(
        "automatic-speech-recognition",
        model="openai/whisper-tiny",
        device=0 if torch.cuda.is_available() else -1
    )

    # Stable Diffusion for image generation
    device = "cuda" if torch.cuda.is_available() else "cpu"
    sd = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16 if device == "cuda" else torch.float32,
        safety_checker=None
    ).to(device)

    if device == "cuda":
        sd.enable_attention_slicing()

    return whisper, sd

whisper_model, sd_model = load_models()

# ==================== UI ====================
st.title("üéôÔ∏è Audio-to-Image Generator")
st.markdown("Transform your voice into stunning AI-generated images!")
st.markdown("---")

# Input methods
tab1, tab2 = st.tabs(["üé§ Upload Audio", "‚úçÔ∏è Type Text"])

prompt_text = None

with tab1:
    st.write("Upload an audio file with your image description")
    audio_file = st.file_uploader(
        "Choose audio file",
        type=["wav", "mp3", "m4a", "flac"],
        help="Speak clearly: \'A beautiful sunset over mountains\'")

    if audio_file:
        st.audio(audio_file)

        if st.button("üéß Transcribe Audio", type="primary"):
            with st.spinner("Converting speech to text..."):
                with open("temp_audio.wav", "wb") as f:
                    f.write(audio_file.read())
                result = whisper_model("temp_audio.wav")
                prompt_text = result["text"]
                st.success(f"‚úÖ Transcription: **{prompt_text}**")
                st.session_state.prompt = prompt_text

with tab2:
    manual_prompt = st.text_area(
        "Describe the image you want to generate:",
        placeholder="Example: A serene lake surrounded by autumn trees at sunset",
        height=100
    )
    if manual_prompt:
        st.session_state.prompt = manual_prompt

# Settings
with st.expander("‚öôÔ∏è Advanced Settings"):
    col1, col2 = st.columns(2)
    steps = col1.slider("Quality (inference steps)", 10, 50, 25,
                       help="More steps = better quality but slower")
    guidance = col2.slider("Prompt strength", 5.0, 15.0, 7.5,
                          help="Higher = follows prompt more closely")

# Generate button
st.markdown("---")
if st.button("üé® Generate Image", type="primary", use_container_width=True):
    final_prompt = st.session_state.get(\'prompt\', None)

    if not final_prompt:
        st.error("‚ùå Please provide audio or text first!")
        st.stop()

    st.info(f"üé® Generating image from: **{final_prompt}**")
    st.write("This may take 30 seconds to 3 minutes depending on your GPU...")

    progress_bar = st.progress(0)
    start_time = time.time()

    with st.spinner("Creating your masterpiece..."):
        try:
            image = sd_model(
                prompt=final_prompt,
                num_inference_steps=steps,
                guidance_scale=guidance,
                height=512,
                width=512
            ).images[0]

            elapsed = time.time() - start_time
            progress_bar.progress(100)

            st.success(f"‚úÖ Generated in {elapsed:.1f} seconds!")
            st.image(image, caption=final_prompt)

            image.save("generated_image.png")
            with open("generated_image.png", "rb") as f:
                st.download_button(
                    "üíæ Download Image",
                    data=f,
                    file_name=f"ai_art_{int(time.time())}.png",
                    mime="image/png",
                    use_container_width=True
                )

        except Exception as e:
            st.error(f"‚ùå Generation failed: {str(e)}")
            st.info("Try simplifying your prompt or reducing quality settings")

# Footer
st.markdown("---")
st.caption("üîä Powered by OpenAI Whisper + Stable Diffusion v1.5")
device_info = "üöÄ GPU Accelerated" if torch.cuda.is_available() else "üê¢ CPU Mode (slower)"
st.caption(device_info)
'''

# Write app.py
try:
    with open("app.py", "w", encoding="utf-8") as f:
        f.write(app_code)
    print("‚úÖ app.py generated successfully")
except Exception as e:
    print(f"‚ùå Failed to write app.py: {e}")
    raise

# ==================== STEP 4: Setup ngrok ====================
from pyngrok import ngrok
import sys

# Kill ngrok again after import
print("üßπ Killing any ngrok processes...")
try:
    ngrok.kill()
    time.sleep(1)
    print("‚úÖ ngrok processes killed")
except Exception as e:
    print(f"   Note: {e}")

# Load ngrok token from environment variables
NGROK_TOKEN = None

# Try Google Colab first
try:
    from google.colab import userdata
    NGROK_TOKEN = userdata.get('NGROK_AUTHTOKEN')
    if NGROK_TOKEN:
        print("‚úÖ Loaded ngrok token from Google Colab userdata")
except ImportError:
    pass

# Try .env file if not found
if not NGROK_TOKEN:
    try:
        from dotenv import load_dotenv
        load_dotenv()
        NGROK_TOKEN = os.getenv('NGROK_AUTHTOKEN')
        if NGROK_TOKEN:
            print("‚úÖ Loaded ngrok token from .env file")
    except ImportError:
        pass

# Fallback to environment variable
if not NGROK_TOKEN:
    NGROK_TOKEN = os.getenv('NGROK_AUTHTOKEN')
    if NGROK_TOKEN:
        print("‚úÖ Loaded ngrok token from environment variable")

# Check if token was found
if not NGROK_TOKEN:
    print("\n‚ùå ERROR: NGROK_AUTHTOKEN not found!")
    print("\nüìù How to set it in Google Colab:")
    print("   1. Click the üîë key icon in the left sidebar")
    print("   2. Add new secret: NGROK_AUTHTOKEN")
    print("   3. Paste your token from: https://dashboard.ngrok.com/get-started/your-authtoken")
    print("   4. Toggle 'Notebook access' ON")
    print("\n   For Local (Jupyter/VS Code):")
    print("   1. Create a .env file in this directory")
    print("   2. Add: NGROK_AUTHTOKEN=your_token_here")
    raise SystemExit("NGROK_AUTHTOKEN not configured")

# Configure ngrok with token
try:
    ngrok.set_auth_token(NGROK_TOKEN)
    print("‚úÖ ngrok token configured successfully")
except Exception as e:
    print(f"‚ö†Ô∏è Warning: Could not set ngrok token: {e}")

# Disconnect any existing tunnels
print("üîå Disconnecting any existing tunnels...")
try:
    tunnels = ngrok.get_tunnels()
    for tunnel in tunnels:
        ngrok.disconnect(tunnel.public_url)
        print(f"   Disconnected: {tunnel.public_url}")
    if tunnels:
        time.sleep(2)
    print("‚úÖ All tunnels disconnected")
except Exception as e:
    print(f"   Note: {e}")

# ==================== STEP 5: Start Streamlit ====================
# Kill any existing streamlit on port 8501
try:
    os.system('lsof -ti:8501 | xargs kill -9 2>/dev/null || true')
except:
    pass

print("\nüöÄ Starting Streamlit...")
try:
    subprocess.Popen(
        ["streamlit", "run", "app.py", "--server.port", "8501", "--server.headless", "true"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
        start_new_session=True
    )
    time.sleep(5)
    print("‚úÖ Streamlit started!")
except Exception as e:
    print(f"‚ö†Ô∏è Error starting Streamlit: {e}")
    print("   You can start it manually with: streamlit run app.py")

# ==================== STEP 6: Create ngrok Tunnel ====================
print("\nüåê Creating public URL with ngrok...")
try:
    public_url = ngrok.connect(8501)
    print("\n" + "="*60)
    print("‚úÖ SUCCESS! Your app is running!")
    print("="*60)
    print(f"\nüåê Public URL (share this):")
    print(f"   {public_url}")
    print(f"\nüè† Local URL:")
    print(f"   http://localhost:8501")
    print(f"\nüìå Tips:")
    print(f"   ‚Ä¢ Keep this notebook running")
    print(f"   ‚Ä¢ First image generation takes longer (loading models)")
    print(f"   ‚Ä¢ Use short, clear voice prompts")
    print(f"   ‚Ä¢ CPU mode works but is slower than GPU")
    print("\n" + "="*60)

except Exception as e:
    error_msg = str(e)
    print(f"\n‚ö†Ô∏è Could not create ngrok tunnel: {e}")

    # ERR_NGROK_108: 3 session limit
    if "ERR_NGROK_108" in error_msg or "3 simultaneous" in error_msg or "agent sessions" in error_msg:
        print("\n" + "="*60)
        print("üí° ISSUE: ngrok free account limit (3 sessions)")
        print("="*60)
        print("\n   These sessions are on OTHER machines, not this one.")
        print("\nüîß HOW TO FIX:")
        print("   1. Go to: https://dashboard.ngrok.com/agents")
        print("   2. Click 'Disconnect' on ALL active sessions")
        print("   3. Wait 10 seconds")
        print("   4. Re-run this cell")

    # ERR_NGROK_334: Endpoint already online
    elif "ERR_NGROK_334" in error_msg or "already online" in error_msg:
        print("\n" + "="*60)
        print("üí° ISSUE: ngrok endpoint already registered")
        print("="*60)
        print("\n   A previous session didn't close properly.")
        print("\nüîß HOW TO FIX:")
        print("   1. Go to: https://dashboard.ngrok.com/agents")
        print("   2. Click 'Disconnect' on ALL active sessions")
        print("   3. Wait 30 seconds")
        print("   4. Runtime ‚Üí Restart runtime")
        print("   5. Re-run this cell")

    else:
        print("\nüîß Troubleshooting:")
        print("   1. Check your ngrok token is correct")
        print("   2. Try: Runtime ‚Üí Restart runtime")
        print("   3. Re-run this cell")

    print("\nüìå App is still running locally at: http://localhost:8501")

üßπ Cleaning up existing processes...
‚úÖ Cleanup complete
üì¶ Installing packages (2-3 minutes)...
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m9.0/9.0 MB[0m [31m56.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m6.9/6.9 MB[0m [31m118.0 MB/s[0m eta [36m0:00:00[0m
[?25h‚úÖ Packages installed!
‚úÖ app.py generated successfully
üßπ Killing any ngrok processes...
‚úÖ ngrok processes killed
‚úÖ Loaded ngrok token from Google Colab userdata
‚úÖ ngrok token configured successfully
üîå Disconnecting any existing tunnels...
‚úÖ All tunnels disconnected

üöÄ Starting Streamlit...
‚úÖ Streamlit started!

üåê Creating public URL with ngrok...

‚úÖ SUCCESS! Your app is running!

üåê Public URL (share this):
   NgrokTunnel: "https://unrivalable-lenna-soothfastly.ng

## Example Usage

### Method 1: Audio Input
1. Record or upload an audio file describing your desired image
2. Supported formats: WAV, MP3, M4A, FLAC
3. Click "Transcribe Audio" to convert speech to text
4. Review the transcription
5. Click "Generate Image" to create the image

**Example Audio Prompts:**
- "A serene lake surrounded by autumn trees at sunset"
- "A futuristic cityscape at night with neon lights"
- "A cozy coffee shop with warm lighting"

### Method 2: Direct Text Input
1. Type your image description directly
2. Be descriptive for better results
3. Click "Generate Image"

**Example Text Prompts:**
- "A beautiful sunset over mountains with trees in the foreground"
- "A modern minimalist living room with large windows"
- "A vintage typewriter on a wooden desk with books"

### Tips for Best Results
- **Be Descriptive**: Include details about colors, mood, style, composition
- **Quality Settings**:
  - More inference steps = higher quality but slower (25-50 recommended)
  - Higher guidance scale = follows prompt more closely (7.5-10 recommended)
- **Generation Time**: First generation takes longer (model loading), subsequent ones are faster
- **GPU vs CPU**: GPU is 5-10x faster; CPU works but is slower