# AI-Powered Audio Transcription & Note Generation Bot

This notebook creates an AI bot that can:
- Record or accept uploaded audio files
- Transcribe audio using OpenAI Whisper
- Generate detailed notes and summaries using LLM
- Provide an intuitive Gradio interface

## Features:
- **Audio Input**: Record live or upload files
- **Transcription**: High-quality speech-to-text
- **Smart Notes**: AI-generated summaries and topic breakdowns
- **User-Friendly UI**: Clean Gradio interface

## 1. Setup & Imports

First, let's install and import all required libraries:

In [None]:
# Install required packages (run this first)
!pip install gradio openai-whisper transformers torch torchaudio soundfile librosa numpy pandas

In [None]:
# Core imports
import gradio as gr
import whisper
import torch
import numpy as np
import os
import tempfile
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Audio processing
import librosa
import soundfile as sf

# LLM and text processing
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import re

print("✅ All libraries imported successfully!")
print(f"🔥 PyTorch version: {torch.__version__}")
print(f"🎵 Whisper available: {whisper.__version__ if hasattr(whisper, '__version__') else 'Yes'}")
print(f"🎨 Gradio version: {gr.__version__}")

## 2. Audio Recording & Upload Functions

These functions handle audio input from recording or file upload:

In [None]:
def process_audio_file(audio_file):
    """
    Process uploaded audio file and prepare it for transcription.
    
    Args:
        audio_file: Path to audio file or tuple (sample_rate, audio_data)
    
    Returns:
        str: Path to processed audio file
    """
    try:
        if audio_file is None:
            return None
        
        # Handle different input types
        if isinstance(audio_file, tuple):
            # From microphone recording (sample_rate, audio_data)
            sample_rate, audio_data = audio_file
            
            # Create temporary file
            with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
                sf.write(tmp_file.name, audio_data, sample_rate)
                return tmp_file.name
        
        elif isinstance(audio_file, str):
            # File path from upload
            return audio_file
        
        else:
            # Gradio file object
            return audio_file.name if hasattr(audio_file, 'name') else str(audio_file)
            
    except Exception as e:
        print(f"❌ Error processing audio file: {str(e)}")
        return None

def get_audio_info(audio_path):
    """
    Get information about the audio file.
    
    Args:
        audio_path (str): Path to audio file
    
    Returns:
        dict: Audio information
    """
    try:
        if not audio_path or not os.path.exists(audio_path):
            return {"error": "Audio file not found"}
        
        # Load audio to get info
        audio_data, sample_rate = librosa.load(audio_path, sr=None)
        duration = len(audio_data) / sample_rate
        
        return {
            "duration": f"{duration:.2f} seconds",
            "sample_rate": f"{sample_rate} Hz",
            "channels": "Mono",
            "file_size": f"{os.path.getsize(audio_path) / 1024:.1f} KB"
        }
    except Exception as e:
        return {"error": f"Could not analyze audio: {str(e)}"}

print("✅ Audio processing functions ready!")

## 3. Transcription Function

Using OpenAI Whisper for high-quality speech-to-text transcription:

In [None]:
# Load Whisper model (using base model for balance of speed and accuracy)
print("🔄 Loading Whisper model...")
whisper_model = whisper.load_model("base")
print("✅ Whisper model loaded successfully!")

def transcribe_audio(audio_file, language="auto"):
    """
    Transcribe audio file using OpenAI Whisper.
    
    Args:
        audio_file: Audio file path or Gradio audio input
        language (str): Language code or 'auto' for auto-detection
    
    Returns:
        dict: Transcription results with text, language, and confidence
    """
    try:
        if not audio_file:
            return {
                "text": "",
                "language": "unknown",
                "status": "❌ No audio file provided"
            }
        
        # Process the audio file
        audio_path = process_audio_file(audio_file)
        if not audio_path:
            return {
                "text": "",
                "language": "unknown",
                "status": "❌ Could not process audio file"
            }
        
        print(f"🔄 Transcribing audio: {os.path.basename(audio_path)}")
        
        # Transcribe using Whisper
        if language == "auto":
            result = whisper_model.transcribe(audio_path)
        else:
            result = whisper_model.transcribe(audio_path, language=language)
        
        # Extract results
        transcribed_text = result["text"].strip()
        detected_language = result.get("language", "unknown")
        
        # Clean up temporary file if created
        if audio_path != audio_file and os.path.exists(audio_path):
            try:
                os.unlink(audio_path)
            except:
                pass
        
        if not transcribed_text:
            return {
                "text": "",
                "language": detected_language,
                "status": "⚠️ No speech detected in audio"
            }
        
        print(f"✅ Transcription completed! Language: {detected_language}")
        
        return {
            "text": transcribed_text,
            "language": detected_language,
            "status": f"✅ Transcription successful ({detected_language})"
        }
        
    except Exception as e:
        print(f"❌ Transcription error: {str(e)}")
        return {
            "text": "",
            "language": "unknown",
            "status": f"❌ Transcription failed: {str(e)}"
        }

print("✅ Transcription function ready!")

## 4. LLM Summary & Note Generation Functions

Using HuggingFace transformers for text summarization and note generation:

In [None]:
# Load summarization model
print("🔄 Loading summarization model...")
try:
    # Using BART for summarization (good balance of quality and speed)
    summarizer = pipeline(
        "summarization", 
        model="facebook/bart-large-cnn",
        device=0 if torch.cuda.is_available() else -1
    )
    print("✅ BART summarization model loaded!")
except Exception as e:
    print(f"⚠️ Could not load BART model, using fallback: {e}")
    # Fallback to a smaller model
    summarizer = pipeline(
        "summarization", 
        model="sshleifer/distilbart-cnn-12-6",
        device=0 if torch.cuda.is_available() else -1
    )
    print("✅ DistilBART summarization model loaded!")

def chunk_text(text, max_chunk_length=1000):
    """
    Split text into chunks for processing by the summarization model.
    
    Args:
        text (str): Input text
        max_chunk_length (int): Maximum length per chunk
    
    Returns:
        list: List of text chunks
    """
    # Split by sentences first
    sentences = re.split(r'(?<=[.!?])\s+', text)
    
    chunks = []
    current_chunk = ""
    
    for sentence in sentences:
        if len(current_chunk + sentence) <= max_chunk_length:
            current_chunk += sentence + " "
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence + " "
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

def generate_summary(text):
    """
    Generate a concise summary of the transcribed text.
    
    Args:
        text (str): Input transcribed text
    
    Returns:
        str: Generated summary
    """
    try:
        if not text or len(text.strip()) < 50:
            return "⚠️ Text too short to summarize effectively."
        
        print("🔄 Generating summary...")
        
        # Handle long texts by chunking
        if len(text) > 1000:
            chunks = chunk_text(text, 900)
            summaries = []
            
            for i, chunk in enumerate(chunks):
                print(f"📝 Summarizing chunk {i+1}/{len(chunks)}")
                try:
                    summary = summarizer(
                        chunk, 
                        max_length=130, 
                        min_length=30, 
                        do_sample=False
                    )[0]['summary_text']
                    summaries.append(summary)
                except Exception as e:
                    print(f"⚠️ Error summarizing chunk {i+1}: {e}")
                    continue
            
            # Combine chunk summaries
            combined_summary = " ".join(summaries)
            
            # If combined summary is still long, summarize it again
            if len(combined_summary) > 800:
                final_summary = summarizer(
                    combined_summary, 
                    max_length=200, 
                    min_length=50, 
                    do_sample=False
                )[0]['summary_text']
                return final_summary
            else:
                return combined_summary
        
        else:
            # Short text, summarize directly
            summary = summarizer(
                text, 
                max_length=150, 
                min_length=30, 
                do_sample=False
            )[0]['summary_text']
            return summary
            
    except Exception as e:
        print(f"❌ Summary generation error: {str(e)}")
        return f"❌ Could not generate summary: {str(e)}"

def extract_topics_and_notes(text):
    """
    Extract key topics and generate structured notes from the text.
    
    Args:
        text (str): Input transcribed text
    
    Returns:
        str: Formatted topic breakdown and notes
    """
    try:
        if not text or len(text.strip()) < 50:
            return "⚠️ Text too short for topic extraction."
        
        print("🔄 Extracting topics and generating notes...")
        
        # Simple topic extraction using keywords and sentence analysis
        sentences = re.split(r'(?<=[.!?])\s+', text)
        
        # Extract potential topics (nouns and important phrases)
        import collections
        words = re.findall(r'\b[A-Z][a-z]+\b|\b[a-z]{4,}\b', text)
        word_freq = collections.Counter(words)
        
        # Get most common meaningful words as topics
        stop_words = {'this', 'that', 'with', 'have', 'will', 'from', 'they', 'been', 'were', 'said', 'each', 'which', 'their', 'time', 'about'}
        topics = [word for word, freq in word_freq.most_common(10) 
                 if word.lower() not in stop_words and freq > 1]
        
        # Create structured notes
        notes = "## 📋 Structured Notes\n\n"
        
        # Key topics section
        if topics:
            notes += "### 🎯 Key Topics Mentioned:\n"
            for i, topic in enumerate(topics[:5], 1):
                notes += f"{i}. **{topic.title()}**\n"
            notes += "\n"
        
        # Main points section
        notes += "### 💡 Main Points:\n"
        
        # Split into logical sections (every 3-4 sentences)
        section_size = max(3, len(sentences) // 5)  # Aim for ~5 sections
        sections = [sentences[i:i+section_size] for i in range(0, len(sentences), section_size)]
        
        for i, section in enumerate(sections, 1):
            if section:
                section_text = " ".join(section)
                if len(section_text.strip()) > 20:  # Only include substantial sections
                    # Try to summarize each section
                    try:
                        if len(section_text) > 100:
                            section_summary = summarizer(
                                section_text, 
                                max_length=60, 
                                min_length=15, 
                                do_sample=False
                            )[0]['summary_text']
                        else:
                            section_summary = section_text
                        
                        notes += f"- **Point {i}:** {section_summary}\n"
                    except:
                        # Fallback to original text if summarization fails
                        notes += f"- **Point {i}:** {section_text[:100]}...\n"
        
        # Add timestamp
        notes += f"\n---\n*Notes generated on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}*"
        
        return notes
        
    except Exception as e:
        print(f"❌ Topic extraction error: {str(e)}")
        return f"❌ Could not extract topics: {str(e)}"

print("✅ LLM summary and note generation functions ready!")

## 5. Main Processing Function

Combining all components into a single processing pipeline:

In [None]:
def process_audio_complete(audio_file, language="auto"):
    """
    Complete pipeline: Audio → Transcription → Summary → Notes
    
    Args:
        audio_file: Audio input from Gradio
        language (str): Language for transcription
    
    Returns:
        tuple: (transcription, summary, notes, status)
    """
    try:
        print("🚀 Starting complete audio processing pipeline...")
        
        # Step 1: Transcribe audio
        transcription_result = transcribe_audio(audio_file, language)
        
        if not transcription_result["text"]:
            return (
                "❌ No transcription available",
                "❌ Cannot generate summary without transcription",
                "❌ Cannot generate notes without transcription",
                transcription_result["status"]
            )
        
        transcribed_text = transcription_result["text"]
        
        # Step 2: Generate summary
        print("📝 Generating summary...")
        summary = generate_summary(transcribed_text)
        
        # Step 3: Extract topics and generate notes
        print("📋 Generating structured notes...")
        notes = extract_topics_and_notes(transcribed_text)
        
        # Create status message
        status = f"""✅ **Processing Complete!**

📊 **Statistics:**
- Language detected: {transcription_result['language']}
- Transcription length: {len(transcribed_text)} characters
- Word count: ~{len(transcribed_text.split())} words
- Processing time: {datetime.now().strftime('%H:%M:%S')}
"""
        
        print("🎉 Complete processing finished successfully!")
        
        return (
            transcribed_text,
            summary,
            notes,
            status
        )
        
    except Exception as e:
        error_msg = f"❌ Processing failed: {str(e)}"
        print(error_msg)
        return (
            error_msg,
            error_msg,
            error_msg,
            error_msg
        )

print("✅ Main processing function ready!")

## 6. Gradio Interface

Creating an intuitive and beautiful user interface:

In [None]:
def create_gradio_interface():
    """
    Create and configure the Gradio interface.
    
    Returns:
        gr.Interface: Configured Gradio interface
    """
    
    # Custom CSS for better styling
    custom_css = """
    .gradio-container {
        font-family: 'Arial', sans-serif;
        max-width: 1200px;
        margin: 0 auto;
    }
    .output-text {
        font-size: 14px;
        line-height: 1.6;
    }
    .status-box {
        background-color: #f0f8ff;
        border-radius: 8px;
        padding: 10px;
    }
    """
    
    # Create the interface using Blocks for more control
    with gr.Blocks(
        css=custom_css,
        title="🎤 AI Audio Transcription & Note Generator",
        theme=gr.themes.Soft()
    ) as interface:
        
        # Header
        gr.Markdown(
            """
            # 🎤 AI-Powered Audio Transcription & Note Generator
            
            Transform your audio recordings into structured notes and summaries using AI!
            
            **Features:**
            - 🎙️ Record audio directly or upload files
            - 🗣️ High-quality speech-to-text transcription
            - 📝 AI-generated summaries and structured notes
            - 🌍 Multi-language support
            """
        )
        
        with gr.Row():
            with gr.Column(scale=1):
                # Input section
                gr.Markdown("## 🎵 Audio Input")
                
                audio_input = gr.Audio(
                    label="🎤 Record or Upload Audio",
                    type="filepath",
                    sources=["microphone", "upload"]
                )
                
                language_input = gr.Dropdown(
                    choices=[
                        ("Auto-detect", "auto"),
                        ("English", "en"),
                        ("Spanish", "es"),
                        ("French", "fr"),
                        ("German", "de"),
                        ("Italian", "it"),
                        ("Portuguese", "pt"),
                        ("Russian", "ru"),
                        ("Japanese", "ja"),
                        ("Chinese", "zh")
                    ],
                    value="auto",
                    label="🌍 Language"
                )
                
                process_btn = gr.Button(
                    "🚀 Process Audio", 
                    variant="primary",
                    size="lg"
                )
                
                # Status display
                status_output = gr.Markdown(
                    "📋 Ready to process audio...",
                    elem_classes=["status-box"]
                )
            
            with gr.Column(scale=2):
                # Output section
                gr.Markdown("## 📊 Results")
                
                with gr.Tabs():
                    with gr.TabItem("📝 Transcription"):
                        transcription_output = gr.Textbox(
                            label="🗣️ Transcribed Text",
                            placeholder="Transcription will appear here...",
                            lines=10,
                            max_lines=15,
                            elem_classes=["output-text"]
                        )
                    
                    with gr.TabItem("📋 Summary"):
                        summary_output = gr.Textbox(
                            label="📋 AI-Generated Summary",
                            placeholder="Summary will appear here...",
                            lines=8,
                            max_lines=12,
                            elem_classes=["output-text"]
                        )
                    
                    with gr.TabItem("🎯 Structured Notes"):
                        notes_output = gr.Markdown(
                            "Structured notes will appear here...",
                            elem_classes=["output-text"]
                        )
        
        # Example section
        gr.Markdown(
            """
            ---
            ## 💡 Tips for Best Results:
            
            - **Audio Quality**: Use clear audio with minimal background noise
            - **Duration**: Works best with 30 seconds to 10 minutes of audio
            - **Language**: Select the correct language for better accuracy
            - **Content**: Structured speech (meetings, lectures) works better than casual conversation
            
            ## 🔧 Supported Formats:
            WAV, MP3, MP4, M4A, FLAC, and more!
            """
        )
        
        # Connect the processing function
        process_btn.click(
            fn=process_audio_complete,
            inputs=[audio_input, language_input],
            outputs=[transcription_output, summary_output, notes_output, status_output],
            show_progress=True
        )
        
        # Auto-process when audio is uploaded (optional)
        audio_input.change(
            fn=lambda audio: "🎵 Audio uploaded! Click 'Process Audio' to continue." if audio else "📋 Ready to process audio...",
            inputs=[audio_input],
            outputs=[status_output]
        )
    
    return interface

print("✅ Gradio interface function ready!")

## 7. Demo Launch Cell

Launch the complete application:

In [None]:
# Create and launch the Gradio interface
print("🚀 Launching AI Audio Transcription & Note Generator...")
print("📝 This may take a moment to load all models...")

# Create the interface
demo = create_gradio_interface()

# Launch with configuration
if __name__ == "__main__":
    demo.launch(
        share=True,          # Create public link (set to False for local only)
        server_name="0.0.0.0",  # Allow external connections
        server_port=7860,    # Port number
        show_error=True,     # Show detailed errors
        quiet=False          # Show startup logs
    )
    
print("🎉 Application launched successfully!")
print("📱 Access the app through the provided URL")
print("🔄 The interface will auto-refresh when you make changes")

## 🎯 Usage Instructions

1. **Run all cells above** to set up the environment and load models
2. **Launch the demo** using the cell above
3. **Upload or record audio** using the interface
4. **Select language** (or use auto-detect)
5. **Click 'Process Audio'** to get transcription, summary, and notes

## 🔧 Customization Options

- **Change Whisper model**: Modify `whisper.load_model("base")` to use "tiny", "small", "medium", or "large"
- **Adjust summary length**: Modify `max_length` and `min_length` in summarization functions
- **Add more languages**: Extend the language dropdown in the Gradio interface
- **Customize UI**: Modify the CSS and layout in the `create_gradio_interface()` function

## 📊 Model Information

- **Transcription**: OpenAI Whisper (base model)
- **Summarization**: Facebook BART or DistilBART
- **Interface**: Gradio with custom styling

---
*Built with ❤️ using open-source AI models*