# üé¨ YouTube Video Transcription + AI Summary

**Simple 2-step process:**

1. **Run Cell 1** ‚Üí Get your transcript (2-5 minutes)
2. **Run Cell 2** (optional) ‚Üí Add AI summary (~30 seconds)

Works with any YouTube video URL!

üí° **Tip:** Go to Runtime ‚Üí Change runtime type ‚Üí GPU (optional, makes it faster)

## üìù Step 1: Download & Transcribe YouTube Video

This cell will:
1. Install all dependencies
2. Ask for YouTube URL
3. Download audio
4. Transcribe to text
5. Save transcript file

**Note:** Transcript will be saved as a variable for the summary step.

In [None]:
#!/usr/bin/env python3
"""
STEP 1: YouTube Download + Transcription
"""

print("\n" + "="*70)
print("üé¨  INSTALLING & SETTING UP")
print("="*70 + "\n")

# Step 1: Install dependencies
print("üì¶ Installing dependencies...")
import subprocess
subprocess.run(['pip', 'install', '-q', 'openai-whisper'], check=True)
subprocess.run(['pip', 'install', '-q', 'yt-dlp'], check=True)
subprocess.run(['apt-get', '-qq', 'install', '-y', 'ffmpeg'], check=True)
print("‚úì Dependencies installed\n")

# Step 2: Import libraries
print("üìö Importing libraries...")
import whisper
import os
from pathlib import Path
from google.colab import files
import torch
import re
print("‚úì Libraries ready\n")

# Step 3: Load Whisper model
print("ü§ñ Loading Whisper Large model...")
print("   (This takes ~1-2 minutes first time)\n")
model = whisper.load_model("large")
print("‚úì Model loaded\n")

# Step 4: Get YouTube URL
print("="*70)
print("üîó ENTER YOUTUBE URL")
print("="*70)
youtube_url = input("üì∫ YouTube URL: ").strip()

if not youtube_url:
    print("‚ö†Ô∏è  No URL provided. Please run this cell again.")
    raise ValueError("No URL provided")

print(f"\n‚úì URL received: {youtube_url}\n")

# Step 5: Download audio from YouTube
print("="*70)
print("üì• DOWNLOADING AUDIO FROM YOUTUBE")
print("="*70 + "\n")

audio_file = "youtube_audio.mp3"

try:
    result = subprocess.run([
        'yt-dlp',
        '-x',  # Extract audio
        '--audio-format', 'mp3',
        '--audio-quality', '0',  # Best quality
        '-o', audio_file.replace('.mp3', '.%(ext)s'),
        '--no-playlist',  # Only download single video
        youtube_url
    ], check=True, capture_output=True, text=True)
    
    print(f"‚úì Audio downloaded: {audio_file}\n")
    
    # Step 6: Transcribe
    print("="*70)
    print("üéôÔ∏è  TRANSCRIBING (wait 2-5 minutes depending on length)")
    print("="*70 + "\n")
    
    def format_timestamp(seconds):
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        return f"{hours:02d}:{minutes:02d}:{secs:02d}"
    
    print(f"üìù Processing: {audio_file}")
    print("-" * 70)
    
    # Transcribe (auto-detect language)
    result = model.transcribe(
        audio_file,
        task='transcribe',
        verbose=False
    )
    
    detected_language = result['language']
    print(f"‚úì Detected language: {detected_language}\n")
    
    # Store transcript data globally for next cell
    transcript_text = result['text'].strip()
    transcript_segments = result['segments']
    transcript_url = youtube_url
    transcript_language = detected_language
    
    # Create safe filename from URL
    video_id_match = re.search(r'(?:v=|youtu\.be/)([^&]+)', youtube_url)
    video_id = video_id_match.group(1) if video_id_match else 'youtube'
    transcript_path = f'youtube_{video_id}_transcript.txt'
    
    # Save transcript WITHOUT summary first
    with open(transcript_path, 'w', encoding='utf-8') as f:
        f.write(f"YouTube Transcript\n")
        f.write(f"URL: {youtube_url}\n")
        f.write(f"Language: {detected_language}\n")
        f.write("=" * 70 + "\n\n")
        f.write("üìù FULL TRANSCRIPT\n")
        f.write("=" * 70 + "\n")
        f.write(transcript_text)
        f.write("\n\n" + "=" * 70 + "\n")
        f.write("‚è±Ô∏è  DETAILED SEGMENTS\n")
        f.write("=" * 70 + "\n\n")
        
        for segment in transcript_segments:
            start = format_timestamp(segment['start'])
            end = format_timestamp(segment['end'])
            text = segment['text'].strip()
            f.write(f"[{start} ‚Üí {end}] {text}\n")
    
    print(f"‚úì Transcript saved: {transcript_path}\n")
    
    # Show transcript preview
    print("üìã Transcript Preview:")
    print("-" * 70)
    preview = transcript_text[:300]
    print(preview)
    if len(transcript_text) > 300:
        print("...(more)")
    print("-" * 70)
    
    # Download transcript
    print(f"\nüì• Downloading transcript...")
    files.download(transcript_path)
    print(f"‚úì Downloaded: {transcript_path}\n")
    
    # Cleanup audio file
    if os.path.exists(audio_file):
        os.remove(audio_file)
        print("‚úì Cleaned up temporary audio file\n")
    
    print("\n" + "="*70)
    print("‚úÖ STEP 1 COMPLETE!")
    print("="*70)
    print("Your transcript is ready and downloaded.\n")
    print("üí° Next: Run the cell below to add AI summary (optional)")
    print("="*70 + "\n")
    
except subprocess.CalledProcessError as e:
    print(f"‚ùå Error downloading video: {e}")
    print("Make sure the URL is valid and the video is accessible.")
except Exception as e:
    print(f"‚ùå Error: {e}")

## üß† Step 2: Add AI Summary (Optional)

Run this cell to add an AI-powered summary to your transcript:
- Brief overview
- Key points
- Main takeaways

**You'll need a FREE Groq API key:** [console.groq.com/keys](https://console.groq.com/keys)

Skip this step if you only want the transcript.

In [None]:
#!/usr/bin/env python3
"""
STEP 2: Add AI Summary to Transcript
"""

print("\n" + "="*70)
print("üß† AI SUMMARY GENERATION")
print("="*70 + "\n")

# Check if transcript exists from previous cell
try:
    if not transcript_text:
        raise NameError
    print("‚úì Transcript found from previous step\n")
except NameError:
    print("‚ùå Error: No transcript found!")
    print("Please run the transcription cell above first.\n")
    raise

# Install Groq if not already installed
import subprocess
print("üì¶ Installing Groq API...")
subprocess.run(['pip', 'install', '-q', 'groq'], check=True)
print("‚úì Groq installed\n")

from groq import Groq
from getpass import getpass
import re

# Get API key
print("üîë Enter your Groq API key:")
print("Get it free at: https://console.groq.com/keys\n")
groq_api_key = getpass("Groq API Key: ").strip()

if not groq_api_key:
    print("‚ö†Ô∏è  No API key provided. Skipping summary.")
else:
    try:
        print("\n" + "="*70)
        print("üß† GENERATING AI SUMMARY (wait ~30 seconds)")
        print("="*70 + "\n")
        
        client = Groq(api_key=groq_api_key)
        
        # Split long transcripts to avoid token limits
        max_chars = 15000
        transcript_to_summarize = transcript_text[:max_chars]
        if len(transcript_text) > max_chars:
            transcript_to_summarize += "... (transcript truncated for summary)"
        
        chat_completion = client.chat.completions.create(
            messages=[
                {
                    "role": "system",
                    "content": "You are a helpful assistant that creates clear, concise summaries of video transcripts. Provide: 1) A brief overview (2-3 sentences), 2) Key points (bullet points), 3) Main takeaways."
                },
                {
                    "role": "user",
                    "content": f"Please summarize this video transcript:\n\n{transcript_to_summarize}"
                }
            ],
            model="llama-3.3-70b-versatile",
            temperature=0.3,
            max_tokens=1000
        )
        
        summary_text = chat_completion.choices[0].message.content
        print("‚úì AI Summary generated\n")
        
        # Display summary
        print("="*70)
        print("üìã AI SUMMARY")
        print("="*70)
        print(summary_text)
        print("="*70 + "\n")
        
        # Create updated transcript file WITH summary
        video_id_match = re.search(r'(?:v=|youtu\.be/)([^&]+)', transcript_url)
        video_id = video_id_match.group(1) if video_id_match else 'youtube'
        transcript_with_summary_path = f'youtube_{video_id}_transcript_with_summary.txt'
        
        def format_timestamp(seconds):
            hours = int(seconds // 3600)
            minutes = int((seconds % 3600) // 60)
            secs = int(seconds % 60)
            return f"{hours:02d}:{minutes:02d}:{secs:02d}"
        
        with open(transcript_with_summary_path, 'w', encoding='utf-8') as f:
            f.write(f"YouTube Transcript + AI Summary\n")
            f.write(f"URL: {transcript_url}\n")
            f.write(f"Language: {transcript_language}\n")
            f.write("=" * 70 + "\n\n")
            
            # Add AI Summary section
            f.write("üß† AI SUMMARY\n")
            f.write("=" * 70 + "\n")
            f.write(summary_text)
            f.write("\n\n" + "=" * 70 + "\n\n")
            
            f.write("üìù FULL TRANSCRIPT\n")
            f.write("=" * 70 + "\n")
            f.write(transcript_text)
            f.write("\n\n" + "=" * 70 + "\n")
            f.write("‚è±Ô∏è  DETAILED SEGMENTS\n")
            f.write("=" * 70 + "\n\n")
            
            for segment in transcript_segments:
                start = format_timestamp(segment['start'])
                end = format_timestamp(segment['end'])
                text = segment['text'].strip()
                f.write(f"[{start} ‚Üí {end}] {text}\n")
        
        print(f"‚úì Updated file created: {transcript_with_summary_path}\n")
        
        # Download updated file
        from google.colab import files
        print(f"üì• Downloading transcript with summary...")
        files.download(transcript_with_summary_path)
        print(f"‚úì Downloaded: {transcript_with_summary_path}\n")
        
        print("\n" + "="*70)
        print("üéâ ALL DONE!")
        print("="*70)
        print("Your transcript + AI summary is ready!\n")
        print("="*70 + "\n")
        
    except Exception as e:
        print(f"‚ùå Summary generation failed: {e}")
        print("Check your API key and try again.\n")

## ‚ú® All Done!

### üì• Downloaded Files:

**After Step 1:**
- `youtube_[id]_transcript.txt` - Basic transcript with timestamps

**After Step 2 (optional):**
- `youtube_[id]_transcript_with_summary.txt` - Transcript + AI summary

### ü§ñ About Groq API (for AI summaries)

- **Free & Fast:** Uses Llama 3.3 70B model
- **Get your key:** [console.groq.com/keys](https://console.groq.com/keys)
- **Generous limits:** Perfect for this use case

### üìÅ Next Steps

Organize your transcripts by running in terminal:
```bash
python3 scripts/trigger_colab.py --organize
```

This moves files to `project/transcripts/` folder. üéâ