# YouTube Video Analyser - Complete Demonstration

This notebook demonstrates all components of the **youtube-video-analyser** RAG pipeline project.

## Components Covered:
1. **TranscriptFetcher** - YouTube transcript extraction
2. **Chunker** - Semantic text chunking strategies
3. **Embedder** - Vector embedding generation
4. **VectorRetriever** - ChromaDB storage and retrieval
5. **Summarizer** - LLM-based summarization
6. **YouTubeAnalyzer** - Main orchestrator (end-to-end pipeline)

---

## Setup and Installation

First, ensure all dependencies are installed.

**‚ö†Ô∏è Important:** After running the installation cell below, **restart the kernel** (Kernel ‚Üí Restart Kernel) before proceeding to the next cells.

In [14]:
# Install required packages (if not already installed)
# Run this cell first if you're in a fresh environment
!pip install youtube-transcript-api langchain langchain-text-splitters sentence-transformers chromadb groq openai python-dotenv spacy transformers torch



---
**üîÑ RESTART THE KERNEL NOW**

After installing packages, please restart the kernel:
- Click **"Kernel"** ‚Üí **"Restart Kernel"** (or press `00` in command mode)
- Then continue with the cells below

---

In [47]:
# Import required libraries
import os
import sys
from pathlib import Path
from dotenv import load_dotenv

# Add the src directory to the Python path
project_root = Path().absolute().parent
sys.path.insert(0, str(project_root / "src"))

# Load environment variables from .env file
env_path = project_root / ".env"
load_dotenv(env_path)

print(f"Project root: {project_root}")
print(f"Python path updated: {sys.path[0]}")
print(f"Environment file loaded: {env_path}")
print(f"\nAPI Keys loaded:")
print(f"  - GROQ_API_KEY: {'‚úì Set' if os.getenv('GROQ_API_KEY') else '‚úó Not set'}")
print(f"  - OPENAI_API_KEY: {'‚úì Set' if os.getenv('OPENAI_API_KEY') else '‚úó Not set'}")
print(f"  - OPENROUTER_API_KEY: {'‚úì Set' if os.getenv('OPENROUTER_API_KEY') else '‚úó Not set'}")
print(f"\nLLM Provider: {os.getenv('LLM_PROVIDER', 'groq')}")

Project root: /Users/vasantharajanpandian/my-development/zero-development/vasanth-experiments/youtube-video-analyser-model
Python path updated: /Users/vasantharajanpandian/my-development/zero-development/vasanth-experiments/youtube-video-analyser-model/src
Environment file loaded: /Users/vasantharajanpandian/my-development/zero-development/vasanth-experiments/youtube-video-analyser-model/.env

API Keys loaded:
  - GROQ_API_KEY: ‚úì Set
  - OPENAI_API_KEY: ‚úì Set
  - OPENROUTER_API_KEY: ‚úì Set

LLM Provider: groq


In [48]:
# Force reload of modules to pick up code changes
import importlib
import sys

# Remove cached modules
modules_to_reload = [key for key in sys.modules.keys() if 'youtube_analyzer' in key]
for module in modules_to_reload:
    del sys.modules[module]

print(f"Reloaded {len(modules_to_reload)} youtube_analyzer modules")
print("Modules cleared - imports will be fresh")

Reloaded 12 youtube_analyzer modules
Modules cleared - imports will be fresh


## 1. TranscriptFetcher - YouTube Transcript Extraction

The `TranscriptFetcher` class handles fetching YouTube video transcripts with robust error handling.

In [49]:
from youtube_analyzer.core.transcript_fetcher import TranscriptFetcher

# Initialize the transcript fetcher
fetcher = TranscriptFetcher()

# Example YouTube video URL - Replace with a video that has captions
# Good test videos:
# - TED Talks: https://www.youtube.com/watch?v=c0KYU2j0TM4 (Science of happiness)
# - Educational content with captions
video_url = "https://www.youtube.com/watch?v=c0KYU2j0TM4"  # Replace with your desired video URL

print(f"Fetching transcript from: {video_url}")
print("-" * 80)
try:
    transcript_segments = fetcher.fetch_transcript(video_url)
    
    print(f"\n‚úì Successfully fetched {len(transcript_segments)} transcript segments")
    print("\nFirst 3 segments:")
    for i, segment in enumerate(transcript_segments[:3]):
        print(f"\n[{i+1}] Time: {segment.start:.2f}s - {segment.start + segment.duration:.2f}s")
        print(f"Text: {segment.text}")
        
except Exception as e:
    print(f"‚úó Error fetching transcript: {e}")
    print("\nNote: Make sure the video has captions/subtitles available")
    transcript_segments = []

Fetching transcript from: https://www.youtube.com/watch?v=c0KYU2j0TM4
--------------------------------------------------------------------------------

‚úì Successfully fetched 409 transcript segments

First 3 segments:

[1] Time: 15.26s - 17.24s
Text: When I was nine years old,

[2] Time: 17.26s - 19.40s
Text: I went off to summer camp
for the first time.

[3] Time: 19.43s - 23.24s
Text: And my mother packed me a suitcase
full of books,

‚úì Successfully fetched 409 transcript segments

First 3 segments:

[1] Time: 15.26s - 17.24s
Text: When I was nine years old,

[2] Time: 17.26s - 19.40s
Text: I went off to summer camp
for the first time.

[3] Time: 19.43s - 23.24s
Text: And my mother packed me a suitcase
full of books,


## 2. Chunker - Semantic Text Chunking

The `ChunkerFactory` provides different strategies for splitting text into semantic chunks.

In [50]:
from youtube_analyzer.core.chunker import ChunkerFactory

# Combine transcript segments into full text
if transcript_segments:
    full_text = " ".join([seg.text for seg in transcript_segments])
    # Use chunk_segments for transcript data with timing
    chunker = ChunkerFactory.create_chunker(
        method="langchain",
        chunk_size=500,
        chunk_overlap=100
    )
    chunks = chunker.chunk_segments(transcript_segments)
    print(f"Full text length: {len(full_text)} characters")
else:
    # Use sample text for demonstration
    full_text = """Artificial intelligence is transforming the world. 
    Machine learning enables computers to learn from data. 
    Deep learning uses neural networks with multiple layers. 
    Natural language processing helps computers understand human language.
    Computer vision allows machines to interpret visual information."""
    
    print(f"Full text length: {len(full_text)} characters")
    
    # Create chunker with LangChain method
    chunker = ChunkerFactory.create_chunker(
        method="langchain",
        chunk_size=500,
        chunk_overlap=100
    )
    
    # For plain text, use chunk_text and wrap in ChunkData manually
    from youtube_analyzer.models.schemas import ChunkData
    text_chunks = chunker.chunk_text(full_text)
    chunks = [
        ChunkData(
            text=chunk,
            chunk_index=i,
            start_time=0.0,
            end_time=0.0,
            metadata={}
        )
        for i, chunk in enumerate(text_chunks)
    ]

print(f"\nCreated {len(chunks)} chunks\n")
print("First 2 chunks:")
for i, chunk in enumerate(chunks[:2]):
    print(f"\n[Chunk {i+1}]")
    print(f"Text: {chunk.text[:200]}..." if len(chunk.text) > 200 else f"Text: {chunk.text}")
    print(f"Start time: {chunk.start_time}s, End time: {chunk.end_time}s")

Full text length: 18005 characters

Created 44 chunks

First 2 chunks:

[Chunk 1]
Text: When I was nine years old, I went off to summer camp
for the first time. And my mother packed me a suitcase
full of books, which to me seemed like
a perfectly natural thing to do. Because in my family...
Start time: 15.26s, End time: 42.117999999999995s

[Chunk 2]
Text: sitting right next to you, but you are also free to go
roaming around the adventureland inside your own mind. And I had this idea that camp was going to be
just like this, but better. (Laughter) I had...
Start time: 39.26s, End time: 71.23599999999999s


## 3. Embedder - Vector Embedding Generation

The `Embedder` class generates vector embeddings using sentence-transformers.

In [51]:
from youtube_analyzer.core.embedder import Embedder

# Initialize embedder
embedder = Embedder(model_name="all-MiniLM-L6-v2")

print("Generating embeddings...")

# Generate embeddings for chunks
texts = [chunk.text for chunk in chunks]
embedding_result = embedder.encode_texts(texts)
embeddings = embedding_result.embeddings

print(f"\nGenerated {len(embeddings)} embeddings")
print(f"Embedding dimension: {embedding_result.dimension}")
print(f"Model used: {embedding_result.model_name}")
print(f"\nFirst embedding (first 10 values): {embeddings[0][:10]}")

Generating embeddings...

Generated 44 embeddings
Embedding dimension: 384
Model used: all-MiniLM-L6-v2

First embedding (first 10 values): [0.08945563435554504, -0.018094534054398537, -0.003563137259334326, 0.11281244456768036, 0.0449887178838253, 0.04932912439107895, 0.050185877829790115, -0.030048904940485954, 0.015698527917265892, -0.005386654753237963]

Generated 44 embeddings
Embedding dimension: 384
Model used: all-MiniLM-L6-v2

First embedding (first 10 values): [0.08945563435554504, -0.018094534054398537, -0.003563137259334326, 0.11281244456768036, 0.0449887178838253, 0.04932912439107895, 0.050185877829790115, -0.030048904940485954, 0.015698527917265892, -0.005386654753237963]


## 4. VectorRetriever - ChromaDB Storage and Retrieval

The `VectorRetriever` class handles storage and similarity-based retrieval using ChromaDB.

In [52]:
from youtube_analyzer.core.retriever import VectorRetriever

# Initialize vector retriever
retriever = VectorRetriever(
    persist_directory="./data/demo_chroma_db",
    collection_name="demo_collection"
)

# Store chunks with embeddings
video_id = "demo_video_001"
print(f"Storing {len(chunks)} chunks in ChromaDB...")

retriever.add_chunks(
    video_id=video_id,
    chunks=chunks,
    embeddings=embeddings
)

print("\nChunks stored successfully!")

# Retrieve similar chunks
query = "What is machine learning?"
print(f"\nSearching for: '{query}'")

# Generate query embedding
query_embedding_result = embedder.encode_texts([query])
query_embedding = query_embedding_result.embeddings[0]

# Retrieve top 3 similar chunks
retrieval_result = retriever.search_similar_chunks(
    query_embedding=query_embedding,
    video_id=video_id,
    n_results=3
)

similar_chunks = retrieval_result.chunks

print(f"\nFound {len(similar_chunks)} relevant chunks:")
print(f"Retrieval time: {retrieval_result.retrieval_time:.3f}s")
for i, chunk in enumerate(similar_chunks):
    print(f"\n[Result {i+1}] (Similarity: {chunk.similarity_score:.3f})")
    print(f"Text: {chunk.text[:150]}..." if len(chunk.text) > 150 else f"Text: {chunk.text}")

Storing 44 chunks in ChromaDB...

Chunks stored successfully!

Searching for: 'What is machine learning?'

Found 3 relevant chunks:
Retrieval time: 0.001s

[Result 1] (Similarity: 0.620)
Text: Artificial intelligence is transforming the world. 
    Machine learning enables computers to learn from data. 
    Deep learning uses neural networks...

[Result 2] (Similarity: 0.207)
Text: for lots of stimulation. And also we have
this belief system right now that I call the new groupthink, which holds that all creativity
and all product...

[Result 3] (Similarity: 0.143)
Text: become such an expert in the first place had he not been too introverted
to leave the house when he was growing up. Now, of course, this does not mean...


## 5. Summarizer - LLM-based Summarization

The `SummarizerFactory` creates summarizers for different LLM providers.

In [53]:
from youtube_analyzer.core.summarizer import SummarizerFactory

# Get LLM provider from environment (default to groq)
llm_provider = os.getenv("LLM_PROVIDER", "groq")

print(f"Using LLM provider: {llm_provider}\n")

# Create summarizer
summarizer = SummarizerFactory.create_summarizer(
    provider=llm_provider,
    api_key=os.getenv(f"{llm_provider.upper()}_API_KEY")
)

# Prepare text from retrieved chunks
text_to_summarize = "\n\n".join([chunk.text for chunk in similar_chunks])

print("Generating summary...\n")

try:
    summary = summarizer.summarize(
        text=text_to_summarize,
        context_chunks=similar_chunks
    )
    
    print("=" * 80)
    print("SUMMARY")
    print("=" * 80)
    print(summary)
    print("=" * 80)
    
except Exception as e:
    print(f"Error generating summary: {e}")
    print("\nNote: Make sure you have set up the API key in your .env file")

Using LLM provider: groq

Generating summary...

SUMMARY
**Concise Summary**

The video discusses the impact of artificial intelligence on the world, highlighting its various applications such as machine learning, deep learning, natural language processing, and computer vision. However, the main focus of the video shifts to the importance of solitude in a world that increasingly values collaboration and group work. The speaker argues that the current education system, with its emphasis on group projects and pods of desks, may be neglecting the needs of introverted students who require solitude to thrive.

The speaker suggests that the "new groupthink" mentality, which holds that all creativity and productivity comes from collaborative work, may be misguided. They argue that solitude is essential for some people, allowing them to focus, reflect, and recharge. The example of Steve Wozniak, who became an expert in his field partly due to his introverted nature, is cited to illustrate the 

## 6. YouTubeAnalyzer - Complete End-to-End Pipeline

The `YouTubeAnalyzer` class orchestrates all components for a complete analysis.

In [54]:
from youtube_analyzer import YouTubeAnalyzer
from youtube_analyzer.models.schemas import AnalyzerConfig

# Create custom configuration using values from .env
config = AnalyzerConfig(
    embedding_model=os.getenv("EMBEDDING_MODEL", "all-MiniLM-L6-v2"),
    chunking_method=os.getenv("CHUNKING_METHOD", "langchain"),
    chunk_size=int(os.getenv("CHUNK_SIZE", "1000")),
    chunk_overlap=int(os.getenv("CHUNK_OVERLAP", "200")),
    llm_provider=os.getenv("LLM_PROVIDER", "groq"),
    chroma_persist_directory=os.getenv("CHROMA_PERSIST_DIRECTORY", "./data/chroma_db"),
    collection_name=os.getenv("COLLECTION_NAME", "youtube_transcripts")
)

# Initialize analyzer
analyzer = YouTubeAnalyzer(config=config)

print("YouTubeAnalyzer initialized successfully!")
print("=" * 80)
print("Configuration:")
print(f"  - Chunking method: {config.chunking_method}")
print(f"  - Chunk size: {config.chunk_size}")
print(f"  - Chunk overlap: {config.chunk_overlap}")
print(f"  - Embedding model: {config.embedding_model}")
print(f"  - LLM provider: {config.llm_provider}")
print(f"  - ChromaDB directory: {config.chroma_persist_directory}")
print(f"  - Collection name: {config.collection_name}")
print("=" * 80)

YouTubeAnalyzer initialized successfully!
Configuration:
  - Chunking method: langchain
  - Chunk size: 1000
  - Chunk overlap: 200
  - Embedding model: all-MiniLM-L6-v2
  - LLM provider: groq
  - ChromaDB directory: ./data/chroma_db
  - Collection name: youtube_transcripts


### Analyze a YouTube Video (End-to-End)

In [59]:
# Replace with a real YouTube video URL you want to analyze
# Example: TED Talk or educational video with captions
youtube_url = "https://www.youtube.com/watch?v=TFnJFaWwlbs"

print(f"Analyzing video: {youtube_url}\n")
print("=" * 80)
print("PIPELINE STEPS:")
print("=" * 80)
print("  1. Fetch the transcript from YouTube")
print("  2. Chunk the text semantically")
print("  3. Generate embeddings for each chunk")
print("  4. Store embeddings in ChromaDB")
print("  5. Retrieve most relevant chunks")
print("  6. Generate comprehensive summary with LLM")
print("=" * 80)
print()

try:
    # Analyze the video (this runs the complete RAG pipeline)
    print("üîÑ Starting analysis...\n")
    result = analyzer.analyze_video(youtube_url)
    
    # Display results
    print("\n" + "=" * 80)
    print("‚úì VIDEO ANALYSIS COMPLETED SUCCESSFULLY")
    print("=" * 80)
    
    print(f"\nüìπ Video Information:")
    print(f"   Video ID: {result.metadata.video_id}")
    print(f"   Title: {result.metadata.title or 'N/A'}")
    print(f"   Duration: {result.metadata.duration:.1f}s" if result.metadata.duration else "   Duration: N/A")
    print(f"   Language: {result.metadata.language or 'N/A'}")
    
    print(f"\nüìä Processing Statistics:")
    print(f"   Transcript length: {result.metadata.transcript_length:,} characters" if result.metadata.transcript_length else "   Transcript length: N/A")
    print(f"   Number of chunks created: {result.metadata.chunk_count or 0}")
    print(f"   Chunks used for context: {len(result.context_chunks) if result.context_chunks else 0}")
    print(f"   Analysis timestamp: {result.analysis_timestamp}")
    
    print(f"\n‚öôÔ∏è  Configuration Used:")
    print(f"   LLM Provider: {result.config_used.get('llm_provider', 'N/A')}")
    print(f"   Embedding Model: {result.config_used.get('embedding_model', 'N/A')}")
    print(f"   Chunking Method: {result.config_used.get('chunking_method', 'N/A')}")
    
    print("\n" + "=" * 80)
    print("‚úÖ Analysis complete! Results stored in ChromaDB for future searches.")
    print("=" * 80)
    
    print("\n" + "-" * 80)
    print("üìù SUMMARY")
    print("-" * 80)
    print(result.summary)
    
    if result.key_points:
        print("\n" + "-" * 80)
        print("üîë KEY POINTS")
        print("-" * 80)
        for i, point in enumerate(result.key_points, 1):
            print(f"{i}. {point}")
    
except Exception as e:
    import traceback
    print("\n" + "=" * 80)
    print("‚úó ERROR ANALYZING VIDEO")
    print("=" * 80)
    print(f"Error: {e}")
    print("\n‚ö†Ô∏è  Please check:")
    print("  ‚Ä¢ The YouTube URL is valid")
    print("  ‚Ä¢ The video has captions/subtitles available")
    print("  ‚Ä¢ Your API keys are correctly set in the .env file")
    print("  ‚Ä¢ All dependencies are installed")
    print("=" * 80)
    print("\nFull traceback:")
    traceback.print_exc()

Analyzing video: https://www.youtube.com/watch?v=TFnJFaWwlbs

PIPELINE STEPS:
  1. Fetch the transcript from YouTube
  2. Chunk the text semantically
  3. Generate embeddings for each chunk
  4. Store embeddings in ChromaDB
  5. Retrieve most relevant chunks
  6. Generate comprehensive summary with LLM

üîÑ Starting analysis...


‚úì VIDEO ANALYSIS COMPLETED SUCCESSFULLY

üìπ Video Information:
   Video ID: TFnJFaWwlbs
   Title: N/A
   Duration: 70.6s
   Language: en

üìä Processing Statistics:
   Transcript length: 642 characters
   Number of chunks created: 1
   Chunks used for context: 1
   Analysis timestamp: 2025-11-09T18:15:21.507927

‚öôÔ∏è  Configuration Used:
   LLM Provider: groq
   Embedding Model: all-MiniLM-L6-v2
   Chunking Method: langchain

‚úÖ Analysis complete! Results stored in ChromaDB for future searches.

--------------------------------------------------------------------------------
üìù SUMMARY
----------------------------------------------------------------

## Advanced Features

### Semantic Search Across Videos

In [None]:
# Search for specific topics across all analyzed videos
search_query = "machine learning applications"

print(f"Searching for: '{search_query}'\n")

try:
    search_results = analyzer.search(search_query, top_k=5)
    
    print(f"Found {len(search_results)} relevant segments:\n")
    
    for i, chunk in enumerate(search_results, 1):
        print(f"[{i}] Video ID: {chunk.metadata.get('video_id', 'N/A')}")
        print(f"    Time: {chunk.start_time}s - {chunk.end_time}s")
        print(f"    Text: {chunk.text[:200]}...\n")
        
except Exception as e:
    print(f"Error during search: {e}")

### Get Analysis Statistics

In [None]:
# Get statistics about analyzed videos
try:
    stats = analyzer.get_stats()
    
    print("Analysis Statistics:")
    print("=" * 50)
    print(f"Total videos analyzed: {stats.get('total_videos', 0)}")
    print(f"Total chunks stored: {stats.get('total_chunks', 0)}")
    print(f"Collection name: {stats.get('collection_name', 'N/A')}")
    print("=" * 50)
    
except Exception as e:
    print(f"Error getting statistics: {e}")

### Batch Processing Multiple Videos

In [None]:
# Analyze multiple videos
video_urls = [
    "https://www.youtube.com/watch?v=VIDEO_ID_1",
    "https://www.youtube.com/watch?v=VIDEO_ID_2",
    # Add more URLs as needed
]

print(f"Processing {len(video_urls)} videos...\n")

results = []
for i, url in enumerate(video_urls, 1):
    print(f"[{i}/{len(video_urls)}] Analyzing: {url}")
    try:
        result = analyzer.analyze_video(url)
        results.append(result)
        print(f"  ‚úì Success - {result.num_chunks} chunks created\n")
    except Exception as e:
        print(f"  ‚úó Failed: {e}\n")

print(f"\nSuccessfully analyzed {len(results)}/{len(video_urls)} videos")

## Summary

This notebook demonstrated all key components of the youtube-video-analyser project:

1. **TranscriptFetcher** - Robust YouTube transcript extraction with error handling
2. **Chunker** - Flexible semantic text chunking (LangChain or spaCy)
3. **Embedder** - Vector embeddings using sentence-transformers
4. **VectorRetriever** - ChromaDB for persistent storage and similarity search
5. **Summarizer** - LLM-based summarization with multiple provider options
6. **YouTubeAnalyzer** - Complete orchestrated RAG pipeline

### Next Steps:

- Configure your `.env` file with API keys
- Try analyzing different YouTube videos
- Experiment with different chunking strategies and sizes
- Use semantic search to find specific topics across videos
- Explore the CLI interface for quick analyses

---

**Note**: This notebook uses the installed `youtube_analyzer` package. For a standalone version, see `youtube_analyzer_interactive.ipynb`.