# Data Ingestion: YouTube Video Transcripts

This notebook handles the **data ingestion stage** of a Retrieval-Augmented Generation (RAG) pipeline.
It focuses on extracting and cleaning **YouTube video transcripts**, which serve as a representative
example of long-form, unstructured text data.

The output of this notebook is a cleaned transcript that can be passed to downstream steps such as
chunking, embedding, and retrieval.


In [4]:
"""
PROJECT: NeuralTranscript: Semantic Search & Q&A for YouTube Content
MODULE: 01_DATA_INGESTION
-------------------------------------------------------------------------
DESCRIPTION:
This notebook serves as the entry point for the NeuralTranscript pipeline. 
It automates the extraction of spoken content from YouTube videos using 
the YouTube Transcript API. The data is cleaned and prepared for 
vectorization and semantic indexing.

AUTHOR: Engr. Inam Ullah Khan
Master's Student in Data Science | Al-Farabi Kazakh National University
-------------------------------------------------------------------------
"""

import os
from typing import List, Any
from youtube_transcript_api import YouTubeTranscriptApi

# --- 1. CONFIGURATION ---
# Video: Demis Hassabis (DeepMind CEO) on AI
VIDEO_ID = "Gfr50f6ZBvo" 
OUTPUT_FOLDER = "data/transcripts"

# --- 2. CORE FUNCTIONS ---

def fetch_youtube_transcript(video_id: str) -> List[Any]:
    """
    Retrieves the raw transcript data using the latest API standards.
    
    Args:
        video_id (str): The unique 11-character YouTube video ID.
        
    Returns:
        List[Any]: Raw transcript segments containing text and timestamps.
    """
    print(f"üì° Accessing YouTube API for Video ID: {video_id}...")
    try:
        # Initializing the API instance (v1.x.x compatibility)
        api_instance = YouTubeTranscriptApi()
        
        # Fetching transcript and converting to standard list format
        raw_data = api_instance.fetch(video_id).to_raw_data()
        return raw_data
    
    except Exception as e:
        print(f"‚ùå Error: Could not retrieve transcript. Details: {e}")
        return []

def process_transcript_to_text(raw_transcript: List[Any]) -> str:
    """
    Cleans and unifies transcript segments into a single cohesive string.
    
    Args:
        raw_transcript (List[Any]): List of transcript dictionaries.
        
    Returns:
        str: A cleaned block of text ready for NLP tasks.
    """
    print("üßπ Cleaning and unifying transcript text...")
    
    # Extract 'text' field while handling potential object/dict variability
    text_segments = []
    for segment in raw_transcript:
        if isinstance(segment, dict):
            text_segments.append(segment.get("text", ""))
        else:
            text_segments.append(getattr(segment, 'text', ""))
            
    # Join segments and remove excess whitespace for cleaner processing
    unified_text = " ".join(text_segments).replace("  ", " ").strip()
    return unified_text

def save_to_disk(text_content: str, filename: str):
    """
    Persists the cleaned text to a local file for downstream processing.
    """
    if not os.path.exists(OUTPUT_FOLDER):
        os.makedirs(OUTPUT_FOLDER)
        print(f"üìÅ Created directory: {OUTPUT_FOLDER}")
        
    file_path = os.path.join(OUTPUT_FOLDER, f"{filename}.txt")
    
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(text_content)
    
    print(f"üíæ Success! Cleaned transcript saved to: {file_path}")

# --- 3. MAIN PIPELINE EXECUTION ---

if __name__ == "__main__":
    print(f"--- Starting NeuralTranscript Ingestion ---")
    
    # Step 1: Extraction
    raw_transcript_data = fetch_youtube_transcript(VIDEO_ID)
    
    if raw_transcript_data:
        # Step 2: Transformation
        final_text = process_transcript_to_text(raw_transcript_data)
        
        if final_text:
            print(f"üìä Stats: {len(final_text)} characters processed.")
            
            # Step 3: Loading (Saving)
            save_to_disk(final_text, VIDEO_ID)
            
            # Final Preview
            print(f"\nüìù PREVIEW (First 250 chars):\n{final_text[:250]}...")
        else:
            print("‚ö†Ô∏è Warning: Transcript was empty after processing.")
    else:
        print("‚ùå Pipeline halted: No data to process.")

--- Starting NeuralTranscript Ingestion ---
üì° Accessing YouTube API for Video ID: Gfr50f6ZBvo...
üßπ Cleaning and unifying transcript text...
üìä Stats: 133836 characters processed.
üíæ Success! Cleaned transcript saved to: data/transcripts\Gfr50f6ZBvo.txt

üìù PREVIEW (First 250 chars):
the following is a conversation with demus hasabis ceo and co-founder of deepmind a company that has published and builds some of the most incredible artificial intelligence systems in the history of computing including alfred zero that learned all b...


## Summary

- Successfully retrieved YouTube video transcript
- Converted timestamped segments into clean long-form text
- Stored transcript for downstream RAG processing

**Next step:** Text chunking and preprocessing (`02_chunking_analysis.ipynb`)
