# Notebook 02 – Text Chunking & Preprocessing

## Objective
Split long transcript text into manageable chunks for embedding and retrieval.

## Input
Clean transcript (.txt file)

## Output
List of text chunks ready for embedding

## Methodology
- Load transcript
- Apply RecursiveCharacterTextSplitter
- Analyze chunk distribution


In [1]:
"""
PROJECT: 
NeuralTranscript: A RAG-Based Semantic Search & Q&A System for YouTube Content

-------------------------------------------------------------------------
AUTHOR: Engr. Inam Ullah Khan
Master's Student in Data Science | Al-Farabi Kazakh National University
-------------------------------------------------------------------------
"""

import os
# NEW: Import from the dedicated text-splitters package
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# --- 1. CONFIGURATION ---
VIDEO_ID = "Gfr50f6ZBvo"
INPUT_PATH = f"data/transcripts/{VIDEO_ID}.txt"

# RAG Hyperparameters
CHUNK_SIZE = 1000   
CHUNK_OVERLAP = 200 

# --- 2. CORE PROCESSING FUNCTIONS ---

def load_processed_transcript(file_path: str) -> str:
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"❌ Transcript not found at {file_path}. Run Notebook 01 first.")
    with open(file_path, "r", encoding="utf-8") as f:
        return f.read()

def create_enriched_chunks(text: str, source_id: str) -> list[Document]:
    print(f"✂️ Initializing Recursive Splitting (Size: {CHUNK_SIZE}, Overlap: {CHUNK_OVERLAP})...")
    
    # Updated Splitter
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators=["\n\n", "\n", ".", " ", ""],
        add_start_index=True 
    )
    
    # Generate chunks as Document objects
    # Note: Using create_documents is cleaner in the new API
    enriched_docs = splitter.create_documents(
        [text], 
        metadatas=[{"source": source_id, "content_type": "video_transcript"}]
    )
    
    return enriched_docs

# --- 3. EXECUTION PIPELINE ---

if __name__ == "__main__":
    print(f"--- Starting NeuralTranscript Chunking Pipeline ---")
    
    full_text = load_processed_transcript(INPUT_PATH)
    chunked_docs = create_enriched_chunks(full_text, VIDEO_ID)
    
    print(f"✅ Created {len(chunked_docs)} enriched chunks.")
    
    # Preview
    sample = chunked_docs[0]
    print(f"\n--- CHUNK VALIDATION ---\nMetadata: {sample.metadata}\nPreview: {sample.page_content[:150]}...")

--- Starting NeuralTranscript Chunking Pipeline ---
✂️ Initializing Recursive Splitting (Size: 1000, Overlap: 200)...
✅ Created 169 enriched chunks.

--- CHUNK VALIDATION ---
Metadata: {'source': 'Gfr50f6ZBvo', 'content_type': 'video_transcript', 'start_index': 0}
Preview: the following is a conversation with demus hasabis ceo and co-founder of deepmind a company that has published and builds some of the most incredible ...


In [2]:
import pickle

# Save the chunks so Notebook 03 can use them
with open("data/chunked_docs.pkl", "wb") as f:
    pickle.dump(chunked_docs, f)

print("✅ Chunks safely persisted to data/chunked_docs.pkl")

✅ Chunks safely persisted to data/chunked_docs.pkl


## Observations

- Transcript was successfully split into 169 chunks.
- Overlap ensures contextual continuity between adjacent chunks.
- Chunk sizes are appropriate for embedding model context window.


## Summary

The transcript was successfully segmented into semantically meaningful chunks using a recursive character-based strategy.
The selected chunk size and overlap balance contextual coherence and computational efficiency.
The resulting chunks are ready for embedding and vector storage in the next stage of the RAG pipeline.


**Next step:** Embedding generation and similarity-based retrieval  
(`03_vector_indexing.ipynb`)
