# **RAG Ingestion Pipeline**

**Notebook Setup and Autoreload**

In [1]:
# --- Notebook setup and autoreload configuration ---
# This cell runs the initial setup script and enables the autoreload extension.
# The autoreload feature ensures that updates made to imported modules (e.g., in src/)
# are automatically reloaded without restarting the kernel.
%run notebook_setup.py

%load_ext autoreload

%autoreload 2

Notebook environment configured successfully!

Project root: /home/ilfn/datascience/workspace/rag-movie-plots
Added to sys.path:
  - /home/ilfn/datascience/workspace/rag-movie-plots/src
  - /home/ilfn/datascience/workspace/rag-movie-plots/src/backend
PYTHONPATH: /home/ilfn/datascience/workspace/rag-movie-plots/src
Current working directory: /home/ilfn/datascience/workspace/rag-movie-plots/notebooks


In [None]:
# import os, sys
# print("CWD:", os.getcwd())
# print("PYTHONPATH:", sys.path[:3])

**1. ETL — Data Cleaning and Structured Output Generation**

This step runs the **ETL pipeline** responsible for preparing the raw dataset of movie plots for subsequent stages of the RAG system.

The process includes **three key operations**:
* `Data Cleaning`:
    * The **raw CSV** file is loaded and cleaned using the `DataCleaner` class (`data_cleaner`), which removes invalid tokens, standardizes formats, and applies column-specific transformations.
* `Structured Output Generation`:
    * The **cleaned dataset** is then saved as a **new CSV file** through the `DataPipeline` class (`data_pipeline`).
* `JSONL Creation`:
    * Finally, the `JsonlWriter` class (`jsonl_writer`) converts the structured dataset into **JSON Lines (.jsonl)** format, separating the main text (`Plot`) from its **metadata** (e.g., title, genre, year, cast, director).

In summary, this step transforms the raw movie dataset into a clean, structured, and RAG-ready format — enabling reliable chunking and semantic representation.


In [None]:
# --- Step 1: ETL (Data Cleaning + JSONL Creation) ---
# Loads the raw movie dataset, cleans invalid tokens using DataCleaner,
# generates a structured CSV, and produces the docs.jsonl file for downstream steps.

# Equivalent to running from the terminal:
# PYTHONPATH=src uv run src/backend/main.py --step etl

# import sys, runpy
# sys.argv = ["backend/main.py", "--step", "etl"]
# _ = runpy.run_module("main", run_name="__main__")

_Expected output_:
* `movies_clean.csv` (cleaned dataset)
* `docs.jsonl` (RAG-ready JSON lines file)

**2. Chunking — Text Splitting for Document Segmentation**

This step splits the cleaned movie plot texts into smaller, context-preserving chunks suitable for embedding and retrieval.

It performs **three major operations**:
* `Chunk Loading`:
    * The `.jsonl` file produced during ETL is loaded and parsed into a collection of documents containing `text` and `metadata`.
* `Splitting Strategy Application`:
    * The `ChunkingPipeline` applies the configured chunking strategy (character, token, or recursive), using parameters defined in `CHUNKING_CONFIG` to determine chunk size and overlap.
    * This ensures contextual continuity while limiting token counts per segment.
* `Chunk Output Serialization`:
    * Each document’s text is split into multiple sub-documents, preserving metadata such as `Title`, `Release Year`, and a unique `id`.
    * The resulting collection is stored in a new `.jsonl` file (`chunks.jsonl`) for embedding.

This stage ensures that the RAG system operates over optimized text units — large enough to preserve context but small enough for efficient vector search.

In [None]:
# --- Step 2: Chunking ---
# Splits the text documents from docs.jsonl into smaller chunks using the configured strategy
# (character, token, or recursive) for better retrieval and embedding performance.

# Equivalent to running from the terminal:
# PYTHONPATH=src uv run src/backend/main.py --step chunking

# import sys, runpy
# sys.argv = ["backend/main.py", "--step", "chunking"]
# _ = runpy.run_module("main", run_name="__main__")

_Expected output_:
* `chunks.jsonl` file containing segmented text chunks with metadata.

**3. VectorStore — Embedding and Persistence**

In this stage, the pre-chunked documents are transformed into vector representations and stored in a persistent vector database.

The process includes the following steps:
* `Chunk Loading`:
    * The `chunks.jsonl` file is read and converted into `langchain_core.documents.Document` objects, combining each text segment with its corresponding metadata.
* `Embedding Generation`:
    * The `VectorStorePipeline` uses a Hugging Face embedding model (as defined in `EMBEDDING_CONFIG`) to generate dense semantic vectors for each chunk.
* `Vector Storage`:
    * These vectors are persisted into a local **ChromaDB** instance (`VECTORSTORE_CONFIG["persist_dir"]`), enabling fast similarity-based retrieval for the RAG retriever.

This step converts textual knowledge into an efficient, queryable vector representation — the foundation for semantic search and contextual question answering.

In [None]:
# --- Step 3: VectorStore Creation ---
# Loads the text chunks, generates embeddings using the configured
# Hugging Face model, and stores them in a persistent ChromaDB database.

# Equivalent to running from the terminal:
# PYTHONPATH=src uv run src/backend/main.py --step vectorstore

# import sys, runpy
# sys.argv = ["backend/main.py", "--step", "vectorstore"]
# _ = runpy.run_module("main", run_name="__main__")

**4. Full Pipeline Execution**
Running the **Full Ingestion Pipeline** (`pipeline.run_full()`) executes all three stages sequentially:
1. `ETL`: cleans and structures the dataset
2. `Chunking`: splits the text into retrieval-friendly chunks
3. `VectorStore`: embeds and persists the chunks in ChromaDB

The full execution ensures reproducibility across environments — from local notebooks to production pipelines — enabling seamless data ingestion for Retrieval-Augmented Generation systems.

In [2]:
# --- Optional: Full Ingestion Execution ---
# Runs all three stages (ETL -> Chunking -> VectorStore) sequentially.
# Equivalent to running from the terminal:
# PYTHONPATH=src uv run src/backend/main.py --step full

import sys, runpy

sys.argv = ["backend/main.py", "--step", "full"]
_ = runpy.run_module("main", run_name="__main__")

  from .autonotebook import tqdm as notebook_tqdm



Starting full RAG preprocessing phase of the RAG workflow pipeline...

Loaded raw dataset with 34886 rows from /home/ilfn/datascience/workspace/rag-movie-plots/data/raw/wiki_movie_plots_deduped.csv
Cleaned CSV saved to /home/ilfn/datascience/workspace/rag-movie-plots/data/processed/v20251111/movies_clean.csv
JSONL file created at /home/ilfn/datascience/workspace/rag-movie-plots/data/processed/v20251111/docs.jsonl
Data processing completed successfully.

ETL completed.

Saving 245859 chunks to /home/ilfn/datascience/workspace/rag-movie-plots/data/processed/v20251111/chunks.jsonl

Chunking completed.

Reading chunks from /home/ilfn/datascience/workspace/rag-movie-plots/data/processed/v20251111/chunks.jsonl...
Preparing 245859 documents for vector storage...
Generating embeddings with model: sentence-transformers/all-MiniLM-L6-v2
Persisting ChromaDB at: /home/ilfn/datascience/workspace/rag-movie-plots/db/chroma

VectorStore creation completed.

Full pipeline finished successfully.

[full

_Expected output_:
* Complete ingestion process executed end-to-end.
* All artifacts generated under `data/processed/vYYYYMMDD/`.

## **Notebook Summary**

**This notebook allows you to**:
* Run and debug each ingestion step individually.
* Inspect intermediate outputs (`movies_clean.csv`, `docs.jsonl`, `chunks.jsonl`).
* Ensure the full ingestion pipeline works consistently across environments (notebook, CLI, production).