# **Executing the ETL Pipeline from the Main Script**

This step runs the **ETL pipeline** responsible for preparing the raw dataset of movie plots for subsequent stages of the RAG system.

The process includes three key operations:
* `Data Cleaning`:
    * The **raw CSV** file is loaded and cleaned using the `DataCleaner` class `data_cleaner`, which removes invalid tokens, standardizes formats, and applies column-specific transformations.
* `Structured Output Generation`:
    * The **cleaned dataset** is then saved as a **new CSV file** through the `DataPipeline` class `data_pipeline`, ensuring consistency for future processing stages.
* `JSONL Creation`:
    * Finally, the `JsonlWriter` class `jsonl_writer` converts the structured data into `.jsonl format`, separating the **main text (Plot)** from its **metadata**, a format optimized for document retrieval and embedding in RAG pipelines.

In summary, this execution transforms the raw movie dataset into a clean, structured, and RAG-ready format, enabling reliable chunking, embedding, and retrieval in the next stages.


**1. Notebook setup and autoreload**

In [1]:
# --- Notebook setup and autoreload configuration ---
# This cell runs the initial setup script and enables the autoreload extension.
# The autoreload feature ensures that updates made to imported modules (e.g., in src/)
# are automatically reloaded without restarting the kernel.
%run notebook_setup.py

%load_ext autoreload

%autoreload 2

CWD: /home/ilfn/datascience/workspace/rag-movie-plots/notebooks
PYTHONPATH: ['/home/ilfn/datascience/workspace/rag-movie-plots/src/backend', '/home/ilfn/datascience/workspace/rag-movie-plots/src', '/home/ilfn/datascience/workspace/rag-movie-plots/notebooks']


**2. Executing the ETL Pipeline from the Main Script**

Running the cell below will reproduce the same ETL step as the terminal command (`PYTHONPATH=src uv run src/backend/main.py --step etl`), ensuring consistent execution between notebook and production environments.

In [2]:
# import os, sys
# print("CWD:", os.getcwd())
# print("PYTHONPATH:", sys.path[:3])

In [2]:
import sys, runpy

sys.argv = ["backend/main.py", "--step", "etl"]
_ = runpy.run_module("main", run_name="__main__")

Processado em /home/ilfn/datascience/workspace/rag-movie-plots/data/processed/v20251109
