# RAG QA Data Preparation

This notebook prepares and explores the dataset for the Retrieval-Augmented Generation (RAG) question-answering pipeline. It handles data loading, exploration, and embedding computation.

## Colab Setup

If you are running this notebook in Colab, we first need to clone the repository and install the requirements. Otherwise, we assume that you already have a clean setup to save execution time.

In [None]:
import sys
IS_COLAB = "google.colab" in sys.modules

if IS_COLAB:
    import os

    REPO_URL = "https://github.com/lucas937-code/rag-qa"
    REPO_DIR = "rag-qa"
    BRANCH = "dev"
    MODE = "DEBUG"

    # Clone repo only if it does not exist yet
    if not os.path.isdir(REPO_DIR):
        print(f"Cloning repository from {REPO_URL}...")
        !git clone {REPO_URL} {REPO_DIR}
    else:
        print(f"Repository '{REPO_DIR}' already exists, skipping clone.")

    # Change into repo directory
    %cd {REPO_DIR}

    # Checkout the correct branch
    if BRANCH != "main":
        remote_branches = !git branch -r
        print(remote_branches)
        if f"origin/{BRANCH}" not in str(remote_branches):
            raise ValueError(f"Branch {BRANCH} does not exist")
        !git checkout {BRANCH}

    # Install dependencies
    !pip install -r requirements.txt

## Setup Configuration

Initialize the configuration based on the execution environment (local using the HuggingFace API, local using an Ollama server or running in Colab using the HuggingFace API) and prepare the necessary directories.

In [None]:
from src.config import OllamaConfig, LocalConfig, ColabConfig

USE_OLLAMA = not IS_COLAB and True     # change to False to disable Ollama in local setup

if USE_OLLAMA:
    OLLAMA_HOST = "172.19.176.1"
    OLLAMA_PORT = 11434
    OLLAMA_URL = f"http://{OLLAMA_HOST}:{OLLAMA_PORT}/api/chat"
    config = OllamaConfig(ollama_url=OLLAMA_URL)
else:
    config = ColabConfig() if IS_COLAB else LocalConfig()
    
config.ensure_dirs()

## Load Dataset

Download and prepare the dataset using the configured data directory paths. 

In [None]:
from src.load_data import ensure_data_available

ensure_data_available(config=config)

print("Dataset ready")

## Explore the Dataset

Load and inspect samples from the training, validation, and test sets to understand the data structure and characteristics.

In [None]:
from src.explore_data import load_shards, explore_dataset

train_ds = load_shards(config.train_dir, max_shards=3)
val_ds   = load_shards(config.val_dir, max_shards=3)
test_ds  = load_shards(config.test_dir, max_shards=3)

explore_dataset(train_ds, "Train set")
explore_dataset(val_ds, "Validation set")
explore_dataset(test_ds, "Test set")

## Compute Embeddings

Generate embeddings for the corpus using the configured model. These embeddings will be used for semantic similarity search in the RAG pipeline.

In [None]:
from src.compute_embeddings import compute_embeddings

corpus, corpus_embeddings = compute_embeddings(config=config)
print("embeddings loaded")