# Baseline Comparison Setup

Colab notebook to run official GraphRAG / DyG-RAG / KEDKG pipelines
on the 25/168/500 query datasets exported from this repo.

**Steps**
1. Mount Google Drive and clone the InsightSpike-AI repo (or upload datasets).
2. Install required dependencies for each baseline.
3. Run the baseline pipeline and collect metrics (PER, acceptance, FMR, latency).
4. Save results back to JSON for later aggregation.

Replace paths as necessary when running in Colab.

## 0. Environment / Repository Setup

In [0]:
# in Colab: mount Google Drive (optional)
# from google.colab import drive
# drive.mount('/content/drive')

# clone InsightSpike-AI if not yet available
# !git clone https://github.com/<USER>/InsightSpike-AI.git
# %cd InsightSpike-AI

# optional: install poetry if needed

## 1. Install Baseline Dependencies

Each baseline has its own requirements. Run the appropriate cell(s) depending on which baseline you need.

In [0]:
# GraphRAG dependencies
# !pip install langchain==0.2.0 faiss-cpu pygraphviz pydantic==1.10.12 sentence-transformers
# !git clone https://github.com/microsoft/graphrag.git external/graphrag
# (Add SBERT integration as needed)

In [0]:
# DyG-RAG dependencies (example; adjust versions accordingly)
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# !pip install torch-geometric torch-scatter torch-sparse torch-cluster torch-spline-conv
# (Add repository clone / setup commands here)

In [0]:
# KEDKG dependencies (example)
# !pip install torch torch-geometric networkx
# (Clone the official repo or add direct implementation commands)

## 2. Load Datasets (25 / 168 / 500)

Assumes `experiments/rag-baselines-data/*` artifacts (documents TSV, questions JSONL) are present in the repo.

In [0]:
import json
from pathlib import Path

DATA_ROOT = Path('experiments/baseline-comparison/data')
DATASETS = {
    '25': DATA_ROOT / 'sample_queries.jsonl',
    '168': DATA_ROOT / 'sample_queries_168.jsonl',
    '500': DATA_ROOT / 'sample_queries_500.jsonl',
}

def load_jsonl(path: Path):
    with path.open('r', encoding='utf-8') as fh:
        for line in fh:
            line = line.strip()
            if line:
                yield json.loads(line)

datasets = {k: list(load_jsonl(p)) for k, p in DATASETS.items()}
print({k: len(v) for k, v in datasets.items()})


## 3. Run GraphRAG

Example: call the GraphRAG pipeline with SBERT embedding. Collect `per_mean`, `acceptance_rate`, `fmr`, `latency_p50` etc.
This will depend heavily on the repository's CLI/SDK; pseudocode shown below.


In [0]:
from pathlib import Path
import subprocess
import sys

NOTEBOOK_DIR = Path.cwd()
EXPERIMENTS_DIR = NOTEBOOK_DIR.parents[2]
RAG_LITE_ROOT = EXPERIMENTS_DIR / 'rag-dynamic-db-v3-lite'
BASELINES_DIR = EXPERIMENTS_DIR / 'rag-baselines'

DATASET = RAG_LITE_ROOT / 'data' / 'sample_queries.jsonl'
OUTPUT = RAG_LITE_ROOT / 'results' / 'graph_rag_baseline_25.json'

print(f'▶︎ dataset: {DATASET}')
print(f'▶︎ output:  {OUTPUT}')
OUTPUT.parent.mkdir(parents=True, exist_ok=True)

cmd = [
    sys.executable,
    str(BASELINES_DIR / 'run_graphrag_baseline.py'),
    '--dataset', str(DATASET),
    '--output', str(OUTPUT),
    '--embedding-model', 'sentence-transformers/all-MiniLM-L6-v2',
]
print('$', ' '.join(cmd))
subprocess.run(cmd, check=True)


## 4. Run DyG-RAG / KEDKG

Add cells similar to GraphRAG, using the official code or adapted scripts. Ensure that outputs include comparable metrics.


In [0]:
from pathlib import Path
import subprocess
import sys

NOTEBOOK_DIR = Path.cwd()
EXPERIMENTS_DIR = NOTEBOOK_DIR.parents[2]
RAG_LITE_ROOT = EXPERIMENTS_DIR / 'rag-dynamic-db-v3-lite'
BASELINES_DIR = EXPERIMENTS_DIR / 'rag-baselines'

DATASET = RAG_LITE_ROOT / 'data' / 'sample_queries.jsonl'
OUTPUT = RAG_LITE_ROOT / 'results' / 'dygrag_like_baseline_25.json'

print(f'▶︎ dataset: {DATASET}')
print(f'▶︎ output:  {OUTPUT}')
OUTPUT.parent.mkdir(parents=True, exist_ok=True)

cmd = [
    sys.executable,
    str(BASELINES_DIR / 'run_dygrag_like_baseline.py'),
    '--dataset', str(DATASET),
    '--output', str(OUTPUT),
    '--embedding-model', 'sentence-transformers/all-MiniLM-L6-v2',
    '--ag-margin', '0.05',
    '--dg-threshold', '0.7',
]
print('$', ' '.join(cmd))
subprocess.run(cmd, check=True)


## 5. Aggregate Metrics

Use the provided aggregation utility (or inline code) to compare with geDIG results.

In [0]:
from pathlib import Path
import json

GEDIG_RESULTS = {
    '25': Path('experiments/baseline-comparison/geDIG_results/rag_v3_lite_sample_queries_20251014_075707.json'),
    '168': Path('experiments/baseline-comparison/geDIG_results/rag_v3_lite_168_20251014_075956.json'),
    '500': Path('experiments/baseline-comparison/geDIG_results/rag_v3_lite_500_20251014_064807.json'),
}

def load_metrics(path: Path, key: str):
    data = json.loads(path.read_text())
    return data['results'][key]

# Example aggregation
summary = {}
for k, gedig_path in GEDIG_RESULTS.items():
    summary[k] = {
        'geDIG': load_metrics(gedig_path, 'gedig_ag_dg'),
        # 'GraphRAG_official': load_metrics(Path(f'results/graph_rag_official_{k}.json'), 'graphrag'),
        # 'DyG-RAG_official': load_metrics(Path(f'results/dygrag_official_{k}.json'), 'dygrag'),
    }

print(summary)

## Notes
- Adapt the pseudocode sections with actual CLI/SDK calls.
- Save each baseline's outputs (JSON/CSV) so they can be pulled into the main repo and aggregated.
- Document the compute environment (Colab GPU/CPU specs) when reporting results.