## 📌 What `build_faiss_index.py` Does

This script builds a FAISS index from precomputed BGE-M3 text embeddings stored in a `.jsonl` file. It:

1. Loads vector embeddings and their associated Neo4j node IDs.
2. Optionally normalizes vectors for cosine similarity.
3. Constructs a FAISS index using `IndexFlatIP` (for inner product / cosine similarity).
4. Saves the FAISS index to disk for fast semantic search.
5. Writes an ID map to match FAISS results back to original Neo4j elements.

All paths and settings (like dimension, logging, etc.) are loaded from `config.yaml`.

### 🧰 Standard Library Modules

- `os`  Handles file paths and directory operations (e.g., joining paths, checking if files exist).

- `json`  Reads and writes JSON and JSONL files (e.g., for embeddings or ID maps).

- `logging`  Outputs progress, errors, and debug messages to a log file for monitoring and troubleshooting.

---

### 🔧 Project-Specific Module

- `digitaiCore.config_loader.ConfigLoader`
  Loads configuration from `config.yaml` with dot-notation access.
  Used to retrieve paths, model settings, batch sizes, and logging options consistently across the project.
### 📊 Numerical & Indexing Modules

- `numpy`
  Converts lists of embeddings into `float32` arrays for FAISS and supports efficient matrix operations.

- `faiss`
  Facebook AI Similarity Search — creates a high-performance vector index for fast similarity search using cosine distance or inner product.

---

### 🔧 Project-Specific Module

- `digitaiCore.config_loader.ConfigLoader`
  Loads configuration from `config.yaml` with dot-notation access.
  Used to retrieve paths, model settings, batch sizes, and logging options consistently across the project.

In [None]:
import os
import json
import faiss
import numpy as np
import logging
from digitaiCore.config_loader import ConfigLoader

## Load Config 💿

In [None]:
repo_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "..")) #Set path of root
config_path = os.path.join(repo_root, "digitaiCore", "config.yaml") #Set path of config.yaml
config = ConfigLoader(config_path) #Load in config

## Set-up Logging  📝
Logging enable and log file location are controlled via parameters set in the 'config.yaml' file
- Logging is enabled by default and is HIGHLY suggested
    - Logging tracks batch processing progress along with where errors have occured
        - Due to the immense line count in the embedding file it is incredibly hard to find errors by hand. Logging makes verification/diagnosis possible

In [None]:
log_path = os.path.join(repo_root, config.get("logging.faissLog"))
logging.basicConfig(
    filename=log_path,
    level=getattr(logging, config.get("logging.level")),
    format=config.get("logging.format")
)
logging.info("=== FAISS Index Build Script Start ===")

### Pull Output Paths and Pre-Sets from Config 🛠️

In [None]:
embedding_path = os.path.join(repo_root, config.get("dataPaths.bgem3Embeddings"))
index_output_path = os.path.join(repo_root, config.get("dataPaths.faissIndex"))
id_map_path = os.path.join(repo_root, config.get("dataPaths.faissIdMap"))
dimension = config.get("vectorIndex.dimension")

## Load Embeddings 💿 and Build Index 👷‍♀️

In [None]:
logging.info(f"📥 Loading embeddings from: {embedding_path}")
embeddings = []
id_map = {}

with open(embedding_path, "r") as f:
    for i, line in enumerate(f):
        record = json.loads(line)
        embeddings.append(record["embedding"])
        id_map[i] = record["id"]  # FAISS index position → node_id

embedding_matrix = np.array(embeddings).astype("float32")

### Normalize for cosine similarity (if enabled) 🧮

In [None]:
if config.get("embedding.normalize"):
    logging.info("📐 Normalizing embeddings for cosine similarity...")
    embedding_matrix = faiss.normalize_L2(embedding_matrix)

## Create FAISS index 👷‍♀️(Inner Product = Cosine Sim when normalized)

In [None]:
logging.info(f"🔧 Building FAISS index: {len(embedding_matrix)} vectors, dimension = {dimension}")
index = faiss.IndexFlatIP(dimension)
index.add(embedding_matrix)

## Save FAISS Index 📝 and ID Map 🗺️, Log Completion ✅

In [None]:
faiss.write_index(index, index_output_path)
logging.info(f"✅ FAISS index saved to: {index_output_path}")

# Save ID map
with open(id_map_path, "w") as f:
    json.dump(id_map, f)
logging.info(f"🗂️  ID map saved to: {id_map_path}")
logging.info("🏁 FAISS Index Build Script Complete.")