# 04. Create Vector Index (FAISS)

## What is a Vector Index?
We have our embeddings (lists of numbers), but if we want to find the *closest* match to a user's question, we can't just check every single row one by one—that would be too slow for millions of documents.

A **Vector Index** is a special data structure that allows us to search through these numbers extremely fast.

Since Databricks Free Edition doesn't have the managed "Vector Search" feature, we will build our own index using a library called **FAISS** (Facebook AI Similarity Search).

## Step 1: Install FAISS
We install the CPU version of FAISS.

In [None]:
%pip install faiss-cpu

## Step 2: Load Embeddings
We load our `gold_embeddings` table.

In [None]:
import faiss
import numpy as np
import os
import pickle

spark.sql("USE rag_demo")

# Load the table
df_embeddings = spark.table("gold_embeddings")

# Convert to Pandas so we can work with it locally on the driver
# NOTE: This is okay for small demos. For huge datasets, you wouldn't do this.
pdf = df_embeddings.select("chunk_id", "embedding").toPandas()

## Step 3: Build the Index
We convert our list of embeddings into a format FAISS understands and build the index.

In [None]:
# 1. Convert the 'embedding' column to a numpy array of floats
embeddings_list = pdf["embedding"].tolist()
embeddings_array = np.array(embeddings_list).astype("float32")

# 2. Get the dimension size (how many numbers in each list?)
# For our model, this should be 384.
d = embeddings_array.shape[1]

# 3. Create the index
# IndexFlatL2 measures the 'distance' between points. Closer = more similar.
index = faiss.IndexFlatL2(d)

# 4. Add our vectors to the index
index.add(embeddings_array)

print(f"Success! Built index with {index.ntotal} vectors.")

## Step 4: Save Index and Metadata
FAISS only stores the vectors, not the text or IDs. We need to save:
1.  The **FAISS Index** file.
2.  A **Mapping** file that tells us "Vector #5 corresponds to Chunk ID #102".

In [None]:
# Create a dictionary mapping: FAISS ID -> Chunk ID
id_mapping = {i: chunk_id for i, chunk_id in enumerate(pdf["chunk_id"])}

# Define paths
local_tmp_dir = "/tmp/rag_data_tmp/"
os.makedirs(local_tmp_dir, exist_ok=True)

local_index_path = os.path.join(local_tmp_dir, "faiss_index.bin")
local_mapping_path = os.path.join(local_tmp_dir, "id_mapping.pickle")

dbfs_dir = "dbfs:/FileStore/rag_data/"
dbutils.fs.mkdirs(dbfs_dir)
dbfs_index_path = dbfs_dir + "faiss_index.bin"
dbfs_mapping_path = dbfs_dir + "id_mapping.pickle"

# Save locally first
faiss.write_index(index, local_index_path)

with open(local_mapping_path, "wb") as f:
    pickle.dump(id_mapping, f)

# Move to DBFS
dbutils.fs.cp("file:" + local_index_path, dbfs_index_path)
dbutils.fs.cp("file:" + local_mapping_path, dbfs_mapping_path)

print(f"Index saved to {dbfs_index_path}")
print(f"Mapping saved to {dbfs_mapping_path}")

## Step 5: Log to MLflow (Optional)
MLflow is a tool for tracking machine learning experiments. We can log our index as an artifact so we can find it later.

In [None]:
import mlflow

# Start an MLflow run
with mlflow.start_run(run_name="faiss_index_creation"):
    # Log some stats
    mlflow.log_param("num_vectors", index.ntotal)
    mlflow.log_param("embedding_dim", d)
    
    # Log the actual files
    mlflow.log_artifact(index_path, artifact_path="index")
    mlflow.log_artifact(mapping_path, artifact_path="index")
    
    print("Logged artifacts to MLflow run.")