## Task 3: Vector Store Ingestion Logic Test

**Objective:** 
Before running the full-scale ingestion script on 1.37M rows, we will:
1. Load the pre-computed embeddings parquet file.
2. Inspect column names and data types.
3. Test the FAISS ingestion logic on a small sample (1,000 rows).
4. Verify that we can successfully search the created index.

In [4]:
import os
import pandas as pd
import numpy as np
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from tqdm.notebook import tqdm

In [None]:
# Setup paths
INPUT_PATH = "../data/processed/complaint_embeddings.parquet" # Adjust path relative to notebook location
OUTPUT_TEST_DIR = "../vector_store/test_faiss_index"


## 1. Load and Inspect Data
We load the dataframe to check the column names. 
**Note:** If the file is huge, this might take a moment

In [None]:
if not os.path.exists(INPUT_PATH):
    print(f"❌ Error: File not found at {INPUT_PATH}")
else:
    print(f"✅ Found file at {INPUT_PATH}")
    
    # Read the file
    df = pd.read_parquet(INPUT_PATH)
    print(f"Data Shape: {df.shape}")
    print("\nColumn Names:")
    print(df.columns.tolist())
    
    print("\nSample Row:")
    display(df.iloc[0])

✅ Found file at ../data/raw/complaint_embeddings.parquet
Data Shape: (1375327, 4)

Column Names:
['id', 'document', 'embedding', 'metadata']

Sample Row:


id                                                  14069121_0
document     a card was opened under my name by a fraudster...
embedding    [-0.04277738183736801, 0.025624370202422142, -...
metadata     {'chunk_index': 0, 'company': 'CITIBANK, N.A.'...
Name: 0, dtype: object

In [None]:
df

Unnamed: 0,id,document,embedding,metadata
0,14069121_0,a card was opened under my name by a fraudster...,"[-0.04277738183736801, 0.025624370202422142, -...","{'chunk_index': 0, 'company': 'CITIBANK, N.A.'..."
1,14061897_0,i made the mistake of using my wellsfargo debi...,"[-0.05458317697048187, 0.10340359061956406, 0....","{'chunk_index': 0, 'company': 'WELLS FARGO & C..."
2,14061897_1,and got a letter stating my dispute was reject...,"[-0.03491289168596268, 0.059216588735580444, 0...","{'chunk_index': 1, 'company': 'WELLS FARGO & C..."
3,14047085_0,"dear cfpb, i have a secured credit card with c...","[-0.010181158781051636, 0.02354264445602894, -...","{'chunk_index': 0, 'company': 'CITIBANK, N.A.'..."
4,14047085_1,y confirmation whatsoever to report to the pol...,"[-0.017308838665485382, -0.007177562452852726,...","{'chunk_index': 1, 'company': 'CITIBANK, N.A.'..."
...,...,...,...,...
1375322,6238123_1,tract i had hey and i explained to them that i...,"[-0.07657872885465622, 0.06277621537446976, 0....","{'chunk_index': 1, 'company': 'Westlake Servic..."
1375323,6238123_2,my balance and i have the documents to show th...,"[-0.05301162227988243, 0.1226310133934021, 0.0...","{'chunk_index': 2, 'company': 'Westlake Servic..."
1375324,6238123_3,alled a crew and then looking back at the cont...,"[-0.07775353640317917, 0.027862858027219772, 0...","{'chunk_index': 3, 'company': 'Westlake Servic..."
1375325,6238123_4,know my car was repossessed on now i've been c...,"[-0.04154370725154877, 0.07054945826530457, 0....","{'chunk_index': 4, 'company': 'Westlake Servic..."


## 2. Configuration & Validation
**CRITICAL STEP:** Compare the printed columns above with the variables below.

In [8]:
# --- MAPPING CONFIGURATION ---
TEXT_COL = 'document'          # Updated from 'text'
EMBEDDING_COL = 'embedding'    # This matches
METADATA_COL = 'metadata'      # The single column containing the dicts

# Validate columns exist
required_cols = [TEXT_COL, EMBEDDING_COL, METADATA_COL]
missing_cols = [col for col in required_cols if col not in df.columns]

if missing_cols:
    print(f"❌ CRITICAL ERROR: The following columns are missing: {missing_cols}")
else:
    print(f"✅ All target columns exist: {required_cols}")

✅ All target columns exist: ['document', 'embedding', 'metadata']


## 3. Test Ingestion Logic (Small Batch)
We will simulate the batching logic from your script using just 1,000 rows and a batch size of 200.

In [10]:
# Initialize Embedding Model
# We need this ONLY for the dimensionality config and for the query later.
# We are NOT generating embeddings for the documents (they are already there).
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create a small sample
sample_size = 1000
batch_size = 200
df_sample = df.head(sample_size).copy()

vectorstore = None

print("Starting batch simulation...")

for i in tqdm(range(0, sample_size, batch_size), desc="Test Batches"):
    batch = df_sample.iloc[i : i + batch_size]
    
    # 1. Extract Text
    texts = batch[TEXT_COL].tolist()
    
    # 2. Extract Embeddings
    # Ensure they are standard lists, not numpy arrays (FAISS from_embeddings can be picky)
    embeddings = batch[EMBEDDING_COL].tolist()
    
    # 3. Zip them together
    text_embeddings = list(zip(texts, embeddings))
    
    # 4. metadatas = batch[METADATA_COL].tolist()
    metadatas = batch[METADATA_COL].tolist()
    
    # 5. Add to FAISS
    if vectorstore is None:
        vectorstore = FAISS.from_embeddings(
            text_embeddings=text_embeddings,
            embedding=embedding_model,
            metadatas=metadatas
        )
    else:
        vectorstore.add_embeddings(
            text_embeddings=text_embeddings,
            metadatas=metadatas
        )

print(f"✅ Ingestion test complete. Index contains {vectorstore.index.ntotal} vectors.")

Starting batch simulation...


Test Batches:   0%|          | 0/5 [00:00<?, ?it/s]

✅ Ingestion test complete. Index contains 1000 vectors.


<langchain_community.vectorstores.faiss.FAISS at 0x2306e4e0440>

## 4. Test Retrieval
Now we test if the index actually works. We will ask a question relevant to the first few rows (usually credit cards or checking accounts).


In [12]:
query = "Why was I charged a fee?"

print(f"Querying: '{query}'")
results = vectorstore.similarity_search(query, k=3)

print("\n--- Results ---")
for i, res in enumerate(results):
    print(f"\nResult {i+1}:")
    print(f"Content: {res.page_content[:200]}...")
    print(f"Metadata: {res.metadata}")

Querying: 'Why was I charged a fee?'

--- Results ---

Result 1:
Content: ansactions and i should not have been issued this. i called them and they said they couldn't remove the fees and the timing of the bank was different and they didn't count it until yesterday or someth...
Metadata: {'chunk_index': 1, 'company': 'WELLS FARGO & COMPANY', 'complaint_id': '13994197', 'date_received': '2025-06-10', 'issue': 'Problem caused by your funds being low', 'product': 'Checking or savings account', 'product_category': 'Savings Account', 'state': 'PA', 'sub_issue': 'Overdrafts and overdraft fees', 'total_chunks': 3}

Result 2:
Metadata: {'chunk_index': 1, 'company': 'Atlanticus Services Corporation', 'complaint_id': '13885498', 'date_received': '2025-06-04', 'issue': 'Fees or interest', 'product': 'Credit card', 'product_category': 'Credit Card', 'state': 'KS', 'sub_issue': 'Problem with fees', 'total_chunks': 2}

Result 3:
Content: - paid the bill the day i received it in the mail. the late fee

## 5. Save and Load Test
Verify we can save to disk and load it back.


In [13]:

# %%
# Save
vectorstore.save_local(OUTPUT_TEST_DIR)
print(f"Saved test index to {OUTPUT_TEST_DIR}")


Saved test index to ../vector_store/test_faiss_index


In [14]:
# Load back
loaded_store = FAISS.load_local(
    OUTPUT_TEST_DIR, 
    embedding_model, 
    allow_dangerous_deserialization=True
)

print(f"Loaded index size: {loaded_store.index.ntotal}")
print("Sanity check passed.")

Loaded index size: 1000
Sanity check passed.
