# Data Processing: Chunking and Vector Store Creation

This notebook executes the data processing pipeline to prepare our complaint data for a retrieval system. It uses a modular script, `build_vector_store.py`, which handles two main tasks:

1.  **Chunking**: Breaks down long complaint narratives into smaller, manageable chunks.
2.  **Embedding & Indexing**: Converts these chunks into numerical vectors (embeddings) and stores them in a FAISS vector store for efficient similarity search.

### Step 1: Import Main Functions

We import the necessary functions from our modular script and define the key parameters for our pipeline.

In [None]:
import pandas as pd
import os
from build_vector_store import chunk_complaints, create_and_save_faiss_index

# --- Configuration ---
BASE_DATA_DIR = '../data'
FILTERED_CSV_PATH = os.path.join(BASE_DATA_DIR, 'filtered_complaints_2.csv')
CHUNKED_CSV_PATH = os.path.join(BASE_DATA_DIR, 'chunked_complaints_500_100.csv')
VECTOR_STORE_PATH = '../vector_store'

CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

### Step 2: Load Filtered Data

Load the cleaned and filtered dataset created during the EDA phase.

In [None]:
try:
    df = pd.read_csv(FILTERED_CSV_PATH)
    print(f"✅ Loaded filtered complaints: {df.shape}")
    display(df.head())
except FileNotFoundError:
    print(f"❌ Error: File not found at {FILTERED_CSV_PATH}. Please run the EDA notebook first.")

### Step 3: Chunk Complaint Narratives

We'll now chunk the `cleaned_narrative` column to prepare it for the embedding model. This process can take a few minutes for a large dataset.

In [None]:
# Check if chunked data already exists to save time
if not os.path.exists(CHUNKED_CSV_PATH):
    chunked_df = chunk_complaints(
        df=df,
        text_column='cleaned_narrative',
        id_column='Complaint ID',
        product_column='Product',
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP
    )
    # Save the intermediate chunked data
    chunked_df.to_csv(CHUNKED_CSV_PATH, index=False)
    print(f"✅ Chunked data saved to: {CHUNKED_CSV_PATH}")
else:
    print(f"Chunked data file already exists at {CHUNKED_CSV_PATH}. Loading it directly.")
    chunked_df = pd.read_csv(CHUNKED_CSV_PATH)

print(f"\n✅ Loaded chunked complaints dataframe: {chunked_df.shape}")
display(chunked_df.head())

### Step 4: Create and Save FAISS Vector Store

This final step converts the text chunks into vector embeddings and builds the FAISS index. This is computationally intensive and will take a significant amount of time.

In [None]:
# Run the indexing process
create_and_save_faiss_index(
    chunk_df=chunked_df,
    embedding_model_name=EMBEDDING_MODEL,
    vector_store_path=VECTOR_STORE_PATH
)