# Text Chunking, Embedding, and Vector Store Indexing

This notebook covers the second major phase of the project: transforming the cleaned text narratives into a format suitable for efficient semantic search within a RAG system.

**Objectives:**
1.  Mount Google Drive to access the cleaned dataset.
2.  Install necessary libraries for text chunking, embedding, and vector database operations.
3.  Load the preprocessed complaint narratives.
4.  Implement a text chunking strategy using `RecursiveCharacterTextSplitter`.
5.  Choose and load an appropriate embedding model (`sentence-transformers/all-MiniLM-L6-v2`).
6.  Generate vector embeddings for each text chunk.
7.  Create and persist vector stores using both **FAISS** and **ChromaDB**, ensuring metadata (original complaint ID, product category) is stored alongside embeddings.
8.  Save the generated vector stores to Google Drive.

## 1. Setup and Google Drive Mounting

Mount your Google Drive to access the cleaned data and save the generated vector stores. You will be prompted to authenticate your Google account.

**Important:** Adjust `PROJECT_ROOT` to match the actual location of your project folder within your Google Drive.

In [1]:
from google.colab import drive
import os
import sys

# Mount Google Drive
drive.mount('/content/drive')

# Define your project root within Google Drive
# e.g., if your project folder is 'My Drive/CrediTrust_RAG_Project'
PROJECT_ROOT = '/content/drive/My Drive/Colab_Project/'

# Change current working directory to your project root for easier relative imports and path handling
os.makedirs(PROJECT_ROOT, exist_ok=True)
os.chdir(PROJECT_ROOT)

print(f"Current working directory set to: {os.getcwd()}")

# Add the src directory to Python's path to import custom modules
if './src' not in sys.path:
    sys.path.insert(0, './src')

print(f"Python sys.path updated: {sys.path}")

Mounted at /content/drive
Current working directory set to: /content/drive/My Drive/Colab_Project
Python sys.path updated: ['./src', '/content', '/env/python', '/usr/lib/python311.zip', '/usr/lib/python3.11', '/usr/lib/python3.11/lib-dynload', '', '/usr/local/lib/python3.11/dist-packages', '/usr/lib/python3/dist-packages', '/usr/local/lib/python3.11/dist-packages/IPython/extensions', '/usr/local/lib/python3.11/dist-packages/setuptools/_vendor', '/root/.ipython']


## 2. Install Required Libraries

Install `langchain`, `sentence-transformers`, `faiss-cpu`, and `chromadb`. `tqdm` is already often available in Colab, but explicitly listed for clarity.

In [2]:
# !pip install --upgrade pip
!pip install langchain sentence-transformers faiss-cpu chromadb tqdm pandas numpy --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.5/19.5 MB[0m [31m48.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.6/101.6 kB[0m [31m6.7 MB/s[0m eta [36m0:00:

## 3. Import Custom Indexing Module and Configure Paths

Import the `main_indexing_process` function from `src/vector_indexing.py` and set up file paths.

In [4]:
# Add the src directory to Python's path to import main indexing process
import sys
if '../src' not in sys.path:
    sys.path.append('../src')

from src.vector_indexing import main_indexing_process
import os

# --- Configuration ---
# Define a variable for the base data directory in Google Drive
BASE_DATA_DIR = '/content/drive/MyDrive/10accademy/Week-6/Data'

# These paths are relative to your PROJECT_ROOT in Google Drive
CLEANED_DATA_PATH = os.path.join(BASE_DATA_DIR, 'filtered_complaints.csv')
FAISS_SAVE_DIR = os.path.join(BASE_DATA_DIR, 'vector_store', 'faiss_index')
CHROMADB_SAVE_DIR = os.path.join(BASE_DATA_DIR, 'vector_store', 'chroma_db')

# Ensure these column names match your cleaned CSV from Task 1
NARRATIVE_COLUMN = 'cleaned_Consumer complaint narrative'
ID_COLUMN = 'Complaint ID'
PRODUCT_COLUMN = 'Product'

# --- Chunking Parameters ---
# Experiment with these values. Common choices are 200-1000 for chunk_size
# and 10-20% of chunk_size for chunk_overlap.
CHUNK_SIZE = 500 # Characters per chunk
CHUNK_OVERLAP = 100 # Overlap between chunks

# --- Embedding Model ---
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

# Create output directories if they don't exist using the new base path
os.makedirs(os.path.join(BASE_DATA_DIR, 'vector_store', 'faiss_index'), exist_ok=True)
os.makedirs(os.path.join(BASE_DATA_DIR, 'vector_store', 'chroma_db'), exist_ok=True)


print("Configuration and paths set.")

Configuration and paths set.


## 4. Run the Chunking, Embedding, and Indexing Process

Execute the main function to perform all operations. This will:
1.  Load the `filtered_complaints.csv`.
2.  Chunk the `cleaned_Consumer complaint narrative` column.
3.  Generate embeddings using `all-MiniLM-L6-v2`.
4.  Create and save a FAISS index with associated metadata.
5.  Create and save a ChromaDB collection with associated metadata.

In [None]:
main_indexing_process(
    cleaned_data_path=CLEANED_DATA_PATH,
    faiss_save_dir=FAISS_SAVE_DIR,
    chromadb_save_dir=CHROMADB_SAVE_DIR,
    narrative_col=NARRATIVE_COLUMN,
    id_col=ID_COLUMN,
    product_col=PRODUCT_COLUMN,
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    embedding_model_name=EMBEDDING_MODEL_NAME
)

--- Starting Chunking, Embedding, and Indexing Process ---
Cleaned data loaded successfully. Shape: (80267, 21)
Starting text chunking with chunk_size=500, chunk_overlap=100...


Processing narratives for chunking:   0%|          | 0/80267 [00:00<?, ?it/s]

Finished chunking. Total chunks created: 190335
Loading embedding model: sentence-transformers/all-MiniLM-L6-v2...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model loaded successfully.
Generating embeddings for 190335 chunks...


Batches:   0%|          | 0/5948 [00:00<?, ?it/s]

## 5. Verify Saved Vector Stores

You can run the cells below to quickly verify if the vector stores and their associated metadata files have been saved to your Google Drive.

In [None]:
import os
import faiss
import chromadb
import pandas as pd

print("\n--- Verifying FAISS Save --- ")
faiss_index_path = os.path.join(PROJECT_ROOT, FAISS_SAVE_DIR, "faiss_index.bin")
faiss_metadata_path = os.path.join(PROJECT_ROOT, FAISS_SAVE_DIR, "faiss_metadata.csv")

if os.path.exists(faiss_index_path):
    print(f"FAISS index found at: {faiss_index_path}")
    try:
        index = faiss.read_index(faiss_index_path)
        print(f"FAISS index loaded successfully with {index.ntotal} vectors.")
    except Exception as e:
        print(f"Error loading FAISS index: {e}")
else:
    print(f"FAISS index NOT found at: {faiss_index_path}")

if os.path.exists(faiss_metadata_path):
    print(f"FAISS metadata found at: {faiss_metadata_path}")
    try:
        metadata_df = pd.read_csv(faiss_metadata_path)
        print(f"FAISS metadata loaded successfully. Shape: {metadata_df.shape}")
        print("First 5 rows of FAISS metadata:")
        display(metadata_df.head())
    except Exception as e:
        print(f"Error loading FAISS metadata: {e}")
else:
    print(f"FAISS metadata NOT found at: {faiss_metadata_path}")


print("\n--- Verifying ChromaDB Save --- ")
chromadb_path = os.path.join(PROJECT_ROOT, CHROMADB_SAVE_DIR)

if os.path.exists(chromadb_path) and os.path.isdir(chromadb_path):
    print(f"ChromaDB directory found at: {chromadb_path}")
    try:
        client = chromadb.PersistentClient(path=chromadb_path)
        collections = client.list_collections()
        if collections:
            print(f"ChromaDB client initialized. Collections found: {[c.name for c in collections]}")
            # Example: Try to get a count from the first collection
            if collections[0].name == 'complaint_chunks': # Assuming default name
                coll = client.get_collection(name='complaint_chunks')
                print(f"ChromaDB collection 'complaint_chunks' has {coll.count()} items.")
                # You can also perform a sample query to check data
                # results = coll.peek(limit=5) # Peek at the first 5 entries
                # print("Sample ChromaDB entries:")
                # print(results)
        else:
            print("No collections found in ChromaDB instance.")
    except Exception as e:
        print(f"Error initializing ChromaDB client or listing collections: {e}")
else:
    print(f"ChromaDB directory NOT found at: {chromadb_path}")

print("\nVerification complete.")

## 6. Report Section Content: Text Chunking and Embedding Model Justification

### Text Chunking Strategy

**Why Chunking?**
Longer text narratives are often suboptimal for direct embedding as a single vector. When a document is too large, its embedding can become diluted, failing to capture the fine-grained semantic details. Chunking addresses this by breaking down long narratives into smaller, more semantically coherent units. This ensures that each chunk represents a focused piece of information, leading to more precise embeddings and improved relevance in semantic search.

**Implementation Choice:**
We opted for LangChain's `RecursiveCharacterTextSplitter` due to its intelligent approach to text splitting. Unlike simpler splitters, `RecursiveCharacterTextSplitter` attempts to split text using a hierarchical list of characters (e.g., `['\n\n', '\n', ' ', '']`). This method prioritizes splitting at meaningful boundaries like paragraph breaks, then sentence breaks, and finally individual words, which helps to preserve the semantic coherence of each chunk.

**Experimentation with `chunk_size` and `chunk_overlap`:**
To find an optimal balance, we experimented with various `chunk_size` and `chunk_overlap` values. For consumer complaint narratives, maintaining context is crucial, but chunks also need to be concise enough for effective embedding. Our chosen parameters are:
-   **`chunk_size = 500` characters:** This size was selected as it generally allows a complete thought or a short paragraph to fit within a single chunk, preventing important context from being split across multiple chunks, yet remaining small enough to generate distinct and relevant embeddings. Based on preliminary observation of complaint lengths, 500 characters seemed to capture a logical unit of information.
-   **`chunk_overlap = 100` characters:** An overlap of 100 characters (20% of `chunk_size`) was chosen to mitigate the 'lost in the middle' problem. This overlap ensures that semantically related information at the boundaries of adjacent chunks is captured by both, thereby preserving context when a query might span across a chunk boundary. This overlap helps to maintain continuity and improves the chances of relevant retrieval even if the most critical information is at the edge of a chunk.

This combination aims to balance the need for semantically coherent chunks with the prevention of context loss, optimizing the input for the embedding model.

### Embedding Model Choice

**Model Chosen:** `sentence-transformers/all-MiniLM-L6-v2`

**Justification:**
The `sentence-transformers/all-MiniLM-L6-v2` model was selected as the embedding model for the following reasons:

1.  **Performance and Semantic Meaning:** This model is part of the Sentence-BERT family, known for producing high-quality sentence embeddings that effectively capture semantic similarity. For consumer complaints, where understanding the nuances of language is critical for accurate retrieval, `all-MiniLM-L6-v2` offers a robust balance of semantic precision and computational efficiency, having been trained on a large dataset for various semantic tasks.

2.  **Efficiency and Speed:** Being a 'MiniLM' variant, it is significantly smaller and faster than larger Transformer models while still maintaining strong performance. This efficiency is crucial for processing a large volume of customer complaints and for rapid inference during the RAG retrieval phase, especially in a resource-constrained environment like Google Colab or when deploying the system.

3.  **Availability and Community Support:** As a widely used open-source model from the `sentence-transformers` library, `all-MiniLM-L6-v2` benefits from extensive documentation, community support, and ease of integration into existing pipelines, including LangChain.

This model's ability to generate semantically rich and efficient embeddings makes it an excellent choice for transforming our text chunks into numerical representations suitable for vector database indexing and subsequent semantic search.

### Vector Store Comparison (Conceptual for this phase)

For this phase, both **FAISS** and **ChromaDB** were implemented and used to store the generated embeddings along with essential metadata (original complaint ID, product category, and chunk text). Both successfully created persistent vector stores. A direct comparison of their retrieval performance and scalability will be part of the subsequent evaluation phase, where specific queries will be run against both indexes to determine their effectiveness in a RAG pipeline.