# Task 2: Text Chunking, Embedding, and Vector Indexing

We load pre-built embeddings, text chunks, and metadata from parquet,and store them in a vector database for semantic retrieval.



We begin by importing required libraries and custom utility functions.


In [1]:
import sys
sys.path.append("../src")
import pandas as pd

from dataUtilities import stratified_sample
from vectorUtilities import load_parquet_embeddings, build_chroma_from_parquet


## Just showing how text chucking happens


### Load Cleaned Data

In [2]:
df = pd.read_csv("../data/processed/filtered_complaints.csv")
df.head()


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID,narrative_length,cleaned_narrative
0,2025-06-13,Credit card,Store credit card,Getting a credit card,Card opened without my consent or knowledge,A XXXX XXXX card was opened under my name by a...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",TX,78230,Servicemember,Consent provided,Web,2025-06-13,Closed with non-monetary relief,Yes,,14069121,91,a xxxx xxxx card was opened under my name by a...
1,2025-06-12,Credit card,General-purpose credit card or charge card,"Other features, terms, or problems",Other problem,"Dear CFPB, I have a secured credit card with c...",Company has responded to the consumer and the ...,"CITIBANK, N.A.",NY,11220,,Consent provided,Web,2025-06-13,Closed with monetary relief,Yes,,14047085,156,dear cfpb i have a secured credit card with ci...
2,2025-06-12,Credit card,General-purpose credit card or charge card,Incorrect information on your report,Account information incorrect,I have a Citi rewards cards. The credit balanc...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",IL,60067,,Consent provided,Web,2025-06-12,Closed with explanation,Yes,,14040217,233,i have a citi rewards cards the credit balance...
3,2025-06-09,Credit card,General-purpose credit card or charge card,Problem with a purchase shown on your statement,Credit card company isn't resolving a dispute ...,b'I am writing to dispute the following charge...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",TX,78413,Older American,Consent provided,Web,2025-06-09,Closed with monetary relief,Yes,,13968411,454,bi am writing to dispute the following charges...
4,2025-06-09,Credit card,General-purpose credit card or charge card,Problem when making payments,Problem during payment process,"Although the account had been deemed closed, I...",Company believes it acted appropriately as aut...,Atlanticus Services Corporation,NY,11212,Older American,Consent provided,Web,2025-06-09,Closed with monetary relief,Yes,,13965746,170,although the account had been deemed closed i ...


### Stratified Sampling

To make embedding feasible on limited hardware, we sample
10,000–15,000 complaints while preserving product distribution.


In [3]:
sampled_df = stratified_sample(
    df,
    label_column="Product",
    sample_size=12000
)

sampled_df["Product"].value_counts(normalize=True)


Product
Credit card    1.0
Name: proportion, dtype: float64

### Prepare Texts and Metadata

In [4]:
texts = sampled_df["cleaned_narrative"].tolist()

metadatas = [
    {
        "complaint_id": row.name,
        "product": row["Product"],
        "issue": row["Issue"],
        "company": row["Company"],
        "state": row["State"]
    }
    for _, row in sampled_df.iterrows()
]

ids = [f"complaint_{i}" for i in range(len(texts))]


### Embedding and Vector Indexing

We embed complaint narratives using a sentence-transformer model
and store them in a persistent ChromaDB vector store.For know we will usse the prebuilt one.



## Load Pre-Built Embeddings

We load embeddings, text chunks, and metadata from `complaint_embeddings.parquet`.


In [2]:
parquet_path = "../data/raw/complaint_embeddings.parquet"
embeddings, texts, metadatas = load_parquet_embeddings(parquet_path)

print(f"Total chunks: {len(texts)}")
print("Sample text chunk:", texts[0])
print("Sample metadata:", metadatas[0])


Total chunks: 1375327
Sample text chunk: a card was opened under my name by a fraudster. i received a notice from that an account was just opened under my name. i reached out to to state that this activity was unauthorized and not me. confirmed this was fraudulent and immediately closed the card. however, they have failed to remove this from the three credit agencies and this fraud is now impacting my credit score based on a hard credit pull done by that was done by a fraudster.
Sample metadata: {'id': '14069121_0', 'metadata': {'chunk_index': 0, 'company': 'CITIBANK, N.A.', 'complaint_id': '14069121', 'date_received': '2025-06-13', 'issue': 'Getting a credit card', 'product': 'Credit card', 'product_category': 'Credit Card', 'state': 'TX', 'sub_issue': 'Card opened without my consent or knowledge', 'total_chunks': 1}}


## Build Persistent ChromaDB Collection

We store embeddings, text chunks, and metadata in ChromaDB
for retrieval in Task 3.


In [3]:
def flatten_metadata(meta):
    """
    Flatten metadata to only allow str, int, float, bool, or None.
    Converts nested dicts to JSON strings if necessary.
    """
    flat = {}
    for key, value in meta.items():
        if isinstance(value, dict) or isinstance(value, list):
            # Convert dict/list to JSON string
            import json
            flat[key] = json.dumps(value)
        else:
            flat[key] = value
    return flat


In [4]:
metadatas_flat = [flatten_metadata(m) for m in metadatas]


In [5]:
collection = build_chroma_from_parquet(
    embeddings=embeddings,
    texts=texts,
    metadatas=metadatas_flat,       # use flattened metadata
    persist_dir="../vector_store",
    collection_name="complaints_full",
    batch_size=1000
)


Added 0 → 1000
Added 1000 → 2000
Added 2000 → 3000
Added 3000 → 4000
Added 4000 → 5000
Added 5000 → 6000
Added 6000 → 7000
Added 7000 → 8000
Added 8000 → 9000
Added 9000 → 10000
Added 10000 → 11000
✅ ChromaDB collection persisted to disk
