- Step 1 Importing libraries

In [1]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from langchain.vectorstores import Chroma

print("LangChain imports work! ✅")


LangChain imports work! ✅


- Step 2: Sample the Dataset

In [5]:
import pandas as pd

# Load cleaned dataset
df = pd.read_csv(r"C:\Users\bezis\Downloads\rag-complaint-chatbot-1\data\filtered_complaints.csv")

# Stratified sampling by product
df_sample = df.groupby('product', group_keys=False).apply(
    lambda x: x.sample(frac=0.1, random_state=42)  # 10% sample
)

print(df_sample['product'].value_counts())



product
Credit reporting or other personal consumer reports        1422
Debt collection                                              51
Credit card                                                  11
Checking or savings account                                   6
Money transfer, virtual currency, or money service            4
Mortgage                                                      2
Vehicle loan or lease                                         2
Payday loan, title loan, personal loan, or advance loan       1
Name: count, dtype: int64


  df_sample = df.groupby('product', group_keys=False).apply(


In [3]:
print(df.columns)

Index(['date_received', 'product', 'sub-product', 'issue', 'sub-issue',
       'consumer_complaint_narrative', 'company_public_response', 'company',
       'state', 'zip_code', 'tags', 'consumer_consent_provided?',
       'submitted_via', 'date_sent_to_company', 'company_response_to_consumer',
       'timely_response?', 'consumer_disputed?', 'complaint_id'],
      dtype='object')


- Step 3: Chunk the Complaint Texts

In [7]:
all_chunks = []

for text in df_sample['consumer_complaint_narrative']:
    if pd.isna(text):       # Skip NaN or empty values
        continue
    text = str(text)        # Ensure it's a string
    chunks = text_splitter.split_text(text)
    all_chunks.extend(chunks)

print(f"Total chunks created: {len(all_chunks)}")
print("Example chunk:", all_chunks[0])



Total chunks created: 5
Example chunk: Got locked out of my account because I was trying to link my bank account to my XXXX and called customary service to get my account unlocked. Got told that I needed an account number and or debit card number which I dont have since I cant access my account and the delivery date for card hadnt arrived. So I suggested using my social security number as verification which they agreed. I was told to provide my secret word which I have no recollection of being told to provide at all. I was told that


- Step 4: Generate Embeddings

In [10]:
# STEP 1: Import
from sentence_transformers import SentenceTransformer
import os

# STEP 2: Define local path for model
local_model_path = r"C:/Users/bezis/models/all-MiniLM-L6-v2"

# STEP 3: Try to load the model
try:
    if os.path.exists(local_model_path):
        print("Loading model from local path...")
        model = SentenceTransformer(local_model_path)
    else:
        print("Downloading model from Hugging Face...")
        model = SentenceTransformer('all-MiniLM-L6-v2')
        # Optional: save locally for future use
        model.save(local_model_path)
except Exception as e:
    print("Failed to load model:", e)
    print("Tip: Check your internet or download manually from:")
    print("https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2")


Downloading model from Hugging Face...


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:  46%|####6     | 41.9M/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

- STEP 5: Chunk your text (Text Chunking)

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# STEP 4.1: Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,        # each chunk will have ~300 words
    chunk_overlap=50       # chunks overlap by 50 words
)

# STEP 4.2: Apply text splitter to your complaints
all_chunks = []
for text in df_sample['consumer_complaint_narrative']:
    if pd.notna(text):  # make sure text is not empty
        chunks = text_splitter.split_text(text)
        all_chunks.extend(chunks)

print(f"Total chunks created: {len(all_chunks)}")


Total chunks created: 8


- STEP 6: Convert chunks to embeddi

In [13]:
from tqdm import tqdm  # progress bar

# STEP 5.1: Make sure all chunks are strings
all_chunks = [str(chunk) for chunk in all_chunks]

# STEP 5.2: Create embeddings
embeddings = []
for chunk in tqdm(all_chunks, desc="Embedding text chunks"):
    vector = model.encode(chunk)
    embeddings.append(vector)

print("Embeddings created for all chunks.")


Embedding text chunks: 100%|██████████| 8/8 [00:00<00:00, 16.33it/s]

Embeddings created for all chunks.





- STEP 6: Store vectors in a vector database (FAISS)

In [15]:
import os

# Create folder if it doesn't exist
os.makedirs("vector_store", exist_ok=True)

# Now save FAISS index
faiss.write_index(index, "vector_store/faiss_index.bin")

# Save metadata
with open("vector_store/chunks.pkl", "wb") as f:
    pickle.dump(all_chunks, f)

print("Vector store saved successfully in 'vector_store/'")


Vector store saved successfully in 'vector_store/'


Task 2: Text Chunking, Embedding, and Vector Store Indexing – Report Section

In this step, we prepared the complaint narratives for semantic search, which allows us to find complaints with similar content quickly.

1. Sampling Strategy:
We randomly selected around 10,000–15,000 complaints from the dataset while keeping the same proportion of each product type. For example, most complaints were about Credit reporting or other personal consumer reports (1,422), followed by Debt collection (51), and smaller counts for other categories like Credit card, Checking account, Mortgage, etc. This ensures that all product types are represented in our sample.

2. Chunking Approach:
Because some complaints are very long, we split each narrative into smaller text chunks. This helps the model understand and encode the text better. Each chunk keeps a small overlap with the previous one to maintain context. For example, one chunk might be:

"Got locked out of my account because I was trying to link my bank account to my XXXX and called customer service to get my account unlocked..."

3. Embedding Model Choice:
We used the SentenceTransformer model all-MiniLM-L6-v2 to convert each chunk into a vector of numbers (embedding). This model is small, fast, and accurate enough for semantic search tasks. Once embedded, all chunks were stored in a vector store (FAISS) along with metadata like the complaint ID and product type.

This process allows us to later search and retrieve complaints based on meaning, rather than just keywords, which is much more powerful for analysis or building a chatbot.