<a href="https://colab.research.google.com/github/lhiwi/complaint-rag-chatbot/blob/task2/notebooks/chunk_embed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


chunk_embed

In [5]:
!pip install langchain openai faiss-cpu sentence_transformers

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_6

In [1]:
import os, json
import pandas as pd, numpy as np
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
import faiss


In [2]:
# ensure vector_store exists
os.makedirs('vector_store', exist_ok=True)

In [4]:
# load data
INPUT_CSV = '/content/drive/MyDrive/data/filtered_complaints.csv'
df = pd.read_csv(INPUT_CSV)
print(f"Loaded {len(df)} cleaned complaints.")

Loaded 82164 cleaned complaints.


Chunk the Text

In [5]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = df['clean_narrative'].tolist()
metas = [
    {'complaint_idx': int(i), 'product': p}
    for i, p in zip(df.index, df['Product'])
]
docs = splitter.create_documents(texts, metadatas=metas)
print(f"Created {len(docs)} chunks.")


Created 142491 chunks.


Embed Chunks

In [6]:
model = SentenceTransformer('all-MiniLM-L6-v2')
chunk_texts = [d.page_content for d in docs]

# embeddings: (n_chunks, dim)
embeddings = model.encode(chunk_texts, show_progress_bar=True, convert_to_numpy=True)
print("Embeddings shape:", embeddings.shape)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/4453 [00:00<?, ?it/s]

Embeddings shape: (142491, 384)


Build & Save FAISS Index

In [7]:
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)
faiss.write_index(index, 'vector_store/faiss_index.bin')
print("FAISS index saved to vector_store/faiss_index.bin")


FAISS index saved to vector_store/faiss_index.bin


In [8]:
meta_list = [d.metadata for d in docs]
with open('vector_store/metadata.json', 'w', encoding='utf-8') as f:
    json.dump(meta_list, f, ensure_ascii=False, indent=2)
print("Metadata saved to vector_store/metadata.json")


Metadata saved to vector_store/metadata.json
