# Build and Save FAISS Index

## This is a system that can store and retrieve ads based on how similar they are to given prompt or context:
- A way to convert text into numbers -> embedding
- A fast system to search through these numbers and find similar ones -> FAISS

## Installation & Import

In [4]:
!pip install faiss-cpu
!pip install sentence-transformers



In [5]:
import os
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
import json
import argparse
import pandas as pd

## Load the dataset from Github

In [7]:
url = "https://raw.githubusercontent.com/m1chae11u/llm-ad-integration/main/generated_ad_dataset.csv"
df = pd.read_csv(url)
print(df.head())

   ad_id                      domain     category                  product  \
0      1  Consumer Goods and Retails  Electronics          EchoAir Earbuds   
1      2  Consumer Goods and Retails  Electronics  QuantumPulse Smartwatch   
2      3  Consumer Goods and Retails  Electronics          VoltCharger Pro   
3      4  Consumer Goods and Retails  Electronics       ApertureAce Camera   
4      5  Consumer Goods and Retails  Electronics        SonicEcho Speaker   

                                        ad_key_words  \
0  ['wireless', 'noise-cancelling', 'high-fidelit...   
1  ['Smartwatch', 'Health Monitor', 'Connectivity...   
2  ['Fast Charging', 'Portable', 'Universal Compa...   
3  ['Photography', 'High-Resolution', 'Versatile'...   
4  ['wireless', 'high-fidelity', 'portable', 'Blu...   

                                      ad_description  \
0  Experience sound like never before with EchoAi...   
1  Introducing the QuantumPulse Smartwatch - a pe...   
2  Experience lightning-fa

## Text Processing

Combining multiple structured columns into a single unstructured string that can be passed into an embedding model.

In [10]:
# Combine relevant fields into a single string per ad
def combine_fields(row):
    benefits = eval(row['ad_benefits']) if isinstance(row['ad_benefits'], str) else []
    return f"{row['product']}. {row['ad_description']} {' '.join(benefits)}"

ad_texts = df.apply(combine_fields, axis=1).tolist()

# Preview a few
ad_texts[:3]

['EchoAir Earbuds. Experience sound like never before with EchoAir Earbuds, combining cutting-edge technology with unmatched comfort. Wireless Bluetooth connectivity for seamless audio streaming. Advanced noise-cancelling technology for immersive listening. High-fidelity sound quality for crystal clear audio. Ergonomic design ensures all-day comfort. Long-lasting battery life for extended playtime.',
 'QuantumPulse Smartwatch. Introducing the QuantumPulse Smartwatch - a perfect blend of technology and style. Monitor your health with precision sensors. Stay connected with seamless notifications. Customizable watch faces to match your mood. Long-lasting battery life for all-day use.',
 'VoltCharger Pro. Experience lightning-fast charging with the VoltCharger Pro, designed to power up all your devices with ease. High-speed charging for multiple devices. Compact and travel-friendly design. Compatible with a wide range of devices. Smart technology prevents overcharging. Durable and long-las

## Embedding Ad Texts using Sentence Transformers

Using a pretrained sentence embedding model (all-MiniLM-L6-v2) to convert ad texts into dense vector representations, which si embeddings. 

These embedding capture semantic meaning of eac ad in a numerical format that we can use later for similarity search, retrieval, or clustering. 

In [13]:
# Load model (fast + good)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed the combined ad texts
embeddings = model.encode(ad_texts, show_progress_bar=True, convert_to_numpy=True, normalize_embeddings=True)
print(f"Generated {len(embeddings)} embeddings with shape {embeddings.shape}")

Batches:   0%|          | 0/133 [00:00<?, ?it/s]

Generated 4251 embeddings with shape (4251, 384)


## Build and Save FAISS Index

In this step, we are:
- Builidng the FAISS Index using IndexFlatIP
  + This will create a vector search index that can retrieve the most similar embeddings (based on inner product, which acts as cosine similarity when vectors are normalized)
  - Then add our ad embeddings to that index.

In [16]:
# Build FAISS index with cosine similarity (normalized vectors + IP)
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(embeddings)

# Save index to file
faiss.write_index(index, "faiss_ad_index.index")
print("Saved index to 'faiss_ad_index.index'")

Saved index to 'faiss_ad_index.index'


## Saving Text Metadata for Retrieval

Saving the original ad texts (the ones that were embedded) to a JSON file so that:
- When we retrieve vectors from the FAISS index, we can map the results back to the actual ad content.
- This is like a lookup table between numeric vectors and their corresponding human-readable text.

In [19]:
# Save the embedding texts (used for retrieval reference)
with open("ad_texts.json", "w", encoding="utf-8") as f:
    json.dump(ad_texts, f, indent=2, ensure_ascii=False)

print("Saved ad texts to 'ad_texts.json'")

Saved ad texts to 'ad_texts.json'
