# 🧠 E5 Embeddings & XGBoost
- Tasks:
    - Preprocessed item descriptions
    - Generated and stored embeddings in ChromaDB
    - Trained XGBoost on embeddings, pushed to HF Hub
- 🧑‍💻 Skill Level: Advanced
- ⚙️ Hardware: ⚠️ GPU required for embeddings (400K items) - use Google Colab
- 🛠️ Requirements: 🔑 Hugging Face Token — must be set in Google Colab secrets

Embeddings are stored and queried via ChromaDB — no LangChain is used for creation or retrieval.

---
📝 **Note:** This notebook is part of a series. Check out the full set [here](https://github.com/lisekarimi/lexo).


In [None]:
!pip install -q tqdm huggingface_hub numpy sentence-transformers chromadb xgboost datasets==2.21.0

In [None]:
# Standard library imports
import math
import os
import re

# Third-party imports
import chromadb
import joblib
import matplotlib.pyplot as plt
import numpy as np
from datasets import load_dataset
from google.colab import userdata
from huggingface_hub import HfApi, login
from openai import OpenAI
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
from xgboost import XGBRegressor

# Jupyter magic commands
%matplotlib inline

In [None]:
# Save model and ChromaDB to a persistent folder for reuse

from google.colab import drive  # type: ignore

drive.mount("/content/drive")

In [None]:
# Load from Colab's secure storage
openai_api_key = userdata.get("OPENAI_API_KEY")

hf_token = userdata.get("HF_TOKEN")
login(hf_token, add_to_git_credential=True)

# Configuration
ROOT = "/content/drive/MyDrive/deal_finder"
CHROMA_PATH = f"{ROOT}/chroma"

## 📥 Load Dataset

In [None]:
HF_USER = "lisekarimi"  # 🔧 Replace with your Hugging Face username
DATASET_NAME = f"{HF_USER}/pricer-data"

dataset = load_dataset(DATASET_NAME)
train = dataset["train"]
test = dataset["test"]

In [None]:
print(train[0]["text"])

In [None]:
print(train[0]["price"])

## 📦 Embed + Save Training Data to Chroma
- No LangChain used.
- We use `intfloat/e5-small-v2` for embeddings:
    - Fast, high-quality, retrieval-tuned
    - **Requires 'passage:' prefix**
- We embed item descriptions and store them in ChromaDB, with price saved as metadata.

In [None]:
# Load embedding model
model_embedding = SentenceTransformer("intfloat/e5-small-v2", device="cuda")

In [None]:
# Init Chroma
client = chromadb.PersistentClient(path=CHROMA_PATH)
collection = client.get_or_create_collection(name="price_items")

In [None]:
# Format description function (no price in text)
def description(item):
    text = item["text"].replace(
        "How much does this cost to the nearest dollar?\n\n", ""
    )
    text = text.split("\n\nPrice is $")[0]
    return f"passage: {text}"


description(train[0])

In [None]:
batch_size = 300  # how many items to insert into Chroma at once
encode_batch_size = 1024  # how many items to encode at once in GPU memory

for i in tqdm(range(0, len(train), batch_size), desc="Processing batches"):
    end_idx = min(i + batch_size, len(train))

    # Collect documents and metadata
    documents = [description(train[j]) for j in range(i, end_idx)]
    metadatas = [{"price": train[j]["price"]} for j in range(i, end_idx)]
    ids = [f"doc_{j}" for j in range(i, end_idx)]

    # GPU batch encoding
    vectors = model_embedding.encode(
        documents,
        batch_size=encode_batch_size,
        show_progress_bar=False,
        normalize_embeddings=True,
    ).tolist()

    # Insert into Chroma
    collection.add(
        ids=ids, documents=documents, embeddings=vectors, metadatas=metadatas
    )

print("✅ Embedding and storage to ChromaDB completed.")

In [None]:
# Now flush and clean
print("🧹 Cleaning up and saving ChromaDB...")
client = None
import gc

gc.collect()

Our ChromaDB is currently saved in a persistent Google Drive path; for a production-ready app, we recommend uploading it to AWS S3 for better reliability and scalability.

## 📈 Embedding-Based Regression with XGBoost

In [None]:
# Step 1: Load vectors and prices from Chroma
result = collection.get(include=["embeddings", "documents", "metadatas"])
vectors = np.array(result["embeddings"])
documents = result["documents"]
prices = [meta["price"] for meta in result["metadatas"]]

In [None]:
# Step 2: Train XGBoost model
xgb_model = XGBRegressor(n_estimators=100, random_state=42, n_jobs=-1, verbosity=0)
xgb_model.fit(vectors, prices)

In [None]:
# Step 3: Serialize XGBoost model locally for Hugging Face upload
MODEL_DIR = os.path.join(ROOT, "models")
MODEL_FILENAME = "xgboost_model.pkl"
LOCAL_MODEL = os.path.join(MODEL_DIR, MODEL_FILENAME)

os.makedirs(MODEL_DIR, exist_ok=True)
joblib.dump(xgb_model, LOCAL_MODEL)

In [None]:
# Step 4: Push serialized XGBoost model to Hugging Face Hub
api = HfApi(token=hf_token)
REPO_NAME = "smart-deal-finder-models"
REPO_ID = f"{HF_USER}/{REPO_NAME}"

# Create the model repo if it doesn't exist
api.create_repo(repo_id=REPO_ID, repo_type="model", private=True, exist_ok=True)

# Upload the saved model
api.upload_file(
    path_or_fileobj=LOCAL_MODEL,
    path_in_repo=MODEL_FILENAME,
    repo_id=REPO_ID,
    repo_type="model",
)