This notebook processes a dataset of product descriptions by encoding them into dense vector embeddings using a sentence transformer model. The resulting vectors, along with relevant metadata (e.g., price), are stored in a Pinecone index for efficient semantic search and retrieval.

Key Components:
* Data preprocessing and description extraction
* Batch encoding using a transformer model
* Metadata handling (price and content)
* Pinecone initialization and index creation
* Uploading vectors to the Pinecone index in batches

**Note:**  
Ensure that your Pinecone API key is correctly set in your environment variables before running this notebook.


In [13]:
# imports

import os
# import re
# import math
# import json
from tqdm import tqdm
# import random
from dotenv import load_dotenv
from huggingface_hub import login
import numpy as np
import pickle
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from pinecone import Pinecone, ServerlessSpec
from items import Item
# from sklearn.manifold import TSNE
# import plotly.graph_objects as go

from datasets import load_dataset

In [None]:
# HF_USER = "ed-donner"
# DATASET_NAME = f"{HF_USER}/pricer-data"

# # Load the dataset
# dataset = load_dataset(DATASET_NAME)

# # Access train and test splits
# train = dataset["train"]
# test = dataset["test"]

# # Save to folders in your current directory
# train.save_to_disk("./train")
# test.save_to_disk("./test")


In [2]:
load_dotenv(override=True)
os.environ['GROQ_API_KEY'] = os.getenv('GROQ_API_KEY')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN')
os.environ['PINECONE_API_KEY'] = os.getenv('PINECONE_API_KEY')

In [3]:
# Log in to HuggingFace
hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [6]:
# With train.pkl in this folder, you can run this:

with open('train.pkl', 'rb') as file:
    train = pickle.load(file)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [11]:
train[200]

122.65

## Create a Pinecone Vector Datastore.

In [14]:
# Load API key from environment
pinecone_api = os.environ['PINECONE_API_KEY']

# Create a Pinecone client instance
pc = Pinecone(api_key=pinecone_api)

# Index name
index_name = "products"

# Check if index exists and delete it if it does
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)
    print(f"Deleted existing index: {index_name}")

# Create the index with a serverless spec
pc.create_index(
    name=index_name,
    dimension=384,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

# Connect to the index
index = pc.Index(index_name)


# Introducing the SentenceTransfomer

The all-MiniLM is a very useful model from HuggingFace that maps sentences & paragraphs to a 384 dimensional dense vector space and is ideal for tasks like semantic search.

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

It can run pretty quickly locally.


In [15]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [16]:
def description(item):
    text = item.prompt.replace("How much does this cost to the nearest dollar?\n\n", "")
    return text.split("\n\nPrice is $")[0]

In [18]:
# Before passing into the description function
train[0].prompt

'How much does this cost to the nearest dollar?\n\nDelphi FG0166 Fuel Pump Module\nDelphi brings 80 years of OE Heritage into each Delphi pump, ensuring quality and fitment for each Delphi part. Part is validated, tested and matched to the right vehicle application Delphi brings 80 years of OE Heritage into each Delphi assembly, ensuring quality and fitment for each Delphi part Always be sure to check and clean fuel tank to avoid unnecessary returns Rigorous OE-testing ensures the pump can withstand extreme temperatures Brand Delphi, Fit Type Vehicle Specific Fit, Dimensions LxWxH 19.7 x 7.7 x 5.1 inches, Weight 2.2 Pounds, Auto Part Position Unknown, Operation Mode Mechanical, Manufacturer Delphi, Model FUEL PUMP, Dimensions 19.7\n\nPrice is $227.00'

In [19]:
# Before passing into the description function
description(train[0])

'Delphi FG0166 Fuel Pump Module\nDelphi brings 80 years of OE Heritage into each Delphi pump, ensuring quality and fitment for each Delphi part. Part is validated, tested and matched to the right vehicle application Delphi brings 80 years of OE Heritage into each Delphi assembly, ensuring quality and fitment for each Delphi part Always be sure to check and clean fuel tank to avoid unnecessary returns Rigorous OE-testing ensures the pump can withstand extreme temperatures Brand Delphi, Fit Type Vehicle Specific Fit, Dimensions LxWxH 19.7 x 7.7 x 5.1 inches, Weight 2.2 Pounds, Auto Part Position Unknown, Operation Mode Mechanical, Manufacturer Delphi, Model FUEL PUMP, Dimensions 19.7'

In [23]:
# from tqdm import tqdm
# train_to_use=train[0:25_000]
# for i in tqdm(range(0, len(train_to_use), 1000)):
#     batch = train_to_use[i: i+1000]

#     # Get the text description from each item
#     documents = [description(item) for item in batch]

#     # Get the vector embeddings
#     vectors = model.encode(documents).astype(float).tolist()

#     # Build the list of items to send to Pinecone
#     to_upsert = []
#     for j, (vector, item) in enumerate(zip(vectors, batch), start=i):
#         doc_id = f"doc_{j}"  # unique id for Pinecone
#         metadata = {"price": item["price"]}  # only price in metadata
#         to_upsert.append((doc_id, vector, metadata))

#     # Upload to Pinecone
#     index.upsert(vectors=to_upsert)

# for i in tqdm(range(0, len(train), 1000)):
#     documents = [description(item) for item in train[i: i+1000]]
#     vectors = model.encode(documents).astype(float).tolist()
#     metadatas = [{"category": item.category, "price": item.price} for item in train[i: i+1000]]
#     ids = [f"doc_{j}" for j in range(i, i+1000)]
#     collection.add(
#         ids=ids,
#         documents=documents,
#         embeddings=vectors,
#         metadatas=metadatas
#     )

train_to_use=train[0:50_000]
# Loop through your dataset in batches of 1000
for i in tqdm(range(0, len(train_to_use), 1000)):
    
    batch = train_to_use[i: i+1000]

    # Extract the descriptions from each item
    documents = [description(item) for item in batch]

    # Generate vector embeddings for each description
    vectors = model.encode(documents).astype(float).tolist()

    # Prepare metadata (only include price, since 'category' may not exist)
    metadatas = [{"category": item.category, "price": item.price} for item in batch]

    # Create unique IDs for each document
    ids = [f"doc_{j}" for j in range(i, i+len(batch))]

    # Format for Pinecone: (id, vector, metadata)
    to_upsert = list(zip(ids, vectors, metadatas))

    # Upload to Pinecone index
    index.upsert(vectors=to_upsert)


  0%|          | 0/50 [00:00<?, ?it/s]

100%|██████████| 50/50 [1:44:21<00:00, 125.24s/it]
