# Pinecone Vector Ingestion Notebook

**Purpose:**  
This notebook processes a dataset of product descriptions by encoding them into dense vector embeddings using a sentence transformer model. The resulting vectors, along with relevant metadata (e.g., price), are stored in a Pinecone index for efficient semantic search and retrieval.

**Key Components:**
* Data preprocessing and description extraction
* Batch encoding using a transformer model
* Metadata handling (price only for now)
* Pinecone initialization and index creation
* Uploading vectors to the Pinecone index in batches

**Note:**  
Ensure that your Pinecone API key is correctly set in your environment variables before running this notebook.


In [None]:
# imports

import os
# import re
# import math
# import json
from tqdm import tqdm
# import random
from dotenv import load_dotenv
from huggingface_hub import login
import numpy as np
import pickle
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from pinecone import Pinecone, ServerlessSpec
# from items import Item
# from sklearn.manifold import TSNE
# import plotly.graph_objects as go

from datasets import load_dataset

In [None]:
HF_USER = "ed-donner"
DATASET_NAME = f"{HF_USER}/pricer-data"

# Load the dataset
dataset = load_dataset(DATASET_NAME)

# Access train and test splits
train = dataset["train"]
test = dataset["test"]

# Save to folders in your current directory
train.save_to_disk("./train")
test.save_to_disk("./test")


In [18]:
with open('train.pkl', 'wb') as file:
    pickle.dump(train,file)

with open('test.pkl', 'wb') as file:
    pickle.dump(test,file)

print(len(train))
print(len(test))

400000
2000


In [11]:
test[0]

{'text': "How much does this cost to the nearest dollar?\n\nOEM AC Compressor w/A/C Repair Kit For Ford F150 F-150 V8 & Lincoln Mark LT 2007 2008 - BuyAutoParts NEW\nAs one of the world's largest automotive parts suppliers, our parts are trusted every day by mechanics and vehicle owners worldwide. This A/C Compressor and Components Kit is manufactured and tested to the strictest OE standards for unparalleled performance. Built for trouble-free ownership and 100% visually inspected and quality tested, this A/C Compressor and Components Kit is backed by our 100% satisfaction guarantee. Guaranteed Exact Fit for easy installation 100% BRAND NEW, premium ISO/TS 16949 quality - tested to meet or exceed OEM specifications Engineered for superior durability, backed by industry-leading unlimited-mileage warranty Included in this K\n\nPrice is $",
 'price': 374.41}

In [None]:
load_dotenv(override=True)
os.environ['GROQ_API_KEY'] = os.getenv('GROQ_API_KEY')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN')
os.environ['PINECONE_API_KEY'] = os.getenv('PINECONE_API_KEY')

In [None]:
# Log in to HuggingFace
hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.



https://drive.google.com/drive/folders/1f_IZGybvs9o0J5sb3xmtTEQB3BXllzrW?usp=drive_link

In [29]:
# With train.pkl in this folder, you can run this:

with open('train.pkl', 'rb') as file:
    train = pickle.load(file)

In [33]:
train[200]

{'text': 'How much does this cost to the nearest dollar?\n\nFit System Passenger Side Mirror for Toyota Tacoma, Black, Foldaway, Power\nPassenger Side Mirror for Toyota Tacoma. Black. Foldaway. Power. Mirror glass is power adjustable. Convex Lens. Mirror glass does not have heating capabilities. Manual folding for additional clearance. Mirror has no turn signal. Housing finish is Black. Passenger side Mirror, tested to fit and function like the original, Meets or exceeds OEM standards Mirror glass is power adjustable OE-comparable wiring harness/ connection (no pigtail connector) for hassle-free installation Manual folding for additional clearance Auto Part Position Right, Dimensions LxWxH 13.25 x 5.25 x 9.25 inches, Lens Curvature Description Convex, Brand Fit System, Color Black, Mounting Type Door Mount, Special\n\nPrice is $123.00',
 'price': 122.65}

In [41]:
train=train.to_list()

In [27]:
# Load API key from environment
pinecone_api = os.environ['PINECONE_API_KEY']

# Create a Pinecone client instance
pc = Pinecone(api_key=pinecone_api)

# Index name
index_name = "products"

# Check if index exists and delete it if it does
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)
    print(f"Deleted existing index: {index_name}")

# Create the index with a serverless spec
pc.create_index(
    name=index_name,
    dimension=384,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

# Connect to the index
index = pc.Index(index_name)


# Introducing the SentenceTransfomer

The all-MiniLM is a very useful model from HuggingFace that maps sentences & paragraphs to a 384 dimensional dense vector space and is ideal for tasks like semantic search.

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

It can run pretty quickly locally.


In [28]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
# Pass in a list of texts, get back a numpy array of vectors

vector = model.encode(["Well hi there"])

In [None]:
vector

In [35]:
def description(item):
    text = item['text']  # access the 'text' key in the dictionary
    text = text.replace("How much does this cost to the nearest dollar?\n\n", "")
    return text.split("\n\nPrice is $")[0]

In [42]:
description(train[0])

'Delphi FG0166 Fuel Pump Module\nDelphi brings 80 years of OE Heritage into each Delphi pump, ensuring quality and fitment for each Delphi part. Part is validated, tested and matched to the right vehicle application Delphi brings 80 years of OE Heritage into each Delphi assembly, ensuring quality and fitment for each Delphi part Always be sure to check and clean fuel tank to avoid unnecessary returns Rigorous OE-testing ensures the pump can withstand extreme temperatures Brand Delphi, Fit Type Vehicle Specific Fit, Dimensions LxWxH 19.7 x 7.7 x 5.1 inches, Weight 2.2 Pounds, Auto Part Position Unknown, Operation Mode Mechanical, Manufacturer Delphi, Model FUEL PUMP, Dimensions 19.7'

In [None]:
from tqdm import tqdm
train_to_use=train[0:25_000]
for i in tqdm(range(0, len(train_to_use), 1000)):
    batch = train_to_use[i: i+1000]

    # Get the text description from each item
    documents = [description(item) for item in batch]

    # Get the vector embeddings
    vectors = model.encode(documents).astype(float).tolist()

    # Build the list of items to send to Pinecone
    to_upsert = []
    for j, (vector, item) in enumerate(zip(vectors, batch), start=i):
        doc_id = f"doc_{j}"  # unique id for Pinecone
        metadata = {"price": item["price"]}  # only price in metadata
        to_upsert.append((doc_id, vector, metadata))

    # Upload to Pinecone
    index.upsert(vectors=to_upsert)


  0%|          | 0/400 [00:00<?, ?it/s]

  3%|▎         | 13/400 [54:15<26:55:00, 250.39s/it]


MaxRetryError: HTTPSConnectionPool(host='products-ggmnn7i.svc.aped-4627-b74a.pinecone.io', port=443): Max retries exceeded with url: /vectors/upsert (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate (_ssl.c:992)')))