# Pinecone Vector Database Demo

This notebook demonstrates using **Pinecone Serverless** with 100 sample articles.

## Pinecone Key Features
- **Serverless Architecture** - Auto-scaling, no infrastructure management
- **Enterprise-Grade** - Used by production AI applications worldwide
- **Simple API** - Clean, intuitive Python client
- **Metadata Filtering** - Filter by indexed metadata fields
- **Free Tier** - Serverless with generous limits (us-east-1)
- **High Performance** - Optimized for low-latency queries
- **Cloud Options** - AWS, GCP, Azure support

## 1. Setup and Imports

In [1]:
import os
import sys
from pathlib import Path
import time

# Add parent directory to path
parent_dir = Path().resolve().parent
sys.path.insert(0, str(parent_dir))

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Import utilities
from utils.embeddings import EmbeddingGenerator
from utils.data_loader import load_articles, get_article_metadata

print("✓ All imports successful")

✓ All imports successful


## 2. Load Embedding Model

Using `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions)

In [2]:
# Initialize embedding model
embedding_model = EmbeddingGenerator()

# Test the model
test_text = "This is a test sentence for embedding generation."
test_embedding = embedding_model.embed_text(test_text)

print(f"  - Embedding dimension: {len(test_embedding)}")
print(f"  - Sample values: {test_embedding[:5]}")

Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
✓ Model loaded successfully. Embedding dimension: 384
  - Embedding dimension: 384
  - Sample values: [0.00306019 0.00200206 0.05544939 0.07702641 0.00857853]


## 3. Load Sample Articles

In [3]:
import json
import random

# Load articles
articles = load_articles("../sample_articles.json")

print(f"\nLoaded {len(articles)} articles")

# Random pick on article and preview
print("\nRandom Sample article:")
selected_index = random.randint(0, len(articles) - 1)
print(json.dumps(articles[selected_index], indent=2))

Loaded 100 articles from ../sample_articles.json

Loaded 100 articles

Random Sample article:
{
  "id": 15573824,
  "item_source": "CLIMBING",
  "item_title": "How a Dirtbag Became a Billionaire\u2014Without Compromising His Ethics",
  "item_subtitle": "The new biography Dirtbag Billionaire explores Yvon Chouinard's philanthropy, as well as his life as a Yosemite climber and alpinist.",
  "body_content": "\u201cI\u2019d feel a lot more comfortable on top of a mountain than here right now,\u201d Yvon Chouinard once told President Bill Clinton while participating in a conference on corporate responsibility.\nThat\u2019s because Chouinard is a reluctant corporate leader. His biggest accident was no blunder in the mountains, but becoming a billionaire. And his greatest legacy is not one of his first ascents or gear innovations, but giving his billions away to help save the planet he loves exploring. In short, this is the narrative arc of the life of Chouinard, a humble descendant of French

## 4. Connect to Pinecone

Using Pinecone Serverless (free tier available in us-east-1)

In [5]:
from pinecone import Pinecone, ServerlessSpec

PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

# Initialize Pinecone client
pc = Pinecone(api_key=PINECONE_API_KEY)

print("✓ Connected to Pinecone")
print(f"  - Existing indexes: {pc.list_indexes().names()}")

✓ Connected to Pinecone
  - Existing indexes: []


## 5. Create or Get Index

Pinecone Serverless index creation.

**Note on Metadata Indexing (Oct 21, 2025):**  
Pinecone has a new [Metadata Indexing](https://docs.pinecone.io/guides/index-data/create-an-index#metadata-indexing) feature that allows you to specify filterable fields upfront for better performance. However, this feature is currently in **early access** and not widely available yet. We'll use the default metadata filtering approach instead, which works for all metadata fields automatically.

In [7]:
INDEX_NAME = "articles"

# Check if index exists
existing_indexes = pc.list_indexes().names()

if INDEX_NAME in existing_indexes:
    print(f"Index '{INDEX_NAME}' already exists")
    
    # Get index object
    index = pc.Index(INDEX_NAME)
    
    # Get stats
    stats = index.describe_index_stats()
    vector_count = stats.get('total_vector_count', 0)
    
    print(f"✓ Using existing index: {INDEX_NAME}")
    print(f"  - Current count: {vector_count} vectors")
    
    # Ask user if they want to delete and recreate
    recreate = input("\nDo you want to delete and recreate? (y/n): ").lower().strip()
    if recreate == 'y':
        pc.delete_index(INDEX_NAME)
        print(f"✓ Deleted index: {INDEX_NAME}")
        existing_indexes.remove(INDEX_NAME)

# Create index if it doesn't exist
if INDEX_NAME not in existing_indexes:
    print(f"Creating new serverless index: {INDEX_NAME}")
    
    # Create serverless index
    # Note: Using default metadata filtering (no schema parameter needed)
    # All metadata fields are automatically filterable
    pc.create_index(
        name=INDEX_NAME,
        dimension=384,  # all-MiniLM-L6-v2 dimensions
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"  # Free tier region
        ),
        deletion_protection="disabled"
    )
    
    # Wait for index to be ready
    print("\nWaiting for index to be ready...")
    while not pc.describe_index(INDEX_NAME).status['ready']:
        time.sleep(1)
    print("✓ Index is ready!")
    
    # Get index object
    index = pc.Index(INDEX_NAME)

Creating new serverless index: articles

Waiting for index to be ready...
✓ Index is ready!


## 6. Generate Embeddings and Upsert Data

Process articles in batches, matching the approach used in other notebooks

In [8]:
# Check current count
stats = index.describe_index_stats()
current_count = stats.get('total_vector_count', 0)


# Process in batches - same approach as other notebooks
BATCH_SIZE = 20
total_articles = len(articles)

print(f"Processing {total_articles} articles in batches of {BATCH_SIZE}...\n")

start_time = time.time()

from tqdm.auto import tqdm

for i in tqdm(range(0, total_articles, BATCH_SIZE), desc="Inserting batches"):
    batch = articles[i:i + BATCH_SIZE]

    # Generate embeddings for batch - same as other notebooks
    texts = [
        f"Title: {a['item_title']}\nSubtitle: {a.get('item_subtitle', '')}\nContent: {a['body_content'][:500]}"
        for a in batch
    ]
    embeddings = embedding_model.embed_batch(texts, show_progress=False)

    # Prepare metadata - use "pinecone" to get timestamps
    metadatas = [get_article_metadata(a, db_type="pinecone") for a in batch]

    # Prepare vectors for Pinecone
    vectors = []
    for metadata, embedding in zip(metadatas, embeddings):
        # Pinecone format: (id, values, metadata)
        vector = {
            "id": f"article_{metadata['id']}",
            "values": embedding.tolist(),
            "metadata": {
                "title": metadata["title"],
                "subtitle": metadata["subtitle"],
                "category": metadata["category"],
                "source": metadata["source"],
                "tags": metadata["tags"],  # Stored as string in Pinecone
                "evergreen": metadata["evergreen"],
                "url": metadata["url"],
                "created_at": metadata["created_at"]  # Unix Timestamp (int)
            }
        }
        vectors.append(vector)

    # Upsert batch into Pinecone
    index.upsert(vectors=vectors)

elapsed_time = time.time() - start_time

# Get updated count
stats = index.describe_index_stats()
final_count = stats.get('total_vector_count', 0)

print(f"\n✓ Successfully inserted {total_articles} articles")
print(f"  - Time taken: {elapsed_time:.2f} seconds")
print(f"  - Average: {elapsed_time/total_articles:.2f} seconds per article")
print(f"  - Index vector count: {final_count}")

Processing 100 articles in batches of 20...



Inserting batches:   0%|          | 0/5 [00:00<?, ?it/s]


✓ Successfully inserted 100 articles
  - Time taken: 2.06 seconds
  - Average: 0.02 seconds per article
  - Index vector count: 0


## 7. Basic Semantic Search

Search using vector similarity

In [9]:
# Test query - SAME AS OTHER NOTEBOOKS
query_text = "Most haunted hikes in the US"

print(f"Query: '{query_text}'\n")

# Generate query embedding
query_embedding = embedding_model.embed_text(query_text)

# Perform search
results = index.query(
    vector=query_embedding.tolist(),
    top_k=5,
    include_metadata=True
)

print(f"Top {len(results['matches'])} results:\n")
for i, match in enumerate(results['matches']):
    # Pinecone returns score (higher = more similar for COSINE)
    score = match['score']
    metadata = match['metadata']
    
    print(f"{i+1}. {metadata['title'][:70]}...")
    print(f"   Category: {metadata['category']} | Source: {metadata['source']}")
    print(f"   Score: {score:.4f}")
    print(f"   URL: {metadata['url']}...")

Query: 'Most haunted hikes in the US'

Top 5 results:

1. 13 of the Most Haunted Hikes in the U.S....
   Category: Destinations | Source: OUTSIDE
   Score: 0.8054
   URL: https://www.outsideonline.com/adventure-travel/destinations/haunted-hikes/...
2. A Missing Dog Helped a Stranded Hiker Return to Shadow Mountain Trail....
   Category: Hiking | Source: OUTSIDE
   Score: 0.4575
   URL: https://www.outsideonline.com/outdoor-adventure/hiking-and-backpacking/arizona-lost-hiker-missing-dog-shadow-mountain/...
3. An Inside Look at Outside’s 2025 Winter Editors’ Choice Testing Trip...
   Category: Gear | Source: OUTSIDE
   Score: 0.3667
   URL: https://www.outsideonline.com/outdoor-gear/winter-editors-choice-trip-maine/...
4. Two Hikers in British Columbia Were Hospitalized After a Grizzly Sow A...
   Category: Hiking | Source: OUTSIDE
   Score: 0.3366
   URL: https://www.outsideonline.com/outdoor-adventure/hiking-and-backpacking/two-hikers-in-british-columbia-were-hospitalized-after-a-grizz

## 8. Metadata Filtering - Category

Pinecone's metadata filtering syntax

In [10]:
# Filter by category - SAME AS OTHER NOTEBOOKS
query_text = "Women's Ironman World Championship"
target_category = "News"

print(f"Query: '{query_text}'")
print(f"Filter: category = '{target_category}'\n")

# Generate query embedding
query_embedding = embedding_model.embed_text(query_text)

# Search with metadata filter
results = index.query(
    vector=query_embedding.tolist(),
    top_k=5,
    filter={
        "category": {"$eq": target_category}
    },
    include_metadata=True
)

print(f"Top 5 Results (Category: {target_category}):\n")
for i, match in enumerate(results['matches']):
    score = match['score']
    metadata = match['metadata']
    
    print(f"{i+1}. {metadata['title'][:70]}...")
    print(f"   Category: {metadata['category']} | Source: {metadata['source']}")
    print(f"   Created: {metadata['created_at']}")
    print(f"   Score: {score:.4f}")

Query: 'Women's Ironman World Championship'
Filter: category = 'News'

Top 5 Results (Category: News):

1. After Joy of Women's-Only Ironman World Championship, Grief Sets In...
   Category: News | Source: TRIATHLETE
   Created: 1760296873.0
   Score: 0.7764
2. What a Race! Here's Where the Ironman Pro Series Stands After the Iron...
   Category: News | Source: TRIATHLETE
   Created: 1760352009.0
   Score: 0.6456
3. The Fastest Shoes at 2025 Ironman World Championship Kona...
   Category: News | Source: TRIATHLETE
   Created: 1760353908.0
   Score: 0.6184
4. The DNF Files: 2025 Ironman World Championship Kona...
   Category: News | Source: TRIATHLETE
   Created: 1760441445.0
   Score: 0.5915
5. In Sweltering Conditions, Norway’s Solveig Løvseth Takes 2025 Ironman ...
   Category: News | Source: TRIATHLETE
   Created: 1760160735.0
   Score: 0.5609


## 9. Metadata Filtering - Date Range

Filter by timestamp (stored as integers)

In [11]:
from utils.date_utils import date_string_to_timestamp, timestamp_to_datetime_string

# Filter by date - SAME AS OTHER NOTEBOOKS
query_text = "cycling deals"
cutoff_date = "2025-10-08"

print(f"Query: '{query_text}'")
print(f"Filter: created_at >= '{cutoff_date}'\n")

# Generate query embedding
query_embedding = embedding_model.embed_text(query_text)

# Convert date to timestamp for Pinecone
cutoff_timestamp = date_string_to_timestamp(cutoff_date)

# Search with date filter
results = index.query(
    vector=query_embedding.tolist(),
    top_k=5,
    filter={
        "created_at": {"$gte": cutoff_timestamp}
    },
    include_metadata=True
)

print(f"Top 5 Recent Results (after {cutoff_date}):\n")
for i, match in enumerate(results['matches']):
    score = match['score']
    metadata = match['metadata']
    created_str = timestamp_to_datetime_string(metadata['created_at'])
    
    print(f"{i+1}. {metadata['title']}")
    print(f"   Category: {metadata['category']}")
    print(f"   Created: {created_str}")
    print(f"   Tags: {metadata.get('tags', 'No tags')}")

Query: 'cycling deals'
Filter: created_at >= '2025-10-08'

Top 5 Recent Results (after 2025-10-08):

1. Opinion: Cycling's Soccer-Inspired Relegation System Is a Hot Mess That Solves Nothing
   Category: Road Racing
   Created: 2025-10-15 22:42:10
   Tags: Analysis, ASO, Cofidis, Tour de France, Tour de Hoody
2. Deal: Tailwind Endurance Fuel Is the Cycling Nutrition I Actually Use
   Category: Road Gear
   Created: 2025-10-13 04:30:52
   Tags: Velo Deals
3. Pogačar's Bonuses and Brand Deals Revealed: Inside His $14 Million Pay Check
   Category: Road Racing
   Created: 2025-10-13 20:39:12
   Tags: Alex Carera, Remco Evenepoel, Tadej Pogačar, Transfers, UAE Emirates
4. Shop Evo's Anniversary Sale and Save up to 50% on Ski, Snowboard, and MTB Gear
   Category: Gear News
   Created: 2025-10-14 03:53:27
   Tags: Commerce, Deals
5. Deal: One of the Best Headphones for Cycling Is 50% Off
   Category: Road Gear
   Created: 2025-10-15 05:12:34
   Tags: headphones, Velo Deals


## 10. Combined Filters - Evergreen + Date

Pinecone supports combining filters with $and operator

In [12]:
# Combine multiple filters - SAME AS OTHER NOTEBOOKS
query_text = "Halloween outdoor activities"
cutoff_date = "2025-10-09"

print(f"Query: '{query_text}'")
print(f"Filters:")
print(f"  - evergreen = True (timeless content)")
print(f"  - created_at >= '{cutoff_date}'\n")

# Generate query embedding
query_embedding = embedding_model.embed_text(query_text)

# Convert date to timestamp
cutoff_timestamp = date_string_to_timestamp(cutoff_date)

# Combine filters with $and
combined_filter = {
    "$and": [
        {"evergreen": {"$eq": True}},
        {"created_at": {"$gte": cutoff_timestamp}}
    ]
}

# Search with combined filters
results = index.query(
    vector=query_embedding.tolist(),
    top_k=10,  # Increased to 10 since evergreen articles might be fewer
    filter=combined_filter,
    include_metadata=True
)

if results['matches']:
    print(f"Top Evergreen Results (After {cutoff_date}):\n")
    for i, match in enumerate(results['matches']):
        score = match['score']
        metadata = match['metadata']
        created_str = timestamp_to_datetime_string(metadata['created_at'])
        
        print(f"{i+1}. {metadata['title'][:70]}...")
        print(f"   Category: {metadata['category']} | Evergreen: {metadata['evergreen']}")
        print(f"   Tags: {metadata.get('tags', 'No tags')}")
        print(f"   Created: {created_str}")
    print(f"\nTotal results: {len(results['matches'])}")
else:
    print("No evergreen articles found after this date.")

Query: 'Halloween outdoor activities'
Filters:
  - evergreen = True (timeless content)
  - created_at >= '2025-10-09'

Top Evergreen Results (After 2025-10-09):

1. 13 of the Most Haunted Hikes in the U.S....
   Category: Destinations | Evergreen: True
   Tags: evergreen, Halloween, Hiking
   Created: 2025-10-16 04:22:41
2. The Thule Outset Hitch-Mounted Tent Turns Your Car Into a Campsite on ...
   Category: Camping | Evergreen: True
   Tags: 2025 Gear Reviews, Car Camping, Car Racks, Commerce, evergreen
   Created: 2025-10-14 03:30:11
3. The Best Daypacks for Every Kind of Hiker (2025)...
   Category: Daypacks | Evergreen: True
   Tags: 2025 Gear Reviews, 2025 Summer Gear Guide, backpack, Commerce, Day Packs
   Created: 2025-10-16 04:31:44
4. Everything You Need To Know Before Skiing Telluride For The First Time...
   Category: Resort Skiing | Evergreen: True
   Tags: evergreen, Telluride Ski Resort
   Created: 2025-10-13 07:39:24
5. He’s Hunted for Elk for 40 Years but Hasn’t Killed

## 11. Performance Summary

Standardized benchmark across all vector databases

In [13]:
from utils.benchmark import benchmark_queries

# Define query function for Pinecone
def pinecone_query_fn(query_text: str):  
    """Query function for Pinecone benchmarking."""
    query_embedding = embedding_model.embed_text(query_text)
    return index.query(
        vector=query_embedding.tolist(),
        top_k=10,
        include_metadata=True
    )

# Run standardized benchmark
results = benchmark_queries(pinecone_query_fn)

# Get index stats
stats = index.describe_index_stats()
print(f"\nIndex Statistics:")
print(f"  - Total vectors: {stats.get('total_vector_count', 0)}")
print(f"  - Vector dimensions: 384")
print(f"  - Distance metric: cosine")
print(f"  - Cloud: AWS us-east-1 (serverless)")

Running performance benchmark...

'outdoor hiking adventures' -> 107.7ms
'cycling race performance' -> 99.8ms
'travel destinations and tips' -> 97.6ms
'fitness training techniques' -> 127.5ms
'gear reviews and recommendations' -> 99.0ms

Performance Summary:
  - Average query time: 106.3ms
  - Min query time: 97.6ms
  - Max query time: 127.5ms

Index Statistics:
  - Total vectors: 100
  - Vector dimensions: 384
  - Distance metric: cosine
  - Cloud: AWS us-east-1 (serverless)


## 12. Key Takeaways - Pinecone

### ✅ Strengths
1. **Serverless Architecture** - Zero infrastructure management
2. **Simple API** - Clean, intuitive Python client
3. **Enterprise-Grade** - Production-ready with SLAs
4. **Auto-Scaling** - Handles traffic spikes automatically
5. **Metadata Filtering** - Powerful filter syntax ($eq, $gte, $and, $or)
6. **Free Tier** - Generous limits for prototyping (us-east-1)
7. **High Performance** - Optimized for low-latency queries
8. **Multi-Cloud** - AWS, GCP, Azure support

### ⚠️ Considerations
1. **Free Tier Region** - Limited to us-east-1 on free plan
2. **Metadata Indexing** - New feature in early access (not used in this demo)
3. **No Native Arrays** - Tags stored as comma-separated strings
4. **Timestamps Only** - No native date type (use INT64)
5. **Costs** - Can get expensive at scale (monitor usage)

### 🎯 Best For
- **Production AI apps** - Reliable, scalable, managed service
- **Serverless workloads** - Auto-scaling, pay-per-use
- **Simple deployments** - Easy setup, no infrastructure
- **Global apps** - Multi-cloud, multi-region support
- **Startups** - Quick to launch, easy to scale

### 📊 Comparison Notes
- **vs Chroma**: More enterprise features, better scaling, serverless
- **vs Qdrant**: Simpler API, serverless, but less flexible metadata
- **vs Weaviate**: No hybrid search, simpler schema, fully managed
- **vs Milvus/Zilliz**: Simpler expressions, serverless, easier setup

### 💡 Unique Pinecone Features
1. **True serverless** - Auto-scaling without pods/nodes
2. **Simple filter syntax** - MongoDB-style operators ($eq, $gte, $and)
3. **Multi-cloud** - Deploy to AWS, GCP, or Azure
4. **Production SLAs** - Enterprise-grade reliability
5. **Easy migration** - Simple import/export tools
6. **Metadata Indexing** - Early access feature for performance optimization

### 🏆 When Pinecone is the Best Choice
Use Pinecone when you need:
- Zero infrastructure management (true serverless)
- Production-grade reliability with SLAs
- Simple API without complex configuration
- Auto-scaling for unpredictable traffic
- Quick time-to-market

### 💻 Filter Examples
```python
# Equality filter
filter={"category": {"$eq": "News"}}

# Range filter
filter={"created_at": {"$gte": timestamp}}

# Combined filters (AND)
filter={
    "$and": [
        {"evergreen": {"$eq": True}},
        {"created_at": {"$gte": timestamp}}
    ]
}

# Combined filters (OR)
filter={
    "$or": [
        {"category": {"$eq": "News"}},
        {"category": {"$eq": "Events"}}
    ]
}
```

### 📝 Note on Metadata Indexing
As of October 21, 2025, Pinecone has introduced a **Metadata Indexing** feature that allows you to specify which metadata fields should be indexed for filtering, which can improve query performance. However, this feature is currently in **early access** and not widely available.

**Default behavior (used in this notebook):** All metadata fields are automatically filterable without any schema declaration. This works well for most use cases.

**With metadata indexing (future):** You can explicitly declare filterable fields in the index schema for optimized performance.

See: https://docs.pinecone.io/guides/index-data/create-an-index#metadata-indexing

**Other DBs' metadata approach:**
- Chroma/Milvus: Timestamps (INT64)
- Weaviate: Native DATE type
- Qdrant: Native datetime objects + payload indexes
- Pinecone: **Timestamps (INT64) with automatic filtering (or metadata indexing in early access)**