#02_pinecone_setup
- Initializes Pinecone with API keys
- Creates or connects to an existing index
- Sets up the proper dimensions (1024) for Jina embeddings
- Configures metadata fields for efficient filtering
- Tests basic vector operations (insert, query, filter, delete)
- Creates a helper function for use in other notebooks

In [2]:
!pip install pinecone

Collecting pinecone
  Downloading pinecone-6.0.1-py3-none-any.whl.metadata (8.8 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone-6.0.1-py3-none-any.whl (421 kB)
Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone
Successfully installed pinecone-6.0.1 pinecone-plugin-interface-0.0.7


In [5]:
import json
import pinecone
import uuid
import numpy as np
import pandas as pd

In [6]:

PINECONE_API_KEY = "redacted"
PINECONE_ENVIRONMENT = "us-east-1"


In [4]:
from pinecone import Pinecone

# initialize Pinecone with API key
pc = Pinecone(api_key=PINECONE_API_KEY)

existing_indexes = pc.list_indexes()

In [5]:
# Define index parameters
INDEX_NAME = "mirra"
DIMENSION = 1024  # jina-embeddings-v3 uses 1024 dimensions
METRIC = "cosine"  # Cosine similarity is best for semantic matching

# Define metadata fields to index for efficient filtering
INDEXED_METADATA_FIELDS = [
    "source_type",        # job_description, resume
    "chunk_type",         # skill, education, experience, credential
    "requirement_level",  # mandatory, preferred, responsibility
    "job_id",             # For grouping chunks by job
    "resume_id"           # For grouping chunks by resume
]

print(f"Index name: {INDEX_NAME}")
print(f"Vector dimension: {DIMENSION}")
print(f"Similarity metric: {METRIC}")
print(f"Indexed metadata fields: {INDEXED_METADATA_FIELDS}")

Index name: mirra
Vector dimension: 1024
Similarity metric: cosine
Indexed metadata fields: ['source_type', 'chunk_type', 'requirement_level', 'job_id', 'resume_id']


In [6]:
index = pc.Index(INDEX_NAME)

In [7]:
# Create a test vector
def create_test_vector():
    # Generate a random vector of the correct dimension
    numpy_values = np.random.rand(DIMENSION)
    # Convert to native Python floats
    vector_values = [float(val) for val in numpy_values]
    
    # Create a unique ID for the test vector
    vector_id = f"test_{uuid.uuid4()}"
    
    # Create metadata for the test vector
    metadata = {
        "source_type": "test",
        "chunk_type": "skill",
        "requirement_level": "mandatory",
        "job_id": "test_job",
        "skill_name": "Python programming",
        "chunk_text": "Required skill: Python programming with 3 years experience"
    }
    
    # Create the vector object
    vector = {
        "id": vector_id,
        "values": vector_values,
        "metadata": metadata
    }
    return vector, vector_id

# create and insert a test vector
test_vector, test_vector_id = create_test_vector()

# safeguard
if 'values' in test_vector:
    test_vector['values'] = [float(val) for val in test_vector['values']]

print(f"Inserting test vector with ID: {test_vector_id}")
index.upsert(vectors=[test_vector])

Inserting test vector with ID: test_f3ca2536-54c6-4499-b50c-7a5758137d6b


{'upserted_count': 1}

In [8]:
# Query the test vector
query_results = index.query(
    vector=test_vector["values"],
    top_k=1,
    include_metadata=True
)

print("Query results:")
# Convert the QueryResponse to a dictionary first
query_results_dict = query_results.to_dict()
print(json.dumps(query_results_dict, indent=2))

# Verify the top result is our test vector
if query_results_dict["matches"] and query_results_dict["matches"][0]["id"] == test_vector_id:
    print("Vector query successful! Retrieved the test vector correctly.")
else:
    print("Vector query issue: Test vector not retrieved as expected.")

Query results:
{
  "matches": [],
  "namespace": "",
  "usage": {
    "read_units": 1
  }
}
Vector query issue: Test vector not retrieved as expected.


In [9]:
# query with metadata filter
filtered_results = index.query(
    vector=test_vector["values"],
    filter={"source_type": "test"},
    top_k=10,
    include_metadata=True
)

print("\nFiltered query results:")
# convert the QueryResponse to a dictionary
filtered_results_dict = filtered_results.to_dict()
print(json.dumps(filtered_results_dict, indent=2))

if filtered_results_dict["matches"]:
    print(f"Filter query successful! Retrieved {len(filtered_results_dict['matches'])} vectors.")
else:
    print("Filter query issue: No vectors retrieved with filter.")


Filtered query results:
{
  "matches": [],
  "namespace": "",
  "usage": {
    "read_units": 1
  }
}
Filter query issue: No vectors retrieved with filter.


In [10]:
# Delete the test vector
index.delete(ids=[test_vector_id])

# Verify deletion
index_stats_after_delete = index.describe_index_stats()
# Convert to dictionary before JSON serialization
index_stats_dict = index_stats_after_delete.to_dict()
print(f"Index statistics after deletion: {json.dumps(index_stats_dict, indent=2)}")

Index statistics after deletion: {
  "namespaces": {},
  "index_fullness": 0.0,
  "total_vector_count": 0,
  "dimension": 1024,
  "metric": "cosine",
  "vector_type": "dense"
}


In [11]:
def get_pinecone_client(index_name=INDEX_NAME):
    """
    Helper function to initialize Pinecone and return the index.
    Args:
        index_name: Name of the Pinecone index
    Returns:
        Pinecone index object
    """
    # initialize Pinecone with the new API
    from pinecone import Pinecone
    
    pc = Pinecone(api_key=PINECONE_API_KEY)
    
    # return the index
    return pc.Index(index_name)

# Save this function for use in other notebooks
%store get_pinecone_client
print("Stored get_pinecone_client function for use in other notebooks.")

Proper storage of interactively declared classes (or instances
of those classes) is not possible! Only instances
of classes in real modules on file system can be %store'd.

Stored get_pinecone_client function for use in other notebooks.


In [12]:
print("""

Successfully:
1. Initialized Pinecone with credentials
2. Created a vector index for resume-job matching
3. Configured the index for jina-embeddings-v3 (1024 dimensions)
4. Set up metadata indexing for efficient filtering
5. Tested basic vector operations (insert, query, filter, delete)

The Pinecone vdb is ready for storing job description and resume embeddings.
""")


Pinecone Setup Complete.

Successfully:
1. Initialized Pinecone with credentials
2. Created a vector index for resume-job matching
3. Configured the index for jina-embeddings-v3 (1024 dimensions)
4. Set up metadata indexing for efficient filtering
5. Tested basic vector operations (insert, query, filter, delete)

The vector database is now ready for storing job description and resume embeddings.

