# Embeddings and Vector Databases with Pinecone

**Embeddings** are vector representations of text. They are used to represent text in a vector space, where the distance between vectors represents the semantic similarity between the texts.

**Vector Databases** are databases that store vectors and their associated metadata. They are used to store and retrieve embeddings. Vector databases are organized into indexes (also called namespaces) - similar to tables in a relational database.

[Pinecone](https://www.pinecone.io/) is a vector database service that allows you to store and retrieve embeddings. It is a hosted service that allows you to scale your vector database as needed.

There are two main ways to use Pinecone:

1. **Store an embedding** - Store an embedding in a vector database.
    - Embed the text you want to store.
    - Create a document with the embedding and metadata.
    - Store the document in a vector database.
2. **Query a vector database** - Query a vector database for the most similar embeddings to a given query.
    - Embed the query.
    - Query the vector database with the embedded query.
    - Retrieve the most similar embeddings to the query.

### Install Pinecone and OpenAI

In [1]:
!pip install pinecone openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Import Libraries

In [1]:
import os
import uuid
from datetime import datetime, timezone
from pinecone import Pinecone
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

  from tqdm.autonotebook import tqdm


True

### Define Environment variables

In [5]:
PINECONE_API_KEY="pcsk_XXXXXXXXXXXXXXXXXXXX" # Pinecone API key
PINECONE_INDEX_NAME="data-agent" # Name of the vector database index
PINECONE_NAMESPACE="https://XXXXXXXXXXXXXX.pinecone.io" # Namespace in your index on Pinecone.io

### Initialize Pinecone and OpenAI

In [6]:
# Initialize Pinecone for vector database
pc = Pinecone(PINECONE_API_KEY)
# Initialize the vector database index
index = pc.Index(PINECONE_INDEX_NAME)
# Initialize OpenAI for embeddings 
client = OpenAI()

## 1. Store an embedding
-----------

In [7]:
string_to_store = "I like cars."

### Embed the string

In [8]:
#OpenAI embeddings
def get_embeddings(string_to_embed):
    response = client.embeddings.create(
        input=string_to_embed,
        model="text-embedding-ada-002"
    )
    return response.data[0].embedding

In [9]:
vector = get_embeddings(string_to_store)

In [10]:
print(f"Vector representation of {string_to_store}: \n", vector)

Vector representation of I like cars.: 
 [-0.016522599384188652, -0.00037778893602080643, 0.006583814509212971, -0.02332082949578762, -0.02381272427737713, -0.012795555405318737, -0.01558926235884428, -0.011679333634674549, 0.005253177601844072, -0.023800110444426537, -0.003348664380609989, 0.02036946453154087, 0.007548683788627386, -0.007996433414518833, 0.002733796602115035, -0.00278424727730453, 0.02301812544465065, -0.004969392437487841, 0.015463135205209255, -0.01851540245115757, -0.004203172866255045, -0.010569418780505657, -0.0021236585453152657, -0.02812625840306282, -0.0008032695041038096, 0.0025651021860539913, 0.015374846756458282, -0.009427972137928009, 1.2661939763347618e-05, -0.01971360482275486, 0.0475497730076313, -0.017090169712901115, -0.012991052120923996, -0.021996499970555305, -0.0015450522769242525, -0.025401920080184937, 0.006489219609647989, -0.012953213416039944, -0.004562634043395519, 0.006148677319288254, -0.007618053816258907, 0.019473964348435402, 0.0026912

### Define the vector metadata to store in the vector database

In [11]:
user_id = "1234"
path = "user/{user_id}/recall/{event_id}"
current_time = datetime.now(tz=timezone.utc)
path = path.format(
    user_id=user_id,
    event_id=str(uuid.uuid4()),
)

### Build the vector document to be stored

In [12]:
# Build document dictionary
documents = [
    {
        "id": str(uuid.uuid4()),
        "values": vector,
        "metadata": {
            "payload": string_to_store,
            "path": path,
            "timestamp": str(current_time),
            "type": "recall", # Define the type of document i.e recall memory
            "user_id": user_id,
        },
    }
]


### Store the vector document in the vector database

In [13]:
index.upsert(
    vectors=documents,
    namespace=PINECONE_NAMESPACE
)

{'upserted_count': 1}

## 2. Query a vector database
-----------

In [30]:
query_string = "What do I like?"
user_id = "1234"
top_k = 10 # This is the number of most similar embeddings to return

### Embed the query

In [31]:
vector = get_embeddings(query_string)

### Query the vector database for similar top_k embeddings + filters

In [32]:
response = index.query(
    vector=vector,
    filter={
        "user_id": {"$eq": user_id},
        "type": {"$eq": "recall"},
    },
    namespace=PINECONE_NAMESPACE,
    include_metadata=True,
    top_k=top_k,
)

In [33]:
response

{'matches': [{'id': 'user/1234/recall/15d72db4-bfa7-4982-bc18-9e0164e06113',
              'metadata': {'path': 'user/1234/recall/15d72db4-bfa7-4982-bc18-9e0164e06113',
                           'payload': 'I like mangoes.',
                           'timestamp': '2025-01-15 09:15:36.344248+00:00',
                           'type': 'recall',
                           'user_id': '1234'},
              'score': 0.83735317,
              'values': []},
             {'id': '4dceebf5-90e6-465e-8534-3ec01fc176c6',
              'metadata': {'path': 'user/1234/recall/1dc67b72-359f-4c7d-b000-1f24bfa3cf41',
                           'payload': 'I like oranges.',
                           'timestamp': '2025-01-15 10:51:06.270951+00:00',
                           'type': 'recall',
                           'user_id': '1234'},
              'score': 0.836027,
              'values': []},
             {'id': 'user/1234/recall/847ae4bd-2f01-4dcb-8470-380bc4c86590',
              'metadata': 

### Build the memories list

In [42]:
memories = []
if matches := response.get("matches"):
    memories = [m["metadata"]["payload"] for m in matches]
    memories
memories

['I like mangoes.',
 'I like oranges.',
 'You like mangoes and charts!',
 'User likes charts.',
 'User likes Python.']