# Pinecone Experiment Example

## Installations

In [1]:
# !pip install --quiet --force-reinstall prompttools

## Setup imports and API keys

First, we will set the API key and Pinecone environment name.

In [1]:
import os

# os.environ["DEBUG"] = "1"  # Set this to "" to call the API
os.environ["PINECONE_API_KEY"] = ""  # Insert your key here
os.environ["PINECONE_ENVIRONMENT"] = ""  # Insert the environment name here

We'll import the relevant `prompttools` modules to setup our experiment.

In [2]:
from prompttools.experiment import PineconeExperiment
import pinecone

## Inserting data in advance

In general, we recommend inserting your data prior to the experiment before Pinecone is **eventually consistent**, such that there will be a **delay** before you can successfully query your data that you just inserted.

Here is an example of how you can insert your data:

In [7]:
index_name = "test"
data = [
    ("A", [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]),
    ("B", [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]),
    ("C", [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]),
    ("D", [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]),
    ("E", [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
]

pinecone.init(api_key=os.environ["PINECONE_API_KEY"], environment=os.environ["PINECONE_ENVIRONMENT"])
try:
    pinecone.delete_index(index_name)  # Optional, delete your index if it already exists
except Exception:
    pass
create_index_params = {"dimension": 8, "metric": "euclidean"}
pinecone.create_index(index_name, **create_index_params)
index = pinecone.Index(index_name)
index.upsert(data)

  return self.urllib3_response.getheader(name, default)


You can have a look at your index's status:

In [25]:
pinecone.describe_index(index_name)

IndexDescription(name='test', metric='euclidean', replicas=1, dimension=8.0, shards=1, pods=1, pod_type='starter', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

Make sure the vector count matches what you expect before trying to query. There is generally a delay after insertion because Pinecone is eventually consistent. You can check by:

In [None]:
index.describe_index_stats()

## Run an experiment

You can also insert your data during the experiment, but there will be delay during the experiment to wait for the data to show up in Pinecone.

If you choose to do this, a new Pinecone index will be temporarily created. The data will be added into it. Then, we will query from it and examine the results. The experiment will automatically clean up the index afterwards.

In [3]:
index_name = "test"

# Index configuration
create_index_params = {"dimension": 8, "metric": "euclidean"}

# Documents that will be added into the database
data = [
    ("A", [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]),
    ("B", [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]),
    ("C", [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]),
    ("D", [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]),
    ("E", [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
]

# Our test queries
test_queries =  [
    [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3],
    [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2],
    [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
]

query_index_params = {
  "vector": test_queries,
  "top_k": [3],
  "include_values": [True],
}


# Set up the experiment
experiment = PineconeExperiment(
    index_name,
    use_existing_index = False,  # Switch to `True` if you # Optional. if you inserted data in advnace
    query_index_params = query_index_params,
    create_index_params = create_index_params,  # Optional. if you inserted data in advnace
    data = data,  # Optional. if you inserted data in advnace
)

We can then run the experiment to get results.

In [4]:
experiment.run()

  return self.urllib3_response.getheader(name, default)


Waiting for Pinecone's eventual consistency after inserting data.
Waiting for Pinecone's eventual consistency after inserting data.
Waiting for Pinecone's eventual consistency after inserting data.
Waiting for Pinecone's eventual consistency after inserting data.


You can see the top 3 doc IDs of each of your queries.

In [7]:
experiment.visualize()

Unnamed: 0,vector,top doc ids,scores,documents,latency
0,"[0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]","[C, D, B]","[0.0, 0.0799999237, 0.0800000429]","[[0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3], [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4], [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]]",0.368411
1,"[0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]","[B, A, C]","[0.0, 0.0800000131, 0.0800000429]","[[0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2], [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]]",0.107679
2,"[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]","[A, B, C]","[0.0, 0.0800000131, 0.32]","[[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2], [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]]",0.127923


## Evaluate the model response

To evaluate the results, we'll define an evaluation function. Sometimes, you know order of the most relevant document should be given a query, and you can compute the correlation between expected ranking and actual ranking.

Note: there is a built-in version of this function that you can import (scroll further below to see an example).

In [5]:
import scipy.stats as stats

# For each query, you can define what the expected ranking is.
EXPECTED_RANKING = {
    (0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3): ["C", "D", "B"],
    (0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2): ["B", "C", "A"],
    (0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1): ["A", "C", "B"],
}


def measure_correlation(row: "pandas.core.series.Series", ranking_column_name: str = "top doc ids") -> float:
    r"""
    A simple test that compares the expected ranking for a given query with the actual ranking produced
    by the embedding function being tested.
    """
    input_query = tuple(row["vector"])
    correlation, _ = stats.spearmanr(row[ranking_column_name], EXPECTED_RANKING[input_query])
    return correlation

Finally, we can evaluate and visualize the results.

In [6]:
experiment.evaluate("ranking_correlation", measure_correlation)

In [7]:
experiment.visualize()

Unnamed: 0,vector,top doc ids,scores,documents,latency,ranking_correlation
0,"[0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]","[C, D, B]","[0.0, 0.0799999237, 0.0800000429]","[[0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3], [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4], [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]]",0.390167,1.0
1,"[0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]","[B, A, C]","[0.0, 0.0800000131, 0.0800000429]","[[0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2], [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]]",0.102859,-1.0
2,"[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]","[A, B, C]","[0.0, 0.0800000131, 0.32]","[[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2], [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]]",0.112139,0.5
