# Lab #2
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/basic-operations-workshop/blob/main/lab2.ipynb)
1. Install dependencies
2. Create a wrongly sized pinecone index - s1.x2 should be s1.x1
3. Insert data and get statistics about your index
4. Query for top_k=10 with meta-data filter on category and timestamp
5. Create a backup(aka collection) and delete the misconfigured index
6. Restore the index - s1.x1 with high cardinality meta-data filter exclusion
7. Query for top_k=10 with meta-data filter
8. TEARDOWN: Delete the index and backup(aka collection)

# 1. Install Pinecone client 
Use the following shell command to install Pinecone:

In [10]:
!pip install -U "pinecone-client[grpc]" "python-dotenv"

try:
    import pinecone
    import dotenv
    import numpy
    print("SUCCESS: lab dependencies are installed.")
except ImportError as ie:
    print(f"ERROR: key deendencies are not installed: {ie}")

SUCCESS: lab dependencies are installed.


# 2. Create a wrongly sized pinecone index - s1.x2 should be s1.x1

* To use Pinecone, you must have an API key. To find your API key, open the [Pinecone console](https://app.pinecone.io/organizations/-NF9xx-MFLRfp0AAuCon/projects/us-east4-gcp:55a4eee/indexes) and click API Keys. This view also displays the environment for your project. Note both your API key and your environment.
* Create a .env file and make sure the following properties are specified

In [11]:
import os
from dotenv import load_dotenv

load_dotenv('.env')

PINECONE_API_KEY = os.environ['PINECONE_API_KEY']
PINECONE_ENVIRONMENT = os.environ['PINECONE_ENVIRONMENT']
PINECONE_INDEX_NAME = os.environ['PINECONE_INDEX_NAME']
PINECONE_COLLECTION_NAME = PINECONE_INDEX_NAME
DIMENSIONS = int(os.environ['DIMENSIONS'])
METRIC = "euclidean"

# print all of values to verify
print(f"PINECONE_API_KEY: {PINECONE_API_KEY}")
print(f"PINECONE_ENVIRONMENT: {PINECONE_ENVIRONMENT}")
print(f"PINECONE_INDEX_NAME: {PINECONE_INDEX_NAME}")
print(f"PINECONE_COLLECTION_NAME: {PINECONE_COLLECTION_NAME}")
print(f"DIMENSIONS: {DIMENSIONS}")
print(f"METRIC: {METRIC}")


PINECONE_API_KEY: d374643f-7ecd-4e6d-85aa-cbb206249ee1
PINECONE_ENVIRONMENT: us-east4-gcp
PINECONE_INDEX_NAME: james-williams
PINECONE_COLLECTION_NAME: james-williams
DIMENSIONS: 512
METRIC: euclidean


In [12]:
# initialize connection to pinecone
import pinecone

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)

if (PINECONE_INDEX_NAME in pinecone.list_indexes()) != True:  
    pinecone.create_index(PINECONE_INDEX_NAME, dimension=DIMENSIONS, metric=METRIC, pods=1, replicas=1, pod_type="s1.x2")
else:
    print(f"Index {PINECONE_INDEX_NAME} already exists")

print(f"Index Description: {pinecone.describe_index(name=PINECONE_INDEX_NAME)}")

Index Description: IndexDescription(name='james-williams', metric='euclidean', replicas=1, dimension=512.0, shards=1, pods=1, pod_type='s1.x2', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')


# 3. Insert data and get statistics about your index

* The upsert operation inserts a new vector in the index or updates the vector if a vector with the same ID is already present.
* The following commands upserts a large batch of vectors with meta-data into your index.

In [13]:
import numpy as np
import random
import time
import uuid

def generate_vectors(dimensions):
    vectors = []
    id_seed = 1
    value_seed = 0.1

    for _ in range(500):
        meta_data = {"category": random.choice(["one", "two", "three"]),
                     "timestamp": time.time(),
                     "transaction_id": str(uuid.uuid4())}
        embeddings = np.full(shape=dimensions, fill_value=value_seed).tolist()
        vectors.append({'id': str(id_seed),
                        'values': embeddings,
                        'metadata': meta_data})
        id_seed = id_seed + 1
        value_seed = value_seed + 0.1
    return vectors

index = pinecone.Index(PINECONE_INDEX_NAME)
index.upsert(generate_vectors(DIMENSIONS))

print(f"Index Description: {pinecone.describe_index(name=PINECONE_INDEX_NAME)}")
print(f"Index Stats: {index.describe_index_stats()}")

Index Description: IndexDescription(name='james-williams', metric='euclidean', replicas=1, dimension=512.0, shards=1, pods=1, pod_type='s1.x2', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')
Index Stats: {'dimension': 512,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 500}},
 'total_vector_count': 500}


# 4. Query for top_k=10 with meta-data filter on category and timestamp and transaction_id

1. Run the query below as-is. This will select the top 10 embeddings that match "category" = "one"
2. Add a timestamp greater than or equal to filter and re-run the query: ```,"timestamp": {"$gt": SOMETIMESTAMP}```
3. Add a transaction equal to filter and re-run the query: ```,"transaction_id": {"$eq": "SOME_TRANSACTION_ID"}```

Both the timestamp and transaction filter should work. We are going to re-configure the index to disable meta-data filtering by "transaction_id". 

In [14]:
embedding = np.full(DIMENSIONS,0.5).tolist()

query_results = index.query(
  vector = embedding,
  top_k=10,
  include_values=False,
  include_metadata=True,
  filter={
        "category": {"$eq": "one"}
  },).matches
print(f"Query results: {query_results}")

Query results: [{'id': '4',
 'metadata': {'category': 'one',
              'timestamp': 1692025616.716545,
              'transaction_id': '664b73b1-0e75-4514-9b9c-d0d27aeedd98'},
 'score': 5.1199646,
 'values': []}, {'id': '6',
 'metadata': {'category': 'one',
              'timestamp': 1692025616.716604,
              'transaction_id': 'aab294d0-4354-4c1c-9ad5-a29e874bad6a'},
 'score': 5.11999512,
 'values': []}, {'id': '7',
 'metadata': {'category': 'one',
              'timestamp': 1692025616.716626,
              'transaction_id': '63376adb-1613-4aa8-9588-1dbd5beaf673'},
 'score': 20.480011,
 'values': []}, {'id': '8',
 'metadata': {'category': 'one',
              'timestamp': 1692025616.716654,
              'transaction_id': '1fd78a9f-5b06-476c-9589-b697e432392f'},
 'score': 46.0799561,
 'values': []}, {'id': '1',
 'metadata': {'category': 'one',
              'timestamp': 1692025616.7161732,
              'transaction_id': '10a23b6a-2a63-45bf-868a-4b84d9f68db2'},
 'score': 81.

Notice above where it says metadata_config=None. We are going to change that when we create the new index.

# 5. Create a backup(aka collection) and delete the misconfigured index

In [15]:
import time

pinecone.create_collection(name=PINECONE_COLLECTION_NAME, source=PINECONE_INDEX_NAME)

while pinecone.describe_collection(name=PINECONE_COLLECTION_NAME).status != "Ready":
    print("collection initializing, please hold...")
    time.sleep(10)
print(pinecone.describe_collection(name=PINECONE_COLLECTION_NAME))

pinecone.delete_index(PINECONE_INDEX_NAME)
print(f"{PINECONE_INDEX_NAME} should not exist in: {pinecone.list_indexes()}")

collection initializing, please hold...
collection initializing, please hold...
{'name': 'james-williams', 'status': 'Ready', 'size': 3262431, 'dimension': 512.0, 'vector_count': 500.0}
james-williams should not exist in: []


### WARNING: You must wait for the collection to be 'READY' before moving on

# 6. Restore the index - s1.x1 with high cardinality meta-data filter exclusion

Create a new index with metadata_config and right sizing (scale down) using the PINECONE_COLLECTION_NAME as the source

In [17]:
import time
# check if index already exists (it shouldn't because we just deleted it)
if PINECONE_INDEX_NAME not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        PINECONE_INDEX_NAME,
        dimension=DIMENSIONS,
        metric=METRIC,
        replicas=1,
        pods=1,
        pod_type='s1.x1',
        source_collection=PINECONE_COLLECTION_NAME,
        metadata_config={"indexed": ["category", "timestamp"]} # all other fields will be stored-only. You can put a dummy value here as a place holder if you have no fields that need to be indexed
    )

    print("Sleeping for additional 10 seconds to give the index time to be created")
    time.sleep(10)
print(f"Index Description: {pinecone.describe_index(name=PINECONE_INDEX_NAME)}")
index = pinecone.Index(PINECONE_INDEX_NAME)
print(f"Index Stats: {index.describe_index_stats()}")

Index Description: IndexDescription(name='james-williams', metric='euclidean', replicas=1, dimension=512.0, shards=1, pods=1, pod_type='s1.x1', status={'ready': True, 'state': 'Ready'}, metadata_config={'indexed': ['category', 'timestamp']}, source_collection='james-williams')
Index Stats: {'dimension': 512,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 500}},
 'total_vector_count': 500}


Notice now it says metadata_config={'indexed': ['category', 'timestamp']}

This will result in the metadata field 'category' and 'timestamp' being indexed. All other fields will be stored-only. This means that you can retrieve them, but you cannot use them in queries.

We have also resized the index to s1.x1 again to bring the pod count down to appropriate size in this case. 

# 7. Query for top_k=10 with meta-data filter

1. Run the query below as-is. This will select the top 10 embeddings that match "category" = "one"
2. Add a timestamp greater than or equal to filter and re-run the query: ```,"timestamp": {"$gt": SOMETIMESTAMP}```
3. Add a transaction equal to filter and re-run the query: ```,"transaction_id": {"$eq": "SOME_TRANSACTION_ID"}```

The transaction filter should **NOT** return any results. It shows we have successfully re-configured the index to disable meta-data filtering by "transaction_id". 

In [None]:
embedding = np.full(DIMENSIONS,0.5).tolist()

query_results = index.query(
  vector = embedding,
  top_k=10,
  include_values=False,
  include_metadata=True,
  filter={
        "category": {"$eq": "one"},
        "timestamp": {"$gt": 1791769493},
  },).matches
print(f"Query results: {query_results}")

# 8. TEARDOWN: Delete the index and backup(aka collection)
# WARNING: This next step will delete the PINECONE_INDEX_NAME index and all data in it. DO NOT RUN THIS UNTIL YOU ARE READY OR MANUALLY REMOVE THE INDEX INSTEAD!!! 

In [9]:
if PINECONE_INDEX_NAME in pinecone.list_indexes():
    pinecone.delete_index(PINECONE_INDEX_NAME)
if PINECONE_COLLECTION_NAME in pinecone.list_collections():
    pinecone.delete_collection(PINECONE_COLLECTION_NAME)
    
pinecone.list_indexes()
pinecone.list_collections()

print(f"{PINECONE_INDEX_NAME} index should not exist in index list: {pinecone.list_indexes()}")
print(f"{PINECONE_COLLECTION_NAME} collection should not exist in collection list: {pinecone.list_collections()}")

james-williams index should not exist in index list: []
james-williams collection should not exist in collection list: []
