# Lab #3 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/basic-operations-workshop/blob/main/lab3.ipynb)
1. Install dependencies
2. Create a pinecone index 
3. Load public image dataset(fashion-mnist) and create vector embeddings from the dataset
4. Insert the fashion-mnist embeddings into Pinecone
5. Run a nearest neighbor search on a sample image that is not in the training dataset
6. Run a nearest neighbor search on 100 random test images that are not in the training dataset
7. Run a load test script to simulate 10 concurrent users querying the index
8. TEARDOWN: Delete the index 

# 1. Install Pinecone client 
Use the following shell command to install Pinecone:

In [None]:
!pip install -U "pinecone-client[grpc]" "python-dotenv" "torch" "torchvision" "pillow" "ftfy" "regex" "tqdm" "git+https://github.com/openai/clip.git" "datasets" "locust"

# 2. Create a pinecone index 
We will create an index that will be used to load/query a hugging face dataset.

In [None]:
from dotenv import load_dotenv
import os
import pinecone

load_dotenv('.env')

PINECONE_INDEX_NAME = os.environ['PINECONE_INDEX_NAME']
PINECONE_API_KEY = os.environ['PINECONE_API_KEY']
PINECONE_ENVIRONMENT = os.environ['PINECONE_ENVIRONMENT']
METRIC = os.environ['METRIC']
DIMENSIONS = 512

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
pinecone.create_index(PINECONE_INDEX_NAME, dimension=DIMENSIONS, metric=METRIC, pods=1, replicas=1, pod_type="s1")
pinecone.describe_index(PINECONE_INDEX_NAME)

# 3. Load public image dataset(fashion-mnist) and create vector embeddings from the dataset

Use the following shell command to download the [fashion-mnist](https://huggingface.co/datasets/fashion_mnist) training dataset from Hugging Face so that we can create vector embeddings that uses a label(image class) as meta-data from this dataset. The meta-data labels mappings are:

| Label  | Description |
| ------ | ----------- |
| 0      | T-shirt/top |
| 1      | Trouser     |
| 2      | Pullover    |
| 3      | Dress       |
| 4      | Coat        |
| 5      | Sandal      |
| 6      | Shirt       |
| 7      | Sneaker     |
| 8      | Bag         |
| 9      | Ankle boot  |

In [None]:
from datasets import load_dataset
from tqdm.auto import tqdm  # progress bar
import torch
import clip
import time

#  Load the fashion-mnist dataset - only retrieve 6000 random images (10% of total dataset)
dataset = load_dataset("fashion_mnist")['train'].shuffle(seed=42).select(range(0,6000))

label_descriptions = {0: "T-shirt/top", 
           1: "Trouser",
           2: "Pullover",
           3: "Dress",
           4: "Coat",
           5: "Sandal",
           6: "Shirt",
           7: "Sneaker",
           8: "Bag",
           9: "Ankle boot"}

# Check to see if GPU is aviailable
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device=device)

# Generate vector embeddings for each image in the dataset
id = 0
vectors = []
for image in tqdm(dataset, total=dataset.num_rows):
    with torch.no_grad():
        image_pp = preprocess(image['image']).unsqueeze(0).to(device)
        image_features = model.encode_image(image_pp)
        embedding_numpy = image_features.cpu().numpy().squeeze().tolist()
        id += 1
        meta_data = {"description": label_descriptions[image["label"]], "timestamp": time.time()}
        vectors.append({'id': str(id),
                        'values': embedding_numpy,
                        'metadata': meta_data})

# 4. Insert the fashion-mnist embeddings into Pinecone

The best way to do bulk updates is by batching the dataset. We will also use a namespace for the data. 

In [None]:
from tqdm.auto import tqdm  # progress bar
import pinecone
import itertools

def chunks(iterable, batch_size=100):
    """A helper function to break an iterable into chunks of size batch_size."""
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

index = pinecone.Index(PINECONE_INDEX_NAME)

# Obtain the upsert embeddings in batches of 100
batch_size = 100
id = 0
for vector_batch in tqdm(chunks(vectors, batch_size=batch_size), total=(len(vectors) / batch_size)):
   index.upsert(vector_batch, namespace="fashion-mnist")

# 5. Run a nearest neighbor search on a sample image that is not in the training dataset

Download a sneaker image file from github that we will use to run a query to see if pinecone search returns the correct description "Sneaker". 
You can change the top_k from 1 to 10 to 100 to see how the ANN results vary.

In [None]:
import pinecone
from PIL import Image
import torch
import clip
import requests

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device=device)

def image_to_embedding():
    
    url = "https://github.com/pinecone-io/basic-operations-workshop/blob/main/sneaker.png?raw=true"
    response = requests.get(url)
    with open("sneaker.png", "wb") as file:
      file.write(response.content)

    # Load and preprocess the image
    image = preprocess(Image.open("./sneaker.png")).unsqueeze(0).to(device)
    
    # Generate the image features
    with torch.no_grad():
        image_features = model.encode_image(image)
    return image_features

embedding = image_to_embedding().cpu().numpy().squeeze().tolist()

index = pinecone.Index(PINECONE_INDEX_NAME)
top_k = 10

query_result = index.query(
  vector = embedding,
  namespace="fashion-mnist",
  top_k=top_k,
  include_values=False,
  include_metadata=True
)

top_k_contains = False
match_cnt = 0
miss_categories = set()

for match in query_result.matches:
  if match.metadata['description'] == "Sneaker":
    match_cnt += 1
    top_k_contains = True
  else:
    miss_categories.add(match.metadata['description'])

print(f"top_k contains matching result: {top_k_contains}")
print(f"top_k: {top_k} match percentage is: {match_cnt/top_k * 100}%")
print(f"Match miss categories: {miss_categories} exepected 'Sneaker'")

# 6. Run a nearest neighbor search on 100 random test images that are not in the training dataset

Select 100 random test images. Keep in mind the model was NOT trained against these images. Obtain the percentage of pinecone queries that return the correct result in top_k. 

In [None]:
import clip
import torch
import pinecone
from datasets import load_dataset
import tqdm

test_dataset = load_dataset("fashion_mnist")['test'].shuffle(seed=42).select(range(0, 100))

# Check to see if GPU is aviailable
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device=device)

# Generate vector embeddings for each image in the dataset
test_vectors = []
for image in tqdm(test_dataset, total=test_dataset.num_rows):
    with torch.no_grad():
        image_pp = preprocess(image['image']).unsqueeze(0).to(device)
        image_features = model.encode_image(image_pp)
        embedding_numpy = image_features.cpu().numpy().squeeze().tolist()
        id += 1
        test_vectors.append({'embedding': embedding_numpy,
                        'description': label_descriptions[image["label"]]})

index = pinecone.Index(PINECONE_INDEX_NAME)
top_k = 10
top_k_contains_cnt = 0

for v in test_vectors:

  top_k_contains = False

  query_result = index.query(
    vector = v['embedding'],
    namespace="fashion-mnist",
    top_k=top_k,
    include_values=False,
    include_metadata=True
  )
  
  for match in query_result.matches:
    if match.metadata['description'] == v['description']:
      top_k_contains = True

  if top_k_contains:
    top_k_contains_cnt += 1

print(f"top_k contains matching result: {top_k_contains_cnt / (len(test_vectors)) * 100}%")

# 7. Run a load test script to simulate 10 concurrent users querying the index

Locust.io is an open-source load testing tool written in Python. It allows you to define user behaviour with Python code and simulate millions of simultaneous users to bombard a system with traffic to test its resilience under heavy load. The (locustfile.py)[./locustfile.py] script re-uses the logic in step #6 to query pinecone. It has a custom event hook that denotes a failure if the top_k result set does not match the search image description. This script will likely fail with a low error rate but you can increase top_k to get a 100% pass rate. The locust summary includes P50 to P100 response time percentiles and QPS(req/s).

In [None]:
%%bash
locust -f locustfile.py --headless -u 10 -r 1 --run-time 60s --host https://pinecone.io --only-summary

# 7. TEARDOWN: Delete the index 
# WARNING: This next step will delete the PINECONE_INDEX_NAME index and all data in it. DO NOT RUN THIS UNTIL YOU ARE READY OR MANUALLY REMOVE THE INDEX INSTEAD!!! 

In [None]:
if PINECONE_INDEX_NAME in pinecone.list_indexes():
    pinecone.delete_index(PINECONE_INDEX_NAME)
    
pinecone.list_indexes()