# Lab #3 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/basic-operations-workshop/blob/main/lab3.ipynb)
1. Install dependencies
2. Create a pinecone index 
3. Load public image dataset(fashion-mnist) and create vector embeddings from the dataset
4. Create a local parquet backup of your image embeddings
5. Insert the fashion-mnist embeddings into Pinecone
6. Run a nearest neighbor search on a sample image that is not in the training dataset
7. Run a nearest neighbor search on 100 random test images that are not in the training dataset
8. Run a load test script to simulate 10 concurrent users querying the index
9. TEARDOWN: Delete the index 

# 1. Install Pinecone client 
Use the following shell command to install Pinecone:

In [1]:
!pip install -U "pinecone-client[grpc]" "python-dotenv" "torch" "torchvision" "pillow" "ftfy" "regex" "tqdm" "git+https://github.com/openai/clip.git" "datasets" "locust"

Found existing installation: absl-py 1.4.0
Uninstalling absl-py-1.4.0:
  Successfully uninstalled absl-py-1.4.0
Found existing installation: aiohttp 3.8.5
Uninstalling aiohttp-3.8.5:
  Successfully uninstalled aiohttp-3.8.5
Found existing installation: aiosignal 1.3.1
Uninstalling aiosignal-1.3.1:
  Successfully uninstalled aiosignal-1.3.1
Found existing installation: appnope 0.1.3
Uninstalling appnope-0.1.3:
  Successfully uninstalled appnope-0.1.3
Found existing installation: asttokens 2.2.1
Uninstalling asttokens-2.2.1:
  Successfully uninstalled asttokens-2.2.1
Found existing installation: astunparse 1.6.3
Uninstalling astunparse-1.6.3:
  Successfully uninstalled astunparse-1.6.3
Found existing installation: async-timeout 4.0.2
Uninstalling async-timeout-4.0.2:
  Successfully uninstalled async-timeout-4.0.2
Found existing installation: attrs 23.1.0
Uninstalling attrs-23.1.0:
  Successfully uninstalled attrs-23.1.0
Found existing installation: backcall 0.2.0
Uninstalling backcall-0.

# 2. Create a pinecone index 
We will create an index that will be used to load/query a hugging face dataset.

In [4]:
from dotenv import load_dotenv
import os
import pinecone

load_dotenv('.env')

PINECONE_INDEX_NAME = os.environ['PINECONE_INDEX_NAME']
PINECONE_API_KEY = os.environ['PINECONE_API_KEY']
PINECONE_ENVIRONMENT = os.environ['PINECONE_ENVIRONMENT']
METRIC = os.environ['METRIC']
DIMENSIONS = int(os.environ['DIMENSIONS'])
INDEX_NAMESPACE = os.environ['INDEX_NAMESPACE']

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
pinecone.create_index(PINECONE_INDEX_NAME, dimension=DIMENSIONS, metric=METRIC, pods=1, replicas=1, pod_type="s1")
pinecone.describe_index(PINECONE_INDEX_NAME)

ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=UTF-8', 'date': 'Thu, 03 Aug 2023 22:37:10 GMT', 'x-envoy-upstream-service-time': '443', 'content-length': '35', 'server': 'envoy'})
HTTP response body: index james-williams already exists


# 3. Load public image dataset(fashion-mnist) and create vector embeddings from the dataset

Use the following shell command to download the [fashion-mnist](https://huggingface.co/datasets/fashion_mnist) training dataset from Hugging Face so that we can create vector embeddings that uses a label(image class) as meta-data from this dataset. The meta-data labels mappings are:

| Label  | Description |
| ------ | ----------- |
| 0      | T-shirt/top |
| 1      | Trouser     |
| 2      | Pullover    |
| 3      | Dress       |
| 4      | Coat        |
| 5      | Sandal      |
| 6      | Shirt       |
| 7      | Sneaker     |
| 8      | Bag         |
| 9      | Ankle boot  |

The Fashion-MNIST dataset is a dataset of Zalando's article images, consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

The accuracy you can achieve depends on the model and the preprocessing steps you use. Here's a rough guideline for what you might expect with some classic machine learning algorithms:

1. **Random Forest:** Around 85-89% accuracy.
2. **Support Vector Machines (SVM):** Around 85-90% accuracy, depending on kernel and hyperparameters.
3. **k-Nearest Neighbors (k-NN):** Around 85-88% accuracy.
4. **Logistic Regression:** Around 82-85% accuracy.
5. **Gradient Boosting Machines (e.g., XGBoost):** Around 87-90% accuracy.

Keep in mind these numbers are approximate and can vary based on the exact preprocessing, feature extraction, and hyperparameter tuning you do. In general, deep learning models, especially Convolutional Neural Networks (CNNs), tend to perform better on image classification tasks like Fashion-MNIST, potentially reaching over 90-95% accuracy.

But for classic machine learning models, anything in the 85-90% range can be considered a reasonable result for the Fashion-MNIST dataset. It reflects a model that has learned something meaningful from the data but isn't necessarily state-of-the-art for this particular task.

In [1]:
from datasets import load_dataset
from tqdm.auto import tqdm  # progress bar
from PIL import Image
import torch
import clip
import time
import numpy as np

#  Load the fashion-mnist dataset - only retrieve 6000 random images (10% of total dataset)
dataset = load_dataset("fashion_mnist")['train'].shuffle(seed=42).select(range(0,6000))
#dataset = load_dataset("fashion_mnist")['train']

label_descriptions = {0: "T-shirt/top", 
           1: "Trouser",
           2: "Pullover",
           3: "Dress",
           4: "Coat",
           5: "Sandal",
           6: "Shirt",
           7: "Sneaker",
           8: "Bag",
           9: "Ankle boot"}

# Check to see if GPU is aviailable
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device=device)
   
# Generate vector embeddings for each image in the dataset
id = 0
vectors = []
for img in tqdm(dataset, total=dataset.num_rows, desc='Images', position=0):
    with torch.no_grad():
        id += 1
        image_pp = preprocess(img['image']).unsqueeze(0).to(device)
        image_features = model.encode_image(image_pp)
        embedding = image_features.cpu().numpy().squeeze().tolist()
        meta_data = {"description": label_descriptions[img["label"]], "timestamp": time.time()}
        vectors.append({'id': str(id),
                        'values': embedding,
                        'metadata': meta_data})

  from .autonotebook import tqdm as notebook_tqdm
Images: 100%|██████████| 6000/6000 [03:14<00:00, 30.79it/s]


# 4. Create a local parquet backup of your image embeddings

This is good practice because generating embeddings can be expensive and time consuming when calling hosted models like OpenAI. As you can see, even locally generated embeddings are time consuming.

In [2]:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(vectors)
df.to_parquet('fashion-mnist-clip.parquet')

# 5. Insert the fashion-mnist embeddings into Pinecone

The best way to do bulk updates is by batching the dataset. We will also use a namespace for the data. 

In [8]:
from tqdm.auto import tqdm  # progress bar
import pinecone
import itertools

# Read Parquet file into a DataFrame
df = pd.read_parquet('fashion-mnist-clip.parquet')
df['values'] = df['values'].apply(lambda x: x.tolist())

# Convert DataFrame to a list of dictionaries
data_list = df.to_dict(orient='records')

def chunks(iterable, batch_size=100):
    """A helper function to break an iterable into chunks of size batch_size."""
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

index = pinecone.Index(PINECONE_INDEX_NAME)

# Obtain the upsert embeddings in batches of 100
batch_size = 100
id = 0
for vector_batch in tqdm(chunks(data_list, batch_size=batch_size), total=(len(vectors) / batch_size)):
   index.upsert(vector_batch, namespace=INDEX_NAMESPACE)

100%|██████████| 60/60.0 [00:25<00:00,  2.32it/s]


# 6. Run a nearest neighbor search on a sample image that is not in the training dataset

Download a sneaker image file from github that we will use to run a query to see if pinecone search returns the correct description "Sneaker". 
You can change the top_k from 1 to 10 to 100 to see how the ANN results vary.

In [None]:
import pinecone
from PIL import Image
import torch
import clip
import requests

# Check to see if GPU is aviailable
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device=device) 

def image_to_embedding():
    
    url = "https://github.com/pinecone-io/basic-operations-workshop/blob/main/sneaker.jpeg?raw=true"
    response = requests.get(url)
    with open("sneaker.jpeg", "wb") as file:
      file.write(response.content)
    image_pp = preprocess(Image.open("./sneaker.jpeg")).unsqueeze(0).to(device)
    with torch.no_grad():
      embedding = model.encode_image(image_pp).squeeze().tolist()
    
    return embedding

index = pinecone.Index(PINECONE_INDEX_NAME)
top_k = 10

query_result = index.query(
  vector = image_to_embedding(),
  namespace=INDEX_NAMESPACE,
  top_k=top_k,
  include_values=False,
  include_metadata=True
)

top_k_contains = False
match_cnt = 0
miss_categories = set()

for match in query_result.matches:
  if match.metadata['description'] == "Sneaker":
    match_cnt += 1
    top_k_contains = True
  else:
    miss_categories.add(match.metadata['description'])

print(f"top_k contains matching result: {top_k_contains}")
print(f"top_k: {top_k} match percentage is: {match_cnt/top_k * 100}%")
print(f"Match miss categories: {miss_categories} exepected 'Sneaker'")

# 7. Run a nearest neighbor search on 100 random test images that are not in the training dataset

Select 100 random test images. Keep in mind the model was NOT trained against these images. Obtain the percentage of pinecone queries that return the correct result in top_k. 

In [None]:
import clip
import torch
import pinecone
from datasets import load_dataset
from tqdm.auto import tqdm 

test_dataset = load_dataset("fashion_mnist")['test'].shuffle(seed=42).select(range(0, 100))

# Check to see if GPU is aviailable
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device=device) 

# Generate vector embeddings for each image in the dataset
test_vectors = []
for img in tqdm(test_dataset, total=test_dataset.num_rows):
  image_pp = preprocess(img['image']).unsqueeze(0).to(device)
  embedding = model.encode_image(image_pp).squeeze().tolist()
    
  test_vectors.append({'embedding': embedding,
                        'description': label_descriptions[img["label"]]})
    
index = pinecone.Index(PINECONE_INDEX_NAME)
top_k = 10
top_k_contains_cnt = 0

for v in test_vectors:

  top_k_contains = False

  query_result = index.query(
    vector = v['embedding'],
    namespace=INDEX_NAMESPACE,
    top_k=top_k,
    include_values=False,
    include_metadata=True
  )
  
  for match in query_result.matches:
    if match.metadata['description'] == v['description']:
      top_k_contains = True

  if top_k_contains:
    top_k_contains_cnt += 1

print(f"top_k contains matching result: {top_k_contains_cnt / (len(test_vectors)) * 100}%")

# 8. Run a load test script to simulate 10 concurrent users querying the index

Locust.io is an open-source load testing tool written in Python. It allows you to define user behaviour with Python code and simulate millions of simultaneous users to bombard a system with traffic to test its resilience under heavy load. The (locustfile.py)[./locustfile.py] script re-uses the logic in step #6 to query pinecone. It has a custom event hook that denotes a failure if the top_k result set does not match the search image description. This script will likely fail with a low error rate but you can increase top_k to get a 100% pass rate. The locust summary includes P50 to P100 response time percentiles and QPS(req/s).

In [None]:
%%bash
locust -f locustfile.py --headless -u 10 -r 1 --run-time 60s --host https://pinecone.io --only-summary

# 9. TEARDOWN: Delete the index 
# WARNING: This next step will delete the PINECONE_INDEX_NAME index and all data in it. DO NOT RUN THIS UNTIL YOU ARE READY OR MANUALLY REMOVE THE INDEX INSTEAD!!! 

In [None]:
if PINECONE_INDEX_NAME in pinecone.list_indexes():
    pinecone.delete_index(PINECONE_INDEX_NAME)
    
pinecone.list_indexes()