<!--<badge>--><a href="https://colab.research.google.com/github/pinecone-io/examples/blob/master/image_search/image_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><!--</badge>-->

# Hybrid Image and Keyword Search

## Background

### What is Image Search and how will we use it?

One may find themselves with an image, looking for similar images among a large image corpus. The difficult part of this requirement is instantly retrieving, at scale, similar images, especially when there are tens of millions or billions of images from which to choose.

In this example, we will walk you through the mechanics of how to solve this problem using an off-the-shelf, pretrained, neural network to generate data structures known as [vector embeddings](https://www.pinecone.io/learn/vector-embeddings/). We will use Pinecone's vector database offering to find images with similar vector embeddings to an _query image_.

### Learning Goals and Estimated Reading Time

_By the end of this 15 minute demo (on a recent MacBook Pro, or up to an hour on Google Colab), you will have:_
 1. Learned about Pinecone's value for solving realtime image search requirements!
 2. Stored and retrieved vectors from Pinecone your very-own Pinecone Vector Database.
 3. Encoded images as vectors using a pretrained neural network (i.e. no model training necessary).
 4. Queried Pinecone's Vector Database to find similar images to the query in question.
 
Once all data is encoded as vectors, and is in your Pinecone Index, results of Pinecone queries are returned, on average, in tens of milliseconds.

## Setup: Prerequisites and Image Data

### Python 3.7+

This code has been tested with Python 3.7. It is recommended to run this code in a virtual environment or Google Colab.

### Acquiring your Pinecone API Key

A Pinecone API key is required. You can obtain a complimentary key on our [our website](https://app.pinecone.io/). Either add `PINECONE_EXAMPLE_API_KEY` to your list of environmental variables, or manually enter it after running the below cell (a prompt will pop up requesting the API key, storing the result within this kernel (session)).

### Installing and Importing Prerequisite Libraries:
All prerequisites are installed and listed in the next cell.

#### Installing via `pip`

In [None]:
!pip install -qU \
                 torchvision \
                 seaborn \
                 tqdm \
                 mmh3 \
                 nltk \
                 pycocotools \
                 transformers
#                 #  pinecone-client \

Temporary - install pinecone's client from a side branch containing new hybrid API

In [None]:
!pip3 install -U git+https://github.com/pinecone-io/pinecone-python-client.git@add-hybridapi-wiring

#### Importing and Defining Constants

In [None]:
import os

from tqdm.notebook import tqdm
import pinecone
import numpy as np
from PIL import Image

import torch
import torchvision
import torch.nn.functional as f

import nltk
nltk.download('punkt')

DATA_DIRECTORY = 'tmp'
INDEX_NAME = 'image-hybrid-search'
INDEX_DIMENSION = 768
BATCH_SIZE=100
NUM_CAPTIONS = 3

### Helper Module

This helper module will be imported and will enable this notebook to be self-contained.

In [None]:
!curl -o helper.py https://raw.githubusercontent.com/pinecone-io/examples/ilai/hybrid_image_search/hybrid_image_search//helper.py

In [None]:
import helper as h

### Downloading Data

To demonstrate image search using Pinecone, we will download 100,000 small images using [built-in datasets](https://pytorch.org/vision/stable/datasets.html) available with the `torchvision` library.

Get the COCO annotaions dataset:

In [None]:
!wget -N http://images.cocodataset.org/annotations/annotations_trainval2017.zip
!wget -N http://images.cocodataset.org/zips/val2017.zip
!wget -N http://images.cocodataset.org/zips/train2017.zip
!unzip -qn train2017.zip
!unzip -qn annotations_trainval2017.zip
!unzip -qn val2017.zip

In [None]:
dataset = torchvision.datasets.CocoCaptions('train2017/', 'annotations/captions_train2017.json', transform=h.preprocess, target_transform=lambda x: x[:3])

In [None]:
val_dataset = torchvision.datasets.CocoCaptions('val2017/', 'annotations/captions_val2017.json', transform=h.preprocess, target_transform=lambda x: x[:3])

### Inspecting Images
These are some of the images from what was just downloaded. If interested, read about the COCO image dataset [here](https://cocodataset.org/#captions-2015).

In [None]:
import matplotlib.pyplot as plt


In [None]:
h.show_random_images_from_full_dataset(dataset, num_rows=5, num_cols=2)

## Generating Embeddings and Sending them to Pinecone

### Loading a Pretrained Image Embedding Model

We will use a pretrianed Vision Transformer model to generate image embedding vectors. 
This model will create a 768-dimensional sequence of floats for each input image. We will use this output as an embedding associated with an image.
You can read more about ViT models [here](https://arxiv.org/abs/2010.11929)

In [None]:
from transformers import ViTFeatureExtractor, ViTModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = ViTModel.from_pretrained("google/vit-base-patch16-224-in21k")
model = model.to(device).eval()

### On Comparing Embeddings

Two embeddings might look like something like this:

- \[-0.02, 0.06, 0.0, 0.01, 0.08, -0.03, 0.01, 0.02, 0.01, 0.02, -0.07, -0.11, -0.01, 0.08, -0.04\]
- \[-0.04, -0.09, 0.04, -0.1, -0.05, -0.01, -0.06, -0.04, -0.02, -0.04, -0.04, 0.07, 0.03, 0.02, 0.03\]

In order to determine how similar they are, we use a [simple](https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d) formula that takes a very short time to compute. Similarity scores are, in general, an excellent proxy for image similarity.

### Creating Our Pinecone Index

The process for creating a Pinecone Index requires your Pinecone API key, the name of your index, and the number of dimensions of each vector (1000).

In this example, to compare embeddings, we will use the [cosine similarity score](https://en.wikipedia.org/wiki/Cosine_similarity) because this model generates un-normalized probability vectors. While this calculation is trivial when comparing two vectors, it will take quite a long time when needing to compare a query vector against millions or billions of vectors and determine those most similar with the query vector.

Note: For the moment hybrid search only supports naive dot product, so we will need to normalize the vetors by the l2 norn to get the actual cosine distance metric

### What is Pinecone for?

There is often a technical requirement to compare one vector to tens or hundreds of millions or more vectors, to do so with low latency (less than 50ms) and a high throughput. Pinecone solves this problem with its managed vector database service, and we will demonstrate this below.

In [None]:
pinecone.init(h.pinecone_api_key, environment='us-west1-gcp')

metadata_config = {
    "indexed": []
}

if INDEX_NAME not in pinecone.list_indexes():
    pinecone.create_index(name=INDEX_NAME, dimension=INDEX_DIMENSION,
                          metric="dotproduct",
                          pod_type="s1h",
                          index_config={"hybrid_search": {"avg_doc_len": 52}},
                          metadata_config=metadata_config)

index = pinecone.Index(INDEX_NAME)

## Creating keywords sparse vectors

The textual image captions are tokenzied indvidually, which will be used by pinecone's BM25 hybrid API to create sparse vector represnations of the keywords

In [None]:
import mmh3
import nltk
from typing import Dict
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from collections import Counter


class WordEncoder():

    def __init__(self):
        nltk.download('punkt')
        self.stemmer = SnowballStemmer('english')

    def encode(self, text: str) -> Dict[int, int]:
        words = [self.stemmer.stem(word) for word in word_tokenize(text)]
        ids = [mmh3.hash(word, signed=False) for word in words]
        return dict(Counter(ids))

In [None]:
from collections import Iterable
sparse_encoder = WordEncoder()

def encode_image_captions(image_captions):
    if isinstance(image_captions, Iterable):
        image_captions = ". ".join(image_captions)
    return sparse_encoder.encode(image_captions)

def get_sparse_representation(captions):
    return map(encode_image_captions, captions)

### Preparing Vector Embeddings

We will encode the downloaded images for upload to Pinecone, and store the associated class of each image as metadata.

#### Creating Vector IDs
Each vector ID will have a prefix corresponding to the dataset's name

In [None]:
def get_vector_ids(batch_number, batch_size, prefix):
    """Return vector ids."""
    start_index = batch_number * batch_size
    end_index = start_index + batch_size
    ids = np.arange(start_index, end_index)
    ids_with_prefix = map(lambda x: f'{prefix}.{str(x)}', ids)
    return ids_with_prefix

#### Creating metadata for each vector containing class label

In [None]:
def get_vector_metadata(labels):
    """Return list of {'label': <class name>}."""
    get_cpation_sentences = lambda caps: {f'label_{i}': cap for i, cap in enumerate(caps)}
    return map(get_cpation_sentences, labels)

#### Constructing Vector Embeddings

In a Pinecone Vector Database, there are three components to every Pinecone vector embedding:

 - a vector ID
 - a sequence of floats of a user-defined, fixed dimension
 - vector metadata (a key-value mapping, used for filtering at runtime)

In [None]:
def get_vectors_from_batch(preprocessed_data, captions, batch_number, dataset, normalize=True, hybrid=True):
    """Return list of tuples like (vector_id, vector_values, vector_metadata)."""
    num_records = len(preprocessed_data)
    prefix = dataset.__class__.__name__
    with torch.no_grad():
        # generate image embeddings with PyTorch model
        preprocessed_data = preprocessed_data.to(device)
        vector_values = model(preprocessed_data).pooler_output
        if normalize:
            vector_values = f.normalize(vector_values, p=2, dim=1)
        vector_values = vector_values.cpu().tolist()

    # return respective IDs/metadata for each image embedding
    vector_ids = get_vector_ids(batch_number, num_records, prefix)
    if hybrid:
        vector_metadata = [{}] * num_records
        sparse_rep = get_sparse_representation(captions)
        return list(zip(vector_ids, vector_values, sparse_rep, vector_metadata))
    else:
        vector_metadata = get_vector_metadata(captions)
        return list(zip(vector_ids, vector_values, vector_metadata))

#### Upsert Vectors to Pinecone
This function iterates through a dataset in batches, generates a list of vector embeddings (as in the the above example) and upserts in batches to Pinecone.

In [None]:
def upsert_image_embeddings(dataset, pinecone_index, batch_size=BATCH_SIZE, num_rows=np.inf, hybrid=True):
    """Iterate through dataset, generate embeddings and upsert in batches to Pinecone index.
    
    Args:
     - dataset: a PyTorch Dataset
     - pinecone_index: your Pinecone index
     - batch_size: batch size
     - num_rows: Number of initial rows to use of dataset, use all rows if None. 
    """
    if num_rows < np.inf:
        if num_rows > len(dataset):
            raise ValueError(f'`num_rows` should not exceed length of dataset: {len(dataset)}')
        sampler = range(num_rows)
    else:
        sampler = None
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=sampler, num_workers=8)
    tqdm_kwargs = h.get_tqdm_kwargs(dataloader, num_rows=num_rows)
    for batch_number, (images, captions) in tqdm(enumerate(dataloader), **tqdm_kwargs):
        captions = list(zip(*captions))
        vectors = get_vectors_from_batch(
            images, 
            captions, 
            batch_number, 
            dataloader.dataset,
            normalize=hybrid,
            hybrid=hybrid)
        
        if hybrid:
            pinecone_index.hybrid.upsert(vectors)
        else:
            pinecone_index.upsert(vectors)

### Begin Upsert for all 100,000 Images
One progress bar is generated per dataset. Truncate number of rows in each dataset by modifying `num_rows` parameter value in the cell below.

In [None]:
upsert_image_embeddings(dataset, index, num_rows=100_000, batch_size=300)

### View Progress On The [Pinecone Console](https://app.pinecone.io) (sample screenshot below)

## Querying Pinecone

Now that all the embeddings of the images are on Pinecone's database, it's time to demonstrate Pinecone's lightning fast query capabilities.

###  Pinecone Example Usage

In the below example we query Pinecone's API with an embedding of a query image to return the vector embeddings that have the highest similarity score. Pinecone effeciently estimates which of the uploaded vector embeddings have the highest similarity when paired with the query term's embedding, and the database will scale to billions of embeddings maintaining low-latency and high throughput. In this example we have upserted 100,000 embeddings. Our starter plan supports up to one million.

#### Example: Pinecone API Request and Response

Let's find images similar to the `query_image` variable, shown below.

#### Example Query Image

In [None]:
def query_hybrid_index(query_image, query_captions=[""], alpha = 0.5, normalize=True):
    with torch.no_grad():
        vector_values = model(query_image.unsqueeze(0).to(device)).pooler_output
        if normalize:
            vector_values = f.normalize(vector_values, p=2, dim=1)
        query_embedding = vector_values.cpu().tolist()

    query_sparse_rep = encode_image_captions(query_captions)
    response = index.hybrid.query(vector=query_embedding,
                                sparse_vector = query_sparse_rep,
                                top_k=4, 
                                alpha=alpha,
                                include_metadata=False)
    return response

In [None]:
idx = np.random.randint(0, len(val_dataset))
query_image, query_captions = val_dataset[idx]

In [None]:
h.show_query_image(query_image, query_captions)

#### Enriched Response
In the next few lines, we look up the actual images associated to the vector embeddings.

In [None]:
def plot_result(query_image, query_captions, alpha=0.5):
    response = query_hybrid_index(query_image, query_captions, alpha=alpha)
    h.show_response_as_grid(response, dataset, figsize=(8, 8), nrows=2, num_captions=1, wrap_len=40,)
    fig = plt.gcf()
    fig.text(0.5, .97, f"Alpha: {alpha}", horizontalalignment='center', fontsize=16)

In [None]:
plot_result(query_image, query_captions, alpha=0.5)

### Analysis

Let's take one particular image, and explore how different keywords in the query affect the results

In [None]:
idx = 4501 # woman playing tennis
query_image, query_captions = val_dataset[idx]

In [None]:
show_query_image(query_image, query_captions)

In [None]:
keywords = [[""], ["woman"], ["yellow"]]
fig, axes = plt.subplots(4, len(keywords), figsize=(12, 16))
for i, query_captions in enumerate(keywords):
    response = query_hybrid_index(query_image, query_captions)
    h.show_response_as_grid(response, dataset, num_captions=1, wrap_len=36, axes=axes[:, i])
    fig.text(0.25 * (i + 1), .92, f"Keyowrd: {query_captions[0]}", horizontalalignment='center', fontsize=16)

line = plt.Line2D((.37,.37),(.1,.93), color="k", linewidth=3)
fig.add_artist(line)    
line = plt.Line2D((.65,.65),(.1,.93), color="k", linewidth=3)
fig.add_artist(line)

Also, let's explore how the hybrid search result differ from sparse search or dense search only

In [None]:
alpha_values = [0, 0.5, 1]
query_captions = ["woman playing tennis"]
fig, axes = plt.subplots(4, len(alpha_values), figsize=(12, 16))
for i, alpha in enumerate(alpha_values):
    response = query_hybrid_index(query_image, query_captions, alpha=alpha)
    h.show_response_as_grid(response, dataset, num_captions=1, wrap_len=36, axes=axes[:, i])
    fig.text(0.25 * (i + 1), .92, f"Alpha: {alpha}", horizontalalignment='center', fontsize=16)

line = plt.Line2D((.37,.37),(.1,.93), color="k", linewidth=3)
fig.add_artist(line)    
line = plt.Line2D((.65,.65),(.1,.93), color="k", linewidth=3)
fig.add_artist(line)

#### Results

We invite the reader to explore various queries to see how they come up. In the one above, we chose one of the CIFAR-10 images as the query image. Note that the query image embedding need not exist in your Pinecone index in order to find similar images. Additionally, the search results are only as good as the embeddings, which are based on the quality and quantity of the images as well as how expressive the model used is. There are plenty of other out of the box, pretrained models in PyTorch and elsewhere!

### Pinecone Example Usage with Metadata

Extensive predicate logic can be applied to metadata filtering, just like the [WHERE clause](https://www.pinecone.io/learn/vector-search-filtering/) in SQL! Pinecone's [metadata feature](https://www.pinecone.io/docs/metadata-filtering/) provides easy-to-implement filtering.

#### Example using Metadata

For demonstration, let's use metadata to find all images classified as a _seal_ that look like the `query_image` variable shown above.

In [None]:
response = index.query(
    query_embedding, 
    top_k=25, 
    filter={"label": {"$eq": "seal"}},
    include_metadata=True
)
h.show_response_as_grid(response, datasets, 5, 5, figsize=(10, 10))

#### Results

All of the results returned are indeed seals, and many of them do look like the query image! Note how the cosine similarity scores are returned in descending order.

#### Additional Note On Querying Pinecone

In this example, you queried your Pinecone index with an embedding that was already in the index, however that is not necessary at all. For this index, _any 1000-dimensional embedding_ can be used to query Pinecone.

## Conclusion

In this example, we demonstrated how Pinecone makes it possible to do realtime image similarity search using a pre-trained computer vision model! We also demonstrated the use of metadata filtering with querying Pinecone's vector database.

### Like what you see? Explore our [community](https://www.pinecone.io/community/) 

Learn more about semantic search and the rich, performant, and production-level feature set of Pinecone's Vector Database by visiting https://pinecone.io, connecting with us [here](https://www.pinecone.io/contact/) and following us on [LinkedIn](https://www.linkedin.com/company/pinecone-io). If interested in some of the algorithms that allow for effecient estimation of similar vectors, visit our Algorithms and Libraries section of our [Learning Center](https://www.pinecone.io/learn/).