<a href="https://colab.research.google.com/github/jeffvestal/elastic_jupyter_notebooks/blob/main/image_vector_search_load_and_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# !!! WORK IN PROGRESS DO NOT USE YET !!!

# Searching for images by description with vector search 

This code will show you how to set up an ingest pipeline to generate vectors for images on ingest. We will then use kNN vector search to find images matching a description.

Overview of steps
1. Set up our python environment
2. Load model from Hugging Face into Elastic
2. Setup index mapping
3. Configure ingest pipeline
4. Download and unzip [Unsplash Lite Data set](https://github.com/unsplash/datasets#lite-dataset)
4. Index an image data set and vectors
5. Run a sample kNN search

### Requirements
This notebook assumes you already have loaded an embedding model into elasticsearch. If you haven't, please start with [this notebook example](https://github.com/jeffvestal/elastic_jupyter_notebooks/blob/main/load_embedding_model_from_hf_to_elastic.ipynb)


### Elastic version support
Requires Elastic version 8.0+ with a platinum or enterprise license (or trial license)

You can set up a [free trial elasticsearch Deployment in Elastic Cloud](https://cloud.elastic.co/registration).

---
---
# Setup
---
---
This section will set up the python environment with the required libraries

## Install and import required python libraries

Elastic uses the [eland python library](https://github.com/elastic/eland) to download modesl from Hugging Face hub and load them into elasticsearch

In [None]:
pip install eland

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
pip install elasticsearch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch>=1.6.0
  Downloading torch-1.13.1-cp38-cp38-manylinux1_x86_64.whl (887.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.4/887.4 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-nvrtc-cu11==11.7.99
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.0/21.0 MB[0m [31m57.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cublas-cu11==11.10.3.66
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.1/317.1 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu11==11.7.99
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

In [None]:
pip install torch==1.11

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.11
  Using cached torch-1.11.0-cp38-cp38-manylinux1_x86_64.whl (750.6 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.13.1
    Uninstalling torch-1.13.1:
      Successfully uninstalled torch-1.13.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.14.1+cu116 requires torch==1.13.1, but you have torch 1.11.0 which is incompatible.
torchtext 0.14.1 requires torch==1.13.1, but you have torch 1.11.0 which is incompatible.
torchaudio 0.13.1+cu116 requires torch==1.13.1, but you have torch 1.11.0 which is incompatible.[0m[31m
[0mSuccessfully installed torch-1.11.0


In [None]:
from pathlib import Path
from eland.ml.pytorch import PyTorchModel
from eland.ml.pytorch.transformers import TransformerModel
from elasticsearch import Elasticsearch, helpers
from elasticsearch.client import MlClient
from pprint import pprint

import requests
import zipfile
import io

import pandas as pd
import base64

  from elasticsearch.client import MlClient


## Configure elasticsearch authentication. 
The recommended authentication approach is using the [Elastic Cloud ID](https://www.elastic.co/guide/en/cloud/current/ec-cloud-id.html) and a [cluster level API key](https://www.elastic.co/guide/en/kibana/current/api-keys.html)

You can use any method you wish to set the required credentials. We are using getpass in this example to prompt for credentials to avoide storing them in github.

In [None]:
import getpass

In [None]:
es_cloud_id = getpass.getpass('Enter Elastic Cloud ID:  ')
es_api_id = getpass.getpass('Enter cluster API key ID:  ') 
es_api_key = getpass.getpass('Enter cluster API key:  ')

## Connect to Elastic Cloud

In [None]:
es = Elasticsearch(cloud_id=es_cloud_id, 
                   api_key=(es_api_id, es_api_key)
                   )
es.info() # should return cluster info

---
---
# Load Model from Hugging Face
---
---
We will be using the [clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1) sentence transformers model. This model's capabilities are described as:

> You can map text (in 50+ languages) and images to a common dense vector space such that images and the matching texts are close. This model can be used for image search (users search through a large collection of images)

### Download the Model
Here we specify the model id from Hugging Face. The easiest way to get this id is clicking the copy the model name icon next to the name on the model page. 

When calling `TransformerModel` you specify the HF model id and the task type. You can try specifying `auto` and eland will attempt to determine the correct type from info in the model config. This is not always possible so a list of specific `task_type` values can be viewed in the following code: 
[Supported values](https://github.com/elastic/eland/blob/15a300728876022b206161d71055c67b500a0192/eland/ml/pytorch/transformers.py#*L41*)

In [None]:
hf_model_id='sentence-transformers/clip-ViT-B-32-multilingual-v1'
tm = TransformerModel(hf_model_id, "text_embedding")

### Set and confirm the model ID
To make the name compatible with elasticsearch, the '/' is replaced with '__'



In [None]:
es_model_id = tm.elasticsearch_model_id()
es_model_id

## Export the model in a TorchScrpt representation which Elasticsearch uses

In [None]:
tmp_path = "models"
Path(tmp_path).mkdir(parents=True, exist_ok=True)
model_path, config, vocab_path = tm.save(tmp_path)

## Load the model into Elasticsearch
Model should not already exist in elasticsearch

In [None]:
ptm = PyTorchModel(es, es_model_id)
ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config) 

# Starting the Model

## View information about the model
This is not required but can be handy to get a model overivew

In [None]:
m = MlClient.get_trained_models(es, model_id=es_model_id)
m.body

## Deploy the model
This will load the model on the ML nodes and start the process(es) making it available for the NLP task

In [None]:
s = MlClient.start_trained_model_deployment(es, model_id=es_model_id)
s.body

## Verify the model started without issue

In [None]:
stats = MlClient.get_trained_models_stats(es, model_id=es_model_id)
stats.body['trained_model_stats'][0]['deployment_stats']['nodes'][0]['routing_state']

---
---
# Elasticsearch index setup
---
---
Here we will configure an index template with settings and mappings to store our vectors and text data

The **important** part here will be setting our vector field to be a `dense_vector` type. This will tell elasticsearch to build the HNSW graph for the vectors so we can then use kNN search later. 

## Define the index template
We will have the following fields

- `vectors` of type `dense_vector`
-- it is important to set `dims` to the number of dimensions the model you will use outputs
- `title` of type `text`
- `summary` of type `text`

We will have 
- 1 primary shard
- 0 replica -> *note* in production you will want at least 1 replica

This will match new indices with the name matching the pattern of `jupyter-vector-demo*`

In [None]:
index_id = 'jupyter-vector-image-search-demo'
index_patterns = index_id + "*"
settings= {
        "number_of_shards": 1,
        "number_of_replicas": 0
    }
mappings= {
        "properties": {
            "vectors": {
                "type": "dense_vector",
                "dims": 512,
                "index" : True,
                "similarity" : "cosine"
            },
            "title": {
                "type": "text"
            },
            "summary": {
                "type": "text"
            }
        }
    }

## Apply the template
Here we apply the templat and give it a name of `jupyter-vector-demo`. This is just the name of the template if we need to modify it later on.

In [None]:
template_id = index_id + '-template'
es.indices.put_template(name=template_id, 
                        index_patterns=index_patterns,
                        settings=settings,
                        mappings=mappings
                        )

---
---
# The Ingest Pipeline
---
---

An ingest pipeline has one or more processors and processes documents before they are written into an elasticsearch index. 

Each processor is designed to perform a various task such as parsing fields or enriching data. 

The main processor for this pipeline is the `inference` processor. The inference processor sends a specified field to a supervised model and writes the output from the model to a new field along with the original fields in the document. 

To make it simpler to access the vector, we will copy the vectors to a field named `vectors` and them remove the `ml` field tree which is the default output.

## Configure the pipeline

In [None]:
pipeline_definition = {
    "description": "A pipeline for generating and storing vectors on ingest",
    "processors": [
      {
       "inference": {
          "model_id": es_model_id,
          "field_map": {
           "summary": "text_field"
          }
       }
     },
     {
      "set": {
        "field": "vectors",
        "copy_from": "ml.inference.predicted_value"
        }
     },
    {
      "remove": {
        "field": "ml"
      }
    }
  ]
}



## Create the pipeline if it doesn't exist

In [None]:
pipeline_id = index_id + '-pipeline'
if es.ingest.put_pipeline(id=pipeline_id, body=pipeline_definition):
    print("Pipeline created successfully")
else:
    print("Failed to create pipeline")


## Verify the pipeline
Not required but nice to verify everything looks correct

In [None]:
pipeline = es.ingest.get_pipeline(id=pipeline_id)
pipeline.body

---
---
# Ingest Docs and Generate Vectors
---
---
We will be using [Usplash Lite dataset](https://github.com/unsplash/datasets#lite-dataset). 

*note* the data set size: [~650MB compressed, ~1.4GB raw]. This will fit in the storage space supplied by the colab space.

## Download the images dataset

In [None]:
url = "https://unsplash.com/data/lite/latest"
r = requests.get(url)

#### Open the zip file

In [None]:
z = zipfile.ZipFile(io.BytesIO(r.content))

#### Extract the contents of the zip file to a folder

In [None]:
z.extractall("unsplash_images")

#### Check a sample image
the path here is where the files will extract to in google colab 

In [None]:
filename = '/content/unsplash_images/photos.tsv000'
df = pd.read_csv(filename, sep='\t', header=0)


#### Check a sampling data (selecting only the columns we need)

In [None]:
df[['photo_id', 'photo_image_url', 'ai_description']]

Unnamed: 0,photo_id,photo_image_url,ai_description
0,XMyPniM9LF0,https://images.unsplash.com/uploads/1411949294...,woman walking in the middle of forest
1,rDLBArZUl1c,https://images.unsplash.com/photo-141633941111...,succulent plants in clear glass terrarium
2,cNDGZ2sQ3Bo,https://images.unsplash.com/photo-142014251503...,rocky mountain under gray sky at daytime
3,iuZ_D1eoq9k,https://images.unsplash.com/photo-141487280988...,red common poppy flower selective focus phography
4,BeD3vjQ8SI0,https://images.unsplash.com/photo-141700759404...,trees during night time
...,...,...,...
24995,c7OrOMxrurA,https://images.unsplash.com/photo-159300793778...,black metal fence during daytime
24996,15IuQ5a0Qwg,https://images.unsplash.com/photo-159296761254...,white and brown seashell on white surface
24997,w8nrcXz8pwk,https://images.unsplash.com/photo-159299937329...,leopard on brown tree trunk during daytime
24998,n1jHrRhehUI,https://images.unsplash.com/photo-159192792878...,woman in beige coat and white hat standing on ...


#### Build list of docs to index

We need to accomplish two tasks at this step
1. download an image 
2. convert it to base64 (this will be passed to the clip model to generate a vector)
3. create the payload for the bulk insert

In [None]:
docs = []
sample_count = 100

for i, row in df.head(sample_count).iterrows():
  photo_image_url = row['photo_image_url']

  response = requests.get(photo_image_url)
  img = response.content
  b64 = base64.b64encode(img)
  b64_string = b64.decode("utf-8")


  docs.append(
    {   "_index": index_id,
        "_source": {
          'photo_image_url' : photo_image_url,
          'photo_id' : row['photo_id'],
          'ai_description' : row['ai_description']
        }
    }
  )



## Index the docs 
This will send a bulk index request to elastic, sending all the docs through the ingest pipeline, generating vectors, and storing them in elasticsearch

In [None]:
helpers.bulk(es, docs, pipeline=pipieline_id )

## Verify one of the docs 
Let's take a look at one doc and see how it was indexed

In [None]:
result = es.search(index=index_id, body={}, size=1)
result.body['hits']['hits'][0]['_source']

---
---
# Searching for an image by text description
---
---



## Generate Vector for Query

Before we can run an approximate k-nearest neighbor (kNN) query, we need to convert our query string to a vector.

Set a sample query doc

Depending on your specific model, you may need to change the field name from "text_field"

In [None]:
docs =  [
    {
      "text_field": "State of the art nlp models"
    }
  ]

We call the `_infer` endpoint supplying the model_id and the doc[s] we want to vectorize. 

In [None]:
vec = MlClient.infer_trained_model(es, model_id=es_model_id, docs=docs, )

The vector for the first doc can be accessed in the response dict as shown below

In [None]:
doc_0_vector = vec['inference_results'][0]['predicted_value']
doc_0_vector

## Run the Search

We will call the `_search` api and specify the `knn` section. 

This is a simple example of a search query. Elastic supports combining kNN search with "traditional" BM25 search. You can also filter documents to reduce the number of docs that needs to be searched. See the [docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search-api.html) for more information.

This will be a very simple example to get started

### Create the search paramaters
Here we are just specifying the `knn` section, but you can also set all the other search params to pass.

In [None]:
knn = {
    "field": "vectors",
    "query_vector": doc_0_vector,
    "k": 2,
    "num_candidates": 10
  }

Send in the search request

In [None]:
results = es.search(index='jupyter-vector-demo', knn=knn, size=1)


### View the results

In [None]:
results.body

### Search Query without returning vectors

Often when running kNN search, you don't actually need to return the vectors themselves, you just want to return the fields to display to the end user

We are adding a new paramater `fields` which allows us to specify specific fields to return rather than all of them

By setting `source` to False (_source:false) we save having to get the entire source payoad back in the response

We are moving the `size` value here simply to gather the paramaters together

In [None]:
knn = {
    "field": "vectors",
    "query_vector": doc_0_vector,
    "k": 2,
    "num_candidates": 10
  }
fields = ["summary", "title"]
size = 1
source = False

In [None]:
results = es.search(index='jupyter-vector-demo', 
                    knn=knn, 
                    source=source, 
                    fields=fields, 
                    size=size
                  )


### View the results

In [None]:
results.body