# Set Up the Hazelcast Connection
The documentation for the Hazelcast Python client is here: https://hazelcast.readthedocs.io/en/latest/

The first thing we will do is establish a connection to the cluster, retrieve the "images" vector collection, and print out it's size.

In [9]:
import hazelcast

# connect to Hazelcast
hz = hazelcast.HazelcastClient(cluster_name = "dev", cluster_members = ["hz:5701"])

# retrieve the vector collection and print the size
vectors = hz.get_vector_collection('images').blocking()
vectors.size()

204

# Set Up the Embedding 

A vector collection only deals with vectors.  In order to do a vector similarity search, we need a query vector. This is obtained by applying the same embedding to the query as we did to the images.  In this case, we are using the sentence_transformers package and specifically, the "clip-ViT-B-32" model.  

The SentenceTransformer performs the embedding.  It can be initialized using just a model name, in which case it will be downloaded from the huggingface model repository.  We have already downloaded the model, so, to speed things up, we will load it from the file system.

In [10]:
# initialize the sentence transformer 
import sentence_transformers
encoder = sentence_transformers.SentenceTransformer('/project/models/clip-ViT-B-32')

# Run the Query

This is a 2 step process.  First we transform the input query into a vector, then we call the Hazelcast 
"search_near_vector" API to find find similar images.

In [11]:
from IPython.display import display, Image
from hazelcast.vector import Vector, VectorType
import requests
from io import BytesIO

# Embed the Query
#
# we turn the query, which is text, into a vector using the previously initialized sentence transformer 
# 
query = encoder.encode("dragonfly")

# Create a hazelcast.vector.Vector instance representing the search
# 
# We create a search vector object using hazelcast.vector.Vector
# in additionn to the vector itself (array of numbers), we also pass the name of the index we wish to search
# and the vector type.  
#
# The encoder returns a numpy array but we need to pass a list of floats to the vector API
# We call "tolist()" to perform the conversion
#
search_vector = Vector(name='semantic-search', type = VectorType.DENSE, vector=query.tolist())

# Search the vector collection
# 
# now we can run the search, passing the previously created Vector object 
# we can also specify what we want the query to return (e.g. the actual vector, the vector metadata)
# in our case, the vector "value" is the URL of the matching image so we pass "include_value=True"
#
results = vectors.search_near_vector(search_vector,include_value=True, limit=3)

# Display the results 
for result in results:
    url = result.value
    # the image will actually be retrieved by the browser, which is outside of the Docker environment 
    # so we change "www" to "localhost"
    url = url.replace('www','localhost')
    display(Image(url=url))
    print(f'score={result.score}')
    

score=0.6422616839408875


score=0.6386593580245972


score=0.6339011192321777


# Your Turn

You may have noticed that the images don't always match the query.  The similarity search just finds the nearest vectors.  If you ask for "Zebras" and there are no zebras, the code above will just return the 3 closest images.  The score is actually the proximity (using cosine similarity) of the image to the query vector.  

There is no universal definition of distance that is close enough to indicate a match.  Additionally, the distances do not cover a very wide range, typically .55 to .70 or so.  However, through trial and error, it appears that a threshold of .63 does a decent job of discriminating matches from non-matches.  

Rework the example above to only display images that have a .63 or higher similarity score.

Note: despite it sometimes being called a "distance", the cosine metric varies from 0 to 1 with 1 being the closest.  