63f78014766fd30436c18a79_Hyperspace - navbar logo.png

#  Binary Vector and Metadata Search with Hyperspace
This notebook demonstrates the use of Hyperspace engine for a hybrid search, which combines vector search of binaryy vectors and metadata filtering over their corresponding metadata.

# The Dataset
The dataset includes randomly generated binary vectors of dimension 800 and corresponding metadata, that describes stores.

## The Dataset Fields
The metadata includes the following fields:
1. **country** [string] - The Country in which the store is located
2. **city** [string] - The city in which the store is located
3. **street** [keyword] - The street in which the store is located
4. **zip_code** [integer] - The store zipcode
5. **open_now** [boolean] - Is the store open
6. **vertical** [keyword] - The store vertical (industry)

# Setting up the Hyperspace environment
Setting the Hyperspace environment and running queries, includes the following steps

1. Download and install the client API
2. Connect to a server
3. Create data schema file
4. Create collection
5. Ingest data
6. Define Logic and Run a Query


# 1. Install the client API
Hyperspace API can be installed directly from git, using the following command:

In [None]:
pip install git+https://github.com/hyper-space-io/hyperspace-py

# 2. Connect to a server

Once the Hyperspace API is installed, the database can be accessed by creating a local instance of the Hyperspace client. This step requires host address, username and password.

In [None]:
import hyperspace

hyperspace_client = hyperspace.HyperspaceClientApi(host='https://search-master-demo.development.hyper-space.xyz',
                                                   username=username, password=password)

Before continuuing, let us check that the cluster is live

In [None]:
cluster_status = hyperspace_client.cluster_status()
display(cluster_status)

#3.   Create a Data Schema File

Similarly to other search databases, Hyper-Space database requires a configuration file that outlines the data schema. Here, we create a config file that corresponds to the fields of the given dataset.

For vector fields, we also provide the index type to be used, and the metric. . Current options for index include "**brute_force**", "**hnsw**", "**ivf**", and "**bin_ivf**" for binary vectors, and "**IP**" (inner product) as a metric for floating point vectors and "**Hamming**" ([hamming distance](https://en.wikipedia.org/wiki/Hamming_distance)) for binary vectors.
Note that the key 'low_cardinality' enables faster search for low cardinality fields.

In [None]:
import json
vector_dimension = 800 # bits
config = {
    "configuration": {
        'city': {"type": 'keyword'},
        'country': {"type": 'keyword'},
        'open_now': {"type": 'boolean'},
        'zip_code': {"type": 'integer'},
        'street': {"type": 'keyword'},
        'vertical': {"type": 'keyword', 'low_cardinality': True},
        "vector": {
            "type": "dense_vector",
            "index_type": "bin_ivf",
            "dim": vector_dimension,
            "metric": "hamming"
        }
    }
}

with open('config.json', 'w') as f:
    f.write(json.dumps(config, indent=2))


## 4. Create Collection
The Hyerspace engine stroes data in Collections, where each collecction commonly hosts data of similar context, etc. Each search is then perfomed within a collection. We create a collection using the command "**create_collection**(schema_filename, collection_name)".

In [None]:
collection_name = 'GeneratedData'
delete_collection = False

if delete_collection:
    hyperspace_client.delete_collection(collection_name)

hyperspace_client.create_collection('config.json', collection_name)

# 5. Ingest data

In the next step we ingest the dataset in batches of 250 documents. This number can be controlled by user, and in particular, can be increased in order improve ingestion time. We add batches of data using the command **add_batch**(batch, collection_name).

In [None]:
import random
import secrets
import base64

def generate_data(metadata, vector_dimension):
    data_point = random.choice(metadata)
    random_bytes = secrets.token_bytes(vector_dimension // 8)
    data_point['vector'] =  base64.b64encode(random_bytes).decode()
    return data_point

In [None]:
import pickle
with open("Generated_data.hsv", 'rb') as file:
  metadata = pickle.load(file)

BATCH_SIZE = 250

batch = []
data = []


for i, vec in enumerate(range(100000)):
    data_point = generate_data(metadata, vector_dimension)
    batch.append(hyperspace.Document(str(i), data_point))

    if (i+1) % BATCH_SIZE == 0:
        response = hyperspace_client.add_batch(batch, collection_name)
        print(i, response)
        batch.clear()

if batch:
    hyperspace_client.add_batch(batch, collection_name)

response = hyperspace_client.add_batch(batch, collection_name)

hyperspace_client.commit(collection_name)



In [None]:
data_point

{'city': 'North Mistyview',
 'country': 'Gibraltar',
 'open_now': True,
 'zip_code': 31158,
 'street': 'banana',
 'vector': '9BHcg8G55CcOpuKrwHePARQnqT2NERUXcWY4u40tRCZKI+5/VA6lRb7I3mCfVeeOZvF2y11I8H8A9DJ4RLZY8aY8r44eaQr8vHgBjFDve1OCoNm+My0O2O+7BvUbiJj80CFgQg=='}

#6. Define Logic and Run a Query
We will build a hybrid search query using Hyper-space. In the query,  we will select a document and find similar ones.

In [None]:
score_function_filename = 'binary_score_function.py'
hyperspace_client.set_function(score_function_filename, collection_name=collection_name, function_name='score_function')

In [None]:
import random
from pprint import pprint

input_vector =  {'vector': base64.b64encode(secrets.token_bytes(vector_dimension // 8)).decode(), 'country': 'France'}

query = {
    'params': input_vector,
    "knn": {
        "query": {"boost": 1},
        "vector": {
            "boost": 10,
        }
    }
}

results = hyperspace_client.search(query,
                                        size=15,
                                        function_name='score_function',
                                        collection_name=collection_name)
candidates = results['candidates']

print(f"Query run time: {results['took_ms']:.2f}ms")
print(f'query run time / candidates+1 = {results["took_ms"] / (results["candidates"] + 1):.2f}')
pprint(results['similarity'])


# Results
Let's view the results. Since the vectors are random, it is hard to evaluate the quality of the vector search, beyond the fact that results are sorted by Hamming distance. On the classic search side, we can see that the filters behave as expected.

In [None]:
for i, x in enumerate(results['similarity']):
  vector = hyperspace_client.get_document(collection_name, x["document_id"])
  print(i, vector)

This notebook gave a simple example of the use of the Hyperspace engine for hybrid search. Hyperspace can support signficantly more complicated use cases with large databases, in extremley low latency.