#  Binary vector and metadata search with Hyperspace
This notebook demonstrates the use of Hyperspace engine for a combining binary vector search and metadata filtering

# The Dataset
The dataset is randomly generated to include binary vectorS of dimension 800 and corresponding metadata.

## Setting up the Hyper Space environment
Setting the environment requires the following steps

1. Download and install the client API
2. Connect to a server
3. Create data schema file
4. Create collection
5. Ingest data
6. Define Logic and Run a Query


# 1. Download and install the client API
Installation of Hyperspace cliend is quite simple and includes the following line

In [None]:
pip install git+https://github.com/hyper-space-io/hyperspace-py

#2. Connect to a server

Using the Hyper-space engine can be done connecting to a remote machine with pre-provided credendtials. The process is standard

In [3]:
import hyperspace

hyperspace_client = hyperspace.HyperspaceClientApi(host='https://search-master-demo.development.hyper-space.xyz',
                                                   username=username, password=password)

#3.   Create data schema file

Similarly to other search databases, Hyper-Space database requires a configuration file which outlines the data schema. We create a config file which corresponds to the fields of the random data. Note that the key 'low_cardinality' enables faster search for low cardinality fields.

The similarity of the binary vector will be calculated using [hamming distance](https://en.wikipedia.org/wiki/Hamming_distance)

In [4]:
import json
vector_dimension = 800 # bits
config = {
    "configuration": {
        'city': {"type": 'keyword'},
        'country': {"type": 'keyword'},
        'open_now': {"type": 'boolean'},
        'zip_code': {"type": 'integer'},
        'street': {"type": 'keyword'},
        'vertical': {"type": 'keyword', 'low_cardinality': True},
        "vector": {
            "type": "dense_vector",
            "index_type": "bin_ivf",
            "dim": vector_dimension,
            "metric": "hamming"
        }
    }
}

with open('config.json', 'w') as f:
    f.write(json.dumps(config, indent=2))


#4. Create Collection
Collections are used to store data of similar context, etc.

In [None]:
collection_name = 'GeneratedData'
try:
    hyperspace_client.delete_collection(collection_name)
except:
    pass

hyperspace_client.create_collection('config.json', collection_name)

#5 Ingest data

We ingest the dataset in batches of 250

In [None]:
import secrets
import base64
import pickle
import random
with open("Generated_data.hsv", 'rb') as file:
  metadata = pickle.load(file)

BATCH_SIZE = 250

batch = []
data = []


for i, vec in enumerate(range(100000)):
    data_point = random.choice(metadata)

    random_bytes = secrets.token_bytes(vector_dimension // 8)
    data_point['vector'] =  base64.b64encode(random_bytes).decode()
    batch.append(hyperspace.Document(str(i), data_point))
    if (i+1) % BATCH_SIZE == 0:
        response = hyperspace_client.add_batch(batch, collection_name)
        print(i, response)
        batch.clear()

if batch:
    hyperspace_client.add_batch(batch, collection_name)

response = hyperspace_client.add_batch(batch, collection_name)

hyperspace_client.commit(collection_name)



In [None]:
data_point

{'city': 'North Mistyview',
 'country': 'Gibraltar',
 'open_now': True,
 'zip_code': 31158,
 'street': 'banana',
 'vector': '9BHcg8G55CcOpuKrwHePARQnqT2NERUXcWY4u40tRCZKI+5/VA6lRb7I3mCfVeeOZvF2y11I8H8A9DJ4RLZY8aY8r44eaQr8vHgBjFDve1OCoNm+My0O2O+7BvUbiJj80CFgQg=='}

#6. Define Logic and Run a Query
We will build a hybrid search query using Hyper-space. In the query,  we will select a document and find similar ones. The query object weights under the "boost" fields, allow to contorl the relative weights of the classic search and vector search scores.

In [None]:
score_function_filename = 'binary_score_function.py'
hyperspace_client.set_function(score_function_filename, collection_name=collection_name, function_name='score_function')

In [None]:
import random
from pprint import pprint

input_vector =  {'vector': base64.b64encode(secrets.token_bytes(vector_dimension // 8)).decode(), 'country': 'France'}

query = {
    'params': input_vector,
    "knn": {
        "query": {"boost": 1},
        "vector": {
            "boost": 10,
        }
    }
}

results = hyperspace_client.search(query,
                                        size=15,
                                        function_name='score_function',
                                        collection_name=collection_name)

candidates = results['candidates']



print(f"Query run time: {results['took_ms']:.2f}ms")
print(f'query run time / candidates+1 = {results["took_ms"] / (results["candidates"] + 1):.2f}')
print()
pprint(results['similarity'])


# Results
Let's view the results

In [None]:
for i, x in enumerate(results['similarity']):
  vector = hyperspace_client.get_document(collection_name, x["document_id"])
  print(i, vector)