#  Academic paper hybrid search with Hyperspace
This notebook demonstrates the use of Hyperspace engine for a hybrid search for academic papers matching - a combination of vector search and keyword matching.

# The Dataset
The dataset is taken from [benchmarking sets]( https://github.com/qdrant/ann-filtering-benchmark-datasets#data) and includes a list of academic papers from arXiv, and their metadata.
We will use the combination of an embedded vector data and metadata, to create a hybrid search.

## Setting up the Hyperspace environment
Setting the enviorment requires the following steps


1. Download and install the client API
2. Connect to a server
3. Create data schema file
4. Create collection
5. Ingest data
6. Run query

We mount a cloud folder which hosts the client files and install the client

###Install the Hyperspace client API


In [4]:
pip install git+https://github.com/hyper-space-io/hyperspace-py

Collecting git+https://github.com/hyper-space-io/hyperspace-py
  Cloning https://github.com/hyper-space-io/hyperspace-py to /tmp/pip-req-build-5xvax89k
  Running command git clone --filter=blob:none --quiet https://github.com/hyper-space-io/hyperspace-py /tmp/pip-req-build-5xvax89k
  Resolved https://github.com/hyper-space-io/hyperspace-py to commit e397b51e57fd6c3d83cdde8a8ed1b6b81d0509a7
  Preparing metadata (setup.py) ... [?25l[?25hdone


###Connect to Server

Using the Hyperspace engine requires connection to a remote machine with pre-provided credendtials.

In [5]:
import hyperspace

hyperspace_client = hyperspace.HyperspaceClientApi(host='https://search-master-demo.development.hyper-space.xyz',
                                                   username=username, password=password)

We check the status before proceeding

In [6]:
cluster_status = hyperspace_client.cluster_status()
display(cluster_status)

[{'Collections size': {'all-MiniLM-L6-v2%20ArXiv%20titles': 376501},
  'FPGA memory usage in GB': '0.1045GB',
  'FPGA memory usage in percentage': '0.1045%',
  'Hostname': 'hyperspace-demo-0',
  'Number of total vectors': 376501},
 {'Number of data nodes': 1}]

###Create the data schema file

Similarly to other search databases, Hyperspace requires a data schema file which outlines the data schema.

In [7]:
import json

config = {
    "configuration": {
        "id": {
            "type":"float"
        },
        "title": {
            "type":"keyword"
        },
        "submitter": {
            "type":"keyword"
        },
        "categories": {
            "type":"keyword",
            "struct_type":"list"
        },
        "labels": {
            "type":"keyword",
            "struct_type":"list"
        },
        "license": {
            "type":"keyword"
        },
        "update_date": {
            "type":"keyword"
        },
        "update_date_ts": {
            "type":"integer"
        },
        "embedded_abstract": {
            "type": "dense_vector",
            "dim": 384,
            "index_type": "brute_force",
            "metric": "IP"
        }
    },
    "settings": {
      "list_delimiter": ","
    }
}

with open('arXiv_config.json', 'w') as f:
    f.write(json.dumps(config, indent=2))




# The Dataset Fields
The metadata includes the following fields:


1.   id [float] - paper unique id
2.   title [string] - paper title
3. submitter [string] - name of person who submitted the paper
4. categories [list[string]] - list of categories which include the paper
5. label [list[string]] - labels aplied to paper
6. license [string] - license type
7. update_date_ts [integer] - update time in unix format

We build a simple filtering function, which filters papers of the same category, gives bias to paper by same submitter an negative bias for papers without given license. We first select a paper as input for the query


### Create Collection
Collections are used to store data of similar context, etc. We will erase the current collection, create a new one, and ingest data.

In [None]:
delete_collections = False
collection_name = 'all-MiniLM-L6-v2 ArXiv titles'

if delete_collections:
  if collection_name in cluster_status[0]['Collections size']:
    hyperspace_client.delete_collection(collection_name)

hyperspace_client.create_collection('arXiv_config.json', collection_name)
hyperspace_client.cluster_status()

### Ingest data

We load the datasets from and ingest it in batches of 500 data points (the batch size can be increased for faster ingestion)

In [None]:
import numpy as np
vecs = np.load('vectors.npy')
metadata = open('payloads.jsonl')

In [None]:
BATCH_SIZE = 500

batch = []
for i, (metadata_row, vec) in enumerate(zip(metadata, vecs)):
    row = {key: value for key, value in json.loads(metadata_row).items() if key in config["configuration"].keys()}
    row['categories'] = row['categories'].split()
    row['embedded_abstract'] = np.ndarray.tolist(vec)

    batch.append(hyperspace.Document(str(i), row))

    if i % BATCH_SIZE == 0:
        response = hyperspace_client.add_batch(batch, collection_name)
        batch.clear()
        print(i, response)



In [None]:
hyperspace_client.commit(collection_name)

#Define Logic and Run a Query
We will build a hybrid search query using Hyperspace. In the query,  we will randomly select a paper from the database and search for smilar papers.

In [None]:
input_vector = hyperspace_client.get_document(collection_name, "63")
print(input_vector['title'], "\n======================================\n", input_vector['submitter'], "\n", input_vector['categories'])

We run a query which combines similarity and vector search. The search fields can be given a weight in the query object. The final score of each search type willbe multiplied by the weight.

In [None]:
from pprint import pprint
response = hyperspace_client.set_function('/content/drive/MyDrive/Demos/ArXiv/arXiv_score_func.py', collection_name=collection_name, function_name='similarity_sf')

query_with_knn = {
    'params': input_vector,
    "knn": {
        "query": {"boost": 1}, # boost 0 means no run
        "embedded_abstract": {
            "boost": 10,
            "top_k": 100,
            "nprobe": 80
        }
    }
}

results = hyperspace_client.search(query_with_knn,
                                        size=15,
                                        function_name='similarity_sf',
                                        collection_name=collection_name)

for i, result in enumerate(results['similarity']):
  api_response = hyperspace_client.get_document(document_id=result['document_id'], collection_name=collection_name)
  print(i + 1, "id", result['document_id'],  ":", api_response['title'], ",", api_response['submitter'], ",", api_response['categories'])
  print("\n")


The returned documents have similar submitter name, as expected from the metadata filtering.

This notebook gave a simple example of the use of the Hyperspace engine for hybrid search. Hyperspace can support complicated use cases with large databases, in extremley low latency.