63f78014766fd30436c18a79_Hyperspace - navbar logo.png

#  Academic Papers Hybrid Search with Hyperspace
This notebook demonstrates the use of Hyperspace engine for a hybrid search for academic papers matching - a combination of vector search and keyword matching.
We will build a simple filtering function, which filters papers of the same category, gives bias to paper by same submitter an negative bias for papers without given license. We first select a paper as input for the query

# The Datset
The dataset includes a list of academic papers from arXiv, and their metadata. The dataset can be downloaded from [here](http://hyperspace-datasets.s3.amazonaws.com/arXiv_data.zip).
We will use the combination of an embedded vector data and metadata, to create a hybrid search.


## The Metadata Fields
The metadata includes the following fields:
1. **id** [float] - paper unique id
2. **title** [string] - paper title
3. **submitter** [string] - name of person who submitted the paper
4. **categories** [list[string]] - list of categories which include the paper
5. **label** [list[string]] - labels aplied to paper
6. **license** [string] - license type
7. **update_date_ts** [integer] - update time in unix format


# Setting up the Hyperspace environment
Working with Hyperspace requires the followin steps

1. Install the client API
2. Create data config file
3. Connect to a server
4. Create collection
5. Ingest data
6. Run query

## 1. Install the client API
Hyperspace API can be installed directly from git, using the following command

In [None]:
pip install git+https://github.com/hyper-space-io/hyperspace-py

Collecting git+https://github.com/hyper-space-io/hyperspace-py
  Cloning https://github.com/hyper-space-io/hyperspace-py to /tmp/pip-req-build-fi8u9fjf
  Running command git clone --filter=blob:none --quiet https://github.com/hyper-space-io/hyperspace-py /tmp/pip-req-build-fi8u9fjf
  Resolved https://github.com/hyper-space-io/hyperspace-py to commit 2bf116d8871d27401cc6032ababc99f72a78dc24
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: hyperspace
  Building wheel for hyperspace (setup.py) ... [?25l[?25hdone
  Created wheel for hyperspace: filename=hyperspace-1.0.0-py3-none-any.whl size=36068 sha256=69cc8a8115372b9699079bc7a3b2633b0c15f36aae2e2c2191162249a9f6db9e
  Stored in directory: /tmp/pip-ephem-wheel-cache-2kfj1yxi/wheels/c4/96/59/f4b91d653fdbfc819e48a7dacbea1c9f3de59a1bc113aa840d
Successfully built hyperspace
Installing collected packages: hyperspace
Successfully installed hyperspace-1.0.0


#2. Connect to a server

Once the Hyperspace API is installed, the database can be accessed by creating a local instance of the Hyperspace client. This step requires host address, username and password

In [None]:
import hyperspace
from getpass import getpass

username = 'username provided by hyperspace'
password = 'password provided by hyperspace'

host = 'https://search-master-demo.development.hyper-space.xyz'

hyperspace_client = hyperspace.HyperspaceClientApi(host=host,
                                                   username=username, password= getpass("Password:"))


We check the status before proceeding

In [None]:
collections_info = hyperspace_client.collections_info()
display(collections_info)

[{'Collections size': {'all-MiniLM-L6-v2_arXiv_titles': 1720000},
  'FPGA memory usage in GB': '0.6836GB',
  'FPGA memory usage in percentage': '0.6836%',
  'Hostname': 'hyperspace-demo-0',
  'Number of total vectors': 1720000},
 {'Number of data nodes': 1}]

#3.   Create a Data Schema File

Similarly to other search databases, Hyper-Space database requires a configuration file that outlines the data schema. Here, we create a config file that corresponds to the fields of the given dataset.

For vector fields, we also provide the index type to be used, and the metric. . Current options for index include "**brute_force**", "**hnsw**", "**ivf**", and "**bin_ivf**" for binary vectors, and "**IP**" (inner product) as a metric for floating point vectors and "**Hamming**" ([hamming distance](https://en.wikipedia.org/wiki/Hamming_distance)) for binary vectors.
Here, we use "brute_force" (exact KNN) with inner product.

In [None]:
import json

config = {
    "configuration": {
        "id": {
            "type":"float"
        },
        "title": {
            "type":"keyword"
        },
        "submitter": {
            "type":"keyword"
        },
        "categories": {
            "type":"keyword",
            "struct_type":"list"
        },
        "labels": {
            "type":"keyword",
            "struct_type":"list"
        },
        "license": {
            "type":"keyword"
        },
        "update_date": {
            "type":"keyword"
        },
        "update_date_ts": {
            "type":"integer"
        },
        "embedded_abstract": {
            "type": "dense_vector",
            "dim": 384,
            "index_type": "brute_force",
            "metric": "IP"
        }
    }
}

with open('arXiv_config.json', 'w') as f:
    f.write(json.dumps(config, indent=2))



## 4. Create Collection
The Hyerspace engine stroes data in Collections, where each collecction commonly hosts data of similar context, etc. Each search is then perfomed within a collection. We create a collection using the command "**create_collection**(schema_filename, collection_name)".

In [None]:
collection_name = 'all-MiniLM-L6-v2_arXiv_titles'

try:
    hyperspace_client.delete_collection(collection_name)
except:
    pass
hyperspace_client.create_collection('arXiv_config.json', collection_name)
hyperspace_client.collections_info()

# 5. Ingest data

In the next step we ingest the dataset in batches of 250 documents. This number can be controlled by user, and in particular, can be increased in order improve ingestion time. We add batches of data using the command **add_batch**(batch, collection_name).

In [None]:
import numpy as np
vecs = np.load('1m-vectors.npy')
metadata_file = open('1m-payloads.jsonl')

In [None]:
BATCH_SIZE = 500

batch = []
for i, (metadata_row, vec) in enumerate(zip(metadata_file, vecs)):
    row = {key: value for key, value in json.loads(metadata_row).items() if key in config["configuration"].keys()}
    row['categories'] = row['categories'].split()
    row['embedded_abstract'] = np.ndarray.tolist(vec)

    batch.append(hyperspace.Document(str(i), row))

    if i % BATCH_SIZE == 0:
        response = hyperspace_client.add_batch(batch, collection_name)
        batch.clear()
        print(i, response)

hyperspace_client.commit(collection_name)

# 6. Define Logic and Run a Query
We will build a hybrid search query using Hyperspace. In the query,  we will randomly select a paper from the database and search for similar papers. The query object weights under the "boost" fields, allow to contorl the relative weights of the classic search and vector search scores. We will first select the input document, and search for similar papers in the next step.

In [None]:
input_document = hyperspace_client.get_document(collection_name, 65)
print(input_document['title'], "\n======================================\n", input_document['submitter'], "\n", input_document['categories'])

Lagrangian quantum field theory in momentum picture. IV. Commutation
  relations for free fields 
 Bozhidar Zakhariev Iliev 
 ['hep-th']


In [None]:
input_document.keys()

dict_keys(['categories', 'embedded_abstract', 'id', 'labels', 'license', 'submitter', 'title', 'update_date', 'update_date_ts'])

In [None]:
from pprint import pprint
response = hyperspace_client.set_function('arXiv_score_func.py', collection_name=collection_name, function_name='similarity_sf')


query_with_knn = {
    'params': input_document,
     "query": {"boost": 1},
    'embedded_abstract':{"boost": 10}
}

results = hyperspace_client.search(query_with_knn,
                                        size=5,
                                        function_name='similarity_sf',
                                        collection_name=collection_name)

for i, result in enumerate(results['similarity']):
  vector_api_response = hyperspace_client.get_document(document_id=result['document_id'], collection_name=collection_name)
  print(i + 1, "id", result['document_id'],  ":", vector_api_response['title'], ",", vector_api_response['submitter'], ",", vector_api_response['categories'])


1 id 65 : Lagrangian quantum field theory in momentum picture. IV. Commutation
  relations for free fields , Bozhidar Zakhariev Iliev , ['hep-th']
2 id 342280 : A General Field-Covariant Formulation Of Quantum Field Theory , Damiano Anselmi , ['hep-th', 'hep-ph', 'math-ph', 'math.MP']
3 id 834408 : A Free-Field Lagrangian for a Gauge Theory of the CPT Symmetry , Kurt Koltko , ['physics.gen-ph', 'astro-ph.GA', 'hep-th']
4 id 1120808 : Towards Divergence-free Theory of Quantum Fields , Nagabhushana Prabhu , ['physics.gen-ph', 'hep-th']
5 id 1663058 : Momentum Gauge Fields and Non-Commutative Space-Time , Eduardo Guendelman I , ['quant-ph', 'gr-qc', 'hep-th']


The returned documents have similar submitter name, as expected from the metadata filtering.

This notebook gave a simple example of the use of the Hyperspace engine for hybrid search. Hyperspace can support complicated use cases with large databases, in extremley low latency.