#  Academic paper hybrid search with Hyperpace
This notebook demonstrates the use of Hyperspace engine for a hybrid search for academic papers matching - a combination of vector search and keyword matching.

# The Datset
The dataset is taken from [benchmarking sets]( https://github.com/qdrant/ann-filtering-benchmark-datasets#data) and includes a list of academic papers from arXiv, and their metadata.
We will use the combination of an embedded vector data and metadata, to create a hybrid search.

## Setting up the Hyperspace environment
Setting the enviorment requires the following steps


1.   Download and install the client API
2.   Create data config file
3. Connect to a server
4.   Create collection
5. Ingest data
6. Run query

We mount a cloud folder which hosts the client files and install the client

###Install the Hyperspace Client API
Installation of Hyperspace cliend is straightforward and can be done using any of the standarad python modules, such as pip

In [None]:
!pip install drive/MyDrive/search_master.zip

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./drive/MyDrive/search_master.zip
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting newlinejson
  Downloading NewlineJSON-1.0-py2.py3-none-any.whl (12 kB)
Building wheels for collected packages: search-master
  Building wheel for search-master (setup.py) ... [?25l[?25hdone
  Created wheel for search-master: filename=search_master-1.0.0-py3-none-any.whl size=39147 sha256=deeb76b3f6d5c0ad7d00216f302beebc59b29d95cab80777c58f249f971ce62a
  Stored in directory: /tmp/pip-ephem-wheel-cache-gr23dlmw/wheels/16/4a/3d/5f117bdb31fe9ec055a07467b1949da65cd6246ecd3a3599fd
Successfully built search-master
Installing collected packages: newlinejson, search-master
Successfully installed newlinejson-1.0 search-master-1.0.0


In [None]:
import search_master as hyperspace

###Connect to Server

Using the Hyperspace engine requires connection to a remote machine with pre-provided credentials.

In [None]:
conf = hyperspace.configuration.Configuration()
conf.host = 'https://search-master-demo.development.hyper-space.xyz'

hyperspace_client = hyperspace.SearchMasterApi(api_client=hyperspace.api_client.ApiClient(configuration=conf))
login_response = hyperspace_client.login({"username": username, "password": password})
api_client = hyperspace.api_client.ApiClient(configuration=conf,
                                                header_name='Authorization',
                                                header_value="Bearer " + login_response.token)

hyperspace_client = hyperspace.SearchMasterApi(api_client=api_client)


In [None]:
hyperspace_client.cluster_status()

###Create the Configuration file

Similarly to other search databases, Hyper-Space database requires a configuration file which outlines the data scheme.

In [None]:
import json

config = {
    "configuration": {
        "id": {
            "type":"float"
        },
        "title": {
            "type":"keyword"
        },
        "submitter": {
            "type":"keyword"
        },
        "categories": {
            "type":"keyword",
            "struct_type":"list"
        },
        "labels": {
            "type":"keyword",
            "struct_type":"list"
        },
        "license": {
            "type":"keyword"
        },
        "update_date": {
            "type":"keyword"
        },
        "update_date_ts": {
            "type":"integer"
        },
        "embedded_abstract": {
            "type": "dense_vector",
            "dim": 384,
            "metric": "IP"
        }
    },
}

with open('arXiv_config.json', 'w') as f:
    f.write(json.dumps(config, indent=2))




# The Dataset Fields
The metadata includes the following fields:


1.   **id** [float] - paper unique id
2.   **title** [Keyword] - paper title
3. **submitter** [Keyword] - name of person who submitted the paper
4. **categories** [list[Keyword]] - list of categories which include the paper
5. **label** [list[Keyword]] - labels aplied to paper
6. **license** [Keyword] - license type
7. **update_date_ts** [integer] - update time in unix format

We build a simple filtering function, which filters papers of the same category, gives bias to paper by same submitter an negative bias for papers without given license. We first select a paper as input for the query


### Create Collection
Collections are used to store data of similar context, etc.

In [None]:
hyperspace_client = search_master.SearchMasterApi(api_client=api_client)

collection_name = 'all-MiniLM-L6-v2 ArXiv titles'

hyperspace_client.delete_collection(collection_name)
hyperspace_client.create_collection('arXiv_config.json', collection_name)
hyperspace_client.cluster_status()

[{'Collections size': {'all-MiniLM-L6-v2%20ArXiv%20titles': 0},
  'FPGA memory usage in GB': '0.0080GB',
  'FPGA memory usage in percentage': '0.0080%',
  'Hostname': 'hyperspace-demo-0',
  'Number of total vectors': 0},
 {'Number of data nodes': 1}]

### Ingest data

We load the datasets from and ingest it in batches of 500 data points (the batch size can be increased for faster ingestion)

In [None]:
import numpy as np
vecs = np.load('vectors.npy')
metadata_file = open('payloads.jsonl')

In [None]:
from search_master import VectorDto # VectorDTO is the basic Hyperspace database object and has a dictionary like structure
BATCH_SIZE = 500

batch = []

for i, (metadata_row, vec) in enumerate(zip(metadata_file, vecs)):
    row = {key: value for key, value in json.loads(metadata_row).items() if key in config["configuration"].keys()}
    row['categories'] = row['categories'].split()
    row['embedded_abstract'] = np.ndarray.tolist(vec)

    batch.append(VectorDto(str(i), row))

    if i % BATCH_SIZE == 0:
        response = hyperspace_client.add_batch(batch, collection_name)
        batch.clear()
        print(i, response)

hyperspace_client.commit(collection_name)

#Define Logic and Run a Query
We will build a hybrid search query using Hyper-space. In the query,  we will randomly select a paper from the database and search for smilar papers.

In [None]:
input_vector = hyperspace_client.find_vector_by_id(collection_name, 65)
print(input_vector['title'], "\n======================================\n", input_vector['submitter'], "\n", input_vector['categories'])

Hall field induced magnetoresistance oscillations of a two-dimensional
  electron system 
 Alejandro Kunold 
 ['cond-mat.mes-hall']


We run a query which combines similarity and vector search. The search fields can be given a weight in the query object. The final score of each search type willbe multiplied by the weight.

In [None]:
from pprint import pprint
response = hyperspace_client.set_function('score_func.py', collection_name=collection_name, function_name='similarity_sf')

query_with_knn = {
    'vector_Content': input_vector,
    'boost': {
        'query': 0.4,
        'embedded_abstract': 0.2
    }
}

results = hyperspace_client.search_data(query_with_knn,
                                        size=5,
                                        function_name='similarity_sf',
                                        collection_name=collection_name)

for i, result in enumerate(results['similarity']):
  vector_api_response = hyperspace_client.find_vector_by_id(vector_id=result['vector_id'], collection_name=collection_name)
  print(i + 1, "vector id", result['vector_id'],  ":", vector_api_response['title'], ",", vector_api_response['submitter'], ",", vector_api_response['categories'])
  print("\n")


1 . vector id 65 : Hall field induced magnetoresistance oscillations of a two-dimensional
  electron system , Alejandro Kunold , ['cond-mat.mes-hall']


2 . vector id 366812 : Quantum and classical dissipation of charged particles , Alejandro Kunold , ['quant-ph', 'cond-mat.mes-hall']


3 . vector id 117319 : Non linear transport theory for negative-differential resistance states
  of two dimensional electron systems in strong magnetic fields , Alejandro Kunold , ['cond-mat.mes-hall']


4 . vector id 242529 : Symmetry breaking as the origin of zero-differential resistance states
  of a 2DEG in strong magnetic fields , Alejandro Kunold , ['cond-mat.mes-hall', 'cond-mat.other']


5 . vector id 178121 : Impact of heavy hole-light hole coupling on optical selection rules in
  GaAs quantum dots , Alejandro Kunold , ['cond-mat.mes-hall']




The returned documents have similar submitter name, as expected from the metadata filtering.

This notebook gave a simple example of the use of the Hyperspace engine for hybrid search. Hyperspace can support complicated use cases with large databases, in extremley low latency.