63f78014766fd30436c18a79_Hyperspace - navbar logo.png

# Classic Search With Hyperspace
This notebook demonstrates the use of Hyperspace engine for classic (keyword and value matching) search.

## The Dataset - Crimes In Chicago Dataset
From Kaggle:
"This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. This data includes unverified reports supplied to the Police Department. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time."

The [dataset](https://www.kaggle.com/datasets/chicago/chicago-crime) can be downloaded from this [link](https://github.com/hyper-space-io/QuickStart/blob/main/DataSets/CrimesInChicago/100k-crimes-dataset-processed_data.zip).

## The dataset fields
1. **Case Number {'type': 'keyword'}** - The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
2. **Date {'type': 'date', 'format': 'MM/dd/yyyy hh:mm:ss a'}** - Date when the incident occurred. this is sometimes a best estimate.
3. **Block {'type 'keyword'}** -The partially redacted address where the incident occurred, placing it on the same block as the actual address.
4. **IUCR {'type 'keyword'}** - The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description. See the list of IUCR codes at https://data.cityofchicago.org/d/c7ck-438e.
5. **Primary Type {'type 'keyword'}** - The primary description of the IUCR code.
6. **Description {'type 'keyword'}** - The secondary description of the IUCR code, a subcategory of the primary description.
7. **Location Description {'type 'keyword'}** - Description of the location where the incident occurred.
8. **Arrest {'type 'boolean'}** - Indicates whether an arrest was made.
9. **Domestic {'type 'boolean'}** - Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
10. **Beat {'type 'integer'}** - Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts. See the beats at https://data.cityofchicago.org/d/aerh-rz74.
11. **District {'type 'integer'}** - Indicates the police district where the incident occurred. See the districts at https://data.cityofchicago.org/d/fthy-xz3r.
12. **Ward {'type 'integer'}** - The ward (City Council district) where the incident occurred. See the wards at https://data.cityofchicago.org/d/sp34-6z76.
13. **Community Area {'type 'integer'}** - Indicates the community area where the incident occurred. Chicago has 77 community areas. See the community areas at https://data.cityofchicago.org/d/cauq-8yn6.
14. **FBI Code {'type 'keyword'}** - Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS). See the Chicago Police Department listing of these classifications at http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html.
15. **X Coordinate {'type 'integer'}** - The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
16. **Y Coordinate {'type 'integer'}** - The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
17. **Year {'type 'integer'}** - Year the incident occurred.
18. **Updated On {'type 'date', 'format 'MM/dd/yyyy hh:mm:ss a'}** - Date and time the record was last updated.
19. **Latitude {'type 'float'}** - The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
20. **Longitude {'type 'float'}** - The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
21. **Location {'type 'geo_point', 'struct_type 'list'}** - The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

We mount a cloud folder which hosts the client files and install the client


# Setting up the Hyperspace Environment
Setting the environment and running the query includes the following steps
1. Download and install the client API
2. Connect to a server
3. Create data schema file
4. Create collection
5. Ingest data
6. Run query

## 1. Install the Hyperspace client API
Hyperspace API can be installed directly from git, using the following command:

In [1]:
pip install git+https://github.com/hyper-space-io/hyperspace-py

Collecting git+https://github.com/hyper-space-io/hyperspace-py
  Cloning https://github.com/hyper-space-io/hyperspace-py to /tmp/pip-req-build-axx4to3s
  Running command git clone --filter=blob:none --quiet https://github.com/hyper-space-io/hyperspace-py /tmp/pip-req-build-axx4to3s
  Resolved https://github.com/hyper-space-io/hyperspace-py to commit fa0e83afe20c732fc8edf77c6b6201b2110a717b
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: hyperspace
  Building wheel for hyperspace (setup.py) ... [?25l[?25hdone
  Created wheel for hyperspace: filename=hyperspace-1.0.0-py3-none-any.whl size=36135 sha256=f8e8d28b7b8dca0cbf597df022599d401ba09c9e67bca323d37eba0db55cd5cf
  Stored in directory: /tmp/pip-ephem-wheel-cache-5vbju5uh/wheels/c4/96/59/f4b91d653fdbfc819e48a7dacbea1c9f3de59a1bc113aa840d
Successfully built hyperspace
Installing collected packages: hyperspace
Successfully installed hyperspace-1.0.0


## 2. Connect to a server

Once the Hyperspace API is installed, the database can be accessed by creating a local instance of the Hyperspace client. This step requires host address, username and password.

In [2]:
import hyperspace
from getpass import getpass

hyperspace_client = hyperspace.HyperspaceClientApi(host='https://search-master-demo.development.hyper-space.xyz',
                                                   username=username, password=getpass())

··········


We check the status before proceeding

In [None]:
collections_info = hyperspace_client.collections_info()
display(collections_info)

## 3. Create a Data Schema File

Similarly to other search databases, Hyper-Space database requires a configuration file that outlines the data schema. Here, we create a config file that corresponds to the fields of the given dataset.

For vector fields, we also provide the index type to be used, and the metric. . Current options for index include "**brute_force**", "**hnsw**", "**ivf**", and "**bin_ivf**" for binary vectors, and "**IP**" (inner product) as a metric for floating point vectors and "**Hamming**" ([hamming distance](https://en.wikipedia.org/wiki/Hamming_distance)) for binary vectors.
Note that the key 'low_cardinality' enables faster search for low cardinality fields.

In [4]:
import json

config = {
  "configuration": {
    "ID": {
      "type": "integer"
    },
    "Case Number": {
      "type": "keyword"
    },
    "Date": {
      "type": "date",
      "format": "MM/dd/yyyy hh:mm:ss a"
    },
    "Block": {
      "type": "keyword"
    },
    "IUCR": {
      "type": "keyword"
    },
    "Primary Type": {
      "type": "keyword"
    },
    "Description": {
      "type": "keyword"
    },
    "Location Description": {
      "type": "keyword"
    },
    "Arrest": {
      "type": "boolean"
    },
    "Domestic": {
      "type": "boolean"
    },
    "Beat": {
      "type": "integer"
    },
    "District": {
      "type": "integer"
    },
    "Ward": {
      "type": "integer"
    },
    "Community Area": {
      "type": "integer"
    },
    "FBI Code": {
      "type": "keyword"
    },
    "X Coordinate": {
      "type": "integer"
    },
    "Y Coordinate": {
      "type": "integer"
    },
    "Year": {
      "type": "integer"
    },
    "Updated On": {
      "type": "date",
      "format": "MM/dd/yyyy hh:mm:ss a"
    },
    "Latitude": {
      "type": "float"
    },
    "Longitude": {
      "type": "float"
    },
    "Location": {
      "type": "geo_point",
      "struct_type": "list"
    }
  }
}

with open('crime-config.json', 'w') as f:
    f.write(json.dumps(config, indent=2))



## 4. Create Collection
The Hyerspace engine stroes data in Collections, where each collecction commonly hosts data of similar context, etc. Each search is then perfomed within a collection. We create a collection using the command "**create_collection**(schema_filename, collection_name)".

In [None]:
collection_name = 'CrimesInChicago'

try:
    hyperspace_client.delete_collection(collection_name)
except:
    pass
hyperspace_client.create_collection('crime-config.json', collection_name)
hyperspace_client.collections_info()

# 5. Ingest data

In the next step we ingest the dataset in batches of 250 documents. This number can be controlled by user, and in particular, can be increased in order improve ingestion time. We add batches of data using the command **add_batch**(batch, collection_name).

In [20]:
metadata = open('100k-crimes-dataset-processed_data.json')

BATCH_SIZE = 500

batch = []
for i, metadata_row in enumerate(metadata):
    row = {key: value for key, value in json.loads(metadata_row).items() if key in config["configuration"].keys()}
    batch.append(hyperspace.Document(str(i), row))

    if i % BATCH_SIZE == 0:
        response = hyperspace_client.add_batch(batch, collection_name)
        batch.clear()
        print(i, response)
if batch:
  response = hyperspace_client.add_batch(batch, collection_name)
  batch.clear()
  print(i, response)
hyperspace_client.commit(collection_name)


0 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
500 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
1000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
1500 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
2000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
2500 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
3000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
3500 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
4000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
4500 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
5000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
5500 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
6000 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}
6500 {'code': 200, 'message': 'Batch succe

{'code': 200, 'message': 'Dataset committed successfully', 'status': 'OK'}

## 6. Define Logic and Run a Query
We will build a hybrid search query using Hyper-space. In the query,  we will select a document and find similar ones.
The score function is given under Crime_Score_Function.py


In [21]:
from pprint import pprint

input_vector = hyperspace_client.get_document(collection_name, "65")
pprint(input_vector)

{'Arrest': False,
 'Beat': 1711,
 'Block': '057XX N SPAULDING AVE',
 'Case Number': 'HY411327',
 'Community Area': 13,
 'Date': 1441396800,
 'Description': 'TO VEHICLE',
 'District': 17,
 'Domestic': False,
 'FBI Code': '14',
 'ID': 10224814,
 'IUCR': '1320',
 'Latitude': 41.985148529,
 'Location': [41.985148529, -87.711378169],
 'Location Description': 'STREET',
 'Longitude': -87.711378169,
 'Primary Type': 'CRIMINAL DAMAGE',
 'Updated On': 1518270601,
 'Ward': 39,
 'X Coordinate': 1153345,
 'Y Coordinate': 1937792,
 'Year': 2015}


We will use a very simple logic, which matchs the description and location, and make sure case number doesn't match so we won't get back the same result.

We use the following logic:


*   Match crime description and not case number
*   geo_dist match (geo distance)
*   Match district and window_match the date
*   Match Block

Score function can be view in the next block





In [22]:
response = hyperspace_client.set_function('Crime_Score_Function.py', collection_name=collection_name, function_name='similarity_sf')

query = {
    'params': input_vector,
     "query": {"boost": 1}
}

results = hyperspace_client.search(query,
                                        size=30,
                                        function_name='similarity_sf',
                                        collection_name=collection_name)
print("query run time:", results["took_ms"])
for i, result in enumerate(results['similarity']):
  vector_api_response = hyperspace_client.get_document(document_id=result['document_id'], collection_name=collection_name)
  print(i + 1, "id", result['document_id'], "score = " , result["score"])



query run time: 3.17509
1 id 65 score =  16.02469253540039
2 id 30954 score =  9.138360977172852
3 id 30984 score =  9.138360977172852
4 id 30985 score =  9.138360977172852
5 id 30989 score =  9.138360977172852
6 id 31004 score =  9.138360977172852
7 id 31176 score =  9.138360977172852
8 id 31192 score =  9.138360977172852
9 id 31195 score =  9.138360977172852
10 id 31230 score =  9.138360977172852
11 id 31239 score =  9.138360977172852
12 id 31270 score =  9.138360977172852
13 id 31304 score =  9.138360977172852
14 id 31305 score =  9.138360977172852
15 id 31323 score =  9.138360977172852
16 id 31333 score =  9.138360977172852
17 id 31356 score =  9.138360977172852
18 id 31357 score =  9.138360977172852
19 id 31363 score =  9.138360977172852
20 id 31382 score =  9.138360977172852
21 id 31429 score =  9.138360977172852
22 id 31430 score =  9.138360977172852
23 id 31447 score =  9.138360977172852
24 id 31453 score =  9.138360977172852
25 id 31462 score =  9.138360977172852
26 id 31521 s

We display the top 30 results. Note that results with similar score are ordered arbitrarily, so more complex logic will likely result in better outcome.

For more information, visit us at [Hyperspace](https://www.hyper-space.io/)