# Classic Search With Hyperspace
This notebook demonstrates the use of Hyperspace engine for classic search (keyword and value matching). The data is taken from [Kaggle](https://www.kaggle.com/datasets/chicago/chicago-crime).

## The Dataset - Crimes In Chicago Dataset
From Kaggle:
This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. This data includes unverified reports supplied to the Police Department. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time.

The dataset includes 8 Million datapoints


## Setting up the Hyper Space environment
Setting the enviorment requires the following steps


1.   Download and install the client API
2.   Create data config file
3. Connect to a server
4.   Create collection
5. Ingest data
6. Run query



We mount a cloud folder which hosts the client files and install the client

###install the hyper-space client


In [None]:
!pip install drive/MyDrive/search_master.zip newlinejson
# hide output

Processing ./drive/MyDrive/search_master.zip
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting newlinejson
  Downloading NewlineJSON-1.0-py2.py3-none-any.whl (12 kB)
Building wheels for collected packages: search-master
  Building wheel for search-master (setup.py) ... [?25l[?25hdone
  Created wheel for search-master: filename=search_master-1.0.0-py3-none-any.whl size=39147 sha256=a7196ff2543a27557cf0c817a1a374faf079ebd5815e0b28661b8ae09a74f677
  Stored in directory: /tmp/pip-ephem-wheel-cache-18lde5u2/wheels/16/4a/3d/5f117bdb31fe9ec055a07467b1949da65cd6246ecd3a3599fd
Successfully built search-master
Installing collected packages: newlinejson, search-master
Successfully installed newlinejson-1.0 search-master-1.0.0


In [None]:
import numpy as np
import json
import search_master

###Connect to server

Using the Hyperspace engine requires connection to a remote machine with pre-provided credentials.

In [None]:
conf = search_master.configuration.Configuration()
conf.host = 'Server-address'

hyperspace_client = search_master.SearchMasterApi(api_client=search_master.api_client.ApiClient(configuration=conf))
login_response = hyperspace_client.login({"username": username, "password": password})
api_client = search_master.api_client.ApiClient(configuration=conf,
                                                header_name='Authorization',
                                                header_value="Bearer " + login_response.token)

hyperspace_client = search_master.SearchMasterApi(api_client=api_client)


In [None]:
hyperspace_client.cluster_status()

[{'Collections size': {'CrimesInChicago': 0},
  'FPGA memory usage in GB': '0.0080GB',
  'FPGA memory usage in percentage': '0.0080%',
  'Hostname': 'hyperspace-demo-0',
  'Number of total vectors': 0},
 {'Number of data nodes': 1}]

###Configuration file

Similarly to other search databases, Hyper-Space database requires a configuration file which outlines the data scheme. Let us first explore the database configuration

In [None]:
config_path = r'crime-config.json'
with open(config_path, 'r') as file:
    config = json.load(file)
display(config)

{'configuration': {'ID': {'type': 'integer'},
  'Case Number': {'type': 'keyword'},
  'Date': {'type': 'date', 'format': 'MM/dd/yyyy hh:mm:ss a'},
  'Block': {'type': 'keyword'},
  'IUCR': {'type': 'keyword'},
  'Primary Type': {'type': 'keyword'},
  'Description': {'type': 'keyword'},
  'Location Description': {'type': 'keyword'},
  'Arrest': {'type': 'boolean'},
  'Domestic': {'type': 'boolean'},
  'Beat': {'type': 'integer'},
  'District': {'type': 'integer'},
  'Ward': {'type': 'integer'},
  'Community Area': {'type': 'integer'},
  'FBI Code': {'type': 'keyword'},
  'X Coordinate': {'type': 'integer'},
  'Y Coordinate': {'type': 'integer'},
  'Year': {'type': 'integer'},
  'Updated On': {'type': 'date', 'format': 'MM/dd/yyyy hh:mm:ss a'},
  'Latitude': {'type': 'float'},
  'Longitude': {'type': 'float'},
  'Location': {'type': 'geo_point', 'struct_type': 'list'}}}

The dataset fields


1. **Case Number {'type': 'keyword'}** -
The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
2. **Date {'type': 'date', 'format': 'MM/dd/yyyy hh:mm:ss a'}** - Date when the incident occurred. this is sometimes a best estimate.
3. **Block {'type 'keyword'}** -The partially redacted address where the incident occurred, placing it on the same block as the actual address.
4. **IUCR {'type 'keyword'}** - The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description. See the list of IUCR codes at https://data.cityofchicago.org/d/c7ck-438e.
5. **Primary Type {'type 'keyword'}** - The primary description of the IUCR code.
6. **Description {'type 'keyword'}** - The secondary description of the IUCR code, a subcategory of the primary description.
7. **Location Description {'type 'keyword'}** - Description of the location where the incident occurred.
8. **Arrest {'type 'boolean'}** - Indicates whether an arrest was made.
9. **Domestic {'type 'boolean'}** - Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
10. **Beat {'type 'integer'}** - Indicates the beat where the incident occurred. A beat is the smallest police geographic area â€“ each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts. See the beats at https://data.cityofchicago.org/d/aerh-rz74.
11. **District {'type 'integer'}** - Indicates the police district where the incident occurred. See the districts at https://data.cityofchicago.org/d/fthy-xz3r.
12. **Ward {'type 'integer'}** - The ward (City Council district) where the incident occurred. See the wards at https://data.cityofchicago.org/d/sp34-6z76.
13. **Community Area {'type 'integer'}** - Indicates the community area where the incident occurred. Chicago has 77 community areas. See the community areas at https://data.cityofchicago.org/d/cauq-8yn6.
14. **FBI Code {'type 'keyword'}** - Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS). See the Chicago Police Department listing of these classifications at http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html.
15. **X Coordinate {'type 'integer'}** - The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
16. **Y Coordinate {'type 'integer'}** - The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
17. **Year {'type 'integer'}** - Year the incident occurred.
18. **Updated On {'type 'date', 'format 'MM/dd/yyyy hh:mm:ss a'}** - Date and time the record was last updated.
19. **Latitude {'type 'float'}** - The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
20. **Longitude {'type 'float'}** - The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
21. **Location {'type 'geo_point', 'struct_type 'list'}** - The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

  
  
  
  
  



# Create Collection
We create a hyper-space collection, using the config file.

In [None]:
hyperspace_client = search_master.SearchMasterApi(api_client=api_client)
collection_name = 'CrimesInChicago'
# hyperspace_client.delete_collection(collection_name)
hyperspace_client.create_collection('crime-config.json', collection_name)
hyperspace_client.cluster_status()

[{'Collections size': {'CrimesInChicago': 0},
  'FPGA memory usage in GB': '0.0080GB',
  'FPGA memory usage in percentage': '0.0080%',
  'Hostname': 'hyperspace-demo-0',
  'Number of total vectors': 0},
 {'Number of data nodes': 1}]

### Ingest data

We load the datasets from and ingest it in batches

In [None]:
metadata = open('crimes-dataset-processed_data.json')

In [None]:
from search_master import VectorDto

BATCH_SIZE = 500

batch = []
for i, metadata_row in enumerate(metadata):
    row = {key: value for key, value in json.loads(metadata_row).items() if key in config["configuration"].keys()}

    batch.append(VectorDto(str(i), row))

    if i % BATCH_SIZE == 0:
        response = hyperspace_client.add_batch(batch, collection_name)
        batch.clear()
        print(i, response)

hyperspace_client.commit(collection_name)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
5285000 {'status': 'OK', 'code': 200, 'message': 'Batch successfully added'}
5285500 {'status': 'OK', 'code': 200, 'message': 'Batch successfully added'}
5286000 {'status': 'OK', 'code': 200, 'message': 'Batch successfully added'}
5286500 {'status': 'OK', 'code': 200, 'message': 'Batch successfully added'}
5287000 {'status': 'OK', 'code': 200, 'message': 'Batch successfully added'}
5287500 {'status': 'OK', 'code': 200, 'message': 'Batch successfully added'}
5288000 {'status': 'OK', 'code': 200, 'message': 'Batch successfully added'}
5288500 {'status': 'OK', 'code': 200, 'message': 'Batch successfully added'}
5289000 {'status': 'OK', 'code': 200, 'message': 'Batch successfully added'}
5289500 {'status': 'OK', 'code': 200, 'message': 'Batch successfully added'}
5290000 {'status': 'OK', 'code': 200, 'message': 'Batch successfully added'}
5290500 {'status': 'OK', 'code': 200, 'message': 'Batch successfully added'}
5291000 {'s

ApiException: ignored

#Define Logic and Run a Query
We will build a classic search query using Hyperspace. In the query,  we will select an incident and find similar one. The score function is given under
/content/drive/MyDrive/Demos/CrimesInChicago/Crime_Score_Function.py


Let's start with a simple example, of matching a randomly selected data point

In [None]:
from pprint import pprint

input_vector = hyperspace_client.find_vector_by_id(collection_name, 65)
pprint(input_vector)

{'Arrest': False,
 'Beat': 1711,
 'Block': '057XX N SPAULDING AVE',
 'Case Number': 'HY411327',
 'Community Area': 13,
 'Date': 1441396800,
 'Description': 'TO VEHICLE',
 'District': 17,
 'Domestic': False,
 'FBI Code': '14',
 'ID': 10224814,
 'IUCR': '1320',
 'Latitude': 41.985148529,
 'Location': [41.985148529, -87.711378169],
 'Location Description': 'STREET',
 'Longitude': -87.711378169,
 'Primary Type': 'CRIMINAL DAMAGE',
 'Updated On': 1518270601,
 'Ward': 39,
 'X Coordinate': 1153345,
 'Y Coordinate': 1937792,
 'Year': 2015}


We will use a very simple logic, which matchs the description and location, and make sure case number doesn't match so we won't get back the same result.

We use the following logic:


*   Match crime description and not case number
*   geo_dist match (ge distance)
*   Match district and window_match the date
*   Match Block

Score function can be view in the netx block





In [None]:
def similarity_sf(Q, V):
    score = 0.0

    if match('Description') and not match('Case Number'):
        score = rarity_max('Description')
    if geo_dist_match('Location',4):
        score += 10s
    if match('District') and window_match('Date', 100, 40):
        score -= 5
    return score


In [None]:
response = hyperspace_client.set_function('/content/drive/MyDrive/Demos/CrimesInChicago/Crime_Score_Function.py', collection_name=collection_name, function_name='similarity_sf')

query_with_knn = {
    'vector_Content': input_vector,
    'boost': {
        'query': 1,
    }
}

results = hyperspace_client.search_data(query_with_knn,
                                        size=30,
                                        function_name='similarity_sf',
                                        collection_name=collection_name)

for i, result in enumerate(results['similarity']):
  vector_api_response = hyperspace_client.find_vector_by_id(vector_id=result['vector_id'], collection_name=collection_name)
  print(i + 1, "vector id", result['vector_id'], "score = " , result["score"] ,":", vector_api_response)


1 vector id 2808182 score =  24.30988311767578 : {'Arrest': False, 'Beat': 1711, 'Block': '057XX N SPAULDING AVE', 'Case Number': 'HK667044', 'Community Area': 13, 'Date': 1096914600, 'Description': 'TO VEHICLE', 'District': 17, 'Domestic': False, 'FBI Code': '14', 'ID': 3588859, 'IUCR': '1320', 'Latitude': 41.985244591, 'Location': [41.985244591, -87.711380912], 'Location Description': 'STREET', 'Longitude': -87.711380912, 'Primary Type': 'CRIMINAL DAMAGE', 'Updated On': 1519826185, 'Ward': 39, 'X Coordinate': 1153344, 'Y Coordinate': 1937827, 'Year': 2004}
2 vector id 332712 score =  24.30988311767578 : {'Arrest': False, 'Beat': 1711, 'Block': '057XX N SPAULDING AVE', 'Case Number': 'JA145909', 'Community Area': 13, 'Date': 1486490400, 'Description': 'TO VEHICLE', 'District': 17, 'Domestic': False, 'FBI Code': '14', 'ID': 10842749, 'IUCR': '1320', 'Latitude': 41.986298546, 'Location': [41.986298546, -87.711414792], 'Location Description': 'STREET', 'Longitude': -87.711414792, 'Primar

We display the top 30 results. Note that results with similar score are ordered arbitrarily.
For more information, visit [Hyperspace](https://www.hyper-space.io/)