## the `vector-db` module

This document reviews the `vector-db` module - which takes as input a numpy array, indexes its vectors, and returns an indexed [faiss database](https://github.com/facebookresearch/faiss).

This document includes an overview of custom pipeline setup, current model set, parameters, and `.process` usage for this module.

To follow along with this demonstration be sure to initialize your krixik session with your api key and url as shown below. 

We illustrate loading these required secrets in via [python-dotenv](https://pypi.org/project/python-dotenv/), storing those secrets in a `.env` file.  This is always good practice for storing / loading secrets (e.g., doing so will reduce the chance you inadvertantly push secrets to a repo).

In [1]:
import sys 
sys.path.append('../../')
from docs.utilities.reset import reset_pipeline

In [2]:
# load secrets from a .env file using python-dotenv
from dotenv import load_dotenv
import os
load_dotenv("../.env")
MY_API_KEY = os.getenv('MY_API_KEY')
MY_API_URL = os.getenv('MY_API_URL')

# import krixik and initialize it with your personal secrets
from krixik import krixik
krixik.init(api_key = MY_API_KEY, 
            api_url = MY_API_URL)

SUCCESS: You are now authenticated.


This small function prints dictionaries very nicely in notebooks / markdown.

In [3]:
# print dictionaries / json nicely in notebooks / markdown
import json
def json_print(data):
    print(json.dumps(data, indent=2))

A table of contents for the remainder of this document is shown below.


- [pipeline setup](#pipeline-setup)
- [using the `base` model](#using-the-base-model)
- [using the `keyword_search` method](#using-the-keyword-search-method)
- [querying output databases locally](#querying-output-databases-locally)

## Pipeline setup

Below we setup a simple one module pipeline using the `keyword-search` module. 

In [6]:
pipeline = krixik.create_pipeline(name="vector-db-pipeline-1",
                                  module_chain=["vector-db"])

ValueError: module_chain item -vector-d - is not a currently one of the currently available modules -['caption', 'json-to-txt', 'keyword-db', 'ocr', 'parser', 'sentiment', 'summarize', 'text-embedder', 'transcribe', 'translate', 'vector-db']

In [5]:
pipeline.save('test.yml')

In [4]:
# import custom module creation tools
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# instantiate module
module_1 = Module(module_type="vector-db")

# create custom pipeline object
custom = CreatePipeline(name='vector-db-pipeline-1', 
                        module_chain=[module_1])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

In [None]:
# import custom module creation tools
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# instantiate module
module_1 = Module(module_type="vector-db")

# create custom pipeline object
custom = CreatePipeline(name='vector-db-pipeline-1', 
                        module_chain=[module_1])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

In [None]:
pipeline = krixik.create_pipeline(name="",
                                  module_chain=[])

The `vector-search` module comes with a single model:

- `faiss`: (default) indexes a numpy array of input vectors

These available modeling options and parameters are stored in our custom pipeline's configuration (described further in LINK HERE).  We can examine this configuration as shown below.

In [10]:
# nicely print the configuration of uor custom pipeline
json_print(custom.config)

{
  "pipeline": {
    "name": "vector-db-pipeline-1",
    "modules": [
      {
        "name": "vector-db",
        "models": [
          {
            "name": "faiss"
          }
        ],
        "defaults": {
          "model": "faiss"
        },
        "input": {
          "type": "npy",
          "permitted_extensions": [
            ".npy"
          ]
        },
        "output": {
          "type": "faiss"
        }
      }
    ]
  }
}


Here we can see the models and their associated parameters available for use.

In [3]:
pipeline = krixik.create_pipeline(name="vector-db-pipeline-1",
                                  module_chain=["vector-db"])

In [4]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)

## using the `faiss` model

We first load in a handful of vectors from disk.

In [12]:
# define path to an input file from examples directory
test_file = "../input_data/vectors.npy"

Lets take a quick look at this file before processing.

In [13]:
# examine contents of input file
import numpy as np
np.load(test_file)

array([[0, 1],
       [1, 0],
       [1, 1]])

Three simple two dimensional vectors.

Now let's process it using our `faiss` model.  Because `faiss` is the default model we need not input the optional `modules` argument into `.process`.

In [5]:
# define path to an input file from examples directory
test_file = "../input_data/vectors.npy"

# process for search
process_output = pipeline.process(local_file_path = test_file,
                                  local_save_directory=".", # save output in current directory
                                  expire_time=60*3,         # set all process data to expire in 5 minutes
                                  wait_for_process=True,    # wait for process to complete before regaining ide
                                  verbose=False)            # set verbosity to False

The output of this process is printed below.  Because the output of this particular module-model pair is a faiss database, the process output is provided in this object is null.  However the file itself has been returned to the address noted in the `process_output_files` key.  The `file_id` of the processed input is used as a filename prefix for the output file.

In [6]:
# nicely print the output of this process
json_print(process_output)

NameError: name 'json_print' is not defined

### querying output databases locally

We can now perform queries on the pulled vector database whose location is given in `process_output_files`.

Below is a simple function for performing single keyword queries on this database locally.  Note: you will need to install the faiss library to execute this cell.  Install [faiss-cpu](https://pypi.org/project/faiss-cpu/) or [faiss-gpu](https://pypi.org/project/faiss-gpu/) depending on the specs of your local setup.

In [11]:
import faiss
import numpy as np
from typing import Tuple

def query_vector_db(query_vector: np.ndarray,
                    k: int,
                    db_file_path: str) -> Tuple[list,list]:
    # read in vector db
    faiss_index = faiss.read_index(db_file_path)
    
    # perform query
    similarities, indices = faiss_index.search(query_vector, k)
    distances = 1 - similarities
    return distances, indices

Perform a simple query using the test function above.

In [12]:
# perform test query using the above query function
original_vectors = np.load(test_file)
query_vector = np.array([[0,1]])
distances, indices = query_vector_db(query_vector, 2, process_output['process_output_files'][0])
print(f"input query vector: {query_vector[0]}")
print(f"closest vector from original: {original_vectors[indices[0][0]]}")
print(f"distance from query to this vector: {distances[0][0]}")
print(f"second closest vector from original: {original_vectors[indices[0][1]]}")
print(f"distance from query to this vector: {distances[0][1]}")

input query vector: [0 1]
closest vector from original: [0 1]
distance from query to this vector: 0.0
second closest vector from original: [1 1]
distance from query to this vector: 0.2928932309150696


### using the `vector_search` method

krixik's `vector_search` method is a convenience function for both embedding and querying - and so can only be used with pipelines containing both `text-embedder` and `vector-search` modules in succession.

Below we construct the simplest custom pipeline that satisfies this criteria - a standard vector search pipeline consisting of three modules: a `parser`, `text-embedder`, and `vector-search` index.

In [20]:
# import custom module creation tools
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# instantiate module
module_1 = Module(module_type="parser")
module_2 = Module(module_type="text-embedder")
module_3 = Module(module_type="vector-db")

# create custom pipeline object
custom = CreatePipeline(name='vector-search-pipeline-1', 
                        module_chain=[module_1, module_2, module_3])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

In [21]:
reset_pipeline(pipeline)

We can now perform any of the core system methods on our custom pipeline (e.g., `.process`, `.list`, etc.,).  Additionally we can invoke the `vector_search` method.

Lets first process a file with our new pipeline.  The `vector-search` module takes in a text file, and returns `faiss` vector database consisting of all non-trivial `(snippet, line_numbers)` tuples from the input.

In [23]:
# define path to an input file from examples directory
test_file = "../input_data/1984_very_short.txt"

# process for search
process_output = pipeline.process(local_file_path = test_file,
                                  local_save_directory=".", # save output in current directory
                                  expire_time=60*10,         # set all process data to expire in 5 minutes
                                  wait_for_process=True,    # wait for process to complete before regaining ide
                                  verbose=True)            # set verbosity to False

# nicely print the output of this process
json_print(process_output)

INFO: hydrated input modules: {'module_1': {'model': 'sentence', 'params': {}}, 'module_2': {'model': 'all-MiniLM-L6-v2', 'params': {'quantize': True}}, 'module_3': {'model': 'faiss', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_pnzmdtxamt.txt
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 600 seconds, at Thu May  2 10:23:50 2024 UTC
INFO: vector-search-pipeline-1 file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: b1461aaf-cf57-5980-a3c3-29bc7ec6b204
INFO: File process and processing status:
SUCCESS: module 1 (of 3) - parser processing complete.
SUCCESS: module 2 (of 3) - text-embedder processing complete.
SUCCESS: module 3 (of 3) - vector-db processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process

Now we can query our text with natural language as shown below.

In [None]:
# perform vector_search over the input file
vector_output = pipeline.vector_search(query="it was cold night",
                                       file_ids=[process_output["file_id"]])

# nicely print the output of this process
json_print(vector_output)

{
  "status_code": 200,
  "request_id": "98e6c488-440d-4a1a-978b-bb82affcb1b4",
  "message": "Successfully queried 1 user file.",
  "items": [
    {
      "file_id": "656c6486-8a89-43ea-8658-762dbf8b9c9c",
      "file_metadata": {
        "file_name": "krixik_generated_file_name_advielayge.txt",
        "symbolic_directory_path": "/etc",
        "file_tags": [],
        "num_vectors": 2,
        "created_at": "2024-04-28 16:21:06",
        "last_updated": "2024-04-28 16:21:06"
      },
      "search_results": [
        {
          "snippet": "It was a bright cold day in April, and the clocks were striking thirteen.",
          "line_numbers": [
            1
          ],
          "distance": 0.224
        },
        {
          "snippet": "Winston Smith, his chin nuzzled into his breast in an effort to escape the\nvile wind, slipped quickly through the glass doors of Victory Mansions,\nthough not quickly enough to prevent a swirl of gritty dust from entering\nalong with him.",
       