## A simple semantic search pipeline

This document reviews a simple semantic search pipeline that can be used to make any input text semantically-searchable.


- [pipeline setup](#pipeline-setup)
- [processing an input file](#processing-an-input-file)


In [1]:
# import utilities
import sys 
import json
sys.path.append('../../../')
import importlib
reset = importlib.import_module("utilities.reset")
reset_pipeline = reset.reset_pipeline

# load secrets from a .env file using python-dotenv
from dotenv import load_dotenv
import os
load_dotenv("../../.env")
MY_API_KEY = os.getenv('MY_API_KEY')
MY_API_URL = os.getenv('MY_API_URL')

# import krixik and initialize it with your personal secrets
from krixik import krixik
krixik.init(api_key = MY_API_KEY, 
            api_url = MY_API_URL)

SUCCESS: You are now authenticated.


## Pipeline setup

Below we setup a pipeline consisting of a [`parser`](../../modules/parser.md), [`text-embedder`](../../modules/text-embedder.md), and [`vector-db`](../../modules/vector-db.md) modules.  This will allow us to process input files and make them semantically searchable.

We do this by passing the module names to the `module_chain` argument of [`create_pipeline`](../../system/create_save_load.md) along with a name for our pipeline.

In [2]:
# create a pipeline with a multi module pipeline
pipeline = krixik.create_pipeline(name="examples-text-search-semantic-pipeline",
                                  module_chain=["parser", "text-embedder", "vector-db"])

This pipeline's available modeling options and parameters are stored in your custom [pipeline's configuration](../../system/create_save_load.md).

In [3]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)

## Processing an input file

Lets take a quick look at a short test file before processing.

In [4]:
# examine contents of input file
test_file = "../../../data/input/1984_short.txt"
with open(test_file, "r") as file:
    print(file.read())
    

It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape the
vile wind, slipped quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty dust from entering
along with him.

The hallway smelt of boiled cabbage and old rag mats. At one end of it a
coloured poster, too large for indoor display, had been tacked to the wall.
It depicted simply an enormous face, more than a metre wide: the face of a
man of about forty-five, with a heavy black moustache and ruggedly handsome
features. Winston made for the stairs. It was no use trying the lift. Even
at the best of times it was seldom working, and at present the electric
current was cut off during daylight hours. It was part of the economy drive
in preparation for Hate Week. The flat was seven flights up, and Winston,
who was thirty-nine and had a varicose ulcer above his right ankle, went
slowly, resting se

Below we [process](system/process.md) the input through our pipeline using the default model for each of our three modules.

In [5]:
# define path to an input file from examples directory
test_file = "../../../data/input/1984_short.txt"

# process a file through the pipeline
process_output = pipeline.process(local_file_path = test_file,
                                  local_save_directory="../../../data/output",  # save output in current directory
                                  expire_time=60*10,         # set all process data to expire in 5 minutes
                                  wait_for_process=True,     # wait for process to complete before regaining ide
                                  verbose=False)             # set verbosity to False

The output of this process is printed below.  Because the output of this particular module-model pair is a faiss database, the process output is provided in this object is null.  However the file itself has been returned to the address noted in the `process_output_files` key.  The `file_id` of the processed input is used as a filename prefix for the output file.

In [6]:
# nicely print the output of this process
print(json.dumps(process_output, indent=2))

{
  "status_code": 200,
  "pipeline": "examples-text-search-semantic-pipeline",
  "request_id": "262563c3-6841-4aa8-8694-01c056fb4ce8",
  "file_id": "b20fa17b-1df9-4185-94c9-ae17ac187822",
  "message": "SUCCESS - output fetched for file_id b20fa17b-1df9-4185-94c9-ae17ac187822.Output saved to location(s) listed in process_output_files.",
  "process_output": null,
  "process_output_files": [
    "../../../data/output/b20fa17b-1df9-4185-94c9-ae17ac187822.faiss"
  ]
}


## Using the `semantic_search` method

krixik's [`semantic_search` method](../../system/semantic_search.md) is a convenience function for both embedding and querying - and so can only be used with pipelines containing both `text-embedder` and `vector-db` modules in succession.  Since our pipeline here satisfies this condition, it has access to the `semantic_search` method.

Now we can query our text with natural language as shown below.

In [7]:
# perform semantic_search over the input file
semantic_output = pipeline.semantic_search(
    query="it was cold night", file_ids=[process_output["file_id"]]
)

# nicely print the output of this process
print(json.dumps(semantic_output, indent=2))

{
  "status_code": 200,
  "request_id": "57a797f5-bc77-4774-a2b0-b5f13007b356",
  "message": "Successfully queried 1 user file.",
  "items": [
    {
      "file_id": "b20fa17b-1df9-4185-94c9-ae17ac187822",
      "file_metadata": {
        "file_name": "krixik_generated_file_name_toyzuamynf.txt",
        "symbolic_directory_path": "/etc",
        "file_tags": [],
        "num_vectors": 50,
        "created_at": "2024-05-07 18:30:32",
        "last_updated": "2024-05-07 18:30:32"
      },
      "search_results": [
        {
          "snippet": "Outside, even through the shut window-pane, the world looked cold.",
          "line_numbers": [
            32,
            33
          ],
          "distance": 0.232
        },
        {
          "snippet": "It was a bright cold day in April, and the clocks were striking thirteen.",
          "line_numbers": [
            1
          ],
          "distance": 0.236
        },
        {
          "snippet": "His hair was very fair, his face\nna

In [8]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)