# Example summary

Description: make your documents searchable in any language

Business use cases:

- make documents searchable (in any language)
- create searchable summaries of documents (in any language)

Pipelines described:

- parser --> text-embedder --> vector-search

- translate --> json-to-txt --> parser --> text-embedder --> vector-search




# Code walkthrough

### boilerplate setup

In [1]:
import sys 
sys.path.append('..')

In [4]:

from dotenv import load_dotenv
import os
load_dotenv()

TEST_DUMMY_API_KEY = os.getenv('TEST_DUMMY_API_KEY_DEV')
TEST_DUMMY_API_URL = os.getenv('TEST_DUMMY_API_URL_DEV')

from krixik import krixik
krixik.init(api_key = TEST_DUMMY_API_KEY, 
            api_url = TEST_DUMMY_API_URL)

import json
def json_print(data):
    print(json.dumps(data, indent=2))
    
# define directory for input files 
input_directory = 'input_files/'

# define directory for output files
output_directory = 'output_files'

# define directory for pipeline_configs
pipeline_configs_directory = 'pipeline_configs'

%load_ext autoreload
%autoreload 2 

SUCCESS: You are now authenticated.


In [2]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# select modules
module_1 = Module(name="parser")

# create custom pipeline object
custom = CreatePipeline(name='parser-pipeline-1', 
                        module_chain=[module_1])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

In [4]:
# define path to an input file
test_file = "1984_full.txt"

# process for search
output = pipeline.process(local_file_path = input_directory + test_file,
                          expire_time=60*5)

INFO: hydrated input modules: {'parser': {'model': 'sentence', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_iwxijryfod.txt
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 300 seconds, at Thu Apr 25 17:53:30 2024 UTC
INFO: parser-pipeline-1 file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: bc3e5081-02ea-039c-e776-f4bbd163cffe
INFO: File process and processing status:
SUCCESS: module 1 (of 1) - parser processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output downloaded


In [5]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# select modules
module_1 = Module(name="transcribe")
module_2 = Module(name="translate")

# create custom pipeline object
custom = CreatePipeline(name='translate-pipeline-1', 
                        module_chain=[module_1, module_2])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

In [6]:
test_file = 'Interesting Facts About Colombia.mp4'
# test_file = "valid.json"

# process for search
output = pipeline.process(local_file_path = input_directory + test_file,
                          file_name = test_file,
                          expire_time=60*5)

INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted input_files/Interesting Facts About Colombia.mp4 to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmpowejric2/krixik_converted_version_Interesting Facts About Colombia.mp3
INFO: hydrated input modules: {'transcribe': {'model': 'whisper-tiny', 'params': {}}, 'translate': {'model': 'opus-mt-en-es', 'params': {}}}
INFO: lower casing file_name Interesting Facts About Colombia.mp4 to interesting facts about colombia.mp4
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 300 seconds, at Thu Apr 25 11:10:43 2024 UTC
INFO: translate-pipeline-1 file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 277349ac-f13a-9d04-5bed-efb5f4f69dd0
INFO: File process and processing status:
SUCCESS: module 1 (of

In [1]:
import sys 
sys.path.append('..')

from dotenv import load_dotenv
import os
load_dotenv()

TEST_DUMMY_API_KEY = os.getenv('TEST_DUMMY_API_KEY_DEV')
TEST_DUMMY_API_URL = os.getenv('TEST_DUMMY_API_URL_DEV')

from krixik import krixik
krixik.init(api_key = TEST_DUMMY_API_KEY, 
            api_url = TEST_DUMMY_API_URL)

import json
def json_print(data):
    print(json.dumps(data, indent=2))
    
# define directory for input files 
input_directory = 'input_data/'

# define directory for output files
output_directory = 'output_data'

# define directory for pipeline_configs
pipeline_configs_directory = 'pipeline_configs'

%load_ext autoreload
%autoreload 2 

SUCCESS: You are now authenticated.


###  vector search pipeline

Start by building a pipeline for semantic search in english.  

This will consist of a parser, embedder, and vector search module.

In [2]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# select modules
module_1 = Module(module_type="parser")
module_2 = Module(module_type="text-embedder")
module_3 = Module(module_type="vector-db")

# create custom pipeline object
custom = CreatePipeline(name='vector-pipeline-1', 
                        module_chain=[module_1, module_2, module_3])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

There are many ways to use this pipeline since most modules have options - models and parameters- you can set when using.

With the pipeline defined we can process a file for search.

Here we process with all cli verbose signals on, and wait for the output to finish.

In [3]:
# define path to an input file
test_file = "chapter_1_short.pdf"

# process for search
output = pipeline.process(local_file_path = input_directory + test_file,
                          expire_time=60*5)

Ignoring wrong pointing object 9 0 (offset 0)


Ignoring wrong pointing object 9 0 (offset 0)
Ignoring wrong pointing object 9 0 (offset 0)


INFO: Converting pdf to text...
SUCCESS: File conversion complete with pydf, result saved to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmpf7iy49ho/krixik_converted_version_chapter_1_short.txt
INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted input_data/chapter_1_short.pdf to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmpf7iy49ho/krixik_converted_version_chapter_1_short.txt
INFO: hydrated input modules: {'parser': {'model': 'sentence', 'params': {}}, 'text-embedder': {'model': 'multi-qa-MiniLM-L6-cos-v1', 'params': {'quantize': True}}, 'vector-db': {'model': 'faiss', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_drkstwiacu.txt
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 300 seconds, at Sun Apr 28 13:56:45 2024 UTC
INFO: vector-pipeline-

The output contains metainformation about the file we just processed - including its file_id.

In [4]:
json_print(output)

{
  "status_code": 200,
  "pipeline": "vector-pipeline-1",
  "request_id": "7d0b4339-7b45-43d8-9190-6bbc1437a921",
  "file_id": "e014905c-55e8-404a-96e2-f39763acb1c2",
  "message": "SUCCESS - output fetched for file_id e014905c-55e8-404a-96e2-f39763acb1c2.Output saved to location(s) listed in process_output_files.",
  "process_output": null,
  "process_output_files": [
    "/Users/jeremywatt/Desktop/krixik-cli/examples/e014905c-55e8-404a-96e2-f39763acb1c2.faiss"
  ]
}


With the file processed we can now search it semantically.

In [5]:
# search the file we just processed semantically
pipeline.semantic_search(query="one with the machines",
                         file_ids=[output["file_id"]])

{'status_code': 200,
 'request_id': 'de72c9c2-c28c-4645-b83e-bae604940c71',
 'message': 'Successfully queried 1 user file.',
 'items': [{'file_id': 'e014905c-55e8-404a-96e2-f39763acb1c2',
   'file_metadata': {'file_name': 'krixik_generated_file_name_drkstwiacu.txt',
    'symbolic_directory_path': '/etc',
    'file_tags': [],
    'num_vectors': 13,
    'created_at': '2024-04-28 20:51:48',
    'last_updated': '2024-04-28 20:51:48'},
   'search_results': [{'snippet': 'While still a young dis-\ncipline with much more awaiting discovery than is currently known, today\nmachine learning can be used to teach computers to perform a wide array\nof useful tasks including automatic detection of objects in images (a crucial\ncomponent of driver-assisted and self-driving cars), speech recognition (which\npowers voice command technology), knowledge discovery in the medical sci-\nences (used to improve our understanding of complex diseases), and predictive\nanalytics (leveraged for sales and economic 

We can also upload files without all the verbose output, and without waiting for each file to finish processing.  

We can also set a `file_name` so that results are easier to understand visually.

In [10]:
# process more files but don't wait 
more_test_files = ["1984_full.txt", "slides.pptx"]

for test_file in more_test_files:
    output = pipeline.process(local_file_path = input_directory + test_file,
                              file_name=test_file,
                              expire_time=60*5,
                              wait_for_process=False,
                              verbose=False)

In [12]:
output = pipeline.vector_search(query = 'he loves the machine',
                                symbolic_directory_paths=['/*'],
                                k=2)

json_print(output)


{
  "status_code": 200,
  "request_id": "fa458d98-c160-4be7-a581-a314822e0061",
  "message": "Successfully queried 3 user files.",
  "items": [
    {
      "file_id": "899235b5-7202-4e43-90dd-d26601911efa",
      "file_metadata": {
        "file_name": "slides.pptx",
        "symbolic_directory_path": "/etc",
        "file_tags": [],
        "num_vectors": 16,
        "created_at": "2024-04-18 19:52:01",
        "last_updated": "2024-04-18 19:52:01"
      },
      "search_results": [
        {
          "snippet": "a sample slidewith some text in itMancala infra overviewCost, Problems,",
          "line_numbers": [
            1
          ],
          "distance": 0.483
        },
        {
          "snippet": "negligible data transfer in / outapproximate base user - $0.50DynamoDB",
          "line_numbers": [
            1
          ],
          "distance": 0.495
        }
      ]
    },
    {
      "file_id": "05e1792b-e6b5-4beb-91fb-3593d350b2b6",
      "file_metadata": {
        "f

### translated semantic search

What if we had users that wanted to examine our text in another language?  

Lets insert a `translate` module in our pipeline so that users can query in spanish instead of english.

In [2]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# select modules
module_1 = Module(name="translate")
module_2 = Module(name="json-to-txt")
module_3 = Module(name="parser")
module_4 = Module(name="text-embedder")
module_5 = Module(name="vector-search")

# create custom pipeline object
custom = CreatePipeline(name='vector-pipeline-2', 
                        module_chain=[module_1, module_2, module_3, module_4, module_5])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

In [5]:
sent_text


['It was a bright cold day in April, and the clocks were striking thirteen.',
 'Winston Smith, his chin nuzzled into his breast in an effort to escape the\nvile wind, slipped quickly through the glass doors of Victory Mansions,\nthough not quickly enough to prevent a swirl of gritty dust from entering\nalong with him.',
 'The hallway smelt of boiled cabbage and old rag mats.',
 'At one end of it a\ncoloured poster, too large for indoor display, had been tacked to the wall.',
 'It depicted simply an enormous face, more than a metre wide: the face of a\nman of about forty-five, with a heavy black moustache and ruggedly handsome\nfeatures.',
 'Winston made for the stairs.',
 'It was no use trying the lift.',
 'Even\nat the best of times it was seldom working, and at present the electric\ncurrent was cut off during daylight hours.',
 'It was part of the economy drive\nin preparation for Hate Week.',
 'The flat was seven flights up, and Winston,\nwho was thirty-nine and had a varicose ulcer

In [8]:
test_file = "1984_short.txt"

### we need to add a parser that splits text into sentences - this is the local work around ###
with open(input_directory + test_file, "r") as readfile:
    text = readfile.read()

import nltk
sent_text = nltk.sent_tokenize(text)

with open(input_directory + "1984_short.json", "w") as outfile:
    json.dump([{"snippet": t} for t in sent_text], outfile)
#### end local work around ####
    
output = pipeline.process(local_file_path = input_directory + "1984_short.json",
                          file_name="1984_short.json",
                          expire_time=60*10,
                          modules={"translate":{"model":"opus-mt-en-es"}})

INFO: hydrated input modules: {'translate': {'model': 'opus-mt-en-es', 'params': {}}, 'json-to-txt': {'model': 'base', 'params': {}}, 'parser': {'model': 'fixed', 'params': {'chunk_size': 10, 'overlap_size': 2}}, 'text-embedder': {'model': 'multi-qa-MiniLM-L6-cos-v1', 'params': {'quantize': True}}, 'vector-search': {'model': 'faiss', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 600 seconds, at Thu Apr 18 13:36:58 2024 UTC
INFO: vector-pipeline-2 file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This file's process_id is: 70928fc8-248b-f2d9-beb4-0888e3c58d1b
INFO: File process and processing status:
SUCCESS: module 1 (of 5) - translate processing complete.
SUCCESS: module 2 (of 5) - json-to-txt processing complete.
SUCCESS: module 3 (of 5) - parser processing complete.
SUCCESS: module 4 (of

In [6]:
pipeline.vector_search(query = 'te está mirando',
                        symbolic_directory_paths=['/*'],
                        k=2)

{'status_code': 200,
 'request_id': 'a7bb6fbc-fde9-4201-b86d-7596cf539156',
 'message': 'Successfully queried 1 user file.',
 'items': [{'file_id': '2a849b9c-6b06-476a-853d-fd74c57aadde',
   'file_metadata': {'file_name': '1984_short.json',
    'symbolic_directory_path': '/etc',
    'file_tags': [],
    'num_vectors': 128,
    'created_at': '2024-04-18 20:26:58',
    'last_updated': '2024-04-18 20:26:58'},
   'search_results': [{'snippet': 'hermano grande te est mirando, el pie de foto debajo',
     'line_numbers': [13],
     'distance': 0.24},
    {'snippet': 'todo el mundo todo el tiempo. Pero, en cualquier caso,',
     'line_numbers': [35, 36],
     'distance': 0.241}]}]}

# Extra

### keyword search

We can also process text for keyword search by using the `keyword-search` module.

In [23]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# select modules
module_1 = Module(name="keyword-search")

# create custom pipeline object
custom = CreatePipeline(name='simple-keyword-search', 
                        module_chain=[module_1])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

In [24]:
# define path to an input file
test_file = "chapter_1_short.pdf"

# process for search
output = pipeline.process(local_file_path = input_directory + test_file,
                          file_name = test_file,
                          expire_time=60*5)

INFO: Converting pdf to text...
SUCCESS: File conversion complete with pydf, result saved to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmp0t_ew036/krixik_converted_version_chapter_1_short.txt
INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted input_files/chapter_1_short.pdf to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmp0t_ew036/krixik_converted_version_chapter_1_short.txt
INFO: hydrated input modules: {'keyword-search': {'model': 'base', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 300 seconds, at Thu Apr 18 11:21:35 2024 UTC
INFO: simple-keyword-search file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This file's process_id is: 3410d296-9e26-9fd4-627c-5ec5269340df
INFO: File process and processing status:
SUCCESS: module 1 (of 

In [26]:
output = pipeline.keyword_search(symbolic_directory_paths=["/*"], 
                                    query="cats and dogs")
json_print(output)

{
  "status_code": 200,
  "request_id": "11c927d4-16da-421a-bef0-6e3d4c8e5f1b",
  "message": "Successfully queried 1 user file.",
    {
        "and"
      ]
    }
  ],
  "items": [
    {
      "file_id": "4666a8ae-ca8b-4934-b6d3-489d7f27b318",
      "file_metadata": {
        "file_name": "chapter_1_short.pdf",
        "symbolic_directory_path": "/etc",
        "file_tags": [],
        "num_lines": 32,
        "created_at": "2024-04-18 18:16:37",
        "last_updated": "2024-04-18 18:16:37"
      },
      "search_results": [
        {
          "keyword": "cats",
          "line_number": 16,
          "keyword_number": 3
        },
        {
          "keyword": "dogs",
          "line_number": 16,
          "keyword_number": 5
        },
        {
          "keyword": "cats",
          "line_number": 20,
          "keyword_number": 3
        },
        {
          "keyword": "dogs",
          "line_number": 20,
          "keyword_number": 7
        },
        {
          "keyword": 