In [1]:
import sys 
sys.path.append('..')
from dotenv import load_dotenv
import os
load_dotenv()

TEST_DUMMY_API_KEY = os.getenv('TEST_DUMMY_API_KEY_STA')
TEST_DUMMY_API_URL = os.getenv('TEST_DUMMY_API_URL_STA')

from krixik import krixik
krixik.init(api_key = TEST_DUMMY_API_KEY, 
            api_url = TEST_DUMMY_API_URL)

import json
def json_print(data):
    print(json.dumps(data, indent=2))

%load_ext autoreload
%autoreload 2 

SUCCESS: You are now authenticated.


## Some minor notes before starting

Because of the flexibility we've added with true modularity, we need to be sticklers about input format.

We'll talk more about this later.  But for now - if you want to input a `json` of your own design first study how the examples look in the `input_files` directory.  Copy that pattern or your input won't upload.

Also - this is a playground version.  Not all tests have been completely carried over to our new modular system.  So you break it you buy it.

To make usage easier define an

- input path to your test data files
- a local_save_directory for your output files
- a directory for your pipeline configs

Lets do this below.

In [2]:
# define directory for input files 
input_directory = 'input_files'

# define directory for output files
output_directory = 'output_files'

# define directory for pipeline_configs
pipeline_configs_directory = 'pipeline_configs'

## Vector search

In [3]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# create modules for text (vector) search
module_1 = Module(name="parser")
module_2 = Module(name="text-embedder")
module_3 = Module(name="vector-search")

# create your custom pipeline
custom = CreatePipeline(name='vector-pipeline-1', 
                        module_chain=[module_1, module_2, module_3])

# pass the custom object to the krixik operator (note you can also do this by passing its config - we'll do below)
pipeline = krixik.load_pipeline(pipeline=custom)

# define a test file in your input_files directory
test_file = "chapter_1_short.pdf"

# process the file
output = pipeline.process(local_file_path = input_directory + "/" + test_file,
                             expire_time=60*3, # have the output expire in 3 minutes
                             modules={},  # purposefully placing modules={}, they are hydrated in as necessary, see printout 
                             local_save_directory=output_directory)  # save the output to the output directory

INFO: Converting pdf to text...
SUCCESS: File conversion complete with pydf, result saved to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmpxawcel9c/krixik_converted_version_chapter_1_short.txt
INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted input_files/chapter_1_short.pdf to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmpxawcel9c/krixik_converted_version_chapter_1_short.txt
INFO: hydrated input modules: {'parser': {'model': 'fixed', 'params': {'chunk_size': 10, 'overlap_size': 2}}, 'text-embedder': {'model': 'multi-qa-MiniLM-L6-cos-v1', 'params': {'quantize': True}}, 'vector-search': {'model': 'faiss', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_qmnoqbmjre.txt
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 180 seconds, at Tue Apr 16 05:27:18 2024

Notice the last line 

```SUCCESS: process output downloaded```

Whats the "process output" of the pipeline?  Its not json right - that comes from using `vector_search`.  

Its the vector search database itself.

Now that our pipelines are truly modular people need to be able to pass input and get output from any module.  More generally - they need to be able to pass input into any first module they choose for a pipeline, and retrieve output from its final module.  For all modules.

(We (kinda) did this previously - but only for pipelines whose final module produced `json` output - e.g., transcribe (a one module pipeline).  That was bespoke wiring.  And we didn't download json, we passed it back to the cli via the request.)  

For example - what if the pipeline ends with `text-embedder`?  Well - as you suggested a while back - the user would need to get back the embeddings.  (and then - if you had another pipeline that began with `vector-search` - how would you be able to use it as a single module pipeline if you couldn't upload embeddings to it?).

So in any event - now that our pipelines are truly modular, users `.fetch_output` needs to return the output of every possible module.  Now it does.

And by default `.process` calls `.fetch_output` as it finishes (that hasn't changed).


Fine.  Back to our pipeline here.

For our standard vector search pipeline the output of `vector-search` is a `.faiss` database.

If you look in the `output_files` directory you can see the file.  It is named `{file_id}.faiss`.  Likewise, every output is called `{file_id}.{its_extension}`.


Fine.

In any event - now we can use our `vector_search` api on this pipeline since it ends with the `vector-search` module and we just processed a file through it.


In [5]:
# vector search the file
pipeline.vector_search(query="do you like cats",
                       symbolic_directory_paths = ["/*"])

{'status_code': 200,
 'request_id': 'f4021539-f480-4914-8c50-11a6bcd3a149',
 'message': 'Successfully queried 2 user files.',
 'items': [{'file_id': 'ca71efb3-bce4-4539-be9f-f85bc778561f',
   'file_metadata': {'file_name': 'krixik_generated_cfcxwtdsnp.txt',
    'symbolic_directory_path': '/etc',
    'file_tags': [],
    'num_vectors': 43,
    'created_at': '2024-04-13 21:30:02',
    'last_updated': '2024-04-13 21:30:02'},
   'search_results': [{'snippet': 'cats and dogs? Intuitively, when',
     'line_numbers': [32],
     'distance': 0.235},
    {'snippet': 'etc.) are either cats or dogs, until they fully grasp',
     'line_numbers': [30, 31],
     'distance': 0.247},
    {'snippet': 'of cats from those with dogs . This will allow',
     'line_numbers': [20],
     'distance': 0.261},
    {'snippet': 'a computer how to distinguish between pic- tures of cats',
     'line_numbers': [19, 20],
     'distance': 0.277},
    {'snippet': 'learned about the di ↵erence between cats and dogs, and'

Goody gumdrops.

Now above we built our custom pipeline object (`custom`) and passed it directly to our `krixik` factory operator.

As we discussed, probably best to be able to save that `custom` object as a configuration file for easier re-use.

You can create the same pipeline by first saving the `custom` config, then re-loading it.  I'll show you below.

In [8]:
# save the configuration of our custom pipeline object
custom.save(pipeline_configs_directory + "/" + "vector-pipeline-1.yaml")

# instantiate our krixik factory processor by re-loading the config from file
pipeline = krixik.load_pipeline(config_path=pipeline_configs_directory + "/" + "vector-pipeline-1.yaml")

# from here - same steps as shown above

# define a test file in your input_files directory
test_file = "chapter_1_short.pdf"

# process the file
output = pipeline.process(local_file_path = input_directory + "/" + test_file,
                             expire_time=60*3, # have the output expire in 3 minutes
                             modules={},  # purposefully placing modules={}, they are hydrated in as necessary, see printout 
                             local_save_directory=output_directory)  # save the output to the output directory

INFO: Converting pdf to text...
SUCCESS: File conversion complete with pydf, result saved to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmpiltsoj1d/krixik_converted_version_chapter_1_short.txt
INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted input_files/chapter_1_short.pdf to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmpiltsoj1d/krixik_converted_version_chapter_1_short.txt
INFO: hydrated input modules: {'parser': {'model': 'fixed', 'params': {'chunk_size': 10, 'overlap_size': 2}}, 'text-embedder': {'model': 'multi-qa-MiniLM-L6-cos-v1', 'params': {'quantize': True}}, 'vector-search': {'model': 'faiss', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_nctsdcedmm.txt
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 180 seconds, at Sat Apr 13 14:44:49 2024

In [9]:
# vector search the file
pipeline.vector_search(query="do you like cats",
                       symbolic_directory_paths = ["/*"])

{'status_code': 200,
 'request_id': 'ea9dd67a-bb20-48a8-9968-1927f0974b0d',
 'message': 'Successfully queried 1 user file.',
 'items': [{'file_id': '572185a8-be42-4582-a547-d0222b341871',
   'file_metadata': {'file_name': 'krixik_generated_nctsdcedmm.txt',
    'symbolic_directory_path': '/etc',
    'file_tags': [],
    'num_vectors': 43,
    'created_at': '2024-04-13 21:41:53',
    'last_updated': '2024-04-13 21:41:53'},
   'search_results': [{'snippet': 'cats and dogs? Intuitively, when',
     'line_numbers': [32],
     'distance': 0.235},
    {'snippet': 'etc.) are either cats or dogs, until they fully grasp',
     'line_numbers': [30, 31],
     'distance': 0.247},
    {'snippet': 'of cats from those with dogs . This will allow',
     'line_numbers': [20],
     'distance': 0.261},
    {'snippet': 'a computer how to distinguish between pic- tures of cats',
     'line_numbers': [19, 20],
     'distance': 0.277},
    {'snippet': 'learned about the di ↵erence between cats and dogs, and',

## Keyword search

Like we discussed, keyword search has been spun off into its own module.

If you're uploading a text file, its now a one module pipeline.  Lets see it.

In [10]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# create modules for text (vector) search
module_1 = Module(name="keyword-search")

# create your custom pipeline
custom = CreatePipeline(name='keyword-pipeline-1', 
                        module_chain=[module_1])

# pass the custom object to the krixik operator (note you can also do this by passing its config - we'll do below)
pipeline = krixik.load_pipeline(pipeline=custom)

# define a test file in your input_files directory
test_file = "chapter_1_short.pdf"

# process the file
output = pipeline.process(local_file_path = input_directory + "/" + test_file,
                             expire_time=60*3, # have the output expire in 3 minutes
                             modules={},  # purposefully placing modules={}, they are hydrated in as necessary, see printout 
                             local_save_directory=output_directory)  # save the output to the output directory

INFO: Converting pdf to text...
SUCCESS: File conversion complete with pydf, result saved to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmp5fxec6la/krixik_converted_version_chapter_1_short.txt
INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted input_files/chapter_1_short.pdf to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmp5fxec6la/krixik_converted_version_chapter_1_short.txt
INFO: hydrated input modules: {'keyword-search': {'model': 'base', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_dzzhhiprkd.txt
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 180 seconds, at Sat Apr 13 14:46:54 2024 UTC
INFO: keyword-pipeline-1 file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This file's process_id is: 57ea3

Again notice the 

```SUCCESS: process output downloaded```

That's not a json, thats a sqlite database (extension `.db`).  Look in your `local_save_directory` and you'll see a file called `{file_id}.db`.  That's it.


Now lets keyword search our file.

In [11]:
# vector search the file
pipeline.keyword_search(query="do you like cats",
                        symbolic_directory_paths = ["/*"])

{'status_code': 200,
 'request_id': '6e4e62c9-f61a-4e95-9615-bca238ee323e',
 'message': 'Successfully queried 1 user file.',
    'you']}],
 'items': [{'file_id': 'c282820d-0699-4baa-93fe-5bb7324459a9',
   'file_metadata': {'file_name': 'krixik_generated_dzzhhiprkd.txt',
    'symbolic_directory_path': '/etc',
    'file_tags': [],
    'num_lines': 32,
    'created_at': '2024-04-13 21:43:56',
    'last_updated': '2024-04-13 21:43:56'},
   'search_results': [{'keyword': 'cats',
     'line_number': 16,
     'keyword_number': 3},
    {'keyword': 'cats', 'line_number': 20, 'keyword_number': 3},
    {'keyword': 'cats', 'line_number': 23, 'keyword_number': 13},
    {'keyword': 'like', 'line_number': 25, 'keyword_number': 8},
    {'keyword': 'cats', 'line_number': 28, 'keyword_number': 11},
    {'keyword': 'cats', 'line_number': 30, 'keyword_number': 15},
    {'keyword': 'cats', 'line_number': 32, 'keyword_number': 6}]}]}

A final note.

If your input was a list of dictionaries (a json), you could attach `json-to-txt` to the front of this pipeline to create a keyword search pipeline for your json.

## A few longer examples

Lets do one or two more long ones and then i'll leave you with my current laundary list of multi module pipelines.

In [3]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# create a few modules
module_1 = Module(name="transcribe")
module_2 = Module(name="translate")
module_3 = Module(name="text-embedder")
module_4 = Module(name="vector-search")

custom = CreatePipeline(name='my-fancy-transcribe-pipeline', 
                               module_chain=[module_1, module_2, module_3, module_4])

# pass the custom object to the krixik operator (note you can also do this by passing its config - we'll do below)
pipeline = krixik.load_pipeline(pipeline=custom)

# define a test file in your input_files directory
test_file = 'too_long.mp3'

# process the file
output = pipeline.process(local_file_path = input_directory + "/" + test_file,
                             expire_time=60*3, # have the output expire in 3 minutes
                             modules={},  # purposefully placing modules={}, they are hydrated in as necessary, see printout 
                             local_save_directory=output_directory)  # save the output to the output directory

INFO: hydrated input modules: {'transcribe': {'model': 'whisper-tiny', 'params': {}}, 'translate': {'model': 'opus-mt-en-es', 'params': {}}, 'text-embedder': {'model': 'multi-qa-MiniLM-L6-cos-v1', 'params': {'quantize': True}}, 'vector-search': {'model': 'faiss', 'params': {}}}


ValueError: file size is less than 1e-05 megabytes (current minimum size allowable) or greater than 3.000001 megabytes (current maximum size allowable) - input_files/too_long.mp3

In [13]:
output = pipeline.vector_search(query="ellos importan y tu no", 
                                   symbolic_directory_paths=['/*'])

json_print(output)

{
  "status_code": 200,
  "request_id": "8724acaf-1a98-4ce2-97cd-3d069c9be110",
  "message": "Successfully queried 1 user file.",
  "items": [
    {
      "file_id": "408bf0fe-0d90-4f6e-bd9e-c4cd042526c1",
      "file_metadata": {
        "file_name": "krixik_generated_whdafsdhon.mp3",
        "symbolic_directory_path": "/etc",
        "file_tags": [],
        "num_vectors": 1,
        "created_at": "2024-04-13 21:58:03",
        "last_updated": "2024-04-13 21:58:03"
      },
      "search_results": [
        {
          "snippet": "Cada vez que uso el trmino Latinx hay una rabia palpable que llena mis secciones de comentarios que simplemente no lo entiendo Todava puedes usar la palabra Latino. Todava me identifico como Latino todo el tiempo",
          "line_numbers": [
            1
          ],
          "distance": 0.296
        }
      ]
    }
  ]
}


In [14]:
# one more example of a custom pipeline
module_1 = Module(name="ocr")
module_2 = Module(name="json-to-txt")
module_3 = Module(name="parser")
module_4 = Module(name="text-embedder")
module_5 = Module(name="vector-search")

custom = CreatePipeline(name='my-fancy-ocr-pipeline', 
                               module_chain=[module_1, module_2, module_3, module_4, module_5])

# pass the custom object to the krixik operator (note you can also do this by passing its config - we'll do below)
pipeline = krixik.load_pipeline(pipeline=custom)

# define a test file in your input_files directory
test_file = 'seal.png'

# process the file
output = pipeline.process(local_file_path = input_directory + "/" + test_file,
                             expire_time=60*3, # have the output expire in 3 minutes
                             modules={},  # purposefully placing modules={}, they are hydrated in as necessary, see printout 
                             local_save_directory=output_directory)  # save the output to the output directory

INFO: hydrated input modules: {'ocr': {'model': 'tesseract-en', 'params': {}}, 'json-to-txt': {'model': 'base', 'params': {}}, 'parser': {'model': 'fixed', 'params': {'chunk_size': 10, 'overlap_size': 2}}, 'text-embedder': {'model': 'multi-qa-MiniLM-L6-cos-v1', 'params': {'quantize': True}}, 'vector-search': {'model': 'faiss', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_ndvtwtdhsk.png
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 180 seconds, at Sat Apr 13 15:03:45 2024 UTC
INFO: my-fancy-ocr-pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This file's process_id is: b8b5cac2-f4b3-c6f0-77ab-39dca2b16262
INFO: File process and processing status:
SUCCESS: module 1 (of 5) - ocr processing complete.
SUCCESS: module 2 (of 5) - json-to-txt process

In [15]:
output = pipeline.vector_search(query="some respite from the temperature", 
                                   symbolic_directory_paths=['/*'])

json_print(output)

{
  "status_code": 200,
  "request_id": "ab63ff1d-2bb3-410e-82fe-67ae20a053f4",
  "message": "Successfully queried 1 user file.",
  "items": [
    {
      "file_id": "8018029e-e6db-4a13-88cc-096763ace678",
      "file_metadata": {
        "file_name": "krixik_generated_ndvtwtdhsk.png",
        "symbolic_directory_path": "/etc",
        "file_tags": [],
        "num_vectors": 14,
        "created_at": "2024-04-13 22:00:48",
        "last_updated": "2024-04-13 22:00:48"
      },
      "search_results": [
        {
          "snippet": "relief from the heat, and at dawn a hot gust",
          "line_numbers": [
            3
          ],
          "distance": 0.311
        },
        {
          "snippet": "hot gust of wind blows across the colorless sea. The",
          "line_numbers": [
            3,
            4
          ],
          "distance": 0.343
        },
        {
          "snippet": "horses stir, stretching their parched muzzles towards the sea. They",
          "line_numbe

## Some more examples

Here are some more examples I've tested manually.

In [None]:
multi_module_pipeline_examples = [
    {
        "name": "caption-keyword-search",
        "module_chain": ["caption", "json-to-txt", "keyword-search"],
    },
    {
        "name": "caption-vector-search",
        "module_chain": [
            "caption",
            "json-to-txt",
            "parser",
            "text-embedder",
            "vector-search",
        ],
    },
    {"name": "txt-keyword-search", "module_chain": ["json-to-txt", "keyword-search"]},
    {
        "name": "ocr-vector-search",
        "module_chain": [
            "ocr",
            "json-to-txt",
            "parser",
            "text-embedder",
            "vector-search",
        ],
    },
    {
        "name": "ocr-keyword-search",
        "module_chain": ["ocr", "json-to-txt", "keyword-search"],
    },
    {
        "name": "ocr-sentiment",
        "module_chain": ["ocr", "json-to-txt", "parser", "sentiment"],
    },
    {
        "name": "standard-vector-search",
        "module_chain": ["parser", "text-embedder", "vector-search"],
    },
    {"name": "summarize-sentiment", "module_chain": ["summarize", "sentiment"]},
    {
        "name": "summarize-vector-search",
        "module_chain": [
            "summarize",
            "json-to-txt",
            "parser",
            "text-embedder",
            "vector-search",
        ],
    },
    {
        "name": "summarize-keyword-search",
        "module_chain": ["summarize", "json-to-txt", "keyword-search"],
    },
    {
        "name": "transcribe-vector-search",
        "module_chain": ["transcribe", "text-embedder", "vector-search"],
    },
    {
        "name": "transcribe-keyword-search",
        "module_chain": ["transcribe", "json-to-txt", "keyword-search"],
    },
    {
        "name": "transcribe-translate-vector-search",
        "module_chain": ["transcribe", "translate", "text-embedder", "vector-search"],
    },
    {"name": "transcribe-summarize", "module_chain": ["transcribe", "summarize"]},
    {"name": "transcribe-sentiment", "module_chain": ["transcribe", "sentiment"]},
    {
        "name": "translate-vector-search",
        "module_chain": ["translate", "text-embedder", "vector-search"],
    },
    {
        "name": "translate-keyword-search",
        "module_chain": ["translate", "json-to-txt", "keyword-search"],
    },
]
