In [1]:
import sys 
sys.path.append('../../')

%load_ext autoreload
%autoreload 2

## Building your first custom krixik pipeline

With krixik *modules* are the building blocks of *pipelines*.  *Moduels* - consisting of both AI models and supporting functions.  

We start off this Section by describing the necessary steps to get started building pipelines with modules.  Advanced details on modules may be found in Section 2.

We will use the following small function to print dictionaries and json files more prettily to cell output.

In [2]:
# print dictionaries / json nicely in notebooks / markdown
import json
def json_print(data):
    print(json.dumps(data, indent=2))

### Viewing available modules

To see all available modules use the following krixik api:

In [2]:
from krixik import krixik

# see all currently available modules
krixik.available_modules

['caption',
 'json-to-txt',
 'keyword-search',
 'ocr',
 'parser',
 'sentiment',
 'summarize',
 'text-embedder',
 'transcribe',
 'translate',
 'vector-search']

### 1.2  Creating modules

Lets create a few instances of the available modules shown above.

In [3]:
from krixik.pipeline_builder.module import Module

# create a few modules
module_1 = Module(module_type='transcribe')
module_2 = Module(module_type='text-embedder')
module_3 = Module(module_type="vector-search")
module_4 = Module(module_type="parser")

Once instantiated we can examine the metadata of these instances or connect them into a pipeline.

### 1.3  Viewing a module's `config` metadata

To see the highest level metadata on this module can be viewed via the `.config` property.

This high level information is especially useful when *processing* data with a module in a pipeline, but its also a great place to get started in understanding current module offerings.  

Specifically, `.config` tells you about a module's defaults, available models, and what kind of input/output data you need / should expect as output from the module.

Lets take a look at our first module's `.config`.

In [4]:
# print a dictionary nicely in an ide or notebook
json_print(module_1.config)

{
  "module": {
    "name": "transcribe",
    "models": [
      {
        "name": "whisper-tiny"
      },
      {
        "name": "whisper-base"
      },
      {
        "name": "whisper-small"
      },
      {
        "name": "whisper-medium"
      },
      {
        "name": "whisper-large-v3"
      }
    ],
    "input": {
      "type": "audio",
      "permitted_extensions": [
        ".mp3",
        ".mp4"
      ]
    },
    "output": {
      "type": "json",
      "permitted_extensions": [
        ".json"
      ]
    },
    "defaults": {
      "model": "whisper-tiny"
    }
  }
}


And the second module's.  In this case we see that each module has a `quantize` parameter that can be set at processing time.

In [5]:
# print a dictionary nicely in an ide or notebook
json_print(module_2.config)

{
  "module": {
    "name": "text-embedder",
    "models": [
      {
        "name": "multi-qa-MiniLM-L6-cos-v1",
        "params": {
          "quantize": {
            "type": "bool",
            "default": true
          }
        }
      },
      {
        "name": "msmarco-distilbert-dot-v5",
        "params": {
          "quantize": {
            "type": "bool",
            "default": true
          }
        }
      },
      {
        "name": "all-MiniLM-L12-v2",
        "params": {
          "quantize": {
            "type": "bool",
            "default": true
          }
        }
      },
      {
        "name": "all-mpnet-base-v2",
        "params": {
          "quantize": {
            "type": "bool",
            "default": true
          }
        }
      },
      {
        "name": "all-MiniLM-L6-v2",
        "params": {
          "quantize": {
            "type": "bool",
            "default": true
          }
        }
      }
    ],
    "input": {
      "type": "json",
 

Advanced module data properties are described in Sectino 2.  Their knowledge is not pre-requisite to building pipelines. 

### 1.4  Building your first pipeline

Let's build a standard text search pipeline using modules.
 
First we instantiate our modules - here we need the `parser`, `text-embedder`, and `vector-search` modules.  The `parser` currently takes care of setting up `keyword-search`.

In [6]:
from krixik.pipeline_builder.module import Module

# define a text search pipeline using modules
parser = Module(module_type='parser')
text_embedder = Module(module_type='text-embedder')
vector_search = Module(module_type='vector-search')

We want to make a pipeline from these four modules that looks like this

`parser` --> `text-embedder` --> `vector-search`

That is, a sequence of discrete processing steps:

- the `parser` module takes a *json* file as *input* and outputs a *json* file of text snippets
- the `text-embedder` processes as *input* the *json* output from the `parser`'s and produces numpy *output*
- the `vector-search` module takes as *input* the numpy *output* from `text-embedder` and produces a vector index as *output*

With our modules instantiated they can be added one a time using pipeline's `.add` api, or all together at instantiation of the pipeline.  

When taking the latter approach the modules are simply placed in order into a list called `module_chain` as shown below

In [7]:
from krixik.pipeline_builder.pipeline import CreatePipeline

text_search_pipeline = CreatePipeline(name='my-text-search-pipeline', 
                                      module_chain=[parser, text_embedder, vector_search])

Connection or "click-ability" tests are performed on the instantiation of this object.  These guarantee proper flow of input/output information through the defined module chain of the pipeline.

These tests catch incompatible module connections.  For example if we try the pipeline

`vector-search` --> `text-embedder`

our instantiation will fail with a message about *why* the connection won't work. 

 Lets try (and fail) to build this pipeline.

In [8]:
from krixik.pipeline_builder.pipeline import CreatePipeline

fail_pipeline = CreatePipeline(name='my-failed-pipeline', 
                               module_chain=[vector_search, text_embedder])

TypeError: format type mismatch between vector-search - whose output format is faiss - and text-embedder - whose input format is json

For more details on what's happening with these tests see Section 2 of this document.  For now the details are not critical.

### 1.5  Testing input in your pipeline

You can test whether inputs to your pipeline will flow properly through it by using your pipeline's `.test_input` api. 

We illustrate this below with both a valid and invalid file for our `text_search_pipeline` above.

Make sure to examine your modules' configs or your pipeline config (detailed in the next subsection) - and in particular the first module's config - to understand allowable input data types and file extensions for your pipeline.  

This test does not execute your pipeline.  It makes sure your input file is consumable by the first module of your pipeline.

In [9]:
# define path to an input file from examples directory
test_file = "../../examples/input_data/1984_very_short.txt"

# use .test_input to ensure the pipeline is working as expected on test files
text_search_pipeline.test_input(local_file_path=test_file)

SUCCESS: local file ../../examples/input_data/1984_very_short.txt passed pipeline input test passed


In [10]:
# define path to an input file from examples directory
test_file = "../../examples/input_data/seal.png"

# use .test_input to ensure the pipeline is working as expected on test files
text_search_pipeline.test_input(local_file_path=test_file)

Exception: file extension .png does not match the expected input format text

Examine the relevant data class of your starting module to ensure your input satisfies the required input structure requirements.

You can get a quick sense of its required structure by looking at a sample datapoint as shown in the next few cells.

In [11]:
# exampine the required input / output data structure for the parser module by printing an example of each
from krixik.modules.parser import io
import json
print('input data example')
print('-----')
print(io.InputStructure().data_example)
print('\n')
print('output data example')
print('-----')
print(json.dumps(io.OutputStructure().data_example, indent=2))

input data example
-----
sample text looks like this.


output data example
-----
{
  "snippet": "This is the main text.",
  "line_numbers": [
    1,
    2,
    3,
    4
  ],
  "other": null
}


Here `other` denotes any other key in your input.  Its value is arbitrary.

For a deeper understanding of module io you can examine its `dataclass` as detailed in Section 2.

### 1.6 Useful pipeline data properties

Let's look at some valuable data properties and apis of our pipeline `text_search_pipeline`. 

To view the module chain of your pipeline, use the `.module_chain` property.

In [12]:
# view the module chain of your pipeline using the .module_chain property
text_search_pipeline.module_chain

['parser', 'text-embedder', 'vector-search']

For a more detailed view of your pipeline, including details on permissible input/output data types and extensions, use the `.config` property.  This essentially centralizes your pipeline's module configs in one place.

Your pipeline config file is also how you save / load your pipeline (so you do not need to go through the pythonic steps of building it each time you want to use it).

In [13]:
# examine a pipeline's high level data by using the .config property
# print a dictionary nicely in an ide or notebook
json_print(text_search_pipeline.config)

{
  "pipeline": {
    "name": "my-text-search-pipeline",
    "modules": [
      {
        "name": "parser",
        "models": [
          {
            "name": "sentence"
          },
          {
            "name": "fixed",
            "params": {
              "chunk_size": {
                "type": "int",
                "default": 10
              },
              "overlap_size": {
                "type": "int",
                "default": 2
              }
            }
          }
        ],
        "defaults": {
          "model": "sentence"
        },
        "input": {
          "type": "text",
          "permitted_extensions": [
            ".txt",
            ".pdf",
            ".docx",
            ".pptx"
          ]
        },
        "output": {
          "type": "json"
        }
      },
      {
        "name": "text-embedder",
        "models": [
          {
            "name": "multi-qa-MiniLM-L6-cos-v1",
            "params": {
              "quantize": {
            

### 1.7  Saving your pipeline config

Once your pipeline is built, connection tested, and input tested, it's a good idea to save its config.  This allows you to load your pipeline directly from file in the future, saving the hassle of having to rebuild it pythonically each time you want to use it. 

To save the config file of a pipeline use the `.save` method, providing a path to a local `.yml`.

In [14]:
# save your pipeline config to a .yaml file
text_search_pipeline.save('text_search_pipeline.yaml')

### 1.8  Loading your pipeline from file

Load your pipeline either directly on instantiation or by using the `.load` method.

In [15]:
from krixik.pipeline_builder.pipeline import CreatePipeline

# load pipeline directly on instantiation
reloaded_pipeline = CreatePipeline(config_path = 'text_search_pipeline.yaml')

In [17]:
# examine a pipeline's high level data by using the .config property
# print a dictionary nicely in an ide or notebook
json_print(reloaded_pipeline.config)

{
  "pipeline": {
    "name": "my-text-search-pipeline",
    "modules": [
      {
        "name": "parser",
        "models": [
          {
            "name": "sentence"
          },
          {
            "name": "fixed",
            "params": {
              "chunk_size": {
                "type": "int",
                "default": 10
              },
              "overlap_size": {
                "type": "int",
                "default": 2
              }
            }
          }
        ],
        "defaults": {
          "model": "sentence"
        },
        "input": {
          "type": "text",
          "permitted_extensions": [
            ".txt",
            ".pdf",
            ".docx",
            ".pptx"
          ]
        },
        "output": {
          "type": "json"
        }
      },
      {
        "name": "text-embedder",
        "models": [
          {
            "name": "multi-qa-MiniLM-L6-cos-v1",
            "params": {
              "quantize": {
            

## 2.  Modules - advanced details

This section contains advanced topics on module usage.  This includes the discussion of additional module data properties - `click_data` and `_example`.

In [18]:
from krixik.pipeline_builder.module import Module

# create a few modules
module_1 = Module(module_type='transcribe')
module_2 = Module(module_type='text-embedder')
module_3 = Module(module_type="vector-search")
module_4 = Module(module_type="parser")

### 2.1  Viewing a module's `click_data`

The module property `click_data` displays all the basic data required to know which other modules it can be "clicked" into in a pipeline.  This is precisely what data is referenced "under the hood" of krixik when you build a pipeline using the `pipeline` api.

First there's the module's input / output data format.  A module like  `transcribe` takes in `audio` and outputs `json`, while the `text-embedder` takes in `json` and outputs `.npy`.  

Checking that the *output* format of a module matches the *input* format of another module is the *first* of two steps in determining if two modules can be clicked together.  If the output format of "module A"  matches the input format of "module B" you'll likely be able to connect "module A" --> "module B" in a pipeline.

The *second* step to determine module click-ability is to make sure the input/output  `process_type`'s match.  A module might input a `json` format, but only *process* on certain key-value pairs of it.  

Checking this aligment of `process_type` guarantees modules can be connected.

Lets take a look at the `click_data` of two modules and discuss what it says about their "click-ability".

In [19]:
# examine a module's "click-ability" data by using the click_data property
# print a dictionary nicely in an ide or notebook
json_print(module_2.click_data)
json_print(module_3.click_data)

{
  "module_name": "text-embedder",
  "input_format": "json",
  "output_format": "npy",
  "input_process_key": "snippet",
  "input_process_type": "<class 'str'>",
  "output_process_key": "data",
  "output_process_type": "<class 'numpy.ndarray'>"
}
{
  "module_name": "vector-search",
  "input_format": "npy",
  "output_format": "faiss",
  "input_process_key": "data",
  "input_process_type": "<class 'numpy.ndarray'>",
  "output_process_key": null,
  "output_process_type": null
}


This data suggests that we can "click" the modules together like this:

`text-embedder` -> `vector-search`

but *not* like this

 `vector-search` -> `text-embedder`

The first module connection (`text-embedder` -> `vector-search`) will work since - from the `click_data` of both modules - we can see that 

- `text-embedder` output_format (`npy`) == `vector-search` input_format (`npy`), and 
- `text-embedder` output_process_type (`<class 'numpy.ndarray'>`) == `vector-search` input_process_type (`<class 'numpy.ndarray'>`)


The latter connection ( `vector-search` -> `text-embedder`) will not work since we can see from the same data 

- `vector-search` output_format (`faiss`) != `text-embedder` input_format (`json`)



### 2.2  Viewing a module's i/o details

You can use the `._example` property to see an example a module's input/output. 

In [20]:
# examine a module's "click-ability" data by using the click_data property
# print a dictionary nicely in an ide or notebook
json_print(module_1.output_example)

{
  "transcript": "This is the full transcript.",
  "segments": [
    {
      "id": 1,
      "seek": 0,
      "start": 0.0,
      "end": 10.0,
      "text": "This is the",
      "tokens": [
        20,
        34
      ],
      "temperature": 0.0,
      "avg_logprob": 0.0,
      "compression_ratio": 0.0,
      "no_speech_prob": 0.0,
      "confidence": 0.0,
      "words": [
        {
          "text": "This",
          "start": 0.0,
          "end": 1.0,
          "confidence": 0.5
        },
        {
          "text": "is the",
          "start": 1.0,
          "end": 2.0,
          "confidence": 0.6
        }
      ]
    },
    {
      "id": 2,
      "seek": 10,
      "start": 10.0,
      "end": 20.0,
      "text": "main text",
      "tokens": [
        44,
        101
      ],
      "temperature": 0.0,
      "avg_logprob": 0.0,
      "compression_ratio": 0.0,
      "no_speech_prob": 0.0,
      "confidence": 0.0,
      "words": [
        {
          "text": "main",
          "start"

### 2.3 Examining input/output dataclasses

To get a deeper understanding of each module's input/output data structure you can examine its associated dataclasses.

As an example the first module in our `text_search_pipeline` pipeline is `parser`.  The io dataclasss for this module is shown below.  Your input must match this class requirement in order for your input test to pass, and in order for your pipeline to function propertly.

In [21]:
# load in io.py from krixik.modules.parser
from krixik.modules.parser import io
import inspect
print(inspect.getsource(io.InputStructure))

@dataclass
class InputStructure:
    format: Literal["text"] = "text"
    filename: str = "input_text.txt"
    process_key: None = None

    @property
    def data_example(self):
        return "sample text looks like this."

    @property
    def process_type(self):
        return (
            str(self.__annotations__[self.process_key])
            if self.process_key is not None
            else None
        )



## 3.  Pipelines - advanced details

### 3.1   Building a pipeline one module at a time

You can build and ad modules one-at-a-time as well using the `.add` api.  Each time a module is added the same sort of connection test described above is performed on the entire module chain.

In [22]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# define a module
module_1 = Module(module_type='transcribe')

# instantiate an empty custom pipeline
pipeline = CreatePipeline(name='my-custom-pipeline')

# add the first module to the pipeline
pipeline.add(module_1)

In [23]:
# define another module
module_2 = Module(module_type='sentiment')

# add the second module to the pipeline
pipeline.add(module_2)

In [24]:
# define another module
module_3 = Module(module_type='translate')

# add the third module to the pipeline
pipeline.add(module_3)

You can now use all of the previously detailed attributes to view your pipelines configuration.  For example the `.config` attribute.

In [25]:
# examine a pipeline's config
# print a dictionary nicely in an ide or notebook
json_print(pipeline.config)


{
  "pipeline": {
    "name": "my-custom-pipeline",
    "modules": [
      {
        "name": "transcribe",
        "models": [
          {
            "name": "whisper-tiny"
          },
          {
            "name": "whisper-base"
          },
          {
            "name": "whisper-small"
          },
          {
            "name": "whisper-medium"
          },
          {
            "name": "whisper-large-v3"
          }
        ],
        "defaults": {
          "model": "whisper-tiny"
        },
        "input": {
          "type": "audio",
          "permitted_extensions": [
            ".mp3",
            ".mp4"
          ]
        },
        "output": {
          "type": "json"
        }
      },
      {
        "name": "sentiment",
        "models": [
          {
            "name": "distilbert-base-uncased-finetuned-sst-2-english"
          },
          {
            "name": "bert-base-multilingual-uncased-sentiment"
          },
          {
            "name": "distilbe