<a href="https://colab.research.google.com/github/krixik-ai/krixik-docs/blob/main/docs/system/pipeline_creation/create_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os
import sys
import json
import importlib
from pathlib import Path

# demo setup - including secrets instantiation, requirements installation, and path setting
if os.getenv("COLAB_RELEASE_TAG"):
    # if running this notebook in Google Colab - make sure to enter your secrets
    MY_API_KEY = "YOUR_API_KEY_HERE"
    MY_API_URL = "YOUR_API_URL_HERE"

    # if running this notebook on Google Colab - install requirements and pull required subdirectories
    # install Krixik python client
    !pip install krixik

    # install github clone - allows for easy cloning of subdirectories from docs repo: https://github.com/krixik-ai/krixik-docs
    !pip install github-clone

    # clone datasets
    if not Path("data").is_dir():
        !ghclone https://github.com/krixik-ai/krixik-docs/tree/main/data
    else:
        print("docs datasets already cloned!")

    # define data dir
    data_dir = "./data/"

    # create output dir
    from pathlib import Path

    Path(data_dir + "/output").mkdir(parents=True, exist_ok=True)

    # pull utilities
    if not Path("utilities").is_dir():
        !ghclone https://github.com/krixik-ai/krixik-docs/tree/main/utilities
    else:
        print("docs utilities already cloned!")
else:
    # if running local pull of docs - set paths relative to local docs structure
    # import utilities
    sys.path.append("../../../")

    # define data_dir
    data_dir = "../../../data/"

    # if running this notebook locally from Krixik docs repo - load secrets from a .env placed at the base of the docs repo
    from dotenv import load_dotenv

    load_dotenv("../../../.env")

    MY_API_KEY = os.getenv("MY_API_KEY")
    MY_API_URL = os.getenv("MY_API_URL")


# load in reset
reset = importlib.import_module("utilities.reset")
reset_pipeline = reset.reset_pipeline


# import Krixik and initialize it with your personal secrets
from krixik import krixik

krixik.init(api_key=MY_API_KEY, api_url=MY_API_URL)

SUCCESS: You are now authenticated.


## Creating a Pipeline

This overview on creating pipelines is divided into the following sections:

- [The `create_pipeline` Method](#the-create_pipeline-method)
- [A Single-Module Pipeline](#a-single-module-pipeline)
- [A Multi-Module Pipeline](#a-multi-module-pipeline)

### The `create_pipeline` Method

The `create_pipeline` method instantiates new pipelines. It's a very simple method that takes two arguments, both required:

- `name` (str): The name of your new pipeline. Set it wisely: pipeline names are their key identifiers, and no two pipelines can share the same name.
- `module_chain` (list): The sequential list of modules that your new pipeline is comprised of.

[Click here](../../modules/modules_overview.md) to see the current list of available Krixik modules. Remember that as long as outputs and inputs match any combination of modules is fair game, including those with module repetition.

### A Single-Module Pipeline

Let's use the `create_pipeline` method to create a single-module pipeline. We'll use the [`parser`](../../modules/support_function_modules/parser_module.md) module, which divides input text files into shorter snippets.

In [2]:
# create a pipeline with a single parser module
pipeline = krixik.create_pipeline(name="create_pipeline_1_parser", module_chain=["parser"])

Make sure that you have [initialized your session](../initialization/initialize_and_authenticate.md) before executing this code.

Note that the `name` argument can be whatever string you want it to be. However, the `module_chain` list can only be comprised of established [module identifiers](../convenience_methods/convenience_methods.md#view-all-available-modules-with-the-available_modules-property).

### A Multi-Module Pipeline

Now let's set up a pipeline sequentially consisting of three modules: a [`parser`](../../modules/support_function_modules/parser_module.md) module, a [`text-embedder`](../../modules/ai_modules/text-embedder_module.md) module, and a [`vector-db`](../../modules/database_modules/vector-db_module.md) module.  This popular `module_chain` arises often: it's the basic document-based semantic (a.k.a. vector) search [pipeline](../../examples/search_pipeline_examples/multi_basic_semantic_search.md).

As you can see, pipeline setup syntax is the same as above. The order of the modules in `module_chain` is the the order they'll process pipeline input in:

In [3]:
# create a basic semantic (vector) search multi-module pipeline
pipeline = krixik.create_pipeline(name="create_pipeline_2_parser_embedder_vector", module_chain=["parser", "text-embedder", "vector-db"])

An array of multi-module pipeline examples can be [found here](../../examples/pipeline_examples_overview.md).

### Module Sequence Validation

Upon `create_pipeline` execution the Krixik CLI confirms that the modules indicated will run properly in the provided sequence. If they cannot—which is generally a consequence of one module's output not matching the next module's input—an explanatory local exception is thrown.

For example, attempting to build a two-module pipeline that sequentially consists of a [`parser`](../../modules/support_function_modules/parser_module.md) module and a [`caption`](../../modules/ai_modules/caption_module.md) module will rightly fail and produce a local exception.  This is because the [`parser`](../../modules/support_function_modules/parser_module.md) module outputs a JSON file, while the [`caption`](../../modules/ai_modules/caption_module.md) module accepts only image input, as the error message below indicates:

In [4]:
# attempt to create a pipeline sequentially comprised of a parser and a caption module
pipeline = krixik.create_pipeline(name="create_pipeline_3_parser_caption", module_chain=["parser", "caption"])

TypeError: format type mismatch between parser - whose output format is json - and caption - whose input format is image

### Pipeline Name Repetition

Krixik will not allow you to create a pipeline with the `name` of a pipeline you have already created. The only exception is if the new pipeline has a module chain identical to the old one.

If you attempt to create a new pipeline with the `name` of a previous pipeline and with a different `module_chain`, initial pipeline instantiation will not fail; in other words, you will be able to run the `create_pipeline` method without issue. However, when two pipelines with the same name and different `module_chain`s exist and you've already [`processed`](../parameters_processing_files_through_pipelines/process_method.md) one file through one of them, you will **not** be allowed to process a file through the other because of pipeline `name` duplication.

In [6]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)