<a href="https://colab.research.google.com/github/krixik-ai/krixik-docs/blob/main/docs/examples/single_module_pipelines/single_text-embedder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> <a href="https://youtu.be/qNZIElHXmLA" target="_parent"><img src="https://badges.aleen42.com/src/youtube.svg" alt="Youtube"/></a>

In [1]:
import os
import sys
import json
from pathlib import Path

# demo setup - including secrets instantiation, requirements installation, and path setting
if os.getenv("COLAB_RELEASE_TAG"):
    # if running this notebook in Google Colab - make sure to enter your secrets
    MY_API_KEY = "YOUR_API_KEY_HERE"
    MY_API_URL = "YOUR_API_URL_HERE"

    # if running this notebook on Google Colab - install requirements and pull required subdirectories
    # install Krixik python client
    !pip install krixik

    # install github clone - allows for easy cloning of subdirectories from docs repo: https://github.com/krixik-ai/krixik-docs
    !pip install github-clone

    # clone datasets
    if not Path("data").is_dir():
        !ghclone https://github.com/krixik-ai/krixik-docs/tree/main/data
    else:
        print("docs datasets already cloned!")

    # define data dir
    data_dir = "./data/"

    # create output dir
    from pathlib import Path

    Path(data_dir + "/output").mkdir(parents=True, exist_ok=True)

else:
    # if running local pull of docs - set paths relative to local docs structure
    # import utilities
    sys.path.append("../../../")

    # define data_dir
    data_dir = "../../../data/"

    # if running this notebook locally from Krixik docs repo - load secrets from a .env placed at the base of the docs repo
    from dotenv import load_dotenv

    load_dotenv("../../../.env")

    MY_API_KEY = os.getenv("MY_API_KEY")
    MY_API_URL = os.getenv("MY_API_URL")

# import Krixik and initialize it with your personal secrets
from krixik import krixik

krixik.init(api_key=MY_API_KEY, api_url=MY_API_URL)

SUCCESS: You are now authenticated.


## Single-Module Pipeline: `text-embedder`
[🇨🇴 Versión en español de este documento](https://krixik-docs.readthedocs.io/es-main/ejemplos/ejemplos_pipelines_modulo_unico/unico_text-embedder_encaje_lexico/)

This document is a walkthrough of how to assemble and use a single-module pipeline that only includes a [`text-embedder`](../../modules/ai_modules/text-embedder_module.md) module. 

Text embedding involves converting words or sentences into numerical vectors in a high-dimensional space. The vectors are mathematical representations that preserve the semantic relationships and contextual meanings of the text. Its applications lie in improving Natural Language Processing (NLP) tasks like semantic/vector search, sentiment analysis, machine translation, information retrieval, and recommendation systems, among other possibilities.

For a slightly more complex pipeline that combines a [`text-embedder`](../../modules/ai_modules/text-embedder_module.md) module with a [`vector-db`](../../modules/database_modules/vector-db_module.md) module to enable [`semantic (vector) search`](../../system/search_methods/semantic_search_method.md), [click here](../search_pipeline_examples/multi_basic_semantic_search.md).

The document is divided into the following sections:

- [Pipeline Setup](#pipeline-setup)
- [Required Input Format](#required-input-format)
- [Using the Default Model](#using-the-default-model)
- [Examining Process Output Locally](#examining-process-output-locally)
- [Using a Non-Default Model](#using-a-non-default-model)
- [Using the `semantic_search` Method](#using-the-semantic_search-method)

### Pipeline Setup

Let's first instantiate a single-module [`text-embedder`](../../modules/ai_modules/text-embedder_module.md) pipeline.

We use the [`create_pipeline`](../../system/pipeline_creation/create_pipeline.md) method for this, passing only the [`text-embedder`](../../modules/ai_modules/text-embedder_module.md) module name into `module_chain`.

In [2]:
# create a pipeline with a single text-embedder module
pipeline = krixik.create_pipeline(name="single_text-embedder-1", module_chain=["text-embedder"])

### Required Input Format

The [`text-embedder`](../../modules/ai_modules/text-embedder_module.md) module accepts JSON file input. The input JSON must respect [this format](../../system/parameters_processing_files_through_pipelines/JSON_input_format.md).

The JSON file may optionally also include, along with each snippet, a key-value pair in which the key is the string `"line numbers"` and the value is an integer list of every line number of the original document that the snippet is on. This will make it easier for you to identify what document line each vector is an embedding of.

Let's take a quick look at a valid input file, and then process it.

In [3]:
# examine contents of a valid input file
with open(data_dir + "input/1984_snippets.json", "r") as file:
    print(json.dumps(json.load(file), indent=2))

[
  {
    "snippet": "It was a bright cold day in April, and the clocks were striking thirteen.",
    "line_numbers": [
      1
    ]
  },
  {
    "snippet": "Winston Smith, his chin nuzzled into his breast in an effort to escape the\nvile wind, slipped quickly through the glass doors of Victory Mansions,\nthough not quickly enough to prevent a swirl of gritty dust from entering\nalong with him.",
    "line_numbers": [
      2,
      3,
      4,
      5
    ]
  }
]


### Using the Default Model

Let's process our test input file using the [`text-embedder`](../../modules/ai_modules/text-embedder_module.md) module's [default model](../../modules/ai_modules/text-embedder_module.md#available-models-in-the-text-embedder-module): [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

Given that this is the default model, we need not specify model selection through the optional [`modules`](../../system/parameters_processing_files_through_pipelines/process_method.md#selecting-models-via-the-modules-argument) argument in the [`process`](../../system/parameters_processing_files_through_pipelines/process_method.md) method.

In a later section of this document we will process the same file again, but select our model and the quantization thereof explicitly.

In [4]:
# process the file with the default model
process_output = pipeline.process(
    local_file_path=data_dir + "input/1984_snippets.json",  # the initial local filepath where the input file is stored
    local_save_directory=data_dir + "output",  # the local directory that the output file will be saved to
    expire_time=60 * 30,  # process data will be deleted from the Krixik system in 30 minutes
    wait_for_process=True,  # wait for process to complete before returning IDE control to user
    verbose=False,
)  # do not display process update printouts upon running code

The output of this process is printed below. To learn more about each component of the output, review documentation for the [`process`](../../system/parameters_processing_files_through_pipelines/process_method.md) method.

Moreover, the output file itself has been saved to the location noted in the `process_output_files` key.  The `file_id` of the processed input is used as a filename prefix for the output file.

In [5]:
# nicely print the output of this process
print(json.dumps(process_output, indent=2))

{
  "status_code": 200,
  "pipeline": "single_text-embedder-1",
  "request_id": "ce2e57ce-c2be-49ac-8d43-59e6db2bcf25",
  "file_id": "ce4ddfa5-12c6-4dcb-86af-4f6d30ed6188",
  "message": "SUCCESS - output fetched for file_id ce4ddfa5-12c6-4dcb-86af-4f6d30ed6188.Output saved to location(s) listed in process_output_files.",
  "process_output": null,
  "process_output_files": [
    "../../../data/output/ce4ddfa5-12c6-4dcb-86af-4f6d30ed6188.npy"
  ]
}


### Examining Process Output Locally

The outputted NPY file containing embedding vectors of our input data can be examined as follows. For the sake of clarity, in this example we will simply print the shape, and not any of the contents, of the returned array:

In [6]:
# examine vector output
import numpy as np

vectors = np.load(process_output["process_output_files"][0])
print(vectors.shape)

(2, 384)


In other words, the array has 2 rows with 384 values in each row. 

In the context of the input file, the first row is the vectorized form of our first snippet: "It was a bright cold day in April, and the clocks were striking thirteen."

### Using a Non-Default Model

To use a [non-default model](../../modules/ai_modules/text-embedder_module.md#available-models-in-the-text-embedder-module) like [`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), we must enter it explicitly through the [`modules`](../../system/parameters_processing_files_through_pipelines/process_method.md#selecting-models-via-the-modules-argument) argument when invoking the [`process`](../../system/parameters_processing_files_through_pipelines/process_method.md) method. As [module documentation](../../modules/ai_modules/text-embedder_module.md) indicates, you can also specify whether or not you wish to use the quantized version of the model.

In [7]:
# process the file with a non-default model
process_output = pipeline.process(
    local_file_path=data_dir + "input/1984_snippets.json",  # all parameters save 'modules' as above
    local_save_directory=data_dir + "output",
    expire_time=60 * 30,
    wait_for_process=True,
    verbose=False,
    modules={"text-embedder": {"model": "all-mpnet-base-v2", "params": {"quantize": False}}},
)  # specify a non-default model for this process

Now we can examine the output as we did above.

In [8]:
# nicely print the output of this process
print(json.dumps(process_output, indent=2))

{
  "status_code": 200,
  "pipeline": "single_text-embedder-1",
  "request_id": "f53c93da-cf1b-41e0-a578-16dc62736ed4",
  "file_id": "1dcbde04-d4f2-414f-acaf-577c355bbb88",
  "message": "SUCCESS - output fetched for file_id 1dcbde04-d4f2-414f-acaf-577c355bbb88.Output saved to location(s) listed in process_output_files.",
  "process_output": null,
  "process_output_files": [
    "../../../data/output/1dcbde04-d4f2-414f-acaf-577c355bbb88.npy"
  ]
}


### Using the `semantic_search` method

Any pipeline containing a [`vector-db`](../../modules/database_modules/vector-db_module.md) module preceded by a [`text-embedder`](../../modules/ai_modules/text-embedder_module.md) module has access to the [`semantic_search`](../../system/search_methods/semantic_search_method.md) method. This provides you with the convenient ability to effect semantic queries on the created vector database(s).

As the single-module pipeline created above lacks the [`vector-db`](../../modules/database_modules/vector-db_module.md) module, the [`semantic_search`](../../system/search_methods/semantic_search_method.md) method will not work on it. Review documentation for this [pipeline example](../../examples/search_pipeline_examples/multi_basic_semantic_search.md) or this [pipeline example](../../examples/search_pipeline_examples/multi_snippet_semantic_search.md), both of which meet the requirements for the method: the former ingests TXT files, and the latter JSON files.

In [9]:
# delete all processed datapoints belonging to this pipeline
krixik.reset_pipeline(pipeline)