## Semantically searchable translation

This document reviews a semantic search pipeline that can be used to make an input text in a specific language semantically-searchable in another language.


- [pipeline setup](#pipeline-setup)
- [processing an input file](#processing-an-input-file)


In [1]:
# import utilities
import sys 
import json
sys.path.append('../../../')
import importlib
reset = importlib.import_module("utilities.reset")
reset_pipeline = reset.reset_pipeline

# load secrets from a .env file using python-dotenv
from dotenv import load_dotenv
import os
load_dotenv("../../.env")
MY_API_KEY = os.getenv('MY_API_KEY')
MY_API_URL = os.getenv('MY_API_URL')

# import krixik and initialize it with your personal secrets
from krixik import krixik
krixik.init(api_key = MY_API_KEY, 
            api_url = MY_API_URL)

SUCCESS: You are now authenticated.


## Pipeline setup

Below we setup a two pipelines.

The first is a single module pipeline consisting of [`translate`](modules/translate.md).

The latter consists of a [`translate`](modules/translate.md), [`parser`](modules/parser.md), [`text-embedder`](modules/text-embedder.md), and [`vector-db`](modules/vector-db.md) modules.  This will allow us to translate input files and make them semantically searchable.

We do this by passing the module names to the `module_chain` argument of [`create_pipeline`](system/create_save_load.md) along with a name for our pipeline.

In [2]:
# create a multi module pipeline to translate an input text file
translate_pipeline = krixik.create_pipeline(name="examples-translate-text-search-pipeline",
                                            module_chain=["parser", 
                                                          "translate", 
                                                          "json-to-txt"])

# create a pipeline with a multi module
pipeline = krixik.create_pipeline(name="examples-translate-text-search-semantic-pipeline",
                                  module_chain=["parser", 
                                                "translate", 
                                                "json-to-txt", 
                                                "parser",
                                                "text-embedder", 
                                                "vector-db"])

This pipeline's available modeling options and parameters are stored in your custom [pipeline's configuration](system/create_save_load.md).

In [3]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(translate_pipeline)
reset_pipeline(pipeline)

## Processing an input file

Lets take a quick look at a short test file before processing.

In [4]:
# examine contents of input file
test_file = "../../../data/input/don_esp.txt"
with open(test_file, "r") as file:
    print(file.read())
    

PRÓLOGO

Desocupado lector: sin juramento me podrás creer que quisiera que este
libro, como hijo del entendimiento, fuera el más hermoso, el más gallardo y
más discreto que pudiera imaginarse. Pero no he podido yo contravenir al
orden de naturaleza; que en ella cada cosa engendra su semejante. Y así,
¿qué podrá engendrar el estéril y mal cultivado ingenio mío, sino la
historia de un hijo seco, avellanado, antojadizo y lleno de pensamientos
varios y nunca imaginados de otro alguno, bien como quien se engendró en
una cárcel, donde toda incomodidad tiene su asiento y donde todo triste
ruido hace su habitación? El sosiego, el lugar apacible, la amenidad de los
campos, la serenidad de los cielos, el murmurar de las fuentes, la quietud
del espíritu son grande parte para que las musas más estériles se muestren
fecundas y ofrezcan partos al mundo que le colmen de maravilla y de
contento. Acontece tener un padre un hijo feo y sin gracia alguna, y el
amor que le tiene le pone una venda en los oj

Below we [process](system/process.md) the input through our pipeline using the default model for each of our three modules.

In [5]:
# define path to an input file from examples directory
test_file = "../../../data/input/don_esp.txt"

# process a file through the pipeline
process_output = translate_pipeline.process(local_file_path = test_file,
                                            local_save_directory="../../../data/output",  # save output in current directory
                                            expire_time=60*10,         # set all process data to expire in 5 minutes
                                            wait_for_process=True,     # wait for process to complete before regaining ide
                                            verbose=True,              # set verbosity to False
                                            modules={"translate": {"model": "opus-mt-es-en"}})

INFO: hydrated input modules: {'module_1': {'model': 'sentence', 'params': {}}, 'module_2': {'model': 'opus-mt-es-en', 'params': {}}, 'module_3': {'model': 'base', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_jypdyflhds.txt
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 600 seconds, at Tue May  7 12:41:19 2024 UTC
INFO: examples-translate-text-search-pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: f5f96398-3b3b-526f-cdf9-546d16616be4
INFO: File process and processing status:
SUCCESS: module 1 (of 3) - parser processing complete.
SUCCESS: module 2 (of 3) - translate processing complete.
SUCCESS: module 3 (of 3) - json-to-txt processing complete.
SUCCESS: pipeline process complete.
SUCCESS: process output

The output of this process is printed below.  Because the output of this particular pipeline pair is text, the process output is provided in this object is null.  However the file itself has been returned to the address noted in the `process_output_files` key.  The `file_id` of the processed input is used as a filename prefix for the output file.

In [6]:
# nicely print the output of this process
print(json.dumps(process_output, indent=2))

{
  "status_code": 200,
  "pipeline": "examples-translate-text-search-pipeline",
  "request_id": "be0f4674-2117-496d-be14-2ea0e819ad65",
  "file_id": "e98c5065-8544-4127-ab6e-288b70435e76",
  "message": "SUCCESS - output fetched for file_id e98c5065-8544-4127-ab6e-288b70435e76.Output saved to location(s) listed in process_output_files.",
  "process_output": null,
  "process_output_files": [
    "../../../data/output/e98c5065-8544-4127-ab6e-288b70435e76.txt"
  ]
}


We load in the text file output from `process_output_files` below. 

In [7]:
# load in process output from file
import json
with open(process_output["process_output_files"][0], "r") as file:
    print(file.read())

It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his great in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough enough to prevent a swirl of gritty dust from entering along with him.
The hallway smelt of built cabbage and old rag mats.
At one end of it a Coloured poster, too large for indoor display, had been tagged to the wall.
It described simply an enormous face, more than a metre wide: the face of a man of about forty-five, with a heavy black moustache and roaredly handsome features.
Winston made for the stairs.
It was not to use trying the lift.
Even at the best of times it was self working, and at present the electric current was cut off during daylight hours.
It was part of the economic drive in preparation for Hate Week.
The flat was seven flights up, and Winston, who was thirty-nine and had a varicose ulcer above his right ankle, went slowly, re

We now process the file with our second pipeline - which will make the outupt searchable in english.

In [8]:
# define path to an input file from examples directory
test_file = "../../../data/input/don_esp.txt"

# process a file through the pipeline
process_output = pipeline.process(local_file_path = test_file,
                                  local_save_directory="../../../data/output",  # save output in current directory
                                  expire_time=60*10,         # set all process data to expire in 5 minutes
                                  wait_for_process=True,     # wait for process to complete before regaining ide
                                  verbose=True,              # set verbosity to False
                                  modules={"module_2": {"model": "opus-mt-es-en"}})

ValueError: when using duplicate modules your module selection labels must lie in the index set: ['module_1', 'module_2', 'module_3', 'module_4', 'module_5', 'module_6'] so that your model/param selections can be correctly mapped to the proper modules

## Using the `semantic_search` method

krixik's [`semantic_search` method](system/semantic_search.md) is a convenience function for both embedding and querying - and so can only be used with pipelines containing both `text-embedder` and `vector-db` modules in succession.  Since our pipeline here satisfies this condition, it has access to the `semantic_search` method.

Now we can query our text with natural language as shown below.

In [None]:
# perform semantic_search over the input file
semantic_output = pipeline.semantic_search(
    query="it was cold night", file_ids=[process_output["file_id"]]
)

# nicely print the output of this process
print(json.dumps(semantic_output, indent=2))

{
  "status_code": 200,
  "request_id": "57a797f5-bc77-4774-a2b0-b5f13007b356",
  "message": "Successfully queried 1 user file.",
  "items": [
    {
      "file_id": "b20fa17b-1df9-4185-94c9-ae17ac187822",
      "file_metadata": {
        "file_name": "krixik_generated_file_name_toyzuamynf.txt",
        "symbolic_directory_path": "/etc",
        "file_tags": [],
        "num_vectors": 50,
        "created_at": "2024-05-07 18:30:32",
        "last_updated": "2024-05-07 18:30:32"
      },
      "search_results": [
        {
          "snippet": "Outside, even through the shut window-pane, the world looked cold.",
          "line_numbers": [
            32,
            33
          ],
          "distance": 0.232
        },
        {
          "snippet": "It was a bright cold day in April, and the clocks were striking thirteen.",
          "line_numbers": [
            1
          ],
          "distance": 0.236
        },
        {
          "snippet": "His hair was very fair, his face\nna

In [None]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(translate_pipeline)
reset_pipeline(pipeline)