## the `translate` module

This document reviews the `translate` module - which takes as input a json of text snippets and returns their translations.  Translation data is returned as a json.

This document includes an overview of custom pipeline setup, current model set, parameters, and `.process` usage for this module.

To follow along with this demonstration be sure to initialize your krixik session with your api key and url as shown below. 

We illustrate loading these required secrets in via [python-dotenv](https://pypi.org/project/python-dotenv/), storing those secrets in a `.env` file.  This is always good practice for storing / loading secrets (e.g., doing so will reduce the chance you inadvertantly push secrets to a repo).

In [1]:
import sys 
sys.path.append('../../')

In [2]:
# load secrets from a .env file using python-dotenv
from dotenv import load_dotenv
import os
load_dotenv("../.env")
MY_API_KEY = os.getenv('MY_API_KEY')
MY_API_URL = os.getenv('MY_API_URL')

# import krixik and initialize it with your personal secrets
from krixik import krixik
krixik.init(api_key = MY_API_KEY, 
            api_url = MY_API_URL)

SUCCESS: You are now authenticated.


In [3]:
# reset pipelines for demo
def reset_pipeline(pipeline):
    current_files = pipeline.list(symbolic_directory_paths=["/*"])
    assert current_files["status_code"] != 500
    for item in current_files["items"]:
        delete_result = pipeline.delete(file_id=item["file_id"])
        assert delete_result["status_code"] != 500
    current_files = pipeline.list(symbolic_directory_paths=["/*"])
    assert current_files["status_code"] != 500
    assert len(current_files["items"]) == 0

This small function prints dictionaries very nicely in notebooks / markdown.

In [4]:
# print dictionaries / json nicely in notebooks / markdown
import json
def json_print(data):
    print(json.dumps(data, indent=2))

A table of contents for the remainder of this document is shown below.


- [pipeline setup](#pipeline-setup)
- [using the english to spanish translation model](#using-the-english-to-spanish-translation-model)
- [using the spanish to english translation model](#using-spanish-to-english-translation-model)


## Pipeline setup

Below we setup a simple one module pipeline using the `translate` module. 

In [5]:
# import custom module creation tools
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# instantiate module
module_1 = Module(module_type="translate")

# create custom pipeline object
custom = CreatePipeline(name='translate-pipeline-1', 
                        module_chain=[module_1])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

The `translate` module comes with a subset of popular translation models created at the [University of Hellsinki](https://huggingface.co/Helsinki-NLP).  These include

- [opus-mt-en-es](https://huggingface.co/Helsinki-NLP/opus-mt-en-es): english to spanish translation model (default)
- [opus-mt-es-en](https://huggingface.co/Helsinki-NLP/opus-mt-es-en): spanish to english translation model
- [opus-mt-de-en](https://huggingface.co/Helsinki-NLP/opus-mt-de-en): german to english translation model
- [opus-mt-en-fr](https://huggingface.co/Helsinki-NLP/opus-mt-en-fr): english to french translation model
- [opus-mt-fr-en](https://huggingface.co/Helsinki-NLP/opus-mt-fr-en): french to english translation model
- [opus-mt-it-en](https://huggingface.co/Helsinki-NLP/opus-mt-it-en): italian to english translation model
- [opus-mt-zh-en](https://huggingface.co/Helsinki-NLP/opus-mt-zh-en): chinese to english translation model

These available modeling options and parameters are stored in our custom pipeline's configuration (described further in LINK HERE).  We can examine this configuration as shown below.

In [6]:
# nicely print the configuration of uor custom pipeline
json_print(custom.config)

{
  "pipeline": {
    "name": "translate-pipeline-1",
    "modules": [
      {
        "name": "translate",
        "models": [
          {
            "name": "opus-mt-de-en"
          },
          {
            "name": "opus-mt-en-es"
          },
          {
            "name": "opus-mt-es-en"
          },
          {
            "name": "opus-mt-en-fr"
          },
          {
            "name": "opus-mt-fr-en"
          },
          {
            "name": "opus-mt-it-en"
          },
          {
            "name": "opus-mt-zh-en"
          }
        ],
        "defaults": {
          "model": "opus-mt-en-es"
        },
        "input": {
          "type": "json",
          "permitted_extensions": [
            ".json"
          ]
        },
        "output": {
          "type": "json"
        }
      }
    ]
  }
}


Here we can see the models and their associated parameters available for use.

In [7]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)

## using the `opus-mt-en-es` model

We first define a path to a local input file.

In [None]:
# define path to an input file
test_file = "../input_data/Interesting Facts About Colombia.mp4"

Lets take a quick look at this file before processing.

In [None]:
# examine contents of input file
from IPython.display import Video
Video(test_file)

Now let's process it using our `tiny` model.  Because `tiny` is the default model we need not input the optional `modules` argument into `.process`.

In [None]:
# define path to an input file from examples directory
test_file = "../input_data/Interesting Facts About Colombia.mp4"

# process for search
process_output = pipeline.process(local_file_path = test_file,
                                  local_save_directory=".", # save output in current directory
                                  expire_time=60*5,         # set all process data to expire in 5 minutes
                                  wait_for_process=True,    # wait for process to complete before regaining ide
                                  verbose=False)            # set verbosity to False

The output of this process is printed below.  Because the output of this particular module-model pair is json, the process output is provided in this object as well.  The output file itself has been returned to the address noted in the `process_output_files` key.

In [None]:
# nicely print the output of this process
json_print(process_output)

{
  "status_code": 200,
  "pipeline": "transcribe-pipeline-1",
  "request_id": "242d366f-e5a6-43eb-8069-4766c2274243",
  "file_id": "8fb91c66-45f5-42b7-8e3f-44f97e75ea6e",
  "message": "SUCCESS - output fetched for file_id 8fb91c66-45f5-42b7-8e3f-44f97e75ea6e.Output saved to location(s) listed in process_output_files.",
  "process_output": [
    {
      "transcript": " That's the episode looking at the great country of Columbia. We looked at some really basic facts. It's name, a bit of its history, the type of people that live there, land size, and all that jazz. But in this video, we're going to go into a little bit more of a detailed look. Yo, what is going on guys? Welcome back to F2D facts. The channel where I look at people cultures and places. My name is Dave Wouple, and today we are going to be looking more at Columbia and our Columbia part two video. Which just reminds me guys, this is part of our Columbia playlist. So put it down in the description box below, and I'll talk abo

We load in the text file output from `process_output_files` below. 

In [None]:
# load in process output from file
import json
with open(process_output['process_output_files'][0], "r") as file:
    print(file.read())  

[{"transcript": " That's the episode looking at the great country of Columbia. We looked at some really basic facts. It's name, a bit of its history, the type of people that live there, land size, and all that jazz. But in this video, we're going to go into a little bit more of a detailed look. Yo, what is going on guys? Welcome back to F2D facts. The channel where I look at people cultures and places. My name is Dave Wouple, and today we are going to be looking more at Columbia and our Columbia part two video. Which just reminds me guys, this is part of our Columbia playlist. So put it down in the description box below, and I'll talk about that more in the video. But if you're new here, join me every single Monday to learn about new countries from around the world. You can do that by hitting that subscribe and that belt notification button. But let's get started. Learn about Columbia. So we all know Columbia is famous for its coffee, right? Yes, right. I know. You guys are sitting the

### using the `large-v3` model

To use a non-default model like `large-v3` we enter it explicitly as a `modules` selection when invoking `.process`.

We use it below to process the same input file shown above.

In [None]:
# define path to an input file from examples directory
test_file = "../input_data/Interesting Facts About Colombia.mp4"

# process for search
process_output = pipeline.process(local_file_path = test_file,
                                  local_save_directory=".", # save output in current directory
                                  expire_time=60*5,         # set all process data to expire in 5 minutes
                                  wait_for_process=True,    # wait for process to complete before regaining ide
                                  verbose=False,            # set verbosity to False
                                  modules={"transcribe":{"model":"whisper-large-v3"}})

The output of this process is printed below.  Because the output of this particular module-model pair is json, the process output is provided in this object as well.  The output file itself has been returned to the address noted in the `process_output_files` key.

In [None]:
# nicely print the output of this process
json_print(process_output)