# multilingual transcription pipeline

This document details a modular pipeline that takes in an audio/video file, transcribes it, and translates the transcription into a desired language.

To follow along with this demonstration be sure to initialize your krixik session with your api key and url as shown below. 

We illustrate loading these required secrets in via [python-dotenv](https://pypi.org/project/python-dotenv/), storing those secrets in a `.env` file.  This is always good practice for storing / loading secrets (e.g., doing so will reduce the chance you inadvertantly push secrets to a repo).


In [1]:
import sys 
sys.path.append('../../../')

from docs.utilities.reset import reset_pipeline

In [None]:
# load secrets from a .env file using python-dotenv
from dotenv import load_dotenv
import os
load_dotenv("../../.env")
MY_API_KEY = os.getenv('MY_API_KEY')
MY_API_URL = os.getenv('MY_API_URL')

# import krixik and initialize it with your personal secrets
from krixik import krixik
krixik.init(api_key = MY_API_KEY, 
            api_url = MY_API_URL)

This small function prints dictionaries very nicely in notebooks / markdown.

In [3]:
# print dictionaries / json nicely in notebooks / markdown
import json
def json_print(data):
    print(json.dumps(data, indent=2))

A table of contents for the remainder of this document is shown below.


- [pipeline setup](#pipeline-setup)
- [processing a file](#processing-a-file)
- [saving the pipeline config for future use](#saving-the-pipeline-config-for-future-use)

## pipeline setup

Below we setup a multi module pipeline to serve our intended purpose, which is to build a pipeline that will transcribe any audio/video and make it semantically searchable in any language.

To do this we will use the following modules:

- [`transcribe`](modules/transcribe.md): takes in audio/video input, outputs json of content transcription
- [`translate`](modules/translate.md): takes in json of text snippets, outputs json of translated snippets


In [4]:
from krixik.pipeline_builder.module import Module
from krixik.pipeline_builder.pipeline import CreatePipeline

# select modules
module_1 = Module(module_type="transcribe")
module_2 = Module(module_type="translate")

# create custom pipeline object
custom = CreatePipeline(name='transcribe-translate-pipeline', 
                        module_chain=[module_1, module_2])

# pass the custom object to the krixik operator (note you can also do this by passing its config)
pipeline = krixik.load_pipeline(pipeline=custom)

With our `custom` pipeline built we now pass it, along with a test file, to our operator to process the file.

## processing a file

We first define a path to a local input file.

In [5]:
# define path to an input file
test_file = "../input_data/Interesting Facts About Colombia.mp4"

Lets take a quick look at this file before processing.

In [6]:
# examine contents of input file
from IPython.display import Video
Video(test_file)

The input video content language content is English.  We will use the `opus-mt-en-es` model of the [`translate`](modules/translate.md) to translate the transcript of this video into Spanish.

For this run we will use the default models for the remainder of the modules.


In [7]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)

In [8]:
# test file
test_file = "../../input_data/Interesting Facts About Colombia.mp4"

# process test input
process_output = pipeline.process(local_file_path = test_file,
                                  expire_time=60*5,
                                  modules={"translate": {"model": "opus-mt-en-es"}})

INFO: Checking that file size falls within acceptable parameters...
INFO:...success!
converted ../input_data/Interesting Facts About Colombia.mp4 to: /var/folders/k9/0vtmhf0s5h56gt15mkf07b1r0000gn/T/tmppaeads7s/krixik_converted_version_Interesting Facts About Colombia.mp3
INFO: hydrated input modules: {'transcribe': {'model': 'whisper-tiny', 'params': {}}, 'translate': {'model': 'opus-mt-en-es', 'params': {}}}
INFO: symbolic_directory_path was not set by user - setting to default of /etc
INFO: file_name was not set by user - setting to random file name: krixik_generated_file_name_xqpbbvidoq.mp3
INFO: wait_for_process is set to True.
INFO: file will expire and be removed from you account in 300 seconds, at Mon Apr 29 15:12:17 2024 UTC
INFO: transcribe-translate-pipeline file process and input processing started...
INFO: metadata can be updated using the .update api.
INFO: This process's request_id is: 2cbc5bbb-bc0e-a552-a439-61e25bdfa4cc
INFO: File process and processing status:
SUCCESS

The output of this process is printed below.  Because the output of this particular pipeline is a json file, the process output is shown as well.  The local address of the output file itself has been returned to the address noted in the `process_output_files` key.

In [9]:
# nicely print the output of this process
json_print(process_output)

{
  "status_code": 200,
  "pipeline": "transcribe-translate-pipeline",
  "request_id": "47c08992-bbe6-4d4a-83b6-abb51ed53c8b",
  "file_id": "82713863-4978-4909-b7ae-c61617b33ee8",
  "message": "SUCCESS - output fetched for file_id 82713863-4978-4909-b7ae-c61617b33ee8.Output saved to location(s) listed in process_output_files.",
  "process_output": [
    {
      "snippet": "Ese es el episodio que mira al gran pas de Columbia. Miramos algunos hechos realmente bsicos. Es el nombre, un poco de su historia, el tipo de gente que vive all, el tamao de la tierra, y todo ese jazz. Pero en este video, vamos a entrar en un poco ms de una mirada detallada. Yo, qu est pasando chicos? Bienvenidos de nuevo a los hechos F2D. El canal donde miro las culturas y lugares de la gente. Mi nombre es Dave Wouple, y hoy vamos a ver ms en Columbia y nuestro video de la segunda parte de Columbia. Lo que me recuerda chicos, esto es parte de nuestra lista de Columbia. As que pngalo en el cuadro de descripcin a con

## saving the pipeline config for future use

You can save the configuration of this pipeline using the `custom` object, and use it later direclty without building it again in python.

In [10]:
# save your config for later use (that way you don't need to re-build in python)
custom.save(config_path='transcribe-translate-semantic-pipeline.yml')

See more about [saving and loading pipeline configuration files](LINNK GOES HERE).